In [None]:
!python -m spacy download fr_core_news_sm
!pip install pandas==2.2.2 -q
!pip install spacy==3.7.5 -q
!pip install ydata-profiling==4.12.2 -q

Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.9/390.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.

## Context

> In the age of information overload, text data is generated at an unprecedented rate. Whether it's social media posts, news articles, customer reviews, or scientific research papers, text is one of the most abundant and valuable forms of data in the digital era. However, this wealth of textual information often comes with a significant challenge - the need for data cleaning and preprocessing.
>
> Cleaning textual data involves the systematic identification and removal of noise, inconsistencies, and irrelevant information to ensure that the data is in a format suitable for analysis, machine learning, or natural language processing (NLP) tasks. Python, with its versatile libraries and tools, has emerged as a powerhouse for text data preprocessing.
>
> In this course, we will explore the importance of text data cleaning and demonstrate how Python can be a valuable ally in this process. We will cover a range of techniques and methods to transform raw, messy text into clean and structured data that can be used for various purposes, such as sentiment analysis, information retrieval, language modeling, and more.
>
> Throughout this course, you will learn how to:
>
> * **Remove Noise**: Identify and eliminate irrelevant characters, symbols, and formatting issues that can interfere with text analysis.
>
> * **Normalize Text**: Standardize text by converting it to lowercase and handling encoding issues.
>
> * **Tokenize Text**: Break text into individual words or tokens, making it easier to process and analyze.
>
> * **Handle Stop Words**: Remove common words (e.g., "the," "and," "in") that add little meaning to the analysis.
>
> * **Lemmatize**: Reduce words to their base or root forms to handle variations and improve consistency in text data.
>
> By the end of this course, you will have the knowledge and tools necessary to prepare text data for analysis, making it more reliable and valuable for a wide range of applications. Let's embark on this journey to harness the power of clean text data with Python!

## Load Data

>**What Are Dataframes in Python?**
>
> In Python, a dataframe is like a special table that helps us organize and work with data. Imagine it as a neat and organized way to store information, kind of like a spreadsheet you'd use in Excel. Each row in a dataframe represents a single "thing" (like a person, a product, or a date), and each column stores a specific piece of information about that "thing" (like a name, age, or price).
>
>**Why Use Dataframes?**
>
>Dataframes are super helpful because they make it easy to:
>
>* **Organize Data** : They help us keep our data structured and tidy, making it easier to understand and work with.
>* **Manipulate Data** : We can change, filter, or calculate things with our data easily, just like in Excel.
>* **Analyze Data** : Dataframes are often used with libraries like Pandas in Python, which provide powerful tools to analyze and make sense of our data.
>* **Visualize Data** : We can create charts and graphs to visualize our data for better insights.
>* **Handle Big Data** : Dataframes can handle large amounts of data efficiently, which is crucial for big projects.
>
>So, in a nutshell, dataframes help us manage and make the most of our data in Python, making it a go-to tool for data analysis and manipulation.
>
>We will use the **pandas** library to load comments extracted by a team of data scientists. Here's an example of how to load a csv file :
>```python
> df = pd.read_csv("my_path/to/my/file.csv")
> ```
>
> In the context of this course, you can choose to consider only 10000 comments. Here's an example using the **sample** method:
>```python
> df_sample = df.sample(10000)
> ```

In [None]:
# If needed, you can access your Drive files with this code snippet
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Insert your code here
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/mba_x_hec/fournisseurs_energie_top5_forums.csv", index_col=0)
df = df.rename(columns={"verbatim":"text"}).dropna(subset="text")
df = df.sample(10000, random_state=0)

In [None]:
import gdown
import pandas as pd

In [None]:
# get the ID of needed file in Google Drive
file_id = "1tQVli7lpXAT7npHpqa-YqBQasdr7DEye"
# create the download link
download_url = f"https://drive.google.com/uc?id={file_id}"
# download the file
gdown.download(download_url, "data.csv", quiet=False)

# read csv
df = pd.read_csv("data.csv")
df = df.sample(1000)

Downloading...
From: https://drive.google.com/uc?id=1tQVli7lpXAT7npHpqa-YqBQasdr7DEye
To: /content/data.csv
100%|██████████| 16.1M/16.1M [00:00<00:00, 79.2MB/s]


In [None]:
df = df.drop(['Unnamed: 0'], axis=1)

## Explore Data
>
> To efficiently explore textual data, you can use the automatically generated reports provided by the Python library **ydata_profiling**. Here's an example code snippet to do so:
> ```python
> from ydata_profiling import ProfileReport
> profile = ProfileReport(df, title="Profiling Report")
> profile
>```
> Explore the report and identify key insights about the sample you have extracted.

In [None]:
## Insert your code here
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Report on scrapped comments.")
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Clean Data

> **How to use apply method with pandas DataFrame**
>
> The `apply` method allows you to apply a user-specified function to each element of a column (or series) of data in a Pandas DataFrame. It takes a function as an argument, typically a lambda function or a user-defined function, and applies this function to each element in the column. The result is a new series containing the transformed values. This is an example to process the punctuation in comments :
>```python
># Define a function to replace punctuation with spaces
>def replace_punctuation(comment):
>    return ''.join([' ' if c in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' else c >for c in comment])
>
># Apply the punctuation replacement function to the "comments" column
>df['comments'] = df['comments'].apply(replace_punctuation)
>```
> **How to use lower method with pandas DataFrame**
>
> The `lower()` method is used to transform all uppercase letters in a string of characters into lowercase letters. This transformation is applied to each element in a column (or series) of a Pandas DataFrame.This is an example to lower comments :
```python
df['comments'].str.lower()
```
>
> Drawing inspiration from the two previous examples, clean the 1000 extracted comments. Feel free to add additional cleaning steps if you are aware of any.

In [None]:
df1 = df.copy()

In [None]:
# insert your code here

# Define a function to replace punctuation with spaces
def replace_punctuation(comment):
   return ''.join([' ' if c in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' else c for c in comment])

# Apply the punctuation replacement function to the "comments" column
df1['verbatim'] = df1['verbatim'].astype('str')
df1['verbatim'] = df1['verbatim'].apply(replace_punctuation)
df1['verbatim'] = df1['verbatim'].str.lower()

In [None]:
df1['verbatim'].loc[11754]

KeyError: 11754

In [None]:
import numpy as np
df1['verbatim'] = df1['verbatim'].replace(' na ', np.nan)
df1['verbatim'].isnull().sum()

33

In [None]:
df1['verbatim'].loc[7523]

' na '

In [None]:
df1.head(40)

Unnamed: 0,page,titre,verbatim,date,note,reponse,date_experience,fournisseur,source
9795,240,Scandaleux,scandaleux factures à triplé ils retir...,9 sept. 2018,1,"Bonjour Natacha Natacha,\n\nSi vous souhaitez ...",Date de l'expérience: 09 septembre 2018,https://fr.trustpilot.com/review/totalenergies.fr,trustpilot
23161,96,Avis client,nous attendons l’installation des panneaux sol...,le 24/10/2022 par Jean C.,4,,suite à une expérience du 20/09/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies
7277,114,Entretien chaudière,j ai un contrat gaz et éléctricité chez eux \n...,3 juin 2021,1,,Date de l'expérience: 03 juin 2021,https://fr.trustpilot.com/review/totalenergies.fr,trustpilot
31684,125,Avis client,pour l instant impressionnant,le 01/12/2022 par HENRI R.,5,,suite à une expérience du 23/11/2022,https://www.avis-verifies.com/avis-clients/eng...,avis_verifies
33572,313,Avis client,technicien super pro il a réglé mon problème,le 26/11/2022 par HERVÉ R.,5,,suite à une expérience du 21/11/2022,https://www.avis-verifies.com/avis-clients/eng...,avis_verifies
13099,156,Recommande vivement et inconditionnellement :),je recommande vivement et sans condition \nile...,21 juil. 2017,5,,Date de l'expérience: 21 juillet 2017,https://fr.trustpilot.com/review/ilek.fr,trustpilot
15143,82,Avis client,très bon fournisseur d’énergie,le 07/08/2020 par Kyllian J.*,5,,suite à une expérience du 08/06/2020\n*Informa...,https://www.avis-verifies.com/avis-clients/fr....,avis_verifies
35907,370,Avis client,explications claires et fidèle au distributeur...,le 20/05/2022 par Andre D.,5,,suite à une expérience du 27/04/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies
4939,37,Que dire sur Eni,que dire sur eni à part fuyez \nce sont...,11 oct. 2019,1,,Date de l'expérience: 11 octobre 2019,https://fr.trustpilot.com/review/eni.fr,trustpilot
18834,224,Avis client,pas toujours sur qu ils soit les moins cher,le 14/10/2022 par BERNARD L.,4,,suite à une expérience du 07/07/2021,https://www.avis-verifies.com/avis-clients/tot...,avis_verifies


In [None]:
# Bonus: Try to avoid using "apply" method to optimize your code
import string

# Create a translation table for replacing punctuation with spaces
translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
translator = str.maketrans('"', ' ')
translator = str.maketrans('\n', ' ')

# Replace punctuation with spaces and convert text to lowercase using vectorized operations
df['verbatim'] = df['verbatim'].str.translate(translator).str.lower()

# Display the dataframe
print(df)

       page                                     titre  \
5118      6  Il ne m'a manqué que le rappel du tarif…   
24555   236                               Avis client   
32552   211                               Avis client   
9132    207      Démarchage irrespectueux et intrusif   
20171   357                               Avis client   
...     ...                                       ...   
35743   353                               Avis client   
37022   481                               Avis client   
33982   354                               Avis client   
13679     9                          Très bon accueil   
4249      3   Je suis vraiment déçu par ENI car il m…   

                                                    text  \
5118   il ne m a manqué que le rappel du tarif et abo...   
24555  une demarche tres simple et tres bien explique...   
32552  le technicien a fait correctement son boulot i...   
9132   comment ils ont eu mon nom et mon téléphone   ...   
20171          

## Tokenization and Lemmatization with spaCy

> **What is spaCy?**
>
>SpaCy is an open-source Python library designed for Natural Language Processing (NLP).
It is known for its speed, efficiency, and ease of use, making it a top choice for NLP tasks.
SpaCy supports a wide range of NLP tasks, including tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and more.
>
> **Why spaCy?**
>
>* High Performance: SpaCy is renowned for its fast execution, making it ideal for processing large volumes of text.
>* User-Friendly: It offers a simple and consistent API that allows beginners to quickly get started.
>* Pre-Trained Models: SpaCy provides pre-trained language models for multiple languages, simplifying the onboarding process.
>
> **Key Features of spaCy**:
>
>* **Tokenization**: Tokenization involves breaking down text into words, phrases, or meaningful subunits. SpaCy excels at this task.
>
>* **Part-of-Speech Tagging (POS)**: SpaCy can tag each word in a sentence with its part of speech (verb, noun, adjective, etc.).
>
>* **Dependency Parsing**: This feature enables the analysis of grammatical structure within sentences and represents dependencies between words.
>
>* **Named Entity Recognition (NER)**: SpaCy can extract named entities such as names of people, organizations, or locations from text.
>
> Run the following cell to see an example of using SpaCy

In [None]:
import spacy

# Load a pre-trained language model (e.g., French)
nlp = spacy.load("fr_core_news_sm")

# Process text
text = "SpaCy est une excellente librairie pour faire du NLP."
doc = nlp(text)

# Loop to print tokens and their part-of-speech tags
for token in doc:
    print(token.text,'->', token.lemma_)

SpaCy -> spacy
est -> être
une -> un
excellente -> excellent
librairie -> librairie
pour -> pour
faire -> faire
du -> de
NLP -> NLP
. -> .


> In Natural Language Processing (NLP), a "stop word" (or "mot vide" in French) is a common word that is often filtered or removed during text preprocessing because it is considered uninformative for content analysis. Stop words typically consist of very frequent words in the language, such as articles (the, a, an), prepositions (in, on, at), conjunctions (and, or, but), pronouns (I, you, he, she), and so on.
>
> The main goal of removing stop words is to reduce the text's dimension while eliminating unnecessary noise, allowing the analysis to focus on keywords and more significant concepts. This can improve the speed and accuracy of text analysis tasks, such as information retrieval, text classification, sentiment analysis, and many others.
>
> Like `text` and `lemma` method you can use `is_stop` method to know if a word is a stop word.
>
```python
# Process text
stop_word = "je"
print(nlp(stop_word)[0].is_stop)
```
> Determine the number of stop words in **text** variable defined previously.

In [None]:
# Process text
stop_word = "je"
print(nlp(stop_word)[0].is_stop)

True


In [None]:
# insert your code here

print("Number of stop words in text variable :", len([token.text for token in doc if token.is_stop]))

Number of stop words in text variable : 4


In [None]:
print(doc)
print(doc.text) # 获取原始文本
print(doc.ents) # 实体 专有名词
print(doc.vector)  # 获取整个文档的词向量（如果模型支持）

# spacy.tokens.doc.Doc
# 在 spaCy 中，Doc 是 经过 NLP 处理的文本，它包含：
#  词元（Token）：文本被拆分成的最小单元，例如单词、标点符号等。
#  句子（Sentences）：spaCy 自动识别的句子。
#  词性（POS Tags）：每个单词的词性（如名词、动词）。
#  依存关系（Dependency Parsing）：单词之间的语法关系。
#  词向量（Word Vectors）：如果模型支持，单词可以转换为向量。
#  实体（Entities）：命名实体识别（NER）找到的专有名词，如人名、地点等。

SpaCy est une excellente librairie pour faire du NLP.
SpaCy est une excellente librairie pour faire du NLP.
(NLP,)
[-0.25828084 -1.7273871   0.8262991  -0.07041977  1.1610556   0.7935073
  0.28170657  0.46771032 -1.644994    0.05820196  1.4873457  -0.31353542
 -1.4711758   1.827647   -0.18569568 -0.64692605  0.24157414  0.42058125
 -0.76890546 -1.4656975  -1.4037007   1.0816054   0.59598225 -0.6314422
  0.27752575 -0.7492221   0.9611038  -0.5333277  -0.79855233  0.17871478
 -0.5242398  -0.74631107 -1.1074226   0.13465807  0.8419918  -0.19741967
 -1.3778862  -0.9776704  -1.5590267  -0.90035975 -1.5684083  -0.334244
 -1.7317717  -0.3305711   0.63080716  0.5334758   0.1169302  -0.9090589
 -0.39877328 -1.3492047   0.14053626  0.99729174 -0.52114475  0.10204411
 -0.08973354 -1.1193656  -1.723135   -0.1668816   0.83558303 -0.18845055
  0.86055917 -0.26746526  0.16951828  0.09355048  0.69791234  2.0537138
  1.249892   -0.02382379  1.5578821   0.10708094 -0.9471078   0.3274816
  0.02269915  0.

> Drawing inspiration from the previous examples, define a function to tokenize and lemmatize the 1000 extracted comments.
>
> Hint: Don't forget to use the **apply** method.

In [None]:
# insert your code here

def tokenize_lemm_func(text):
    # Analyze the text with Spacy model
    doc = nlp(text)

    # Lemmatisation and suppression of stopwords
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]

    # Concatenate lemmatized tokens to have a str
    lemmatized_text = ' '.join(lemmatized_tokens)

    return lemmatized_text

df["verbatim"] = df["verbatim"].apply(tokenize_lemm_func)

> Generate a report using the Python library **ydata-profiling** library and compare it with the report generated at the beginning.

In [None]:
# insert your code here
from ydata_profiling import ProfileReport
profile2 = ProfileReport(df, title="Report 2 on scrapped comments.")
profile2

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

