In [None]:
!python -m spacy download fr_core_news_sm
!pip install dataprep==0.4.5 -q
!pip install pandas==1.5.3 -q
!pip install spacy==3.6.1 -q

2024-02-01 13:31:01.625895: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-01 13:31:01.625982: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-01 13:31:01.628102: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-01 13:31:01.640635: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting fr-core-news-sm==3.6.0
  Downloading https

## Context

> In the age of information overload, text data is generated at an unprecedented rate. Whether it's social media posts, news articles, customer reviews, or scientific research papers, text is one of the most abundant and valuable forms of data in the digital era. However, this wealth of textual information often comes with a significant challenge - the need for data cleaning and preprocessing.
>
> Cleaning textual data involves the systematic identification and removal of noise, inconsistencies, and irrelevant information to ensure that the data is in a format suitable for analysis, machine learning, or natural language processing (NLP) tasks. Python, with its versatile libraries and tools, has emerged as a powerhouse for text data preprocessing.
>
> In this course, we will explore the importance of text data cleaning and demonstrate how Python can be a valuable ally in this process. We will cover a range of techniques and methods to transform raw, messy text into clean and structured data that can be used for various purposes, such as sentiment analysis, information retrieval, language modeling, and more.
>
> Throughout this course, you will learn how to:
>
> * **Remove Noise**: Identify and eliminate irrelevant characters, symbols, and formatting issues that can interfere with text analysis.
>
> * **Normalize Text**: Standardize text by converting it to lowercase and handling encoding issues.
>
> * **Tokenize Text**: Break text into individual words or tokens, making it easier to process and analyze.
>
> * **Handle Stop Words**: Remove common words (e.g., "the," "and," "in") that add little meaning to the analysis.
>
> * **Lemmatize**: Reduce words to their base or root forms to handle variations and improve consistency in text data.
>
> By the end of this course, you will have the knowledge and tools necessary to prepare text data for analysis, making it more reliable and valuable for a wide range of applications. Let's embark on this journey to harness the power of clean text data with Python!

## Load Data

>**What Are Dataframes in Python?**
>
> In Python, a dataframe is like a special table that helps us organize and work with data. Imagine it as a neat and organized way to store information, kind of like a spreadsheet you'd use in Excel. Each row in a dataframe represents a single "thing" (like a person, a product, or a date), and each column stores a specific piece of information about that "thing" (like a name, age, or price).
>
>**Why Use Dataframes?**
>
>Dataframes are super helpful because they make it easy to:
>
>* **Organize Data** : They help us keep our data structured and tidy, making it easier to understand and work with.
>* **Manipulate Data** : We can change, filter, or calculate things with our data easily, just like in Excel.
>* **Analyze Data** : Dataframes are often used with libraries like Pandas in Python, which provide powerful tools to analyze and make sense of our data.
>* **Visualize Data** : We can create charts and graphs to visualize our data for better insights.
>* **Handle Big Data** : Dataframes can handle large amounts of data efficiently, which is crucial for big projects.
>
>So, in a nutshell, dataframes help us manage and make the most of our data in Python, making it a go-to tool for data analysis and manipulation.
>
>We will use the **pandas** library to load comments extracted by a team of data scientists. Here's an example of how to load a csv file :
>```python
> df = pd.read_csv("my_path/to/my/file.csv")
> ```
>
> In the context of this course, you can choose to consider only 10000 comments. Here's an example using the **sample** method:
>```python
> df_sample = df.sample(10000)
> ```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Insert your code here
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Capegemini Bootcamp/Cours-2 Cleaning Embedding/fournisseurs_energie_top5_forums.csv", index_col=0)
df = df.rename(columns={"verbatim":"text"}).dropna(subset="text")    # On enleve les missing value
df = df.sample(10000)
df.head()

Unnamed: 0,page,titre,text,date,note,reponse,date_experience,fournisseur,source
4324,7,ce sont des escrocs de premieres,"ce sont des escrocs de premieres, résilier par...",18 oct. 2021,1,,Date de l'expérience: 18 octobre 2021,https://fr.trustpilot.com/review/eni.fr,trustpilot
7807,141,ouverture du contrat très rapide et…,ouverture du contrat très rapide et très simpl...,21 févr. 2021,4,,Date de l'expérience: 21 février 2021,https://fr.trustpilot.com/review/totalenergies.fr,trustpilot
114,6,Très bien pas de surprise comme avec…,Très bien pas de surprise comme avec certains ...,14 nov. 2022,5,,Date de l'expérience: 14 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
22509,31,Avis client,Pour ça j'attends que l'installation soit fait...,le 19/11/2022 par Denis B.,5,,suite à une expérience du 25/10/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies
23330,113,Avis client,clair - aimable - patient. N'a pas compté ses ...,le 16/10/2022 par Monique J.,5,,suite à une expérience du 19/09/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies


## Explore Data
>
> To efficiently explore textual data, you can use the automatically generated reports provided by the Python library **dataprep**. Here's an example code snippet to do so:
> ```python
>   from dataprep.eda import create_report
>   create_report(df)
>```
> Explore the report and identify key insights about the sample you have extracted.

In [None]:
## Insert your code here

from dataprep.eda import create_report
create_report(df)

Output hidden; open in https://colab.research.google.com to view.

## Clean Data

> **How to use apply method with pandas DataFrame**
>
> The `apply` method allows you to apply a user-specified function to each element of a column (or series) of data in a Pandas DataFrame. It takes a function as an argument, typically a lambda function or a user-defined function, and applies this function to each element in the column. The result is a new series containing the transformed values. This is an example to process the punctuation in comments :
>```python
># Define a function to replace punctuation with spaces
>def replace_punctuation(comment):
>    return ''.join([' ' if c in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' else c >for c in comment])
>
># Apply the punctuation replacement function to the "comments" column
>df['comments'] = df['comments'].apply(replace_punctuation)
>```
> **How to use lower method with pandas DataFrame**
>
> The `lower()` method is used to transform all uppercase letters in a string of characters into lowercase letters. This transformation is applied to each element in a column (or series) of a Pandas DataFrame.This is an example to lower comments :
```python
df['comments'].str.lower()
```
>
> Drawing inspiration from the two previous examples, clean the 1000 extracted comments. Feel free to add additional cleaning steps if you are aware of any.

In [None]:
df['text'].isna().sum()

0

In [None]:
# insert your code here

# Define a function to replace punctuation with spaces
def replace_punctuation(comment):
   return ''.join([' ' if c in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' else c for c in comment])

# Apply the punctuation replacement function to the "comments" column
df['text'] = df['text'].apply(replace_punctuation)
df['text'] = df['text'].str.lower()
df.head()

Unnamed: 0,page,titre,text,date,note,reponse,date_experience,fournisseur,source
4324,7,ce sont des escrocs de premieres,ce sont des escrocs de premieres résilier par...,18 oct. 2021,1,,Date de l'expérience: 18 octobre 2021,https://fr.trustpilot.com/review/eni.fr,trustpilot
7807,141,ouverture du contrat très rapide et…,ouverture du contrat très rapide et très simpl...,21 févr. 2021,4,,Date de l'expérience: 21 février 2021,https://fr.trustpilot.com/review/totalenergies.fr,trustpilot
114,6,Très bien pas de surprise comme avec…,très bien pas de surprise comme avec certains ...,14 nov. 2022,5,,Date de l'expérience: 14 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
22509,31,Avis client,pour ça j attends que l installation soit fait...,le 19/11/2022 par Denis B.,5,,suite à une expérience du 25/10/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies
23330,113,Avis client,clair aimable patient n a pas compté ses ...,le 16/10/2022 par Monique J.,5,,suite à une expérience du 19/09/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies


## Tokenization and Lemmatization with spaCy

> **What is spaCy?**
>
>SpaCy is an open-source Python library designed for Natural Language Processing (NLP).
It is known for its speed, efficiency, and ease of use, making it a top choice for NLP tasks.
SpaCy supports a wide range of NLP tasks, including tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and more.
>
> **Why spaCy?**
>
>* High Performance: SpaCy is renowned for its fast execution, making it ideal for processing large volumes of text.
>* User-Friendly: It offers a simple and consistent API that allows beginners to quickly get started.
>* Pre-Trained Models: SpaCy provides pre-trained language models for multiple languages, simplifying the onboarding process.
>
> **Key Features of spaCy**:
>
>* **Tokenization**: Tokenization involves breaking down text into words, phrases, or meaningful subunits. SpaCy excels at this task.
>
>* **Part-of-Speech Tagging (POS)**: SpaCy can tag each word in a sentence with its part of speech (verb, noun, adjective, etc.).
>
>* **Dependency Parsing**: This feature enables the analysis of grammatical structure within sentences and represents dependencies between words.
>
>* **Named Entity Recognition (NER)**: SpaCy can extract named entities such as names of people, organizations, or locations from text.
>
> Run the following cell to see an example of using SpaCy

In [None]:
import spacy

# Load a pre-trained language model (e.g., French)
nlp = spacy.load("fr_core_news_sm")

# Process text
text = "SpaCy est une excellente librairie pour faire du NLP."
doc = nlp(text)

# Loop to print tokens and their part-of-speech tags
for token in doc:
    print(token.text, token.lemma_)

SpaCy spacy
est être
une un
excellente excellent
librairie librairie
pour pour
faire faire
du de
NLP NLP
. .


> In Natural Language Processing (NLP), a "stop word" (or "mot vide" in French) is a common word that is often filtered or removed during text preprocessing because it is considered uninformative for content analysis. Stop words typically consist of very frequent words in the language, such as articles (the, a, an), prepositions (in, on, at), conjunctions (and, or, but), pronouns (I, you, he, she), and so on.
>
> The main goal of removing stop words is to reduce the text's dimension while eliminating unnecessary noise, allowing the analysis to focus on keywords and more significant concepts. This can improve the speed and accuracy of text analysis tasks, such as information retrieval, text classification, sentiment analysis, and many others.
>
> Like `text` and `lemma` method you can use `is_stop` method to know if a word is a stop word.
>
```python
# Process text
stop_word = "je"
print(nlp(stop_word)[0].is_stop)
```
> Determine the number of stop words in **text** variable defined previously.

In [None]:
# Process text
stop_word = "je"
print(nlp(stop_word)[0].is_stop)

True


In [None]:
# insert your code here


print("The stop words are:", [token.text for token in doc if token.is_stop])
print("Number of stop words in text variable :", len([token.text for token in doc if token.is_stop]))

The stop words are: ['est', 'une', 'pour', 'du']
Number of stop words in text variable : 4


> Drawing inspiration from the previous examples, define a function to tokenize and lemmatize the 1000 extracted comments.
>
> Hint: Don't forget to use the **apply** method.

In [None]:
# insert your code here

def tokenize_lemm_func(text):
    # Analyser le texte avec le modèle Spacy
    doc = nlp(text)

    # Lemmatisation et suppression des stopwords
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]

    # Rejoindre les tokens lemmatisés en une chaîne de caractères
    lemmatized_text = ' '.join(lemmatized_tokens)   #prend la lite de token.leema et creer un string

    return lemmatized_text

df["text"] = df["text"].apply(tokenize_lemm_func)

In [None]:
df.head()

Unnamed: 0,page,titre,text,date,note,reponse,date_experience,fournisseur,source
4324,7,ce sont des escrocs de premieres,escroc premiere résilier sms facture payer e...,18 oct. 2021,1,,Date de l'expérience: 18 octobre 2021,https://fr.trustpilot.com/review/eni.fr,trustpilot
7807,141,ouverture du contrat très rapide et…,ouverture contrat rapide simple téléphone bo...,21 févr. 2021,4,,Date de l'expérience: 21 février 2021,https://fr.trustpilot.com/review/totalenergies.fr,trustpilot
114,6,Très bien pas de surprise comme avec…,bien surprise opérateur,14 nov. 2022,5,,Date de l'expérience: 14 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
22509,31,Avis client,j attend l installation faire conseiller top,le 19/11/2022 par Denis B.,5,,suite à une expérience du 25/10/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies
23330,113,Avis client,clair aimable patient ne compter heure...,le 16/10/2022 par Monique J.,5,,suite à une expérience du 19/09/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies


> Generate a report using the Python library **dataprep** library and compare it with the report generated at the beginning.

In [None]:
# insert your code here
from dataprep.eda import create_report
create_report(df)

Output hidden; open in https://colab.research.google.com to view.