<strong>
    <font color="#0E1117">
        Author: lprtk
    </font>
</strong>

<br/>
<br/>


<Center>
    <h1 style="font-family: Arial">
        <font color="#0E1117">
            NLP: sentiment analysis, topic modeling & sentiment prediction
        </font>
    </h1>
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            Notebook 2/5
        </font>
    </h3>
</Center>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Introduction & context
        </font>
    </h2>
</div>

<p style="text-align: justify">
    This project focuses on extracting information and value from large volumes of textual data using Natural Language Processing (NLP). Why do you want to do this?
</p>
<ul>
    <li><p style="text-align: justify">To improve the customer experience on the website, mobile application or in the office.</p></li>
    <li><p style="text-align: justify">Assess customer satisfaction differently.</p></li>
    <li><p style="text-align: justify"></p>Evaluate the company's image.</li>
    <li><p style="text-align: justify"></p>Be more available and accessible to customers.</li>
    <li><p style="text-align: justify"></p>According to the company's activity: find new solutions to improve the banking services offered, evaluate the seller of an online sales platform or improve the product based on customer reviews.</li>
</ul>

<p style="text-align: justify">
    Our application approach is presented in 5 main streams:
</p>
<ul>
    <li>
        <u>Step 1:</u> Web Scraping
        <ul>
            <li>Collect and create the data schema.</li>
            <li>Parsing customer reviews to enrich the database: extracting title, description, date, time, nickname and rating.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 2:</u> Sentiment Analysis and Scoring
        <ul>
            <li>Understand and probe the satisfaction of each customer.</li>
            <li>Scoring the intensity and polarity of feelings from the review description.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 3:</u> Text mining and data cleaning
        <ul>
            <li>Text cleaning adapted to the sales domain and to the general content of reviews.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 4:</u> Topic Modeling (unsupervised learning)
        <ul>
            <li>To improve availability and speed up response time, reviews can be disassociated and prioritized according to the topic they address.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 5:</u> Machine Learning (supervised learning)
        <ul>
            <li>Without reading future reviews, design a robust model to identify the overall sentiment expressed by the customer.</li>
        </ul>
    </li>
</ul>

In [None]:
!git clone https://github.com/lprtk/pyTCTK
!pip install git+https://github.com/cjhutto/vaderSentiment

fatal: destination path 'pyTCTK' already exists and is not an empty directory.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/cjhutto/vaderSentiment
  Cloning https://github.com/cjhutto/vaderSentiment to /tmp/pip-req-build-m8oqr1b0
  Running command git clone -q https://github.com/cjhutto/vaderSentiment /tmp/pip-req-build-m8oqr1b0


In [None]:
import sys

# New Section

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Librairies import
        </font>
    </h2>
</div>

In [None]:
import pandas as pd
from pyTCTK.codefile.pyTCTK import TextNet, WordNet
from vaderSentiment.vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings("ignore")

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Data import
        </font>
    </h2>
</div>

In [None]:
df_data = pd.read_csv(filepath_or_buffer="amzn_customer_reviews.csv", sep=",")

In [None]:
df_data.head(3)

Unnamed: 0,Pseudo,Title,Review,Rating,Verified Purchase,Date
0,R.A.O,"\nA Small, But Very Powerful Device\n",\nReview is based of having the computer as a ...,4.5 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...
1,Bee Lor,\nExcellent portable gaming laptop\n,\nI'm writing this review for anyone who's on ...,5.0 out of 5 stars,True,"Reviewed in the United States 🇺🇸 on January 6,..."
2,R.A.O,\nFour Stars\n,"\nTruly a nice product, but overpriced in gene...",3.0 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Text cleaning: first part
        </font>
    </h2>
</div>

> « It’s all about the Data. Key to any successful algorithm is a good dataset » (source : National Aeronautics and Space Administration, You Can Help Train NASA’s Rovers to Better Explore Mars | NASA.com).

<p style="text-align: justify">
    Before moving on to the modeling stage, there is a whole pre-processing to be done: make sure that there are no missing values, no absurd values (outliers), no duplicates, no uninterpretable characteristics, no correlation between certain variables, make sure to re-encode certain characteristics, perform format conversions, create new variables or delete old characteristics. This cleaning is the first step, crucial to start the project well, and has a direct influence on the performance and predictions of the future modeling. More and more, the notion of data quality is evoked because poor quality, unprepared or uninterpretable variables will only amplify the black box effect of certain models.
</p>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Filter reviews, remove spaces and clean some specific characters
        </font>
    </h3>
</div>

<p style="text-align: justify">
    We make a first very light cleaning (deletion of spaces and html tags) before applying the sentimental analysis because the VADER algorithm takes into account the capital letters, the negations or the punctuation to realize the sentimental scoring.
</p>

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_space()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_space()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_whitespace()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_whitespace()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).additional_cleaning(
    add_regexs=None
)

df_data = TextNet(
    data=df_data,
    column="Title"
).additional_cleaning(
    add_regexs=None
)

In [None]:
df_data.head(3)

Unnamed: 0,Pseudo,Title,Review,Rating,Verified Purchase,Date
0,R.A.O,"A Small, But Very Powerful Device",Review is based of having the computer as a da...,4.5 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...
1,Bee Lor,Excellent portable gaming laptop,I'm writing this review for anyone who's on th...,5.0 out of 5 stars,True,"Reviewed in the United States 🇺🇸 on January 6,..."
2,R.A.O,Four Stars,"Truly a nice product, but overpriced in genera...",3.0 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Sentiment analysis: VADER
        </font>
    </h2>
</div>

In [None]:
vader = SentimentIntensityAnalyzer()

In [None]:
df_data["Score"] = df_data["Review"].apply(
    lambda review: vader.polarity_scores(review)
)
df_data["Compound"] = df_data["Score"].apply(
    lambda score_dict: score_dict["compound"]
)
df_data["Sentiment"] = df_data["Compound"].apply(
    lambda sent: "positive" if sent > 0 else ("neutral" if sent == 0 else "negative")
)

In [None]:
df_data["Sentiment"].value_counts()

positive    743
negative    166
neutral      82
Name: Sentiment, dtype: int64

In [None]:
df_data.head(3)

Unnamed: 0,Pseudo,Title,Review,Rating,Verified Purchase,Date,Score,Compound,Sentiment
0,R.A.O,"A Small, But Very Powerful Device",Review is based of having the computer as a da...,4.5 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...,"{'neg': 0.015, 'neu': 0.82, 'pos': 0.165, 'com...",0.9984,positive
1,Bee Lor,Excellent portable gaming laptop,I'm writing this review for anyone who's on th...,5.0 out of 5 stars,True,"Reviewed in the United States 🇺🇸 on January 6,...","{'neg': 0.016, 'neu': 0.88, 'pos': 0.104, 'com...",0.9921,positive
2,R.A.O,Four Stars,"Truly a nice product, but overpriced in genera...",3.0 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...,"{'neg': 0.066, 'neu': 0.831, 'pos': 0.102, 'co...",0.6369,positive


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Text cleaning: second part
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Lowercase
        </font>
    </h3>
</div>

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).lowercase()

df_data = TextNet(
    data=df_data,
    column="Title"
).lowercase()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Punctuation
        </font>
    </h3>
</div>

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_punctuation()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_punctuation()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            3) Specific cleaning
        </font>
    </h3>
</div>

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_url()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_url()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_html()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_html()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_email()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_email()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_digit()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_digit()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_mention()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_mention()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_hastag()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_hastag()

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_emoji()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_emoji()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            4) Remove stopwords
        </font>
    </h3>
</div>

In [None]:
stopwords_to_keep = [
    "doesn", "doesn't", "doesnt", "dont", "don't", "not", "wasn't", "wasnt",
    "aren", "aren't", "arent",  "couldn", "couldn't", "couldnt", "didn",
    "didn't", "didnt", "hadn", "hadn't", "hadnt",  "hasn", "hasn't", "hasnt",
    "haven't", "havent", "isn", "isn't", "isnt", "mightn",  "mightn't",
    "mightnt", "mustn", "mustn't", "mustnt", "needn", "needn't", "neednt",
    "shan", "shan't", "shant", "shouldn", "shouldn't", "shouldnt", "wasn",
    "wasn't",  "wasnt", "weren", "weren't", "werent", "won", "won't", "wont",
    "wouldn", "wouldn't", "wouldnt", "good", "bad", "worst", "wonderfull",
    "best", "better"
]

stopwords_to_add = [
    "es", "que", "en", "la", "las", "le", "les", "lo", "los", "de", "no",
    "el", "al", "un", "una", "se", "sa", "su", "sus", "por", "con", "mi",
    "para", "todo", "gb", "laptop", "computer", "pc"
]

In [None]:
df_data = WordNet(
    data=df_data,
    column="Review"
).remove_stopword(
    language="english",
    lowercase=False,
    remove_accents=False,
    add_stopwords=stopwords_to_add,
    remove_stopwords=stopwords_to_keep
)

df_data = WordNet(
    data=df_data,
    column="Title"
).remove_stopword(
    language="english",
    lowercase=False,
    remove_accents=False,
    add_stopwords=stopwords_to_add,
    remove_stopwords=stopwords_to_keep
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            5) Lemmatization process
        </font>
    </h3>
</div>

In [None]:
df_data = WordNet(
    data=df_data,
    column="Review"
).lemmatize(
    language="english",
    lowercase=False,
    remove_accents=False
)

df_data = WordNet(
    data=df_data,
    column="Title"
).lemmatize(
    language="english",
    lowercase=False,
    remove_accents=False
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            6) Empty lines
        </font>
    </h3>
</div>

In [None]:
df_data = df_data.dropna()
df_data.reset_index(drop=True, inplace=True)

In [None]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_whitespace()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_whitespace()

In [None]:
df_data.head(3)
df_data["Date"][0]

'Reviewed in the United States 🇺🇸 on November 27, 2021'

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Feature Engineering
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Rating
        </font>
    </h3>
</div>

In [None]:
def feature_rating(text, dataframe) -> pd.core.frame.DataFrame:
    """
    Function that extracts the customer's rating and converts it to a float.
    
    Parameters
    ----------
    text : str or pandas.core.series.Series
        Text from which to extract the client's note.
    
    dataframe : pandas.core.frame.DataFrame
        Dataframe that allows the extraction of the final results.

    Returns
    -------
    pandas.core.frame.DataFrame
        Dataframe that contains the final result.

    """
    rating = []
    for i in range(0, len(text)):
        row = text[i].split(" ")
        row = row[0].replace(",", ".")
        row = float(row)
        rating.append(row)
    
    df_data = pd.DataFrame({"New rating": rating})
    dataframe = pd.concat([dataframe, df_data["New rating"]], axis=1)
    
    return dataframe

df_data = feature_rating(text=df_data["Rating"], dataframe=df_data)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Date
        </font>
    </h3>
</div>

In [None]:
def feature_date(text, dataframe) -> pd.core.frame.DataFrame:
    """
    Function to extract the date from a text and convert it to datetime.

    Parameters
    ----------
    text : str or pandas.core.series.Series
        Text from which to extract the review's date.
    
    dataframe : pandas.core.frame.DataFrame
        Dataframe that allows the extraction of the final results.

    Returns
    -------
    pandas.core.frame.DataFrame
        Dataframe that contains the final result.

    """
    root = {
        "January": "01", "january": "01",
        "February": "02", "february": "02",
        "March": "03", "march": "03",
        "April": "04", "april": "04",
        "May": "05", "may": "05",
        "June": "06", "june": "06",
        "July": "07", "july": "07",
        "August": "08", "august": "08",
        "September": "09", "september": "09",
        "0ctober": "10", "october": "10",
        "November": "11", "november": "11",
        "December": "12", "december": "12"
    }

    date = []
    for i in range(0, len(text)):
        row = text[i].split(" ")
        row = " ".join(row[row.index("on") + 1:row.index("on") + 4])
        date.append(row)

    datetime = [word.split(" ") for word in date]

    date = []
    for element in datetime:
        for key, value in root.items():
            if key not in element:
                continue

            index = element.index(key)
            element[index] = value
        row = "/".join(element)
        date.append(row)
    
    df_data = pd.DataFrame({"New date": date})
    df_data["New date"] = pd.to_datetime(df_data["New date"])
    dataframe = pd.concat([dataframe, df_data["New date"]], axis=1)
    
    return dataframe

df_data = feature_date(text=df_data["Date"], dataframe=df_data)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            3) Verified purchase
        </font>
    </h3>
</div>

In [None]:
#not necessary with how i processed data
df_data["Verified Purchase"].replace(
    to_replace="Verified Purchase",
    value="TRUE",
    inplace=True
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            4) Country
        </font>
    </h3>
</div>

In [None]:
def feature_country(text, dataframe) -> pd.core.frame.DataFrame:
    """
    Function to extract the country from a text.

    Parameters
    ----------
    text : str or pandas.core.series.Series
        Text from which to extract the review's country.
    
    dataframe : pandas.core.frame.DataFrame
        Dataframe that allows the extraction of the final results.

    Returns
    -------
    pandas.core.frame.DataFrame
        Dataframe that contains the final result.

    """
    country = []
    for i in range(0, len(text)):
        row = text[i].split(" ")
        row = row[3] + " " + row[4]
        country.append(row)
    
    df_data = pd.DataFrame({"Country": country})
    dataframe = pd.concat([dataframe, df_data["Country"]], axis=1)
    
    return dataframe

df_data = feature_country(text=df_data["Date"], dataframe=df_data)

In [None]:
df_data.head(3)

Unnamed: 0,Pseudo,Title,Review,Rating,Verified Purchase,Date,Score,Compound,Sentiment,New rating,New date,Country
0,R.A.O,small powerful device,review base daily driver week meant gaming med...,4.5 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...,"{'neg': 0.015, 'neu': 0.82, 'pos': 0.165, 'com...",0.9984,positive,4.5,2021-11-27,United States
1,Bee Lor,excellent portable gaming,write review anyone fence purchasing since rea...,5.0 out of 5 stars,True,"Reviewed in the United States 🇺🇸 on January 6,...","{'neg': 0.016, 'neu': 0.88, 'pos': 0.104, 'com...",0.9921,positive,5.0,2022-01-06,United States
2,R.A.O,stars,truly nice product overpriced general example ...,3.0 out of 5 stars,True,Reviewed in the United States 🇺🇸 on November 2...,"{'neg': 0.066, 'neu': 0.831, 'pos': 0.102, 'co...",0.6369,positive,3.0,2021-11-27,United States


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Data export
        </font>
    </h2>
</div>

In [None]:
df_data.to_csv(path_or_buf="amzn_customer_reviews.csv", sep=",", index=False)