-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Librairies import
        </font>
    </h2>
</div>

In [57]:
import pandas as pd
from pyTCTK import TextNet, WordNet
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings("ignore")

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Data import
        </font>
    </h2>
</div>

In [58]:
df_data = pd.read_csv(filepath_or_buffer="upv1.csv", sep=",", encoding="ISO-8859-1")

In [59]:
df_data['Review'] = df_data['Review'].astype(str)

In [60]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1009 entries, 0 to 1008
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    1009 non-null   object
 1   Rating  1009 non-null   int64 
 2   Date    1009 non-null   object
 3   Review  1009 non-null   object
 4   Title   1009 non-null   object
dtypes: int64(1), object(4)
memory usage: 39.5+ KB


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Text cleaning: first part
        </font>
    </h2>
</div>

> « It’s all about the Data. Key to any successful algorithm is a good dataset » (source : National Aeronautics and Space Administration, You Can Help Train NASA’s Rovers to Better Explore Mars | NASA.com).

<p style="text-align: justify">
    Before moving on to the modeling stage, there is a whole pre-processing to be done: make sure that there are no missing values, no absurd values (outliers), no duplicates, no uninterpretable characteristics, no correlation between certain variables, make sure to re-encode certain characteristics, perform format conversions, create new variables or delete old characteristics. This cleaning is the first step, crucial to start the project well, and has a direct influence on the performance and predictions of the future modeling. More and more, the notion of data quality is evoked because poor quality, unprepared or uninterpretable variables will only amplify the black box effect of certain models.
</p>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Filter reviews, remove spaces and clean some specific characters
        </font>
    </h3>
</div>

<p style="text-align: justify">
    We make a first very light cleaning (deletion of spaces and html tags) before applying the sentimental analysis because the VADER algorithm takes into account the capital letters, the negations or the punctuation to realize the sentimental scoring.
</p>

In [61]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_space()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_space()

In [63]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_whitespace()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_whitespace()

In [64]:
df_data = TextNet(
    data=df_data,
    column="Review"
).additional_cleaning(
    add_regexs=None
)

df_data = TextNet(
    data=df_data,
    column="Title"
).additional_cleaning(
    add_regexs=None
)

In [65]:
df_data.head(3)

Unnamed: 0,Name,Rating,Date,Review,Title
0,Dharmesh Tolia,5,27-01-2024,Dabur Meswak Complete Oral Care Toothpaste is ...,t's more than just toothpaste; it's a holistic...
1,Rohit yadav,4,26-01-2024,Dabur Meswak Complete Oral Care Toothpaste I u...,Dabur Meswak Complete Oral Care Toothpaste
2,Anabayan,5,29-01-2024,Unique taste and good product,Unique taste and good product


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Sentiment analysis: VADER
        </font>
    </h2>
</div>

In [66]:
vader = SentimentIntensityAnalyzer()

In [67]:
df_data["Score"] = df_data["Review"].apply(
    lambda review: vader.polarity_scores(review)
)
df_data["Compound"] = df_data["Score"].apply(
    lambda score_dict: score_dict["compound"]
)
df_data["Sentiment"] = df_data["Compound"].apply(
    lambda sent: "positive" if sent > 0 else ("neutral" if sent == 0 else "negative")
)

In [68]:
df_data["Sentiment"].value_counts()

positive    765
neutral     171
negative     73
Name: Sentiment, dtype: int64

In [69]:
df_data.head(3)

Unnamed: 0,Name,Rating,Date,Review,Title,Score,Compound,Sentiment
0,Dharmesh Tolia,5,27-01-2024,Dabur Meswak Complete Oral Care Toothpaste is ...,t's more than just toothpaste; it's a holistic...,"{'neg': 0.0, 'neu': 0.671, 'pos': 0.329, 'comp...",0.9879,positive
1,Rohit yadav,4,26-01-2024,Dabur Meswak Complete Oral Care Toothpaste I u...,Dabur Meswak Complete Oral Care Toothpaste,"{'neg': 0.0, 'neu': 0.57, 'pos': 0.43, 'compou...",0.9575,positive
2,Anabayan,5,29-01-2024,Unique taste and good product,Unique taste and good product,"{'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compou...",0.4404,positive


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Text cleaning: second part
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Lowercase
        </font>
    </h3>
</div>

In [70]:
df_data = TextNet(
    data=df_data,
    column="Review"
).lowercase()

df_data = TextNet(
    data=df_data,
    column="Title"
).lowercase()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Punctuation
        </font>
    </h3>
</div>

In [71]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_punctuation()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_punctuation()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            3) Specific cleaning
        </font>
    </h3>
</div>

In [72]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_url()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_url()

In [73]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_html()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_html()

In [74]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_email()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_email()

In [75]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_digit()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_digit()

In [76]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_mention()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_mention()

In [77]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_hastag()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_hastag()

In [78]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_emoji()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_emoji()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            4) Remove stopwords
        </font>
    </h3>
</div>

In [79]:
stopwords_to_keep = [
    "doesn", "doesn't", "doesnt", "dont", "don't", "not", "wasn't", "wasnt",
    "aren", "aren't", "arent",  "couldn", "couldn't", "couldnt", "didn",
    "didn't", "didnt", "hadn", "hadn't", "hadnt",  "hasn", "hasn't", "hasnt",
    "haven't", "havent", "isn", "isn't", "isnt", "mightn",  "mightn't",
    "mightnt", "mustn", "mustn't", "mustnt", "needn", "needn't", "neednt",
    "shan", "shan't", "shant", "shouldn", "shouldn't", "shouldnt", "wasn",
    "wasn't",  "wasnt", "weren", "weren't", "werent", "won", "won't", "wont",
    "wouldn", "wouldn't", "wouldnt", "good", "bad", "worst", "wonderfull",
    "best", "better"
]

stopwords_to_add = [
    "es", "que", "en", "la", "las", "le", "les", "lo", "los", "de", "no",
    "el", "al", "un", "una", "se", "sa", "su", "sus", "por", "con", "mi",
    "para", "todo", "gb", "laptop", "computer", "pc"
]

In [80]:
df_data = WordNet(
    data=df_data,
    column="Review"
).remove_stopword(
    language="english",
    lowercase=False,
    remove_accents=False,
    add_stopwords=stopwords_to_add,
    remove_stopwords=stopwords_to_keep
)

df_data = WordNet(
    data=df_data,
    column="Title"
).remove_stopword(
    language="english",
    lowercase=False,
    remove_accents=False,
    add_stopwords=stopwords_to_add,
    remove_stopwords=stopwords_to_keep
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            5) Lemmatization process
        </font>
    </h3>
</div>

In [81]:
df_data = WordNet(
    data=df_data,
    column="Review"
).lemmatize(
    language="english",
    lowercase=False,
    remove_accents=False
)

df_data = WordNet(
    data=df_data,
    column="Title"
).lemmatize(
    language="english",
    lowercase=False,
    remove_accents=False
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            6) Empty lines
        </font>
    </h3>
</div>

In [82]:
#df_data.drop(index=[282, 749, 859, 883, 1013, 1014, 1061], axis=0, inplace=True)
df_data.reset_index(drop=True, inplace=True)

In [83]:
df_data = TextNet(
    data=df_data,
    column="Review"
).remove_whitespace()

df_data = TextNet(
    data=df_data,
    column="Title"
).remove_whitespace()

In [84]:
df_data.head(20)

Unnamed: 0,Name,Rating,Date,Review,Title,Score,Compound,Sentiment
0,Dharmesh Tolia,5,27-01-2024,dabur meswak complete oral care toothpaste gam...,toothpaste holistic approach oral care,"{'neg': 0.0, 'neu': 0.671, 'pos': 0.329, 'comp...",0.9879,positive
1,Rohit yadav,4,26-01-2024,dabur meswak complete oral care toothpaste use...,dabur meswak complete oral care toothpaste,"{'neg': 0.0, 'neu': 0.57, 'pos': 0.43, 'compou...",0.9575,positive
2,Anabayan,5,29-01-2024,unique taste good product,unique taste good product,"{'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compou...",0.4404,positive
3,Placeholder,5,29-12-2023,first time use product nice,good product,"{'neg': 0.0, 'neu': 0.646, 'pos': 0.354, 'comp...",0.669,positive
4,G,3,04-01-2024,quality definitely since start product disappo...,use better,"{'neg': 0.196, 'neu': 0.633, 'pos': 0.171, 'co...",-0.1027,negative
5,vinay sainath reddy,4,02-02-2024,taste flavour toothpaste good refreshingelimin...,flavour good,"{'neg': 0.115, 'neu': 0.729, 'pos': 0.155, 'co...",0.4019,positive
6,Seysmail,5,13-01-2024,worth price,nice,"{'neg': 0.0, 'neu': 0.513, 'pos': 0.487, 'comp...",0.2263,positive
7,Suresh Singh,5,26-12-2023,good product,nice,"{'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'comp...",0.4404,positive
8,Zaki,4,10-02-2024,good toothpaste usesmell good,dabur miswak,"{'neg': 0.0, 'neu': 0.408, 'pos': 0.592, 'comp...",0.7003,positive
9,Vikas,4,17-01-2024,good,paste,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.4404,positive


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Feature Engineering
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Rating
        </font>
    </h3>
</div>

In [85]:
# def feature_rating(text, dataframe) -> pd.core.frame.DataFrame:
#     """
#     Function that extracts the customer's rating and converts it to a float.
    
#     Parameters
#     ----------
#     text : str or pandas.core.series.Series
#         Text from which to extract the client's note.
    
#     dataframe : pandas.core.frame.DataFrame
#         Dataframe that allows the extraction of the final results.

#     Returns
#     -------
#     pandas.core.frame.DataFrame
#         Dataframe that contains the final result.

#     """
#     rating = []
#     for i in range(0, len(text)):
#         row = text[i].split(" ")
#         row = row[0].replace(",", ".")
#         row = float(row)
#         rating.append(row)
    
#     df_data = pd.DataFrame({"New rating": rating})
#     dataframe = pd.concat([dataframe, df_data["New rating"]], axis=1)
    
#     return dataframe

# df_data = feature_rating(text=df_data["Rating"], dataframe=df_data)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Date
        </font>
    </h3>
</div>

In [86]:
# def feature_date(text, dataframe) -> pd.core.frame.DataFrame:
#     """
#     Function to extract the date from a text and convert it to datetime.

#     Parameters
#     ----------
#     text : str or pandas.core.series.Series
#         Text from which to extract the review's date.
    
#     dataframe : pandas.core.frame.DataFrame
#         Dataframe that allows the extraction of the final results.

#     Returns
#     -------
#     pandas.core.frame.DataFrame
#         Dataframe that contains the final result.

#     """
#     root = {
#         "January": "01", "january": "01",
#         "February": "02", "february": "02",
#         "March": "03", "march": "03",
#         "April": "04", "april": "04",
#         "May": "05", "may": "05",
#         "June": "06", "june": "06",
#         "July": "07", "july": "07",
#         "August": "08", "august": "08",
#         "September": "09", "september": "09",
#         "0ctober": "10", "october": "10",
#         "November": "11", "november": "11",
#         "December": "12", "december": "12"
#     }

#     date = []
#     for i in range(0, len(text)):
#         row = text[i].split(" ")
#         row = row[7] + " " + row[6] + " " + row[8]
#         date.append(row)

#     datetime = [word.split(" ") for word in date]

#     date = []
#     for element in datetime:
#         for key, value in root.items():
#             if key not in element:
#                 continue

#             index = element.index(key)
#             element[index] = value
#         row = "/".join(element)
#         date.append(row)
    
#     df_data = pd.DataFrame({"New date": date})
#     df_data["New date"] = pd.to_datetime(df_data["New date"])
#     dataframe = pd.concat([dataframe, df_data["New date"]], axis=1)
    
#     return dataframe

# df_data = feature_date(text=df_data["Date"], dataframe=df_data)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            3) Verified purchase
        </font>
    </h3>
</div>

In [87]:
# df_data["Verified Purchase"].replace(
#     to_replace="Verified Purchase",
#     value="Yes",
#     inplace=True
# )

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            4) Country
        </font>
    </h3>
</div>

In [88]:
# def feature_country(text, dataframe) -> pd.core.frame.DataFrame:
#     """
#     Function to extract the country from a text.

#     Parameters
#     ----------
#     text : str or pandas.core.series.Series
#         Text from which to extract the review's country.
    
#     dataframe : pandas.core.frame.DataFrame
#         Dataframe that allows the extraction of the final results.

#     Returns
#     -------
#     pandas.core.frame.DataFrame
#         Dataframe that contains the final result.

#     """
#     country = []
#     for i in range(0, len(text)):
#         row = text[i].split(" ")
#         row = row[3] + " " + row[4]
#         country.append(row)
    
#     df_data = pd.DataFrame({"Country": country})
#     dataframe = pd.concat([dataframe, df_data["Country"]], axis=1)
    
#     return dataframe

# df_data = feature_country(text=df_data["Date"], dataframe=df_data)

In [89]:
df_data.head(3)

Unnamed: 0,Name,Rating,Date,Review,Title,Score,Compound,Sentiment
0,Dharmesh Tolia,5,27-01-2024,dabur meswak complete oral care toothpaste gam...,toothpaste holistic approach oral care,"{'neg': 0.0, 'neu': 0.671, 'pos': 0.329, 'comp...",0.9879,positive
1,Rohit yadav,4,26-01-2024,dabur meswak complete oral care toothpaste use...,dabur meswak complete oral care toothpaste,"{'neg': 0.0, 'neu': 0.57, 'pos': 0.43, 'compou...",0.9575,positive
2,Anabayan,5,29-01-2024,unique taste good product,unique taste good product,"{'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compou...",0.4404,positive


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Data export
        </font>
    </h2>
</div>

In [90]:
df_data.to_csv(path_or_buf="amzn_customer_reviews_pre.csv", sep=",", index=False)

PermissionError: [Errno 13] Permission denied: 'amzn_customer_reviews_pre.csv'