
## NLP Lab 

The purpose of todays lab is to calculate and validate the customer sentiment (polarity) based on reviews of purchased womens clothing, as an introduction to using NLP techniques on text data and retrieving actionable data insights per department / division at the womens clothing company.  


> This lab has been designed to be completed at either **Starter level** or **Stretch level** - in order to accomodate different technical / time / motivation levels in the class. **Starter** - is the minimum required deliverable for today's lab, so you should complete the steps below as directed before submitting the url to your notebook via the student portal. **Stretch** will take you longer and the actions you will take are prompted, rather than prescriptive.  


You will follow the instructions and concepts you saw in class today to:
+ retrieve the data 
+ sample customer reviews using textblob
+ use a function to apply sentiment analysis to the whole data set
+ visualise the sentiment by department / division
*(optional- Stretch)*
+ visualise to validate the sentiment analysis
+ apply another sentiment analyser
+ identify and evaluate the differences between each approach

#### Each step below is given a description of what to do and key stages are accompanied by prompt images in [this folder](https://github.com/student-IH-labs-and-stuff/BER-DAFT-MAR21/tree/main/Labs/NLPscreenshots) to confirm you are on the right track 

------

### Starter steps 

1. retrieve the data from this [kaggle link](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews)
2. install texblob onto your conda environment
3. launch jupyter notebook in the same environment, and import pandas, matplotlib/seaborn and textblob 
4. read the data into a dataframe with pandas
5. filter the data frame to a relevant subset of columns required for our data scenario (image clothing_columns.png)
6. use df.columns.str.lower() (or other preferred method) to standardise your column headers
7. EDA - with descriptive statistics (eg `describe()`, `shape`, `info()`) and/or simple charts, explore and familiarise yourself with the data at your own pace - clarify what each column means/contains and what cleaning steps could be employed (and if needed for our scenario)
8. use the pandas [groupby function](https://realpython.com/pandas-groupby) to summarise the average rating by division and department as a new dataframe (image clothing_rating_groupby.png)
9. this dataframe can be easily visualised as a bar chart - do so now (image clothing_rating_chart.png) 
10. OPTIONAL - if you find it difficult to do this group by and visualise task in python- you can connect to the csv with Tableau and do the same chart there! this is also a useful exercise to remind you of how to work with Tableau (image clothing_rating_tableau.png)
11. do a spot /sample check on the review column, index position 5 (hint: iloc/loc/at) to apply textblob over the selected review text (image clothing_sentiment_index5.png) Do this for at least 3 samples to evaluate the accuracy of the sentiment polarity and subjectivity against the text itself
12. for the any selected customer review, use textblob to break out the text into sentences (image clothing_sentences.png)
13. using dropna, remove any rows in your data which contain have null in the review column ( hint your new data will be 22641 rows)
14. define a function with lambda (or other preferred method) to calculate sentiment polarity for each row of the filtered review data set, as a new column on the data frame. (image clothing_sentiment_allrows.png)
15. using the pandas groupby function again, summarise the minimum review polarity, by division and department as a new dataframe (image clothing_min_review.png) - this means whats the lowest sentiment score seen in each department and division - so everything will be less than 0
16. visualise this summary as a simple sorted bar chart (image clothing_min_review_chart.png)
17. using the pandas groupby function again, summarise the average review polarity, by division and department as a new dataframe
18. visualise this summary as a simple sorted bar chart 
19. OPTIONAL - if you find it difficult to do these group by and visualise tasks in python- you can output your data frame to a csv, then connect to that csv with Tableau and do the same charts there! this is also a useful exercise to remind you of how to work with Tableau (image clothing_polarity_tableau.png)
20. tidy up your notebook as much as possible, removing any redundant code, and adding annotations where useful 

---- 

### Stretch guidance (optional follow on activities)

1. how accurate is the sentiment polarity calculated on this data set? how can you tell ? 
2. what are the most useful ways to visualise the sentiment polarity against the other data in the reviews data set? (hint: tableau or seaborn, exploratory data visualisation) - I have started this in Tableau - heres [my workbook](https://public.tableau.com/profile/sianedavies#!/vizhome/Customer_reviews_viz/reviewcategorytotals)
3. through sample /limited experimentation, investigate whether processing / cleaning the review text might lead to a more accurate sentiment calculation
4. textblob has an alternative, naive bayes sentiment analyser that is trained on movie reviews - do you think this could be more accurate? what features would you choose to include if training a sentiment analysis model ? 
5. could there be advantages to utilising spacy+textblob instead? why? how would you identify a sentiment tool better suited to this data set?
6. install the needed packages and apply a second sentiment analysis method, end to end to evaluate the accuracy of the first approach
7. summarise what you have learnt in a .md file to accompany your notebook or annotations /images in the notebook itself 
8. tidy up your code as much as possible, consider modularising any elements of what you have done for re-usability and efficiency

--------

### When you are ready submit your lab via the student portal (github /google colab url)






In [7]:
# 1. retrieve the data from kaggle √
# 2. install textblob onto your conda environment √
# 3. launch jupyter notebook in the same environment, import libraries √
import numpy as np
import pandas as pd 

import matplotlib
%matplotlib inline

import seaborn as sns

from tqdm import tqdm
import textblob

In [8]:
# 4. Read the data into a dataframe with pandas


data = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
df = data

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [13]:
# 5. filter the data frame to a relevant subset of columns required for our data scenario(image clothing_columns.png) √

# we need to get rid of these columns [Unnamed:0, Clothig ID, Age, Recommended IND, Positive Feedback Count, Class Name]

df.drop(["Unnamed: 0", "Clothing ID", "Age", "Recommended IND", "Positive Feedback Count", "Class Name"], axis=1, inplace=True)

In [14]:
df.head()

Unnamed: 0,Title,Review Text,Rating,Division Name,Department Name
0,,Absolutely wonderful - silky and sexy and comf...,4,Initmates,Intimate
1,,Love this dress! it's sooo pretty. i happene...,5,General,Dresses
2,Some major design flaws,I had such high hopes for this dress and reall...,3,General,Dresses
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,General Petite,Bottoms
4,Flattering shirt,This shirt is very flattering to all due to th...,5,General,Tops


In [16]:
# 6. use df.columns.str.lower() (or other preferred method) to standardise your column headers √

df.columns = df.columns.str.lower()

In [17]:
df.head()

Unnamed: 0,title,review text,rating,division name,department name
0,,Absolutely wonderful - silky and sexy and comf...,4,Initmates,Intimate
1,,Love this dress! it's sooo pretty. i happene...,5,General,Dresses
2,Some major design flaws,I had such high hopes for this dress and reall...,3,General,Dresses
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,General Petite,Bottoms
4,Flattering shirt,This shirt is very flattering to all due to th...,5,General,Tops


In [18]:
# 7. EDA - with descriptive statistics (eg describe(), shape, info()) and/or simple charts,
# explore and familiarise yourself with the data at your own pace
#- clarify what each column means/contains and what cleaning steps could be employed (and if needed for our scenario)

In [19]:
df.describe()

Unnamed: 0,rating
count,23486.0
mean,4.196032
std,1.110031
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [21]:
df.shape

# 23486 rows and 5 columns

(23486, 5)

In [23]:
df.info()

# from first sight 
# title has cca 3000 nulls
# review text around 800 nulls
# rating without nulls
# division name and department name negligible amount of nulls

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            19676 non-null  object
 1   review text      22641 non-null  object
 2   rating           23486 non-null  int64 
 3   division name    23472 non-null  object
 4   department name  23472 non-null  object
dtypes: int64(1), object(4)
memory usage: 917.5+ KB


In [28]:
# detect missing values

df.isnull().sum()

# as said above
# 3810 nulls in title
# 845 nulls in review text
# rating without nulls
# 14 nulls in division name and department name (negligible amounts)

title              3810
review text         845
rating                0
division name        14
department name      14
dtype: int64

In [31]:
# calculate the percentage of nulls
nulls_perc = df.isnull().sum() * 100 / len(df)
nulls_perc

# according to what we have learnt in Unit 1 the percentage is so low that it would not pay off to get rid of the nulls.

title              16.222430
review text         3.597888
rating              0.000000
division name       0.059610
department name     0.059610
dtype: float64

In [35]:
# check for correlations

correlations = df.corr(method="pearson")
correlations

Unnamed: 0,rating
rating,1.0


In [39]:
# 7. use the pandas groupby function to summarise the average rating 
# by division and department as a new dataframe (image clothing_rating_groupby.png)

df.groupby(["division name", "department name"]).rating.mean().sort_values()

division name   department name
General Petite  Trend              3.782609
General         Trend              3.822917
General Petite  Dresses            4.133256
General         Tops               4.148749
                Dresses            4.163003
General Petite  Tops               4.216469
General         Jackets            4.240310
General Petite  Intimate           4.240343
General         Bottoms            4.268686
Initmates       Intimate           4.286285
General Petite  Jackets            4.304910
                Bottoms            4.329356
Name: rating, dtype: float64