# NLP Lab with Customer Review Data

### Launch jupyter notebook in the same environment, and import pandas, matplotlib/seaborn and textblob


In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
from textblob import TextBlob

### Read the data into a dataframe with pandas


In [2]:
df=pd.read_csv("Womens Clothing E-Commerce Reviews.csv")

### Filter the data frame to a relevant subset of columns required for our data scenario (image clothing_columns.png)


In [3]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates


In [4]:
df_filtered = df[['Title', 'Review Text', 'Rating', 'Division Name', 'Department Name']].copy()

In [5]:
df_filtered.head(1)

Unnamed: 0,Title,Review Text,Rating,Division Name,Department Name
0,,Absolutely wonderful - silky and sexy and comf...,4,Initmates,Intimate


### Use df.columns.str.lower() (or other preferred method) to standardise your column headers


In [6]:
df_filtered.columns=df_filtered.columns.str.lower()
df_filtered

Unnamed: 0,title,review text,rating,division name,department name
0,,Absolutely wonderful - silky and sexy and comf...,4,Initmates,Intimate
1,,Love this dress! it's sooo pretty. i happene...,5,General,Dresses
2,Some major design flaws,I had such high hopes for this dress and reall...,3,General,Dresses
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,General Petite,Bottoms
4,Flattering shirt,This shirt is very flattering to all due to th...,5,General,Tops
...,...,...,...,...,...
23481,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,General Petite,Dresses
23482,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,General Petite,Tops
23483,"Cute, but see through","This fit well, but the top was very see throug...",3,General Petite,Dresses
23484,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,General,Dresses


### EDA - with descriptive statistics (eg describe(), shape, info()) and/or simple charts
### Explore and familiarise yourself with the data at your own pace - 
### Clarify what each column means/contains and what cleaning steps could be employed (and if needed for our scenario)


In [7]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            19676 non-null  object
 1   review text      22641 non-null  object
 2   rating           23486 non-null  int64 
 3   division name    23472 non-null  object
 4   department name  23472 non-null  object
dtypes: int64(1), object(4)
memory usage: 917.5+ KB


 - There seems to be quite a bit of nulls in the title, review text columns and some in the division & department name

In [8]:
df_filtered.describe()

Unnamed: 0,rating
count,23486.0
mean,4.196032
std,1.110031
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [9]:
 df_filtered.shape

(23486, 5)

In [10]:
df_filtered.groupby(['department name']).agg({'rating':'mean'})

Unnamed: 0_level_0,rating
department name,Unnamed: 1_level_1
Bottoms,4.28876
Dresses,4.150815
Intimate,4.280115
Jackets,4.264535
Tops,4.172239
Trend,3.815126


In [11]:
df_filtered.groupby(['division name']).agg({'rating':'mean'})

Unnamed: 0_level_0,rating
division name,Unnamed: 1_level_1
General,4.176606
General Petite,4.211084
Initmates,4.286285


Column name meaning

- title: headline of each review, probably is an optional field which explains the amount of nulls
- review text: actual review of the product in text form, not mandatory hence the amount of nulls
- rating: numeric value from 1 to 5, overall rating of the product bought. mandatory field hence no nulls. 
- divison name: name of each of the 3 divisions within the company, so review could be directed accordingly. 
- department name: names of the 6 departments of products. 
    
last 2 names have exact number of nulls, suggesting some fields are due to an error or the fields didn't exist at the time


### Use the pandas groupby function to summarise the average rating by division and department as a new dataframe (image clothing_rating_groupby.png)

In [12]:
df_groupby = df_filtered.groupby(['division name','department name']).agg({'rating':'mean'})
df_groupby

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
division name,department name,Unnamed: 2_level_1
General,Bottoms,4.268686
General,Dresses,4.163003
General,Jackets,4.24031
General,Tops,4.148749
General,Trend,3.822917
General Petite,Bottoms,4.329356
General Petite,Dresses,4.133256
General Petite,Intimate,4.240343
General Petite,Jackets,4.30491
General Petite,Tops,4.216469


In [13]:
df_groupby.info

<bound method DataFrame.info of                                   rating
division name  department name          
General        Bottoms          4.268686
               Dresses          4.163003
               Jackets          4.240310
               Tops             4.148749
               Trend            3.822917
General Petite Bottoms          4.329356
               Dresses          4.133256
               Intimate         4.240343
               Jackets          4.304910
               Tops             4.216469
               Trend            3.782609
Initmates      Intimate         4.286285>

### This dataframe can be easily visualised as a bar chart - do so now (image clothing_rating_chart.png)


![alt text](image.png "Title")


### Do a spot /sample check on the review column, index position 5 
### (hint: iloc/loc/at) to apply textblob over the selected review text (image clothing_sentiment_index5.png) 
### Do this for at least 3 samples to evaluate the accuracy of the sentiment polarity and subjectivity against the text itself


In [14]:
df_filtered.iloc[5]

title                                        Not for the very petite
review text        I love tracy reese dresses, but this one is no...
rating                                                             2
division name                                                General
department name                                              Dresses
Name: 5, dtype: object

In [15]:
df_filtered.iloc[5,1]

'I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of the package but its a lot of dress. the skirt is long and very full so it overwhelmed my small frame. not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment of the garment. i love the color and the idea of the style but it just did not work on me. i returned this dress.'

In [16]:
TextBlob('I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of the package but its a lot of dress. the skirt is long and very full so it overwhelmed my small frame. not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment o the garment. i love the color and the idea of the style but it just did not work on me. i returned this dress.').sentiment

Sentiment(polarity=0.17874999999999996, subjectivity=0.533125)

In [17]:
df_filtered.iloc[300]

title                                                            NaN
review text        This dress is stunning- vibrant colors and fli...
rating                                                             4
division name                                                General
department name                                              Dresses
Name: 300, dtype: object

In [18]:
df_filtered.iloc[300,1]

"This dress is stunning- vibrant colors and flirty feel to it. i got the small and i am a 34b/27 pants, 132 lbs- great fit. i only question two things- am i tall enough to pull off the extra fabric in the back and what the heck do you where for a bra? those two considerations are why i didn't give it 5 stars."

In [19]:
TextBlob("This dress is stunning- vibrant colors and flirty feel to it. i got the small and i am a 34b/27 pants, 132 lbs- great fit. i only question two things- am i tall enough to pull off the extra fabric in the back and what the heck do you where for a bra? those two considerations are why i didn't give it 5 stars.").sentiment

Sentiment(polarity=0.17962962962962964, subjectivity=0.49814814814814806)

In [20]:
df_filtered.iloc[2000]

title                                                   Disappointed
review text        The pleats on the bib make this look like some...
rating                                                             1
division name                                         General Petite
department name                                              Dresses
Name: 2000, dtype: object

In [21]:
df_filtered.iloc[2000,1]

"The pleats on the bib make this look like something from chloe sevigny's wardrobe on the set of big love. and the shoulders are cut for an offensive lineman."

In [22]:
TextBlob("The pleats on the bib make this look like something from chloe sevigny's wardrobe on the set of big love. and the shoulders are cut for an offensive lineman.").sentiment


Sentiment(polarity=0.25, subjectivity=0.35)

Overall, sentiment polarity was not very accurate as it was showing higher scores for the more negtive reviews. while the subjectivity score made a bit more sense, it still was a bit unaccurate in terms of the 3 examples chosen. 

### For the any selected customer review, use textblob to break out the text into sentences (image clothing_sentences.png)


In [23]:
sentence5 = df_filtered.iloc[5,1]

In [24]:
sentence5=TextBlob(sentence5)

In [25]:
print(sentence5.sentences)

[Sentence("I love tracy reese dresses, but this one is not for the very petite."), Sentence("i am just under 5 feet tall and usually wear a 0p in this brand."), Sentence("this dress was very pretty out of the package but its a lot of dress."), Sentence("the skirt is long and very full so it overwhelmed my small frame."), Sentence("not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment of the garment."), Sentence("i love the color and the idea of the style but it just did not work on me."), Sentence("i returned this dress.")]


### Using dropna, remove any rows in your data which contain have null in the review column ( hint your new data will be 22641 rows)


In [26]:
df_filtered1 = df_filtered.dropna(axis=0, subset=['review text'])


In [27]:
df_filtered1

Unnamed: 0,title,review text,rating,division name,department name
0,,Absolutely wonderful - silky and sexy and comf...,4,Initmates,Intimate
1,,Love this dress! it's sooo pretty. i happene...,5,General,Dresses
2,Some major design flaws,I had such high hopes for this dress and reall...,3,General,Dresses
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,General Petite,Bottoms
4,Flattering shirt,This shirt is very flattering to all due to th...,5,General,Tops
...,...,...,...,...,...
23481,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,General Petite,Dresses
23482,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,General Petite,Tops
23483,"Cute, but see through","This fit well, but the top was very see throug...",3,General Petite,Dresses
23484,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,General,Dresses


In [28]:
df_filtered1.reset_index(drop=True,inplace=True)

In [29]:
df_filtered1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22641 entries, 0 to 22640
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            19675 non-null  object
 1   review text      22641 non-null  object
 2   rating           22641 non-null  int64 
 3   division name    22628 non-null  object
 4   department name  22628 non-null  object
dtypes: int64(1), object(4)
memory usage: 884.5+ KB


### Define a function with lambda (or other preferred method) to calculate sentiment polarity for each row of the filtered review data set, as a new column on the data frame. (image clothing_sentiment_allrows.png)


In [30]:
def sentiment(text):
    try:
        return TextBlob(text).sentiment
    except:
        return None

In [31]:
df_filtered1['polarity'] = df_filtered1['review text'].apply(sentiment).apply(lambda x: x[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered1['polarity'] = df_filtered1['review text'].apply(sentiment).apply(lambda x: x[0])


In [32]:
df_filtered1.head()

Unnamed: 0,title,review text,rating,division name,department name,polarity
0,,Absolutely wonderful - silky and sexy and comf...,4,Initmates,Intimate,0.633333
1,,Love this dress! it's sooo pretty. i happene...,5,General,Dresses,0.339583
2,Some major design flaws,I had such high hopes for this dress and reall...,3,General,Dresses,0.073675
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,General Petite,Bottoms,0.55
4,Flattering shirt,This shirt is very flattering to all due to th...,5,General,Tops,0.512891


### Using the pandas groupby function again, summarise the minimum review polarity,
### By division and department as a new dataframe (image clothing_min_review.png) -  
### This means whats the lowest sentiment score seen in each department and division - so everything will be less than 0


In [33]:
df_groupby4 = df_filtered1.sort_values(['polarity']).groupby(['division name','department name'],sort=False).agg({'polarity':'min'})



In [34]:
df_groupby4

Unnamed: 0_level_0,Unnamed: 1_level_0,polarity
division name,department name,Unnamed: 2_level_1
General,Tops,-0.975
General,Dresses,-0.916667
General,Jackets,-0.75
General Petite,Tops,-0.7
General Petite,Intimate,-0.575
General Petite,Jackets,-0.5625
General,Bottoms,-0.533333
General Petite,Dresses,-0.4
Initmates,Intimate,-0.392333
General,Trend,-0.270833


### Visualise this summary as a simple sorted bar chart (image clothing_min_review_chart.png)


In [35]:
df_filtered1.to_csv(r'C:\Users\zeyad\Documents\GitHub\mylabs\mylabs\NLP\polarity_reviews.csv', index = False, header = True)

![alt text](image2.png "Title")



### Using the pandas groupby function again, summarise the average review polarity, by division and department as a new dataframe


In [36]:
df_groupby5 = df_filtered1.groupby(['division name', 'department name']).agg({'polarity':'mean'})

In [37]:
df_groupby5

Unnamed: 0_level_0,Unnamed: 1_level_0,polarity
division name,department name,Unnamed: 2_level_1
General,Bottoms,0.245849
General,Dresses,0.251091
General,Jackets,0.237141
General,Tops,0.247025
General,Trend,0.203986
General Petite,Bottoms,0.259258
General Petite,Dresses,0.24838
General Petite,Intimate,0.239517
General Petite,Jackets,0.241773
General Petite,Tops,0.256835



### OPTIONAL - if you find it difficult to do these group by and visualise tasks in python- you can output your data frame to a csv, then connect to that csv with Tableau and do the same charts there! this is also a useful exercise to remind you of how to work with Tableau (image clothing_polarity_tableau.png)

![alt text](avg_polarity.png "Title")