<a href="https://colab.research.google.com/github/therajmaurya/assignment/blob/main/LeadSquared.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Business Context**
- We are building a Customer Experience platform for B2B Customers.
One of the core components of this Customer Experience Platform is our NLP Engine, which would consume textual data and generate insights.

**Problem statement:**
- We have recently started working building an initial iteration for an Ecommerce customer with their Reviews data.


**Data:**
- Let us use a dataset available in Kaggle for our reference. Link to ecommerce reviews data https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/version/1?select=Womens+Clothing+E- Commerce+Reviews.csv

**Description:**
- As an initial solution. We extracted key phrases (using Bi-Grams and Trigrams) from the review text and built a word cloud and presented it to our business teams to understand the usefulness. Feedback we received is that it's very generic and it’s missing context and not adding lot of value.
- When we tried understanding a bit more about this feedback, we realized that they are looking for meaningful labels. As an example, they have shared Amazon's Mobile product page, where labels are generated based on the review text. Even though business team has given Amazon’s Reviews as a reference, business team mentioned that it still misses the context.
- Say for Example,
  - It doesn’t say whether the battery life is good or bad.
  - Memory card is good/bad, available/not available
  - Camera Quality is good or bad
- Without the context, users need to do Analysis on their own.

**Objective:**

You are NLP expert in the team, and we want you to come up with a solution which can meet the business team's expectations. This will help in iterating and taking this solution further. Following are our expectations:

*As a data scientist, we want you to come up with*
1. How would you approach this problem statement? We would like to understand your thought
process.
2. Can you come up with an initial model using any of your favorite language, frameworks, libraries, and model?
  - Generate a CSV file as an output which contains all the existing columns + add a new column for the inference/output for each review.
  - You may want to take Clothing ID 1078 for your analysis and code(model) as it has the
maximum number of reviews.
3. What are the assumptions you have made to come up with a solution?

*What are we trying to achieve with this?*
1. Our objective is to understand your approach (thought process, code), when there is a one- liner like this given to you.
2. We understand that we have given you a one liner. We are not expecting a perfect model which would take care of all the edge cases.
3. Feel free to make your own assumptions. Document your assumptions so that we can review them later.

**MY SOLUTION APPROACH**

***Assumptions:***
- Assuming that we have sufficient reviews and ratings data and we do not face the cold start problems.
- For the scenario, where reviews are not there but ratings are there, we can assume that the ratings signify the same indicated level of sentiments for all the topics for that category.
- For the scenario, where reviews are there but ratings are missing, we can treat it in two ways: *either we can put a rating based on the overall sentiment of the text or we can fill with average for now.*
- If reviews and ratings both are missing, we will ignore those for now.
- We will use recommended, positive and age - appropriately wherever possible during modelling and for handling different scenarios as well.
- Meaning of the rating for my own analysis:
  - 1 Star: Hate it. [negative]
  - 2 Star: Don't like it. [negative]
  - 3 Star: It's okay. [neutral]
  - 4 Star: Like it. [positive]
  - 5 Star: Love it. [positive]
- Whereever i use the word review, I am referring to the combined "Review Title and Review Text".
- We will use the out-of-the-box models if they are available, to save time. 
- We will not spend much time in text cleaning and pre-processing, we will just do the minimum for now (not in-depth).

**Potential Todo Tasks:**
- Text Cleaning (removing stop words, short-hands, punctuaations, html tags etc.)
- Handling Accented Characters
- Spelling correction (not going to implement it right now)
- Text Language Check and if other languages found (ignore or translate those text if applicable)
- Missing Value Treatment (if applicable)
- Class Imbalance Handling (if any)
- Stemming / Lemmatisation (when not using embedding)
- Tokenisation & Encoding/Embedding
- Modelling (model selection, hyperparameter-tuning/fine-tuning etc.)
- Evaluation on Metrics and re-iteration.
- Handling sarcasm (as the next step)


**Approach:**
- First we can group all the products into categories that have similar properties/features/specifications. Groups can be like "electronic/mobile-phone", "electronic/laptop", "clothing/wearables", "clothing/cleaning", etc. (For our analysis, the data we have is already for such a group).
- For various category/group of products, if we do not have a set of features list available from the businesses, we can use "Topic Modelling" to extract the intents/topics of the reviews that people talk about and after extracting or going through all the reviews, we will have consolidated all the list of topics that people generally talk about. Ex. "battery life, memory, RAM etc for mobile phones" and "color, cloth quality, design, fit etc for clothing/wearables" and so on.
- Now we want to get sentiment for each identified topics/features from a given review. If some sentiments are not mentioned in the review or missing, we can assume neutral.
- There may be easy and complex supervised as well unsupervised methods to extract feature-level sentiments. (We need to research more about it)
- The way, I am going to approach this problem is as given the topics/features (extracted from Topic Modelling) and the review, I will treat it as a naive question answer problem where i will give the topics/features as question w.r.t. review text. We will get some relevant subtext as output from the question.
- We assume that sentiment bearing words such as adjectives are likely to be located close to a feature word rather than far away from it. So, we can take nearby 1-2 words into consideration too and run a sentiment analyser on top it to get the fearture sentiment with probabilities. 
- We scale 5 times & round off the probabilites to be in range of ratings. That will solve our objective for the first iteration and we can improve upon it in next iterations by using some custom models and some other fine-tuning of the approach. 


**More Ideas:**
- We can use Sarcasm Detection Module (open-sourced / pre-trained or custom-trained models) to detect the reviews that may have sarcasms in it. 
- Once we identify it, we can choose to ignore it or invert the outcomes of previous results. 
- If we invert the all topics scores, then that might not be truely accurate, as the user might not be sarcastic for all the topics, they may just be using sarcasm to point out only a few features. 
- One assumption can be that if there is above 3 ratings, then there is lower chances of sarcasm. We can use it to our advantage for anaylsis.
- Similarly we can use Postive Feedback and Recommended feature sets as well to our analysis.
- User's age might also change their preferances and taste for the features of a product.

**MY SOLUTION IMPLEMENTATION:**

- Solution provided here is just a sample and not the exhaustive/optimal solution (as per the expectation)

In [None]:
import pandas as pd
import numpy as np
pd.set_option("display.max_colwidth", None)

In [None]:
import re 
import time 
import nltk 
from nltk.corpus import stopwords 
nltk.download('stopwords') 
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from textblob import TextBlob
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords 
from nltk import word_tokenize 
import string 
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
df.drop("Unnamed: 0", axis=1, inplace=True)
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,"I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c",3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,5,1,6,General,Tops,Blouses


In [None]:
df.info() # some missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              23486 non-null  int64 
 1   Age                      23486 non-null  int64 
 2   Title                    19676 non-null  object
 3   Review Text              22641 non-null  object
 4   Rating                   23486 non-null  int64 
 5   Recommended IND          23486 non-null  int64 
 6   Positive Feedback Count  23486 non-null  int64 
 7   Division Name            23472 non-null  object
 8   Department Name          23472 non-null  object
 9   Class Name               23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.8+ MB


In [None]:
# df[df.Title == ''] 
# df[df['Review Text']=='']
df.Title = df.Title.fillna('')
df['Review Text'] = df['Review Text'].fillna('')
df['user_comment'] = df.Title + ' ' + df['Review Text']
cols = ['Age', 'user_comment', 'Rating', 'Recommended IND', 'Positive Feedback Count']
print("Columns of interest:", cols)

Columns of interest: ['Age', 'user_comment', 'Rating', 'Recommended IND', 'Positive Feedback Count']


In [None]:
df[cols].head()

Unnamed: 0,Age,user_comment,Rating,Recommended IND,Positive Feedback Count
0,33,Absolutely wonderful - silky and sexy and comfortable,4,1,0
1,34,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4
2,60,"Some major design flaws I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c",3,0,0
3,50,"My favorite buy! I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",5,1,0
4,47,Flattering shirt This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,5,1,6


**Text Cleaning/Preprocessing (basic, not exhaustive)**

In [None]:
char_corpus = set() # to check Accented Characters
_ = df['user_comment'].apply(lambda x: char_corpus.update(set(x)))
print(char_corpus) 

{'~', 'h', 'E', ')', 'f', 'o', 'm', 'g', ':', 'i', 'W', '#', 'M', '%', 'H', 'q', 'p', 'I', 'D', '_', 'B', '-', 'e', 'N', 'C', '*', '|', 'k', 'P', 'F', ']', '3', 'G', 'V', "'", 's', '1', '\r', 'J', 'A', '}', 'n', ',', 'u', '<', '9', '.', 'U', 'd', '5', 'L', 'X', '&', '6', '\xa0', '\\', 'a', '0', '$', '"', 'K', '¨', '(', '>', 'â', ';', '\n', ' ', 'w', '+', 'x', 'v', '!', 'j', '¼', '2', '`', 'z', 'Q', 'Y', '/', 'O', '8', 'ã', 'y', '[', '©', '?', '7', 'l', 'S', 'R', 't', '4', 'c', 'b', 'T', 'r', '@', '{', 'Z', '='}


In [None]:
# handling accented characters 
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')
df['user_comment'] = df['user_comment'].str.lower().apply(strip_accents)

In [None]:
def strip_html_tags(text):
    # Initiating BeautifulSoup object soup.
    soup = BeautifulSoup(text, "html.parser")
    # Get all the text other than html tags.
    stripped_text = soup.get_text(separator=" ")
    return stripped_text
df['user_comment'] = df['user_comment'].apply(lambda x: strip_html_tags(x))

In [None]:

CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
}
def clean_text(text): # text in lower case already
    re.sub(r"[^a-zA-Z0-9:$-,%.?!]+", ' ', text) # special characters
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    for k, v in CONTRACTION_MAP.items():
      text = re.sub(k, v, text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r"\. com", " .com ", text)
    text = re.sub(r"\ [A-Za-z]*\.com", " ", text)
    pattern = re.compile(r'\s+') 
    text = re.sub(pattern, ' ', text) 
    # text = re.sub('\W', ' ', text)
    # text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

df['user_comment'] = df['user_comment'].map(lambda com : clean_text(com))

In [None]:
sp_list = ["\n", "\\n", "\r", "\\r", "\t", "\\t", "\'", "\"", "/", "//", "\\"] + list(string.punctuation)
for k in sp_list: # replacing the special character and punctuations
  df['user_comment'] = df['user_comment'].apply(lambda x: x.replace(k, " "))

**Preprocessing only for Non-embedding based model**

In [None]:
# remove stopwords; we will not remove stop words etc for BERT based model
stoplist = stopwords.words('english') 
stoplist = set(stoplist)

df['user_comment_non_embedding'] = df['user_comment'].map(lambda x : \
          (" ".join([word for word in x.split() if word.lower() not in stoplist ])).strip())

In [None]:
# remove numbers
df['user_comment_non_embedding'] = df['user_comment_non_embedding'].map(lambda x : re.sub(r"[^a-zA-Z:$-,%.?!]+", ' ', x))

In [None]:
# lemmatization
# we will perform lemmatisation (no stemming) for CountVectorizer / TF-IDT (Non-BERT) based models

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
for pos in ["n", "v", "a", "r", "s"]:
  df['user_comment_non_embedding'] = df['user_comment_non_embedding'].map(lambda x :\
          (' '.join([lemmatizer.lemmatize(w,pos) for w in w_tokenizer.tokenize(x)])).strip())

In [None]:
df.head(1)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,user_comment,user_comment_non_embedding
0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,absolutely wonderful silky and sexy and comfortable,absolutely wonderful silky sexy comfortable


**Topic Modelling**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# discard any words that appear in >75% of reviews and words that appear in less than 100 reviews will be discarded too
vectorizer = CountVectorizer(max_df=0.75, min_df=100, token_pattern='\w+|\$[\d\.]+|\S+')
tf = vectorizer.fit_transform(df['user_comment_non_embedding']).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

number_of_topics = 10
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

LatentDirichletAllocation(random_state=0)

In [None]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

no_top_words = 10
display_topics(model, tf_feature_names, no_top_words)
# we are just taking single words, but we can also consider bi-grams for the topic words

Unnamed: 0,Topic 0 words,Topic 0 weights,Topic 1 words,Topic 1 weights,Topic 2 words,Topic 2 weights,Topic 3 words,Topic 3 weights,Topic 4 words,Topic 4 weights,Topic 5 words,Topic 5 weights,Topic 6 words,Topic 6 weights,Topic 7 words,Topic 7 weights,Topic 8 words,Topic 8 weights,Topic 9 words,Topic 9 weights
0,sweater,3165.3,look,3378.2,pant,2280.9,waist,978.0,top,4255.4,dress,10552.2,size,5891.0,great,3365.1,wear,2369.8,color,3200.0
1,soft,1218.9,like,3152.9,jean,2133.2,back,910.1,fit,1344.2,fit,2619.2,small,4085.4,wear,2896.5,dress,1750.9,top,2567.8
2,love,992.2,nice,1479.8,fit,2090.3,blouse,904.6,look,1143.0,size,1511.2,order,2964.9,love,1888.3,love,1669.7,love,2173.2
3,sleeve,807.5,fabric,1474.8,love,1811.9,short,848.9,fabric,1018.2,beautiful,1353.9,run,2577.3,perfect,1561.8,get,1487.2,size,1377.4
4,warm,671.6,would,1372.1,great,1333.7,fit,844.8,like,943.4,flat,1352.8,large,2345.2,summer,1476.5,one,1135.1,fit,1358.6
5,wear,641.5,skirt,1339.5,wear,1102.9,low,789.3,bra,769.9,love,1274.1,fit,2067.9,top,1382.1,buy,977.0,blue,1120.6
6,coat,618.1,really,1308.5,pair,1064.6,cut,779.0,back,768.6,make,854.7,petite,1914.7,dress,1188.1,compliment,965.2,buy,1058.5
7,fit,611.5,good,1264.6,comfortable,1011.2,beautiful,748.3,size,733.7,fabric,845.6,would,1755.9,comfortable,1077.6,time,953.9,white,1024.4
8,jacket,553.0,material,1259.2,size,1003.4,love,746.8,bottom,724.6,perfect,813.1,x,1740.5,look,1009.8,wash,918.0,beautiful,886.4
9,color,525.0,color,1134.0,perfect,902.4,arm,679.9,would,723.6,look,801.8,medium,1443.3,fall,974.8,retailer,811.9,order,871.8


In [None]:
# i have defined some random classes based on the topic's words for now, this can be improved further later
classes = ['size/fit/waist', 'look/style/design', 'color/shade', 'fabric/make/feel/quality/wash/button']

**Feature-wise Sentiment Scoring**

In [None]:
df.head(1)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,user_comment,user_comment_non_embedding
0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,absolutely wonderful silky and sexy and comfortable,absolutely wonderful silky sexy comfortable


In [None]:
# !pip install transformers
from transformers import pipeline
qa_model = pipeline("question-answering")
question = (' '.join("size/fit/waist".split('/'))).strip()
context = "Some major design flaws I had such high hopes for this dress and \
really wanted it to work for me. i initially ordered the petite small (my \
usual size) but i found this to be outrageously small. so small in fact that \
i could not zip it up! i reordered it in petite medium, which was just ok. \
overall, the top half was comfortable and fit nicely, but the bottom half had \
a very tight under layer and several somewhat cheap (net) over layers. imo, \
a major design flaw was the net over layer sewn directly into the zipper - it c"
context = "This shirt is very flattering to all due to the adjustable front tie. \
 it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!"
ans = qa_model(question = question, context = context)
ans

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'answer': 'sleeveless', 'end': 138, 'score': 0.5273441076278687, 'start': 128}

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
import operator
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()
sia.polarity_scores(ans['answer'])["compound"]

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


0.0

Scaling to entire data

In [None]:
def extract_feature_phrase(text, clas):
  feature_phrases = []
  clas_set = set(clas.split('/'))
  text_set = set(text.split(' '))
  phrases = clas_set.intersection(text_set)
  clas_set = set(clas.split('/'))
  if phrases:
    # print("using intersection")
    feature_idx = (text.split(' ')).index(list(phrases)[0])
    feature_phrases = (text.split(' '))[feature_idx-3: feature_idx+3+1]
  else:
    # print("using QA model")
    phrase = qa_model(question = (' '.join(clas.split('/'))).strip(), context = text)["answer"]
    text_list = list(map(lambda x: x.split(' '), text.split(phrase)))
    feature_phrases = text_list[0][:-3] + phrase.split(' ') + text_list[1][:3+1]
  return (' '.join(feature_phrases)).strip()

extract_feature_phrase(context, "size/fit/waist")

'This shirt is very flattering to all due to the adjustable front tie.  it is the perfect length to wear with leggings and sleeveless  so it pairs'

In [None]:
cloth_id_list = df['Clothing ID'].unique()[:10] 
# choosing only 10 clothing ids to save compute for now, this can be scaled to all later
df = df[df['Clothing ID'].isin(cloth_id_list)]
df.shape

(996, 12)

In [None]:
for clas in classes:
  df[clas+'_phrase'] = df.user_comment_non_embedding.apply(lambda x: extract_feature_phrase(x, clas) if x else '')
  df[clas+"_sentiment_score"] = df[clas+'_phrase'].apply(lambda x: sia.polarity_scores(x)["compound"])
  df[clas+"_sentiment"] = np.select([df[clas+"_sentiment_score"] < 0, df[clas+"_sentiment_score"] == 0, df[clas+"_sentiment_score"] > 0],
                             ['neg', 'neu', 'pos'])
df.head(1)


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,...,size/fit/waist_sentiment,look/style/design_phrase,look/style/design_sentiment_score,look/style/design_sentiment,color/shade_phrase,color/shade_sentiment_score,color/shade_sentiment,fabric/make/feel/quality/wash/button_phrase,fabric/make/feel/quality/wash/button_sentiment_score,fabric/make/feel/quality/wash/button_sentiment
0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,...,pos,silky sexy comfortable,0.7717,pos,silky sexy comfortable,0.7717,pos,silky sexy comfortable,0.7717,pos


In [None]:
# handling the scenario where there is no review just rating (converting the same rating to sentiment for all classes)
'''
1 Star: Hate it. [negative]
2 Star: Don't like it. [negative]
3 Star: It's okay. [neutral]
4 Star: Like it. [positive]
5 Star: Love it. [positive]
'''
def rating_to_sentiment(rating):
  if rating<3:
    return 'neg'
  elif rating>3:
    return 'pos'
  else:
    return 'neu'

for clas in classes:
  df[clas+'_sentiment'] = df.apply(lambda x: rating_to_sentiment(x.Rating) if not x[clas+'_phrase'] else x[clas+'_sentiment'], axis=1)



In [None]:
df[['Clothing ID', "user_comment_non_embedding"]].groupby('Clothing ID').count()

Unnamed: 0_level_0,user_comment_non_embedding
Clothing ID,Unnamed: 1_level_1
767,2
847,4
853,6
858,21
1049,32
1065,16
1077,297
1080,289
1095,327
1120,2


In [None]:
# computing product-wise feature-wise sentiment
cols = ['Clothing ID'] + [clas + '_sentiment' for clas in classes]
df[cols].groupby('Clothing ID').agg(lambda x: pd.Series.value_counts(x).to_dict())

Unnamed: 0_level_0,size/fit/waist_sentiment,look/style/design_sentiment,color/shade_sentiment,fabric/make/feel/quality/wash/button_sentiment
Clothing ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
767,{'pos': 2},{'pos': 2},"{'pos': 1, 'neg': 1}",{'pos': 2}
847,{'pos': 4},{'pos': 4},{'pos': 4},{'pos': 4}
853,{'pos': 6},{'pos': 6},{'pos': 6},"{'pos': 5, 'neu': 1}"
858,"{'pos': 16, 'neg': 4, 'neu': 1}","{'pos': 18, 'neg': 3}","{'pos': 17, 'neg': 3, 'neu': 1}","{'pos': 17, 'neu': 2, 'neg': 2}"
1049,"{'pos': 28, 'neu': 3, 'neg': 1}","{'pos': 31, 'neu': 1}","{'pos': 29, 'neu': 2, 'neg': 1}","{'pos': 30, 'neu': 1, 'neg': 1}"
1065,"{'pos': 10, 'neu': 6}","{'pos': 15, 'neu': 1}","{'pos': 15, 'neu': 1}","{'pos': 11, 'neu': 4, 'neg': 1}"
1077,"{'pos': 227, 'neu': 42, 'neg': 28}","{'pos': 239, 'neu': 32, 'neg': 26}","{'pos': 269, 'neu': 17, 'neg': 11}","{'pos': 250, 'neu': 32, 'neg': 15}"
1080,"{'pos': 243, 'neu': 39, 'neg': 7}","{'pos': 258, 'neu': 18, 'neg': 13}","{'pos': 259, 'neu': 20, 'neg': 10}","{'pos': 250, 'neu': 29, 'neg': 10}"
1095,"{'pos': 243, 'neu': 57, 'neg': 27}","{'pos': 271, 'neu': 30, 'neg': 26}","{'pos': 294, 'neu': 20, 'neg': 13}","{'pos': 249, 'neu': 48, 'neg': 30}"
1120,{'pos': 2},{'pos': 2},{'pos': 2},{'pos': 2}
