# Understanding Customer Using Reviews Data

The purpose of this project is to try out Python concepts and techniques I've learned from Computational Concept in HCDE to understand the customers in a women's clothing ecommerce webiste. To understand the customers, I will calculate and visualize descriptive statistics to gain a basic understanding of the dataset, conduct sentiment analysis on review text to uncover key words in positive and negative sentiments, and conduct topic modeling to find customer segments. 

## Dataset

The dataset used in this project is [the women's ecommerce clothing reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) dataset found on Kaggle. This data set is real reviews from an anonymized women’s clothing e-commerce platform. The data is a collection of 22641 Rows and 10 column variables. Each row consists of a written review as well as an additional feature of the customer information. All 10 variables are clothing ID, age, title, review text, rating, recommended IND, positive feedback count, division name, department name, and class name. 


In [None]:
%pylab inline

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data = pd.read_csv('../input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
data = data.drop(['Unnamed: 0'], axis=1)

Remove na in data and store as a new csv file

In [None]:
data = data.dropna(how='any') 
data.to_csv('Womens_Clothing_clean.csv')

Use clean data

In [None]:
df = pd.read_csv('/kaggle/input/womens-clothing-cleancsv/Womens_Clothing_clean.csv')
df = df.drop(['Unnamed: 0'], axis=1)

print('Number of rows of original dataset: %d' % len(data))
print('Number of rows after dropping na: %d' % len(df))

In [None]:
df

## Descriptive statistics

Mean age and mean rating: 

In [None]:
print('Mean reviewer age: ' , round(df['Age'].mean(), 3))
print('Mean rating: ', round(df['Rating'].mean(), 3))

#### Plotting distribution of age and rating

In [None]:
# histogram for age
age = pd.Series(df['Age'])
# age.value_counts()
age_plot = age.plot.hist(bins=20, rwidth=0.9, color='#607c8e', title = 'Age distribution')
plt.xlabel('Age')

The age distribution plot indicates a slightly left skewed age distribution with median value around 35 - 40. Most of the customers who shop at this website are younger than 60 years old.

In [None]:
# histogram for rating
rating = pd.Series(df['Rating'])
# rating.value_counts()
rating_plot = rating.plot.hist(color='#607c8e')
plt.title('Rating distribution')
plt.xlabel('Rating')
plt.ylabel('Frequency')

The biggest group of people give the highest rating of 5. 

In [None]:
rating_by_age = df.groupby('Age')['Rating'].mean()

rating_by_age_plot = rating_by_age.plot(kind='bar', title='Ratings by Age', color='#4290be')



The Rating by Age graph shows that the reviewers who are less than or equal to 76-year-old have similar mean ratings between 4 to about 4.5. Mean ratings of reviewers older than 76 fluctuate more. However, whether those reviewers who identified themselves as above 80 years old were reporting their real age remains questionable. Upon reading the reviews of those in the higher age group, I did not find any evidence of fake age, so I decided to keep data of those age groups. 


In [None]:
rating_by_department = df.groupby('Department Name')['Rating'].mean()

rating_by_department_plot = rating_by_department.plot(kind='bar', title='Ratings by Department', color='#4290be')


There isn't any big difference among ratings on clothing in different departments. 

In [None]:
rating_by_class = df.groupby('Class Name')['Rating'].mean()
rating_by_class_plot = rating_by_class.plot(kind='bar', title='Ratings by Class', color='#4290be')

Similarly, there isn't much difference among ratings on clothing in different class.

## Sentiment Analysis on review text

### Labeling reviews as positive or negative
Since there are way more customers rated items as 5, and in those reviews with ratings of less than 5, there are always some things the customers are not satisfied with, I will label reviews with a rating of 5 as positive and anything below as negative.

In [None]:
sentiment_df = df
sentiment_df['label'] = ['pos' if rating == 5 else 'neg' for rating in df['Rating']] 
sentiment_df.head()

### split training and testing set for final model

In [None]:
from sklearn.model_selection import train_test_split
target = [1 if label == 'pos'  else 0 for label in df['label']] # use 1 and 0 to represent labels
review_train, review_test, target_train, target_test = train_test_split(sentiment_df['Review Text'].values, target, test_size=0.20, random_state=1)
print(review_train)

### Vectorization

In order to feed the review text into machine learning algorithms, we need to convert the text into numbers. 

After testing out combinations of different text processing techniques such as tf-idf, bi-gram, stemming and others, using CountVectorizer with tri-gram and stop words yields the best accuracy.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
stop_words=['in','of','at','a','the']
trigram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words) #initialize vectorizer using tri-gram and removing stop words
trigram_vectorizer.fit(review_train)
X = trigram_vectorizer.transform(review_train) # convert training data
X_test = trigram_vectorizer.transform(review_test) # convert testing data

### Build classifier

In comparison to logistic regression, support vector machine (svm) yields higher accuracy and has shorter computation time. So I decided to use svm to fit the model.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

# using svm
X_train, X_val, y_train, y_val = train_test_split(
    X, target_train, train_size = 0.75, random_state=1) # set random state to retain the same data split each time.  

# Tuning hyperparameter c to adjust regularization and see which one yields the highest accuracy
for c in [0.001, 0.005, 0.01, 0.05, 0.1]:
    
    svm = LinearSVC(C=c)
    svm.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, svm.predict(X_val))))




From the above results we see that C=0.005 yields the highest accuracy. Therefore, we will deploy our final model using C=0.005

In [None]:
final_model = LinearSVC(C=0.005)
final_model.fit(X, target_train)
print ("Final Accuracy: %s" 
       % accuracy_score(target_test, final_model.predict(X_test)))

Let’s look at the 10 most discriminating words for both positive and negative reviews. We’ll do this by looking at the largest and smallest coefficients, respectively.

In [None]:
feature_to_coef = {
    word: coef for word, coef in zip(
        trigram_vectorizer.get_feature_names(), final_model.coef_[0]
    )
}
print('positive:')

for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:10]:
    print (best_positive)

print('\nnegative:')
    
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:10]:
    print (best_negative)
    


From the output we can see that words that strongly associate with positive reivews are words that usually express positive emotions such as 'perfect' and 'love.' In negative words, besides adjective that usually describe negative emotions, words related with the action of returning items are high on the list. 

## Clustering review text - Topic modeling using LDA

Topic modeling is used to extract topics from large collections of documents. Latent Dirichlet Allocation (LDA) is one way to do topic modeling. 

### Prepare data

In [None]:
df

In [None]:
reviews = df[['Review Text']]
reviews['index'] = reviews.index
documents = reviews
reviews

### Preprocessing and vectorization

#### Lemmatizing, stemming, removing stop words

* Stopwords are removed.
* Lemmatizing — words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Stemming — words are reduced to their root form.

In [None]:
# Loading libraries for text processing
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

In [None]:
# functions for stokenizing, stemming, lemmatizing, and removing stop words
stemmer = SnowballStemmer('english')
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result


A sample review being processed

In [None]:
doc_sample = documents[documents['index'] == 4].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))



Process all review text

In [None]:
processed_docs = documents['Review Text'].map(preprocess)
processed_docs

#### Compute Bi/Tri-gram

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

In [None]:
# using gensim library to generate bi-grams and tri-grams
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 10 times or more).
bigram = Phrases(processed_docs, min_count=10)
trigram = Phrases(bigram[processed_docs])

for idx in range(len(processed_docs)):
    for token in bigram[processed_docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            processed_docs[idx].append(token)
    for token in trigram[processed_docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            processed_docs[idx].append(token)

#### Corpus and Dictionary 

The two main inputs to the LDA topic model are the dictionary and the corpus.

In [None]:
dictionary = gensim.corpora.Dictionary(processed_docs)
print('Number of unique words in initital documents:', len(dictionary))
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

# Filter out words that occur less than 10 documents.
dictionary.filter_extremes(no_below=10)
print('Number of unique words after removing rare and common words:', len(dictionary))



In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Preview sample processed documents:

In [None]:
bow_doc_4 = bow_corpus[4]
for i in range(len(bow_doc_4)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4[i][0], 
                                               dictionary[bow_doc_4[i][0]], 
bow_doc_4[i][1]))

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(bow_corpus))

### Train Baseline LDA model

In [None]:
from gensim.models import LdaModel

num_topics = 9
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=4, workers=3, random_state=100)

Look at terms in each topics

In [None]:
from pprint import pprint
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[bow_corpus]

#### Compute Baseline Coherence Score


In [None]:
from gensim.models import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_docs, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nBaseline Coherence Score: ', coherence_lda)

### Hyperparameter Tuning

For the simplicity of the project, I will only tune the number of topics (K).


C_v will be used as metric for performance comparison.

In [None]:
def compute_coherence_values(dictionary, corpus, texts, max_topics = 25, min_topics=3, step_size=2):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(min_topics, max_topics, step_size):
        model = gensim.models.LdaMulticore(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=4, workers=3, random_state=100) #workers = 3 to increase computation power
        model_list.append(model)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())

    return model_list, coherence_values

In [None]:
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=bow_corpus, texts=processed_docs, max_topics = 25, min_topics=3, step_size=2)

*Took about 15 min to run models with different numbers of topic*

In [None]:
# Show graph
max_topics = 25
min_topics=3
step_size=2
x = range(min_topics, max_topics, step_size)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

### Running final model

Based on the coherence score above, I choose 17 topics for the final model. In previous model, I set the passes to be lower in order to save time, but in the final model we increases passes to 10 for a better training results.

In [None]:
optimal_num_topics = 17
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics= 17, id2word=dictionary, passes=10, workers=3, random_state=100)

In [None]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_docs, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nFinal Model Coherence Score: ', coherence_lda)

Compared to the baseline model, the coherence score has increased. 

#### Finding the dominant topic in each review

To determine what topic a given document is about, we find the topic number that has the highest percentage contribution in that document.

In [None]:
def format_topics_sentences(ldamodel=lda_model, corpus=bow_corpus, texts=processed_docs):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=bow_corpus, texts=processed_docs) #aggregates this information in a presentable table.

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']



In [None]:
# Show
df_dominant_topic

In [None]:
reviews.iloc[3497,0]

#### Finding the most representative document for each topic 

To help with understanding the topic, we find the documents a given topic has contributed to the most and infer the topic by reading that document.

In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf = pd.concat([sent_topics_sorteddf, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0) # Perc_Contribution is percentage contribution of the topic in the given document.

# Reset Index    
sent_topics_sorteddf.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf

#### Checking age and rating differences for each topic group

In [None]:
age_rating_by_topics = df_dominant_topic
age_rating_by_topics['Rating'] = df['Rating']
age_rating_by_topics['Age'] = df['Age']


age_rating_by_topics.groupby('Dominant_Topic')['Age', 'Rating'].mean()


There are no major difference in age among different topic groups. Group 15 has the lowest rating. 

### Visualizing topics

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:

pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)

The interactive visualization allows you to explore the most relevant terms in each topic. By comparing the estimated term frequency with overall term frequency we can also identify words that are unique to the topic.

### Interpreting topics

Some topics are harder to interpret than others. By combining all the information above and based on my subjective interpretation, the following topic groups are formed:
* Topic 1: petite buyer
* Topic 2: loves to report back compliment they receive when wearing
* Topic 3: recommender, also loves to try in store
* Topic 4: loves the purchase because it fits very well.
* Topic 5: the size run large for them and loves to tell that to other reviewers
* Topic 6: loves reading reviews
* Topic 7: buys during sales and usually happy purchase
* Topic 8: recommender, buying fall and winter clothes
* Topic 9: review reader
* Topic 10: loves the high quality
* Topic 11: jeans/pants buyer, cares about the fit
* Topic 12: intimacy clothing buyer, loves light weight
* Topic 13: cami buyer, cares most about color
* Topic 14: skirt buyer 
* Topic 15: disappointed buyer 
* Topic 16: bath suit buyer
* Topic 17: (difficult to interpret)


## Conclusion

There were a lot of assumptions made when exploring the data. For example, we assume that people honestly reported their age. Besides, not all customers would have leave reviews for the products they've purchased. So the exploration of customer segmentation only captures those who write reviews. Even though I was able to interpret some of the topics, but the interpretation is very subjective and it's not the most obvious. When doing customer segmentation in the real world (not just exploring python like me), it would be best to have other features about the customers. Features may come from other data collection methods such as large scale surveys. 

Thank you for reading my exploration of the women's clothing reviews dataset!