# Dataset - Women's E-Commerce Clothing Reviews

## References 

* https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
*  http://math.mit.edu/~gs/linearalgebra/ila0601.pdf
*   http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
* https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/
* https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/
* https://medium.com/analytics-vidhya/text-mining-101-a-stepwise-introduction-to-topic-modeling-using-latent-semantic-analysis-using-add9c905efd9



## Setting up environment

In [None]:
# Global variables 
dataset_filename = "../input/Womens Clothing E-Commerce Reviews.csv"

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import re
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cosine
from numpy import array
from sklearn.decomposition import TruncatedSVD

%matplotlib inline

In [None]:
# Loading dataset
data = pd.read_csv(dataset_filename, index_col=0)

# Question 1: Preprocess the corpus of customer reviews dataset

## Performing Some Basic EDA

In [None]:
# Looking at some of the top rows of dataset
data.head()

In [None]:
# Description
data.describe()

In [None]:
# Info
data.info()

## Selecting Clothing ID

In [None]:
# Printing the unique Clothing IDs along with their frequencies
data['Clothing ID'].value_counts()[:5]

In [None]:
# From here, we can conclude Clothing ID 1078 as most common. So, we will be using this for the rest of our project
datax = data.loc[data['Clothing ID'] == 1078 , :] # We will be calling this data as datax
datax.head()

In [None]:
datax.info()

## Extracting our Text Corpus

In [None]:
corpus = [review for (id,review) in datax['Review Text'].iteritems() if isinstance(review,str)]

## Creating Review to ID Dictionary

In [None]:
# Creating dictionary of review to id
review_to_id_dict = {review : id for (id,review) in enumerate(corpus)}

## Tokenize

In [None]:
corpus_tokenized = np.array([review.split() for review in corpus])

print(corpus_tokenized[:5])

# Question 2: Remove stopwords, standardize tokens

### Removing All Symbols other than Alphabets and Converting all Letters to Lowercase

In [None]:
for i in range(5):
  print(i, corpus[i])

In [None]:
corpus1 = []

for review in corpus:
  if isinstance(review,str):
    review = review.split()
    review = [re.sub('[^A-Za-z]+', '', x) for x in review]
    review = [x.lower() for x in review if len(x) > 0]
    corpus1.append(' '.join(review))

### Getting Set of Stopwords

In [None]:
# We are using nltk list of stopwords
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
             "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
             'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
             'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
             'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
             'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
             'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
             'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
             'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
             'by', 'for', 'with', 'about', 'against', 'between', 'into',
             'through', 'during', 'before', 'after', 'above', 'below', 'to',
             'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
             'again', 'further', 'then', 'once', 'here', 'there', 'when',
             'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
             'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
             'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
             'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
             'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
             "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't",
             'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
             'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
             'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
             'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Extending stopwords with space
stopwords.append('')

# Converting it to a set
stopwords = set(stopwords)

### Removing Stopwords from Reviews

In [None]:
# Removing stopwords and storing it into a new dict

corpus_sr = [] # Corpus after removing stopwords

for review in corpus1 :
  if isinstance(review, str):
    review = review.split()
    new_review = []
    for x in review:
      if x not in stopwords:
        new_review.append(x)
    corpus_sr.append(" ".join(new_review))

### Stemming and Lemmatization

In [None]:
# Creating a list of all words present in review
word_list = []

for review in corpus_sr:
  word_list.extend(review.split())

word_list = list(set(word_list))

print(word_list[:10])

In [None]:
# Stemming
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

corpus_stemmed = []

for review in corpus_sr:
  review = [porter_stemmer.stem(x) for x in review.split()]
  corpus_stemmed.append(' '.join(review))

for i in range(5):
  print(corpus_sr[i])
  print(corpus_stemmed[i])

In [None]:
# Lemmatization
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

corpus_lemmatized = []

for review in corpus_sr:
  review = [wordnet_lemmatizer.lemmatize(x) for x in review.split()]
  corpus_lemmatized.append(' '.join(review))

for i in range(5):
  print(corpus_sr[i])
  print(corpus_lemmatized[i])

## Standardize Tokens

In [None]:
corpus_tokenized = np.array([review.split() for review in corpus_lemmatized])

print(corpus_tokenized[:5])

# Question 3: Build the Term-Frequency Inverse-Document-Frequency (TF-IDF) matrix and apply the Latent Semantic Analysis (LSA) method

### Creating Word List

In [None]:
vocabulary = []
for review in corpus_lemmatized:
  vocabulary.extend(review.split())
vocabulary = list(set(vocabulary))

# Printing some of the first elements of word_list and number of words present in it
print(vocabulary[:5])
print(len(vocabulary))

# Tokenizing
word_to_id = {word:id for id,word in enumerate(vocabulary)}
id_to_word = {id:word for id,word in enumerate(vocabulary)}

### Document-term matrix

In [None]:
m = len(corpus_lemmatized) # m = number of reviews 
n = len(vocabulary) # n = number of unique words

tfm = np.zeros((m, n),dtype=int) # Term frequency matrix
for i in range(m):
  words = corpus_lemmatized[i].split()
  for j in range(len(words)):
    word = words[j]
    tfm[i][word_to_id[word]] += 1 

### Term frequency inverse document frequency matrix

In [None]:
tmpm = tfm != 0 # Temporary matrix
dft = tmpm.sum(axis = 0) #the number of documents where term t appears
tfidfm = np.multiply(tfm, np.log(m/dft))

### Perform ​LSA​ using Singular Value Decomposition (​SVD​). Consider the TF matrix for SVD. You can also perform SVD on the TF-IDF matrix.

In [None]:
U, s, VT = np.linalg.svd(tfm)

K = 2 # number of components

tfm_reduced = np.dot(U[:,:K], np.dot(np.diag(s[:K]), VT[:K, :]))
docs_rep = np.dot(tfm, VT[:K, :].T)
term_rep = np.dot(tfm.T, U[:,:K])

### Plot documents in the LSA/TF-IDF space

In [None]:
plt.scatter(docs_rep[:,0], docs_rep[:,1])
plt.title("Document Representation")
plt.show()

In [None]:
plt.scatter(term_rep[:,0], term_rep[:,1])
plt.title("Term Representation")
plt.show()

# Question 4: Compare the performance of Information Retrieval (IR) using both TF-IDF and LSA methods

In [None]:
query = 'nice good'


key_word_indices = []

for x in query.split():
  if x in word_to_id.keys():
    key_word_indices.append(word_to_id[x])

## IR using LSA with TF matrix

In [None]:
key_words_rep = term_rep[key_word_indices,:]     
query_rep = np.sum(key_words_rep, axis = 0)

print (query_rep)

In [None]:
query_doc_cos_dist = [cosine(query_rep, doc_rep) for doc_rep in docs_rep]
query_doc_sort_index = np.argsort(np.array(query_doc_cos_dist))

for rank, sort_index in enumerate(query_doc_sort_index):
    print (rank, query_doc_cos_dist[sort_index], corpus[sort_index])
    if rank == 4 : 
      break

## IR using TF-IDF matrix

In [None]:
query_vector = np.zeros((1,n))
for x in key_word_indices:
  query_vector[0,x] += 1
  
query_vector = np.multiply(query_vector, np.log(m/dft))

query_doc_cos_dist = [cosine(query_vector, tfidfm[i]) for i in range(m)]
query_doc_sort_index = np.argsort(np.array(query_doc_cos_dist))

x = []

for rank, sort_index in enumerate(query_doc_sort_index):
    print (rank, query_doc_cos_dist[sort_index], corpus[sort_index])
    x.append(corpus[sort_index])
    if rank == 4 : 
      break

## Comparision Between LSA and TF-IDF

Query : nice good

### Result by LSA with TF matrix


*   0 3.906716264934218e-07 I really like this dress. i tried it on in the store and i had to take it home with me. the print is fun and bright and interesting, without being over the top. the grommets give it a little bit of edgy balance to the sweet, flowy shape. a note on the the fit: i am a 12/14 and the 12 fit well. it is a bit loose in some places, like the waist, but it certainly doesn't look like a maternity dress. i think the looseness is a good part of the overall look. i am high-waisted, and the waistline hit
*   4.924336040046384e-07 I really like the soft, flowy satin fabric and vibrant color of this dress! i am 5'5'' 125lbs, the s was too large, so i ordered the xs instead. things to note: pockets, elastic waistband with nice detailing on the sides, slip underneath, and large flutter sleeves. i think that women with smaller torsos should consider petite sizing.
*  2.0761380433720333e-06 I didn't hate it, but i don't love it. i purchased it in 2 colors b/c i thought i'd fall in love. it is just ok. maybe it's b/c i have a larger chest, but i felt like it was too small in the chest area and too loose elsewhere. not the maxi fit i normally go for. however, the material is great and it's a simple sunday dress to lounge around in and run errands comfortably.
* 2.6978156143497856e-06 This is the most flattering dress i have bought in years. the fact that it's machine washable was a huge plus. fabric is so comfortable and the length is perfect! i wear either m or l and went with the large for a little extra room, but either would have fit well.
* 4.388323377790826e-06 I was lucky enough to get a hold of this intarsia sweater dress after the sale and i wish i had purchased this the first time round. it is absolutely stunning, flattering, comfortable and unique! i am 5'3" and the regular hem fit me just fine at the ankles. i think this dress runs both tts to a bit large so i would size down if your small framed and stay your usual if your busty or broad shouldered. perfectly complements my taupe booties that i already owned and a my taupe maxi sweater that has


### Result by TF-IDF
* 0.5753370983500936 I love this dress . perfect fit and very good quality.
* 0.6377201772710546 I bought this dress because i thought it would look good on her, and it does.
* 0.6633070868863611 This dress really was huge, and not at all flattering. i don't know how they got it to look good on the model in the picture. material was nice and soft, but i really don't see how anyone could look good in this, no matter what your body shape is.
* 0.7419458491496504 The pattern and fabric on this dress are very nice. there is just too much fabric. it's much too baggy but could make a nice maternity dress. i'll be returning it.
* 0.7570324564069156 Really cute fun print. nice summer dress..



For this example, TF-IDF is giving comparatively better result.