## Word2Vec practice with Clothing review Data

Let's first import all the necessary libraries and read our datafile

In [None]:
import pandas as pd
import numpy as np
import spacy
import re
from time import time

In [None]:
df = pd.read_csv('../input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.head()

In [None]:
df = df[['Title', 'Review Text']] #I only want to use the text based values, so I modify the dataframe
df.head()

In [None]:
df.isnull().sum() #finding the null values

In [None]:
df = df.dropna().reset_index(drop=True) #and dropping them 
df.isnull().sum() 

In [None]:
df.shape

Let's load Spacy for to lemmatize and clean our text (this will not work with dirtier, more unruly data).

In [None]:
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser'])

In [None]:
def cleaning(doc): #the lemmatizing function
    txt = [token.lemma_ for token in doc if not token.is_stop]
    if len(txt) > 2:
        return ' '.join(txt)

In [None]:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['Review Text'])

In [None]:
t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

And this will be the final clean text data we are going to be wokring with.

In [None]:
df_clean.head() 

And finally importing the Word2Vec model

In [None]:
import gensim 
from gensim.models import Word2Vec

In [None]:
sent = [row.split() for row in df_clean['clean']] #splitting the columns into the correct format

In [None]:
print(sent[:10])

Training the model:

In [None]:
t = time()

model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

## Finally we are able to get some insights out of our model

Most smilar words:

In [None]:
model.wv.most_similar(positive=["dress"])

In [None]:
model.wv.most_similar(positive=["jumper"])

In [None]:
model.wv.most_similar(positive=["skirt"])

In [None]:
model.wv.most_similar(positive=["favorite"])

In [None]:
model.wv.most_similar(positive=["favourite"])

How similar are two words?

In [None]:
model.wv.similarity("little", 'petite')

In [None]:
model.wv.similarity("pencil", 'skirt')

Which one doesn't fit?

In [None]:
model.wv.doesnt_match(['skirt', 'dress', 'book'])