## Text Preprocessing 
- Tokenization
- Stemming and Lemmatization
- StopWords
- POS: Parts of Speech tagging is a linguistic activity in Natural Language Processing (NLP) wherein each word in a document is given a particular part of speech (adverb, adjective, verb, etc.) or grammatical category.
- BOW (Bag Of Words)
- TF-IDF
- Word2Vec

### Word2Vec

In Text Preprocessing 1, we discussed BOW (bag of words) and TF-IDF approach, however, semantic information is not stored in thoese methods. And there’s chance of overfitting. To overcome these problems, we have the Word2Vec model.

In Word2Vec, each word is represented as a vector of 32 or more dimension instead of a single number. The semantic information and relation between different words is also preserved. 


In [1]:
## import libraries
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec
import re

In [2]:
## review for granger down wash from Amazon
paragraph = """You should know two things about using this product: it takes over four hours from start to finish to clean a down jacket right, and this is not your usual casual laundry.

I own the Patagonia Primo ski jacket. 800 count down, three ply Gore-Tex. I did not want to screw it up. Here in Chicago, a warm waterproof jacket literally keeps me alive while working outside in the winter. I cannot afford (in all meanings of the word) for my winter coat to fail.

I did a lot of research before I put it in the wash with this cleaner. I can say it works well; the down is as lofty as new, the Gore-Tex unaffected, and the Durable Water Resistance (DWR) renewed.

I took several steps the ensure success, based on advice from the Arcteryx clothing YouTube channel, Patagonia's site and clothing label, and other sources.

First, I ran my washing machine empty, normal wash, hot water, to get all detergent residue out. Regular detergents have enzymes that can damage down; you don't want it in your coat.

I zipped my main zipper closed (always a good thing to do in any case), and closed my pocket zippers halfway, to protect them but yet allow the pockets to get clean. Ditto my pit zips.

I loosened all cord locks so the hood and waist were fully relaxed and opened.

I set my Velcro cuffs to their widest.

I washed the coat with two caps of cleaner, with the machine set for delicates, warm water, gentle spin.

When it was completed I ran a rinse/spin cycle, cold water, gentle spin.

The jacket was sopping wet when all that was done. I laid it flat on a beach towel, then gently rolled it up like a burrito, without squeezing or wringing, to remove excess water.

Then, into the dryer, on low heat, with two tennis balls, as specified by Patagonia. I added a dry beach towel to help absorb the water; this seemed to speed thing up a bit, but I'd like to hear if people think this is a good or bad idea. I checked on it every half hour or so; I turned it inside out to facilitate the drying. It takes about three hours in the dryer to dry completely.

Out of the dryer the loft was exceptional, and a little water dribbled on the coat ran right off without being absorbed, proving the DWR was renewed. I've taken it out in a moderate rain storm, and it's still waterproof.

I hope these tips help you out. I'm pleased with the results I achieved.

Winter is coming. Grangers has helped me get ready for it."""

In [3]:
## preprocessing 
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text) #replace any sequence of one or more whitespace characters (spaces, tabs, newlines, etc.) with space
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

In [4]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [5]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Access vocabulary details
words = model.wv.key_to_index 

In [6]:
print(words)

{',': 0, '.': 1, 'water': 2, ';': 3, 'jacket': 4, 'coat': 5, 'dryer': 6, 'winter': 7, 'patagonia': 8, '(': 9, 'ran': 10, ')': 11, 'get': 12, 'two': 13, 'without': 14, 'like': 15, 'dry': 16, 'help': 17, 'towel': 18, 'wash': 19, 'thing': 20, 'beach': 21, 'renewed': 22, 'spin': 23, 'gentle': 24, 'set': 25, 'clothing': 26, "'s": 27, 'machine': 28, 'good': 29, 'closed': 30, 'dwr': 31, 'cleaner': 32, 'right': 33, 'warm': 34, 'hours': 35, 'three': 36, 'takes': 37, 'gore-tex': 38, 'want': 39, 'clean': 40, 'waterproof': 41, 'detergents': 42, 'label': 43, 'steps': 44, 'ensure': 45, 'success': 46, 'based': 47, 'advice': 48, 'start': 49, 'arcteryx': 50, 'damage': 51, 'youtube': 52, 'channel': 53, 'four': 54, 'site': 55, 'sources': 56, 'regular': 57, 'first': 58, 'things': 59, 'enzymes': 60, 'several': 61, ':': 62, 'empty': 63, 'using': 64, 'normal': 65, 'hot': 66, 'product': 67, 'detergent': 68, 'residue': 69, 'washing': 70, 'keeps': 71, 'finish': 72, 'took': 73, 'working': 74, 'outside': 75, 'lit

In [9]:
# Finding Word Vectors
vector = model.wv['water']

# Most similar words
similar = model.wv.most_similar('water')

In [10]:
print(vector)

[ 9.4669449e-05  3.1122367e-03 -6.8130204e-03 -1.3196648e-03
  7.6049580e-03  7.2199828e-03 -3.6538129e-03  2.7571912e-03
 -8.3689252e-03  6.1584376e-03 -4.6531884e-03 -3.2273859e-03
  9.2729628e-03  9.2281576e-04  7.4480027e-03 -6.0846391e-03
  5.1994571e-03  9.8198513e-03 -8.5367570e-03 -5.2658594e-03
 -7.0919800e-03 -4.8739729e-03 -3.7834318e-03 -8.5326703e-03
  7.9363380e-03 -4.8710369e-03  8.4203435e-03  5.2250256e-03
 -6.6436008e-03  4.0003024e-03  5.4777670e-03 -7.4295225e-03
 -7.3083849e-03 -2.5296691e-03 -8.7129120e-03 -1.4305762e-03
 -3.9131049e-04  3.2640360e-03  1.4350056e-03 -1.0357924e-03
 -5.6137429e-03  1.6137923e-03 -9.8565349e-04  6.7918324e-03
  4.0379530e-03  4.5901234e-03  1.4064645e-03 -2.7221863e-03
 -4.4015869e-03 -9.9315797e-04  1.4715518e-03 -2.6860726e-03
 -7.0302659e-03 -7.8357477e-03 -9.0995217e-03 -5.9431670e-03
 -1.8578147e-03 -4.3345988e-03 -6.5662628e-03 -3.7178348e-03
  4.2760894e-03 -3.7415200e-03  8.4257452e-03  1.5076697e-03
 -7.3691774e-03  9.48908

In [11]:
print(similar)

[('detergent', 0.3722821772098541), ('waterproof', 0.21293485164642334), ('hope', 0.20585587620735168), ('youtube', 0.20357394218444824), ('patagonia', 0.20141823589801788), ('remove', 0.20058925449848175), ('little', 0.19965402781963348), ('research', 0.18946188688278198), ('unaffected', 0.18259188532829285), (')', 0.1777937114238739)]
