## Text Preprocessing 
- Tokenization
- Stemming and Lemmatization
- StopWords
- POS: Parts of Speech tagging is a linguistic activity in Natural Language Processing (NLP) wherein each word in a document is given a particular part of speech (adverb, adjective, verb, etc.) or grammatical category.
- Bag Of Words
- TF-IDF
- Unigrams
- Bigrams
- n-grams 

In [1]:
import nltk
#nltk.download()

In [2]:
## review for granger down wash from Amazon
paragraph = """You should know two things about using this product: it takes over four hours from start to finish to clean a down jacket right, and this is not your usual casual laundry.

I own the Patagonia Primo ski jacket. 800 count down, three ply Gore-Tex. I did not want to screw it up. Here in Chicago, a warm waterproof jacket literally keeps me alive while working outside in the winter. I cannot afford (in all meanings of the word) for my winter coat to fail.

I did a lot of research before I put it in the wash with this cleaner. I can say it works well; the down is as lofty as new, the Gore-Tex unaffected, and the Durable Water Resistance (DWR) renewed.

I took several steps the ensure success, based on advice from the Arcteryx clothing YouTube channel, Patagonia's site and clothing label, and other sources.

First, I ran my washing machine empty, normal wash, hot water, to get all detergent residue out. Regular detergents have enzymes that can damage down; you don't want it in your coat.

I zipped my main zipper closed (always a good thing to do in any case), and closed my pocket zippers halfway, to protect them but yet allow the pockets to get clean. Ditto my pit zips.

I loosened all cord locks so the hood and waist were fully relaxed and opened.

I set my Velcro cuffs to their widest.

I washed the coat with two caps of cleaner, with the machine set for delicates, warm water, gentle spin.

When it was completed I ran a rinse/spin cycle, cold water, gentle spin.

The jacket was sopping wet when all that was done. I laid it flat on a beach towel, then gently rolled it up like a burrito, without squeezing or wringing, to remove excess water.

Then, into the dryer, on low heat, with two tennis balls, as specified by Patagonia. I added a dry beach towel to help absorb the water; this seemed to speed thing up a bit, but I'd like to hear if people think this is a good or bad idea. I checked on it every half hour or so; I turned it inside out to facilitate the drying. It takes about three hours in the dryer to dry completely.

Out of the dryer the loft was exceptional, and a little water dribbled on the coat ran right off without being absorbed, proving the DWR was renewed. I've taken it out in a moderate rain storm, and it's still waterproof.

I hope these tips help you out. I'm pleased with the results I achieved.

Winter is coming. Grangers has helped me get ready for it."""

### Tokenization
Convert a sequence of text into smaller parts, known as tokens. 

In [3]:
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)

# Tokenizing words
words = nltk.word_tokenize(paragraph)

In [4]:
len(sentences)

29

In [5]:
sentences[0:5]

['You should know two things about using this product: it takes over four hours from start to finish to clean a down jacket right, and this is not your usual casual laundry.',
 'I own the Patagonia Primo ski jacket.',
 '800 count down, three ply Gore-Tex.',
 'I did not want to screw it up.',
 'Here in Chicago, a warm waterproof jacket literally keeps me alive while working outside in the winter.']

In [6]:
words[0:5]

['You', 'should', 'know', 'two', 'things']

### Stemming and Lemmatization

Stemming is the process of reducing infected words to their word stem (base word). E.g. history, historical -- histori; going, goes, gone -- go. 

Lemmatization, on the other hand convert words into human readable words. E.g. history, historical -- history. 

Note that lemmatization will take more time than stemming. 

### Stop words
Stop words are words that are so widely used and carry very little useful information.

In [7]:
## Stemming 
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [8]:
stemmer = PorterStemmer()

In [9]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)   
    

In [10]:
sentences[0:5]

['you know two thing use product : take four hour start finish clean jacket right , usual casual laundri .',
 'i patagonia primo ski jacket .',
 '800 count , three pli gore-tex .',
 'i want screw .',
 'here chicago , warm waterproof jacket liter keep aliv work outsid winter .']

In [11]:
## Lemmatization
from nltk.stem import WordNetLemmatizer

In [12]:
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)   

sentences[0:5]

['You know two thing using product : take four hour start finish clean jacket right , usual casual laundry .',
 'I Patagonia Primo ski jacket .',
 '800 count , three ply Gore-Tex .',
 'I want screw .',
 'Here Chicago , warm waterproof jacket literally keep alive working outside winter .']

### Bag Of Words
Bag Of Words converts sentences to numerical representation, which can help with sentiment analysis. Note, this is not good for huge data set.

In [13]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i]) #replacing all symbols other than a-zA-Z with spaces 
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [14]:
sentences[:4]

['You should know two things about using this product: it takes over four hours from start to finish to clean a down jacket right, and this is not your usual casual laundry.',
 'I own the Patagonia Primo ski jacket.',
 '800 count down, three ply Gore-Tex.',
 'I did not want to screw it up.']

In [15]:
corpus[:4]

['know two thing using product take four hour start finish clean jacket right usual casual laundry',
 'patagonia primo ski jacket',
 'count three ply gore tex',
 'want screw']

In [16]:
X.shape

(29, 182)

In [17]:
X[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

### TF-IDF
Term Frequency - Inverse Document Frequency is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).

In [18]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

wordnet = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()

In [19]:
X.shape

(29, 182)

In [20]:
X[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.26793307,
       0.        , 0.        , 0.        , 0.23863533, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.26793307, 0.        , 0.        , 0.26793307, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     