# Introduction to natural language processing


NLP or natural language processing is a field where people want to analysis text to extrack meaning and knowledge from it. 


In [78]:
import spacy
import pandas as pd

Find two real use cases where NLP can be used to solve problems

NLP has one particularity : words are string. And we can not do mathematics on string (and thus machine learning). Therefore, before doing machine learning, we need to transform words and sentence into a numerical representation like vectors. 

In addition to that several other preprocessing is usually done. 

What is tokenisation ? 

La tokenisation consiste à découper un texte en token, le plus souvent des mots.

What is lemmatization and stemming ? 

**Stemming** algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.

**Lemmatization**, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.

What is a stop word ? 

**Stop words** means that it is a very common words in a language (e.g. a, an, the in English. It does not help on most of NLP problem such as semantic analysis, classification etc.)

**Stop words** are words which are filtered out before processing of natural language data (text).[1] Stop words are generally the most common words in a language; there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools avoid removing stop words to support phrase search.

In this Practical we will learn how to process text a bit and build our first machine learning model on it. For the processing part we will use a library named Spacy which is very popular in the industry to do text preocessing and NLP

Install spacy. You may also have to install the english module of spacy (cf https://spacy.io/usage/models)

In [8]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 2.2MB/s ta 0:00:011   28% |█████████▎                      | 10.8MB 1.8MB/s eta 0:00:15
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /Users/romane/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
    /Users/romane/anaconda3/lib/python3.7/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [9]:
nlp = spacy.load("en_core_web_sm")

Download the following dataset  : http://ai.stanford.edu/~amaas/data/sentiment/ . It contains movie reviews with ratings. We will use it to build a model which classify if a review is positive or negative

With a text editor open a positive review and a negative review

With python load the first positive review and store it in the variable **review**

In [28]:
file_name = './aclImdb/train/pos/1_7.txt'
review = open(file_name).read()

load the review variable into spacy and store the result into the **doc** variable. It is a variable that can be used with a for loop

In [29]:
doc = nlp(review)

With a for loop, iterate through the words of **doc**. For each word, you can use the method is_stop to check if the word is a stop word. There are a lot of methods associated with the word : can you find what are they ? 

In [30]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

If if ADP IN mark Xx True False
you -PRON- PRON PRP nsubj xxx True True
like like ADP IN advcl xxxx True False
adult adult NOUN NN compound xxxx True False
comedy comedy NOUN NN compound xxxx True False
cartoons cartoon NOUN NNS dobj xxxx True False
, , PUNCT , punct , False False
like like ADP IN prep xxxx True False
South south PROPN NNP compound Xxxxx True False
Park park PROPN NNP pobj Xxxx True False
, , PUNCT , punct , False False
then then ADV RB advmod xxxx True True
this this DET DT nsubj xxxx True True
is be VERB VBZ ROOT xx True True
nearly nearly ADV RB advmod xxxx True False
a a DET DT det x True True
similar similar ADJ JJ amod xxxx True False
format format NOUN NN attr xxxx True False
about about ADP IN prep xxxx True True
the the DET DT det xxx True True
small small ADJ JJ amod xxxx True False
adventures adventure NOUN NNS pobj xxxx True False
of of ADP IN prep xx True True
three three NUM CD nummod xxxx True True
teenage teenage ADJ JJ amod xxxx True False
girls girl N

code a function `remove_stop_words(text)` which removes all stop words from a given string

In [27]:
def remove_stop_words(text):
    return [token for token in nlp(open(text).read()) if not(token.is_stop)]
print(remove_stop_words('./aclImdb/train/pos/1_7.txt'))

[If, like, adult, comedy, cartoons, ,, like, South, Park, ,, nearly, similar, format, small, adventures, teenage, girls, Bromwell, High, ., Keisha, ,, Natella, Latrina, given, exploding, sweets, behaved, like, bitches, ,, I, think, Keisha, good, leader, ., There, small, stories, going, teachers, school, ., There, 's, idiotic, principal, ,, Mr., Bip, ,, nervous, Maths, teacher, ., The, cast, fantastic, ,, Lenny, Henry, 's, Gina, Yashere, ,, EastEnders, Chrissie, Watts, ,, Tracy, -, Ann, Oberman, ,, Smack, The, Pony, 's, Doon, Mackichan, ,, Dead, Ringers, ', Mark, Perry, Blunder, 's, Nina, Conti, ., I, n't, know, came, Canada, ,, good, ., Very, good, !]


Code a function `load_texts(folder)` which loads all text file from a folder and put it into a list

In [34]:
from os import listdir
from os.path import isfile, join

In [49]:
def load_texts(folder):
    return [join(folder, f) for f in listdir(folder) if isfile(join(folder, f))]
load_texts('./aclImdb/train/pos/')

['./aclImdb/train/pos/4715_9.txt',
 './aclImdb/train/pos/12390_8.txt',
 './aclImdb/train/pos/8329_7.txt',
 './aclImdb/train/pos/9063_8.txt',
 './aclImdb/train/pos/3092_10.txt',
 './aclImdb/train/pos/9865_8.txt',
 './aclImdb/train/pos/6639_10.txt',
 './aclImdb/train/pos/10460_10.txt',
 './aclImdb/train/pos/10331_10.txt',
 './aclImdb/train/pos/11606_10.txt',
 './aclImdb/train/pos/6168_10.txt',
 './aclImdb/train/pos/2712_10.txt',
 './aclImdb/train/pos/3225_10.txt',
 './aclImdb/train/pos/3574_10.txt',
 './aclImdb/train/pos/3192_10.txt',
 './aclImdb/train/pos/716_10.txt',
 './aclImdb/train/pos/2612_10.txt',
 './aclImdb/train/pos/5568_8.txt',
 './aclImdb/train/pos/6554_7.txt',
 './aclImdb/train/pos/1807_7.txt',
 './aclImdb/train/pos/3474_10.txt',
 './aclImdb/train/pos/11057_10.txt',
 './aclImdb/train/pos/10231_10.txt',
 './aclImdb/train/pos/11706_10.txt',
 './aclImdb/train/pos/11167_9.txt',
 './aclImdb/train/pos/803_10.txt',
 './aclImdb/train/pos/5245_8.txt',
 './aclImdb/train/pos/7935_8.txt

With the two previous functions load the the positive reviews and removes stop words from all of them. Do the same for negative reviews. 

In [50]:
pos_folder = './aclImdb/train/pos'
neg_folder = './aclImdb/train/neg'

pos_review = load_texts(pos_folder)
neg_review = load_texts(neg_folder)
i=0
pos_review_list = []
neg_review_list = []

for file in pos_review:
    if i < 100:
        pos_review_list.append(remove_stop_words(file))
        i+=1
    else: 
        break
i=0
for file in neg_review:
    if i < 100:
        neg_review_list.append(remove_stop_words(file))
        i+=1
    else:
        break

Store postive and negative reviews into a single list named **all_reviews**

In [72]:
all_reviews = pos_review_list + neg_review_list
new_reviews = []
for all_review in all_reviews:
    new_reviews.append(' '.join([str(t) for t in all_review]))
new_reviews

['For movie gets respect sure lot memorable quotes listed gem . Imagine movie Joe Piscopo actually funny ! Maureen Stapleton scene stealer . The Moroni character absolute scream . Watch Alan " The Skipper " Hale jr . police Sgt .',
 'Bizarre horror movie filled famous faces stolen Cristina Raines ( later TV \'s " Flamingo Road " ) pretty somewhat unstable model gummy smile slated pay attempted suicides guarding Gateway Hell ! The scenes Raines modeling captured , mood music perfect , Deborah Raffin charming Cristina \'s pal , Raines moves creepy Brooklyn Heights brownstone ( inhabited blind priest floor ) , things start cooking . The neighbors , including fantastically wicked Burgess Meredith kinky couple Sylvia Miles & Beverly D\'Angelo , diabolical lot , Eli Wallach great fun wily police detective . The movie nearly cross - pollination " Rosemary \'s Baby " " The Exorcist "-- combination ! Based best - seller Jeffrey Konvitz , " The Sentinel " entertainingly spooky , shocks brought d

Our reviews are still text.. We will transform them into vectors with TFIDF. 
Load the TFIDF module from scikit learn and transform **all_reviews**  and store the result into X. 

In [83]:
import pandas as pd
 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [89]:
vectorizer = TfidfVectorizer()
cv = CountVectorizer()
X = vectorizer.fit_transform(new_reviews)
word_count_vector=cv.fit_transform(new_reviews)

X=TfidfTransformer(smooth_idf=True,use_idf=True)
X.fit(word_count_vector)


TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

add a column which contains the labels of the reviews

In [90]:
df_idf = pd.DataFrame(X.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
the,1.319698
movie,1.451275
br,1.602825
film,1.649345
it,1.781516
like,1.848958
this,1.848958
good,1.985817
story,2.192431
people,2.295972


In [92]:
# count matrix
count_vector=cv.transform(new_reviews)
 
# tf-idf scores
tf_idf_vector=X.transform(count_vector)

In [94]:
feature_names = cv.get_feature_names()
 
#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
listed,0.222322
quotes,0.222322
stapleton,0.222322
skipper,0.222322
hale,0.222322
jr,0.222322
stealer,0.222322
piscopo,0.222322
moroni,0.222322
maureen,0.222322


Do a train test split on the values

Train a logistic regression on the train set. Display the scores on train and test

That's it ! You dit you first Logistic regression on text data. Of course you can do a lot more 