# Intro2AI: Lecture 6 - Introduction to Natural Language Processing


![](https://drive.google.com/uc?export=view&id=1jiB_rcx8777OqnMHcll2jYJQZF-sIaPm)

In this practical session, we will see how to:
- Pre-process data using Spacy
- Build a sentiment analysis system, using Scikit-Learn
- Generate word embeddings, using Gensim

Our corpus for sentiment classification will be the Pop-corn dataset ( https://www.kaggle.com/ymanojkumar023/kumarmanoj-bag-of-words-meets-bags-of-popcorn/code).


We make available a reduced and cleaned version here (code to download the data below):
- Training set: https://drive.google.com/file/d/1VcPE4bo8ygubyLmxwA-jycckUtVcC6l6/view?usp=sharing
- Test set: https://drive.google.com/file/d/1GQf17s5Tf7rXobjhDON9gek2gOHyJ4-K/view?usp=sharing
- Full data for training word embeddings: https://drive.google.com/file/d/1TokJd_dnYksfHjCqMpKEqrLHDhkwSFyV/view?usp=sharing



### Organization of the session: please read

The practical session is organized in several parts, with some code to run and a detailed explanation.
You simply have to read and run the code, the goal is to make you understand that NLP systems are not perfect, but also how easy it is to build a simple classification system (with rather good performance).
Each time try to **understand the code and its output**: ask me question whenever something is unclear.

In [None]:
# install wget, a module used to download data from the web
!pip install wget

In [None]:
# Downloading data
# wiki_ai.txt
#wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dayJql47Thz8dmOF1txyQj77FoktWtGo' -O wiki_ai.txt

import wget
# wiki_ai.txt (for Part 1)
url = "https://docs.google.com/uc?export=download&id=1dayJql47Thz8dmOF1txyQj77FoktWtGo"
filename = wget.download(url)

# Training data (for Part 2)
url = "https://docs.google.com/uc?export=download&id=1VcPE4bo8ygubyLmxwA-jycckUtVcC6l6"
filename = wget.download(url)

# Test data (for Part 2)
url = "https://docs.google.com/uc?export=download&id=1GQf17s5Tf7rXobjhDON9gek2gOHyJ4-K"
filename = wget.download(url)

# Full dataset (for Part 3)
url = "https://docs.google.com/uc?export=download&id=1BzE2l8R51ONSp3ADvXXukuYyuAgobnMW"
filename = wget.download(url)

# Original dataset (fyi, not used in this Practical Session)
#url = "https://docs.google.com/uc?export=download&id=1HFGLcWDn_vcmze-L_0jzB40ybmnD2_c1"
#filename = wget.download(url)


# Part 1: Using Spacy for data pre-processing

![](https://drive.google.com/uc?export=view&id=1L9JLLQHPZoMRwzYfmKcyM9VME_SHeZrr)

Within a computer, text is encoded as a string of characters.
In order to analyze textual data within NLP applications, we first need to properly preprocess it.
An NLP preprocessing pipeline generally consists of the following steps :
* sentence segmentation
* tokenisation
* normalization: lower-casing, lemmatization, removing stop-words
* pos-tagging
* named entity recognition
* parsing

The first two steps are in general necessary, while the others are optional.

For these exercises, we will use the module **spacy** (already installed on google colab): https://spacy.io/api/doc

Spacy is a python module that implements an NLP pipeline, in order to carry out tasks such as segmentation, tokenization, lemmatization and pos-tagging.
We will use it in order to preprocess a document in English.




The text comes from Wikipedia: https://www.wikiwand.com/en/Artificial_intelligence




In [None]:
# The code below opens a file, read, save and print its content
with open( 'wiki_ai.txt') as infile:
  text = infile.read()

print(text)

## 1.1 Tokenisation

Spacy can be used to directly tokenize any text.
To make it work, you need to load a model specific to the target language, here 'en_core_web_sm' for English. There are also some domain specific models, and models for other languages: https://spacy.io/models/en


This model corresponds to a processing 'pipeline', including, depending of the model: tokenisation, lemmatization, POS tagging, Named Entity Recognition and parsing.

### Import and load

The code below is used to:
- import the spacy module into Python
- load all the necessary models for English (other models can be found here: https://spacy.io/usage/models)

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

### Read and process a text
The code below  is used to:
- open the file 'wiki_ai.txt' for reading
- process it using spacy’s nlp pipeline

In [None]:
# Read in string of characters
with open('wiki_ai.txt') as inFile:
    text = inFile.read()

# Preprocess using spacy's pipeline
doc = nlp(text)

print('Preprocessing done')

### Inspect tokens

Our preprocessed document is now present as a list of tokens in our doc variable, and we can access its different annotations by looping through it.

The code below  is used to:
- print each individual token, together with its lemmatized form and part of speech tag

In [None]:
# Inspect tokens, lemmas, and pos tags
for token in doc:
  print( token.text, token.lemma_, token.pos_)

### Vizualizing using Pandas

You can use Pandas, another Python library, to better visualize the results

In [None]:
# Using pandas for a better visualization
import pandas as pd

spacy_pos_tagged = [(w, w.tag_, w.pos_) for w in doc]
pd.DataFrame(spacy_pos_tagged,
             columns=['Word', 'POS tag', 'Tag type'])

### Look at the results

Do you see some errors?

### POS tags explanation

* You can use the method 'explain' to have information about some annotation, for example the POS tags, see the code below.
* Here we used a very small set of POS (vs e.g. 36 in the PTB: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [None]:
# Inspect POS tags
all_tags = set()
for token in doc:
  all_tags.add(token.pos_)
for tag in all_tags:
  print( tag, spacy.explain(tag)) # explain each label

## 1.2 Segmenting into sentences

Apart from token segmentation, Spacy has also automatically segmented our document into sentences.



### Printing sentences
The code below can be used to print out the different sentences of the document. Do you see any error?

In [None]:
# Print the sentences
for i, sent in enumerate( doc.sents ):
  print( i, sent.text.strip() )

## 1.3 Named entity recognition

As part of the preprocessing pipeline, Spacy has equally carried out named entity recognition.

### Printing Named Entities

The code below will:
* print out each named entity, together with the label assigned to it

In [None]:
entity_labels = set()
for entity in doc.ents:
  label = entity.label_
  print( entity.text, '\t', label )
  entity_labels.add( label )

What do the labels stand for? We can use the method 'explain' again:

In [None]:
for l in entity_labels:
  print( l, spacy.explain(l))

### Visualization

A module called 'displacy' can be used to visualize the Named Entities directly in the text. It's easier to read.

Can you see some errors?

In [None]:
from spacy import displacy

# Visually
displacy.render(doc, style='ent', jupyter=True)

## 1.4 Syntactic parsing

Syntactic parsers produce an analysis of the sentences, where the words are connected to each other through syntactic relations.
We can easily parse sentences with Spacy, in order to produce a dependency graph over the sentences.
The dependency relations can be used as features for other systems, to know who did what, or to know which word is modified by an adjective.

More info: https://spacy.io/usage/linguistic-features#dependency-parse

### Printing a parse tree

The code below will:
- import Spacy and load the model
- process the sentence using the spacy pipeline
- vizualise the parse tree with displacy

In [None]:
from spacy import displacy

#nlp = spacy.load('en')
example_sentence = "You can make dependency trees."
example_doc = nlp(example_sentence)

# Visualization
displacy.render(example_doc, style="dep", jupyter=True)

The code below will:
- iterate over the sentence in the wikipedia document
- print the parse tree of the first sentence (index 0)

In [None]:
# Print the first sentence of our document
sentences = [sent.text for sent in doc.sents]
print(sentences[0])
doc = nlp(sentences[0])

# Visualization
displacy.render(doc, style="dep", jupyter=True)

### Navigating the parse tree

Each element of the tree is associated to attributes: you can use them to inspect the different elements of the trees.

The code below will print a tabular version of the tree where each token id associated to its head, with the relation ('amod') between them. The eventual children of the current token are also printed.

In [None]:
# Navigating the parse tree
spacy_dep_rel = [(w.text, w.dep_, w.head.text, w.head.pos_, [child.text for child in w.children]) for w in doc]
pd.DataFrame(spacy_dep_rel,
             columns=['Word', 'Dep', 'Head text', 'Head pos', 'children'])


# Part 2: Sentiment analysis, "Bag of Words Meets Bags of Popcorn"


![](https://drive.google.com/uc?export=view&id=13nwT3niIwy8jJKEyTF0dRaHeEi1q8Zlv)

In this part, we will make experiments on sentiment analysis on movie reviews.
The reviews are either positive (label 1) or negative (label 0).

The data come from: https://www.kaggle.com/ymanojkumar023/kumarmanoj-bag-of-words-meets-bags-of-popcorn/code

In this part, we will:
- vectorize the data using a bag-of-word representation
- train and evaluate a classifier for sentiment analysis.

To this aim, we will use the **scikit-learn** library.
It is already installed within google colab (https://scikit-learn.org/).

## 2.1 Retrieving the data

The data have already been tokenized and normalized (i.e. lowercased).
Data are balanced: there is an equal number of positive and negative examples in both the training an test set.
We have 5000 training instances, and 500 test instances.



### Read data
The code below will:
- read the data
- print the first instances: do the labels seem correct?

In [None]:
import numpy as np

# Read data using panda
import pandas as pd

def read_data( infile ):
  data = pd.read_csv(infile, header=0, \
                    delimiter="\t", quoting=3)
  print("Number of examples:", data.shape[0],"\n")

  reviews = data["review"]
  labels = data["sentiment"]
  return data, reviews, labels

print( "\n-- Reading training data ")
train, train_reviews, train_labels =read_data( "popcorn_clean_train_5000.tsv" )

train.head()

## 2.2 Feature extraction

Now, we are going to transform our textual data into vectors.
We'll start with simple bag-of-words features.

The class CountVectorizer implements this transformation:
- It converts a collection of text documents to a matrix of token counts (= raw frequency): each document become a numerical vector.
  

### How to vectorize data

To transform your data, you need to:
- (1) build a CountVectorizer object with the desired options
```
vectorizer = CountVectorizer( analyzer = 'word', max_features=1000 )
```
- (2) learn the transformation on your input data
```
vectorizer.fit( train_reviews )
```
- (3) transform your data into the desired output
```
train_features = vectorizer.transform( train_reviews )
```

Note that the method "fit_transform" automatically learns AND applies the transformation to the input data (steps (2) and (3)).
```
train_features = vectorizer.fit_transform( train_reviews )
```

### Filtering

Without filtering, this produces vectors of **39328 dimensions!**
Here we arbitrarily reduce to 1000 (but other values should be tested).
- max_features=1000: build a vocabulary that only consider the top 1000 features ordered by term frequency across the corpus.



### Vectorize data

The code below will:
- import the required module
- create an instance of CountVectorizer, used to build vectors from text
- vectorize the training set (i.e. the text 'train_reviews' is transformed into numerical vectors in 'train_features')

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer( analyzer = 'word', max_features=1000 )
train_features = vectorizer.fit_transform( train_reviews )

### Look at the vectorization

The code below will print several information, check that you understand each part:

- a- Print the shape of the matrix, that is the set of vectors representing the reviews (nb of instances x nb of features)


In [None]:
# a- array of shape (n_samples, n_features)
print( "Shape of the data, ie nb of examples x number of features:", train_features.shape )

- b- Print the vocabulary, i.e. the unique words used as features

In [None]:
# b- print the vocabulary (= unique words, here 1000)
print( "\nVocabulary:", list( vectorizer.vocabulary_.keys() ) )
vocab = vectorizer.get_feature_names_out()
print( "Size of the vocabulary:", len(vocab))
print(  "Sorted vocabulary:", vocab[:10] )

- c- Print the vector representing the first review

In [None]:
# c- print the vector representing the first review (500 dimensions)
# use toarray() to densify the matrix --> many 0s = sparsity
print( "\nVector representing the first review", train_features[0].toarray())

- d- Print the word corresponding to the first non-zero dimension, here dimension 6 (index = 5). Check that it appears once in the first review.

In [None]:
# d- sixth dimension, value =5
# what is the corresponding word?
# - invert the dictionnary
index_to_token = {v: k for k, v in vectorizer.vocabulary_.items()}
print( "\nWord corresponding to the 5th dimension:", index_to_token[5])
print( "First review", train_reviews[0]) # 2nd sentence: "some human drama about what could"

## 2.3 Preparing test data

We also need to pre-process and vectorize the test set.

The difference is:
- the vectorization is 'learned' on the training data only, we use the 'transform' method of the vectorizer (without the 'fit' part): words that do not appear in our training set are considered 'unknown'.

### Test data

The code below will prepare the test data (note that we should use development data instead, see last part for more details).

In [None]:
print( "-- Reading test data ")
test, test_reviews, test_labels = read_data( "popcorn_clean_test_500.tsv" )

test_features = vectorizer.transform( test_reviews )
print( "Vectorized, shape:", test_features.shape )

## 2.4 Classification without neurons: Scikit-Learn

Now we can train a model and use to make predictions on our test set.
- Choose an algorithm, e.g. LogisticRegression (aka MaxEnt)
- Train on the training set, meaning that we fit the model to the training data
- Make predictions on the development set
- Report performance by comparing the gold labels from the evaluation set (i.e. test_labels) to the predictions


### Step 1- Training

The code below will:
- import the required module
- initialize a classifier based on logistic regression
- train/fit the classifier using the training data


In [None]:
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

classifier = LogisticRegression()
classifier.fit( train_features, train_labels )
print( 'Training done')

### Step 2- Making predictions

The code below uses the model learned (the classifier) to make predictions on the test data.

In [None]:
preds = classifier.predict( test_features )
print( "Prediction done")

### Step 3- Computing scores

Scoring is done by comparing the gold labels (i.e. the ones annotated by an human) to the predicted labels assigned by the model.

Scikit-learn provides a method called "classification_report" that gives an overview of the performance using different metrics, as done in the code below.

In [None]:
from sklearn.metrics import classification_report

print( classification_report( test_labels, preds ) )

## 2.5 Improving the results: modifying data representation

Many parts of a model can be modified to try to improve the performance:
- the data representation
- the values of the hyper-parameters (in Part 4)
- the choice of the algorithm (in Part 4)

Data representation corresponds to the choice of features.
Here, we choosed a simple bag-of-word representation (BOW) with raw frequency.

### TF-IDF normalization

As said during the course, BOW comes with many flavors, and a good option in general is to use TF-IDF normalization instead of raw features.

With scikit, you can either directly vectorize using TF-IDF (with the class 'TfidfVectorizer') or transform a count-based representation (with the class 'TfidfTransformer').
Here, we use this second option.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer



The code below will produce training and test sets with TFIDF representations.
The next cell witll train and evaluate the classifier.

- Look how it changed the values in the vector representing the first review.
- Does it change the performance?

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
train_features_tfidf = transformer.fit_transform(train_features)
test_features_tfidf = transformer.transform(test_features)

# Print the vector representing the first review (500 dimensions)
# use toarray() to densify the matrix --> many 0s = sparsity
print( "\nVector representing the first review", train_features_tfidf[0].toarray())

In [None]:
# Training
classifier = LogisticRegression()
classifier.fit( train_features_tfidf, train_labels )
# Predictions
preds = classifier.predict( test_features_tfidf )
# Scores
print( classification_report( test_labels, preds ) )

### Tri-grams features

 As said during the course, BOW doesn't take into account the context of each word, which can be crucial for the task.

 Let's try with tri-grams, or here, more specifically a concatenation of:
 - unigrams: single tokens, same as BOW
 - bigrams: two words
 - trigrams: three words
 This is done with the option 'ngram_range'.

 Note that here we directly take the TF-IDF vectorizer.

 Without filtering, this produces vectors of **1366006 dimensions**! Here, we choose to keep 5000 features, more than previously to take into account the new features.

The code below combines:
- Building a representation of the data built on tri-grams
- Train a classifier
- Make predictions
- Print scores

Do you see any improvement?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer( analyzer = 'word', max_features = 5000, ngram_range=(1,3) )
train_features_tfidf_ngram = vectorizer.fit_transform( train_reviews )
test_features_tfidf_ngram = vectorizer.transform( test_reviews )

print( train_features_tfidf_ngram.shape, test_features_tfidf_ngram.shape)

# Training
classifier = LogisticRegression()
classifier.fit( train_features_tfidf_ngram, train_labels )
# Predictions
preds = classifier.predict( test_features_tfidf_ngram )
# Scores
print( classification_report( test_labels, preds ) )

### Vocabulary size

The code below will print the vocabulary for this representation of the data: can you see what has changed?

In [None]:
# Print the vocabulary (= unique words, here 5000)
print( "\nVocabulary:", list( vectorizer.vocabulary_.keys() ) )
vocab = vectorizer.get_feature_names_out()
print( "Size of the vocabulary:", len(vocab))
print(  "Sorted vocabulary:", vocab[:100] )

## 2.6 Inspecting the model

The linear classifiers work by learning weights over the features.
Looking at these weights can give some insights on your model.

With LogisticRegression in the binary setting, we have:
- the most positive weights are the best indicators of the positive class (here positive reviews)
- the most negative weights are the best indicators of the negative class (here negative reviews)

### Comparing positive and negative features

The code below will print the 50 most positive and negative features: do the results make sense?

In [None]:
# Here we look at the best model obtained with grid search, ngrams features and tf idf normalization

vocab = vectorizer.get_feature_names_out()
allCoefficients = [(classifier.coef_[0,i], vocab[i]) for i in range(len(vocab))]
allCoefficients.sort()
allCoefficients.reverse()

print("Top features for positive class:")
print( '\n'.join( [ f+':\t'+str((round(w,3))) for (w,f) in allCoefficients[:50]] ) )

print("\nTop features for negative class:")
print( '\n'.join( [ f+':'+str((round(w,3))) for (w,f) in allCoefficients[-50:]] ) )

### Error analysis

The code below will print the errors of our system. Look at some examples, what do you think?

In [None]:
# Retrieve the errors from the systems, ie reviews that were wrongly classified
pos_as_neg = []
neg_as_pos = []
for i, r in enumerate( test_reviews):
  if test_labels[i] != preds[i]:
    if test_labels[i] == 1:
      pos_as_neg.append( [test_labels[i], preds[i], r] )
    else:
      neg_as_pos.append( [test_labels[i], preds[i], r] )
print( "------ Positive reviews that have been wrongly predicted as negative: "+str(len(pos_as_neg))+"\n")
print( '\n'.join( [r for g,p,r in pos_as_neg] ))

print( "\n\n------ Negative reviews that have been wrongly predicted as positive: "+str(len(neg_as_pos))+"\n")
print( '\n'.join( [r for g,p,r in neg_as_pos] ))

# Part 3: generating word embeddings

![](https://drive.google.com/uc?export=view&id=1eLkKWp8yOP6AJsK2h6Btbr3L95TvDsyD)



As introduced during the course, we can use neural networks to generate vectors representing words.
These vectors, learned on massive amount of data, allow to compute similarity measures between words.

As an introductive exercise, we will generate word embeddings from the sentiment review dataset and take a look at the generated vectors.

Remind that this corpus is "small", compared to what is generally used for generating embeddings, here around 40k words against millions of words in general!
The resulting vectors will thus not be of extremely good quality (but the model will run very fast :).

## 3.1 Generating word embeddings

We  will  use gensim in  order  to  induce  word  embeddings  from  text.
gensim is  a  vector  space modeling and topic modeling toolkit for python, and contains an efficient implementation of the word2vec algorithms.

word2vec consists of two different algorithms: skipgram (sg) and continuous-bag-of-words (cbow).
The underlying prediction task of the former is to estimate the context words from the target word ; the prediction task of the latter is to estimate the target word from the sum of the context words.

In [None]:
from gensim.models import Word2Vec

import gzip
import logging

import time

# set up logging for gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

# we define a PlainTextCorpus class; this will provide us with an
# iterator over the corpus (so that we don't have to load the corpus
# into memory)
class PlainTextCorpus(object):
    def __init__(self, fileName):
        self.fileName = fileName

    def __iter__(self):
        for line in gzip.open(self.fileName, 'rt', encoding='utf-8'):
            yield  line.split()

# instantiate the corpus class using corpus location
#sentences = PlainTextCorpus('raw_reviews.txt.gz')
sentences = PlainTextCorpus('raw_reviews_cleaned2.txt.gz')

# we only take into account words with a frequency of at least 50, and
# we iterate over the corpus only once
model = Word2Vec(sentences, min_count=50, epochs=1)

# finally, save the constructed model to disk
model.save('model_word2vec')

## 3.2 Compute word similarity

You can now compute the most similar words (which is measured by cosine similarity between the word vectors) by issuing the following command:

model.wv.most_similar(myword)

Don't hesitate to test with other words, such as "movie", "good" etc

In [None]:
model.wv.most_similar('actor')

In [None]:
model.wv.most_similar('romance')

Word  embeddings  allow  us  to  do  analogical  reasoning  using  vector  addition and subtraction.
gensim offers the possibility to do so.

Try to perform analogical reasoning,  e.g.  actor - man  +  woman  = ?

In [None]:
model.wv.most_similar(positive=["actor", "woman"], negative=["man"])

## 3.3 Modify the model

As a default, the word2vec module creates word embeddings with the following setting:
- algorithm: CBOW
- window: 5
- embeddings size: 100

Try other options, including:
- algorithm: skipgram
- window: try varied sizes, from very small to large one
- embeddings size: try varied sizes, from very small to large one

Each time, evaluate the impact on the similarity computation.
What configuration works best?

See doc: https://radimrehurek.com/gensim_3.8.3/models/word2vec.html

In [None]:
# a- MODIFYING THE WINDOW SIZE (here 1)

model_w1 = Word2Vec(sentences, min_count=50, epochs=1, window=1)

In [None]:
# a- MODIFYING THE WINDOW SIZE (here 20)

model_w20 = Word2Vec(sentences, min_count=50, epochs=1, window=20)

In [None]:
# b- MODIFYING THE EMBEDDINGS SIZE (here 10)

model_s10 = Word2Vec(sentences, min_count=50, epochs=1, vector_size=10)

In [None]:
# b- MODIFYING THE EMBEDDINGS SIZE (here 300)

model_s300 = Word2Vec(sentences, min_count=50, epochs=1, vector_size=300)

In [None]:
# c- WITH SKIPGRAM

# we only take into account words with a frequency of at least 50, and
# we iterate over the corpus only once
model_sg = Word2Vec(sentences, min_count=50, epochs=1, sg=1)

# finally, save the constructed model to disk
model.save('model_word2vec_sg')

In [None]:
# Results with CBOW
model.wv.most_similar('romance')

In [None]:
# Results with Skip-gram
model_sg.wv.most_similar('romance')

# PART 4: additional notes on classification

## Finding the best model

Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with:
- Different features
- Different classification algorithms
- Different model parameters

However, we have to be careful: we cannot use our test set over and over again, as we’ll be optimizing our parameters for that particular test set, and run the risk of overfitting, which means we are not able to properly generalize to data we haven’t trained on.
We want to build a model that is robust, meaning that it will get good performance on unseen data.
That's why we only use the test set at the end, with the best model.

For this reason, we need to make use of a validation our development set.
However, our training set is already quite small; creating a separate validation set would give us even less training data.

Fortunately, there is another option: we can use k-fold cross validation.
The idea is the following:
- Break up data into k (e.g. 10) parts (folds)
- For each fold
    - Current fold is used as temporary test set
    - Use other 9 folds as training data
    - Performance is computed on test fold
- Average performance over 10 runs

Scikit provides efficient ways of performing cross-fold validation.
We will test below the grid search that allows to choose the best values for the hyper-parameters, using cross-validation over the trianing set.

### 4.1 Optimizing the hyper-parameters

Each algorithm comes with some "options" called hyper-parameters.
The chosen values can have an important effect on the results.  

For example, Logistic Regression has:
- 'C' a coefficient C used for regularization, with smaller values specifying stronger regularization.
- 'max_iter' (default=100) Maximum number of iterations taken for the solvers to converge.

Here we use the class 'GridSearchCV' that will perform an exhaustive search over specified parameter values for an estimator (i.e. a classifier).
We specify the algorithm we want (here 'LogisticRegression') and the parameters values we want to test (see the dictionnary 'parameters').

Then the 'fit' method over the GridSearchCV object allows to perform the search over the parameters, using a cross-fold validation (default: 5-fold CV).

Then, you can print the best set of parameters and the best score (i.e. Mean cross-validated score of the best_estimator), and use a panda dataframe to visualize the results according to each set of parameters.


See the doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


In [None]:

from sklearn.model_selection import GridSearchCV

parameters = {'C':[0.0001, 0.01, 0.1, 1, 100], 'max_iter':[1, 10] }
lr = LogisticRegression()
clf_lr = GridSearchCV(lr, parameters, verbose=1)
clf_lr.fit( train_features_tfidf_ngram, train_labels )
sorted(clf_lr.cv_results_.keys())

print( "Best parameters found:", clf_lr.best_params_)
print( "Best score found:", clf_lr.best_score_)

pd.concat([pd.DataFrame(clf_lr.cv_results_["params"]),pd.DataFrame(clf_lr.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)

You can then directly use the GridSearchCV object (here called 'clf') to make predictions on your test set: it correspond to the best model found during the search.

In [None]:
preds = clf_lr.predict( test_features_tfidf_ngram )
print( classification_report( test_labels, preds ) )

## 4.2 Optional exercise: Try other algorithms

Now, you can use the grid search to test another algorithm (e.g. Naive Bayes, SVM).
You only need to perform the grid search, then report the results on the test set for the best algorithm only.

Doc for Naive Bayes: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

Doc for SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

Which one performs the best?

In [None]:
# Testing Naive Bayes

from sklearn.naive_bayes import MultinomialNB


In [None]:
# Testing SVM

from sklearn.svm import LinearSVC



## 4.3 Notes: Other options for vectorization

When converting our data into vectors using bag-of-words, many other options are implemented in scikit-learn, e.g.:
- 'stop_words='english': will automatically remove stop-words from a list (but be careful: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words)
- binary (default False): If True, all non zero counts are set to 1.
- ngram_range (tuple (min_n, max_n), default=(1, 1)): The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
- analyzer='word': can be changed to 'char' if you want to use characters as features

See the doc: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer