# Introduction to Natural Language Processing

### Tutorial 3

---

Until now, we have seen, how to tokenize a document, extract attributes from its tokens and how to create a bag of words. Today, we'll use the `CountVectorizer` and the `TfidfVectorizer` from `scikit-learn` in order to create a matrix that represents words in a document. 

The process of vectorizing a text is performed given the fact that most algorithms are not designed to handle raw text. Therefore, we need to represent each text document in a numerical form, so that calculations can be done.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import naive_bayes

**Task:**
Use the text provided below to create a bag of words using the `CountVectorizer` function. 

In [2]:
docs = ['Albert Einstein, who became a Swiss citizen in 1901 and worked for years in Switzerland, is the most famous Nobel Prize winner in the sciences.',
        'UK is too risky after the Brexit for a Gigafactory',
        'Tesla wants to build a Gigafactory in Berlin',
        'Brexit has made it too risky for Tesla to put a Gigafactory in the UK.']

In [3]:
count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(docs)
print(count_vectorizer.vocabulary_.keys())

dict_keys(['albert', 'einstein', 'who', 'became', 'swiss', 'citizen', 'in', '1901', 'and', 'worked', 'for', 'years', 'switzerland', 'is', 'the', 'most', 'famous', 'nobel', 'prize', 'winner', 'sciences', 'uk', 'too', 'risky', 'after', 'brexit', 'gigafactory', 'tesla', 'wants', 'to', 'build', 'berlin', 'has', 'made', 'it', 'put'])


In [4]:
print(X.toarray())

[[1 0 1 1 1 0 0 0 1 1 1 1 0 0 3 1 0 0 1 1 1 0 0 1 1 1 0 2 0 0 0 0 1 1 1 1]
 [0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0]]


However, we know that counting is not the only way of representing text with numbers. We can also penalize tokens that occurr very often. Why?

Because in general, very frequent token sometimes are not as relevant as tokens that appear less often. For instance, if we want to retrieve documents that are relevant for a query: _The President Donald Trump_, one idea would be to retrieve all document containing all the words in the query. However, since our search retrieved still many documents, we might want to count the times that query words appear in the selected documents. But, _the_ and _president_, may still have many occurrences. Therefore, we should focus on documents that contain rather _donald trump_. But how do we get there? Calculating term frequency–inverse document frequency (tf-idf).

**Tf-Idf**: term frequency of a token, multiplied by the inverse document frequency (log[number of documents containing a token]).

- Tf: Term frequency: $\log({count(t,d)+1}) $
- Idf: Inverse document frequency: $\log\frac{|D|}{\# d : term \in doc}$

Notice that we calculate the tf-idf for each term in each document.

**Question**
Use the same data and the `TfidfVectorizer`, create a matrix and print features of the vectorizer. 

**Steps** 
Use the `?` to explore the function of the vectorizer and refer to [this link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) for more options about the vectorizer. Increase the range of ngrams and observe the matrix. Can you see any difference?

In [5]:
tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),)
X = tf_idf_vectorizer.fit_transform(docs)
print("FEATURES:")
print(tf_idf_vectorizer.vocabulary_.keys())
print("\nMATRIX:")
print(X.toarray())

FEATURES:
dict_keys(['albert', 'einstein', 'who', 'became', 'swiss', 'citizen', 'in', '1901', 'and', 'worked', 'for', 'years', 'switzerland', 'is', 'the', 'most', 'famous', 'nobel', 'prize', 'winner', 'sciences', 'uk', 'too', 'risky', 'after', 'brexit', 'gigafactory', 'tesla', 'wants', 'to', 'build', 'berlin', 'has', 'made', 'it', 'put'])

MATRIX:
[[0.20705515 0.         0.20705515 0.20705515 0.20705515 0.
  0.         0.         0.20705515 0.20705515 0.20705515 0.13216062
  0.         0.         0.39648185 0.16324466 0.         0.
  0.20705515 0.20705515 0.20705515 0.         0.         0.20705515
  0.20705515 0.20705515 0.         0.26432124 0.         0.
  0.         0.         0.20705515 0.20705515 0.20705515 0.20705515]
 [0.         0.43314018 0.         0.         0.         0.
  0.34149269 0.         0.         0.         0.         0.27646777
  0.27646777 0.         0.         0.34149269 0.         0.
  0.         0.         0.         0.         0.34149269 0.
  0.         0.  

Let's explore these vectorizers with real data. Let's use the Yelp reviews to compare both vectorizers.

In [6]:
data = pd.read_csv("yelp_polarity.txt", sep="\t", header=None)
display(data)

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


### Training a classifier

In order to train any model, you need to split your data. The reason for this is that you want to test the performance of your classifier at the end. And this can't be done on the same data you train. Therefore, you need to keep a small set of data that you never use until you test. Let's create our train and test sets:

In [7]:
text_train, text_test, label_train, label_test = train_test_split(data[0], data[1], 
                                                                  test_size=0.20, 
                                                                  random_state=1234, shuffle=True)

In [8]:
text_train.size

800

## Count Vectorizer: Bag of Words (BOW)

We use the training set to generate a feature matrix using the `CountVectorizer`. This one will be used to train our classifier.

**Hint:** Please notice that you need to instantiate again a new vectorizer.

In [9]:
polarity_count_vectorizer = CountVectorizer()
polarity_bow_matrix = polarity_count_vectorizer.fit_transform(text_train) # determine features and transform train corpus to those features

In [10]:
# Rows - samples ; Columns -> values for features
polarity_bow_matrix.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0]])

To train our model, we need to input the train features and their labels. In this case we will use a support vector machine (svm).

In [11]:
polarity_bow_matrix.shape

(800, 1822)

Let's instantiate our classifier:

In [12]:
svm_classifier = svm.LinearSVC()

Now it's time to train...

In [13]:
svm_classifier.fit(polarity_bow_matrix, label_train);



And after training we can test... But first, we need to convert our test data into numerical features.

**Question**
Vectorize your test set, in the same way we did with the train data. After vectorizing use the method `predict()` and put your test_matrix in the parentesis. This method returns a class prediction for each document in the matrix. 

Use NumPy to compare the original labels with the ones predicted by our model.

**Hint:** Check np.sum and np.equal

In [14]:
polarity_bow_matrix_test = polarity_count_vectorizer.transform(text_test)
test = svm_classifier.predict(polarity_bow_matrix_test)

In [15]:
print(test)

[0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0
 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0
 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1
 0 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 1
 1 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0
 0 1 1 0 1 1 0 1 1 1 0 1 0 1 0]


In [16]:
label_test.values

array([0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0])

In [17]:
polarity_bow_matrix_test.shape

(200, 1822)

In [18]:
polarity_bow_matrix.shape

(800, 1822)

In [19]:
correct_answers = np.sum(np.equal(test, label_test))
accuracy = correct_answers / (len(test)*1.0) * 100
print(accuracy)

85.5


## TF - IDF

**Question**: Train a new model with `TfidfVectorizer`.

**Hint:** Please notice that you need to instantiate a new TF-IDF vectorizer.