# Text vectorisation: Turning Text into Features

More advanced forms of text analysis require that text documents are converted into numerical values or features. In this  section we will examine:

* different methods for representing a collection of texts as numbers
* the decisions we need to make when generating a particular representation as well as the kinds of insights each numerical representation can give us.

We will use tools from the Python libraries `scikit-learn` and `gensim` to perform some popular text vectorisation methods:
* Re-cap of N-grams (unigram and bi-gram) term friquency
* TF-IDF (Term Frequency–Inverse Document Frequency)
* Word embedding—Word2Vec

In [None]:
# Import libraries

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

import gensim
from gensim.models import Word2Vec

from matplotlib import pyplot as plt

## Turning text into n-grams features 
### Unigrams

Compute the friquency of word occurance using count vectoriser in `scikit-learn`  

### Toy example

In [None]:
# Text corpus

# Load the parsed news dataset 
corpus = pd.read_csv('sample_news_large_phrased.csv', index_col='index')

In [None]:
corpus.head(1)

In [None]:
# Subset news stories about brexit
corpus_brexit = corpus[corpus['query']=='brexit']

corpus_toy=corpus_brexit.iloc[[7,22], [1]]

# Set the maximum width of columns
pd.options.display.max_colwidth = 200

corpus_toy.head(20)

In [None]:
# Use CountVectorizer to tokenize a collection of text documents and convert it into a matrix of token counts

# Create an instance of the CountVectorizer class


# Learn the vocabulary from the corpus using the toy corpus


# encode documents as vectors


# The vocabulary_ attribute maps the tokens (keys) to the integer feature indices (values) in a dictionary


Note that punctuation and single letter's words are removed. We will use below the prerpocessed tokens you have already preprocessed.

In [None]:
# Access the feature index of a token


The numbers assigned to each token (e.g., "brexit") are indices. For clarity, indices are sorted in the cell bellow.

In [None]:
# Print the matrix of rows (documents) and columns (count for the number of times a token appeared in the document) 


`vector.toarray()` returns a matrix where the rows indicate the number of documents (two in our case) and the columns indicate the size of the vocabulary of the entire corpus (all documents).

Each document is encoded as a vector with a length indicating the size of the vocabulary of the entire corpus and an integer count for the number of times each token appeared in the document.

In [None]:
# Sort the dictionary of terms (keys) and indices (values) in the feature matrix by values in ascending order
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))
print(vector.toarray())

The output consists of 24 unigram features. The 1st token `brexit` has appeared twice in the first title and once in the second title.

In [None]:
# Find the most friquent token in the corpus and the number of times it appeared in the corpus 
maximum = vector.toarray().max()
index_of_maximum = np.where(vector.toarray() == maximum)

print("max:", maximum)
print("index:", index_of_maximum)

In [None]:
# Sort the vector of integer count in ascending order
np.sort(vector.toarray())

### Example using the entire data set of News Tokens

In [None]:
corpus['text']

In [None]:
# Convert a collection of text documents to a matrix of token counts

vectorizer_corpus = CountVectorizer()

#  Learn the vocabulary from the corpus and tokenise
vectorizer_corpus.fit(corpus['text'])

# encode documents as vectors
vector_corpus = vectorizer.transform(corpus['text'])

# summarize & generate output
print(vectorizer_corpus.vocabulary_)
print(vector_corpus.toarray())

## Exercise 1

For the entire corpus, find the most friquent token in the corpus and the number of times it appeared in the corpus. 

In [None]:
# Please write below the code for Exercise 1




### Bi-grams (combination of two tokens)
In the unigram transformation, each token is a feature. For example, `general` and `election` are two separate features. The bi-gram transformation relaxes this contrain by pairing each word to previous and subsequent words.  

In [None]:
# Extracting unigrams and bigrams
    # ngram_range of (1, 1) extracts unigrams
    # ngram_range of (1, 2) extracts unigrams and bigrams
    # ngram_range of (1, 2) extracts bigrams

# Create an instance of the CountVectorizer class set bigram extraction   

# Learn the vocabulary from the corpus and tokenise

# encode documents as vectors

# # The vocabulary_ attribute maps the tokens (keys) to the integer feature indices (values) in a dictionary
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))
print(vector.toarray())

The output consists of 28 bigram-based features. The count is either 1 or 0 for each of our bigram.     

##  Term frequency–inverse document frequency (TF-IDF)

TF-IDF vectorisation weights down tokens that are present across many documents in the corpus (in particular, words like "of" and "the" if stop words are not removed) and are therefore less informative than tokens that are present in specific documents in the corpus. 

### Toy example

In [None]:
# Convert a collection of raw documents to a matrix of TF-IDF features


# Learn the vocabulary from the corpus and tokenise


# Summarize & print the tokens and the matrix of TF-IDF features 
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))
print(vectorizer.idf_)

###### How is TF-IDF computed by `scikit-learn`?  

In [None]:
# TF-IDF 
# IDF = log(1 + N/ 1 + n) + 1 
# N is the total number of documents 
# n is the number of documents in which the word appears
# constant “1” is added to the numerator and denominator to prevent zero divisions
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
    
import math as m
# the term "brexit" is present in two of two documents
m.log((2+1)/(2+1))+1 

In [None]:
# the term "election" is present in one of two documents
m.log((2+1)/(1+1))+1

#### TF-IDF vectorisation of the `row` news sub-corpus related to Brexit

In [None]:
# Convert our corpus of row documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer()

# Learn the vocabulary from the corpus and tokenise
vectorizer.fit(corpus_brexit['text'])

# Summarize & print the tokens and the matrix of TF-IDF features
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))
print(vectorizer.idf_)

In [None]:
# TF-IDF of the token "the" in the brexit corpus
print("TF-IDF score of the term 'the':",vectorizer.idf_[vectorizer.vocabulary_["the"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the token "brexit" in the brexit corpus
print("TF-IDF score of the term 'brexit':",vectorizer.idf_[vectorizer.vocabulary_["brexit"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the token "deal" in the brexit corpus
print("TF-IDF score of the term 'deal':",vectorizer.idf_[vectorizer.vocabulary_["deal"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the token "protesters" in the brexit corpus
print("TF-IDF score of the term 'protesters':", vectorizer.idf_[vectorizer.vocabulary_["protesters"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

The word `"the"` is present in many documents and hence the vector value is close to 1; Converseley, the term `"protesters"` is present in few documents and has a higher vector value. 

#### Let's explore some parameters of the TfidfVectorizer function

In [None]:
# Key parameters of the TfidfVectorizer function
    # min_df: float or int, default=1.0. ignores terms that have a document frequency lower than the given threshold
    # max_df: float or int, default=1.0. ignores terms that have a document frequency higher than the given threshold
    # stop_words: removes stopwords, only for english, with issues; max_df set to a value in the range [0.7, 1.0) 
    # automatically filters stop words based on intra corpus document frequency of terms.

# Convert our corpus of row documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', min_df = 0.2, max_df = 0.9) # threshold depends on corpus and question

# Learn the vocabulary from the corpus and tokenise
vectorizer.fit(corpus_brexit['text'])

# Summarize & print the tokens and the matrix of TF-IDF features
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))
print(vectorizer.idf_)

In [None]:
# TF-IDF of the token "the" in the brexit corpus
print("TF-IDF score of the term 'the':",vectorizer.idf_[vectorizer.vocabulary_["the"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

The word `"the"` appears in more than 90% of the documents and is removed on that basis.   

In [None]:
# TF-IDF of the token "brexit" in the brexit corpus
print("TF-IDF score of the term 'brexit':",vectorizer.idf_[vectorizer.vocabulary_["brexit"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the token "deal" in the brexit corpus
print("TF-IDF score of the term 'deal':",vectorizer.idf_[vectorizer.vocabulary_["deal"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the token "election" in the brexit corpus
print("TF-IDF score of the term 'election':", vectorizer.idf_[vectorizer.vocabulary_["protesters"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

The word `"protesters"` appears in less than 20% of the documents and is removed on that basis.

#### TF-IDF vectorisation using the `tokenied` News sub-corpus related to Brexit

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', min_df = 0.2, max_df = 0.9) # threshold depends on corpus and question
#Tokenize and build vocab
vectorizer.fit(corpus_brexit['tokens'])
#Summarize
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))
print(vectorizer.idf_)

In [None]:
# TF-IDF of the token "the" in the brexit corpus
print("TF-IDF score of the term 'the':",vectorizer.idf_[vectorizer.vocabulary_["the"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the token "brexit" in the brexit corpus
print("TF-IDF score of the term 'brexit':", vectorizer.idf_[vectorizer.vocabulary_["brexit"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the collocation "deal" in the brexit corpus
print("TF-IDF score of the term 'deal':", vectorizer.idf_[vectorizer.vocabulary_["deal"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

In [None]:
# TF-IDF of the collocation "prime_minister" in the brexit corpus
print("TF-IDF score of the term 'prime_minister':", vectorizer.idf_[vectorizer.vocabulary_["prime_minister"]])
print("Mean TF-IDF in corpus:", np.mean(vectorizer.idf_))

## Word Embeddings and word2vec

> You shall know a word by the company it keeps (Firth, 1957).

`Word2vec` [Mikolov et al. 2013](https://arxiv.org/abs/1301.3781) and related techniques (e.g., [GloVe](https://nlp.stanford.edu/projects/glove/)) use the context of a given word — i.e., the words surrounding a word — to learn its meaning and represent it as vectors.

In [None]:
# Convert your tokens in the News dataset into a list
corpus_brexit['tokens']= corpus_brexit['tokens'].apply(lambda token_string: token_string.split('|*|'))


In [None]:
import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

# training the word2vec model
skipgram = Word2Vec(corpus_brexit['tokens'], size =300, window = 3, min_count=1,sg = 1)

print("Dimensionality—size of vocabulary and size of vectors:", skipgram)

# access vector for one word, "brexit" in this instance
print("vectors for 'brexit':", skipgram['brexit'])

In [None]:
skipgram.wv.similarity('brexit', 'migration')

In [None]:
skipgram.wv.most_similar(positive = "brexit")

In [None]:
# Fit Principal component analysis (PCA) on the skipgram model output and plot the first 2 components

data = skipgram[skipgram.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(data)
# create a scatter plot of the projection
plt.figure(figsize=(28,20))
plt.scatter(result[:, 0], result[:, 1])
words = list(skipgram.wv.vocab)

for i, word in enumerate(words):
       plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()


## Acknowledgement

1. Akshay Kulkarni and Adarsha Shivananda. 2019. Natural Language Processing Recipes. [Chapter 3: Converting Text to Features](https://learning.oreilly.com/library/view/natural-language-processing/9781484242674/html/475440_1_En_3_Chapter.xhtml#)

2. [Sklearn's module on feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html)