# Text vectorisation: Turning Text into Features

# Part 1: N-grams and TF-IDF models 

More advanced forms of text analysis require that text documents are converted into numerical values or features. In this  section we will examine:

* different methods for representing a collection of texts as numbers
* the decisions we need to make when generating a particular representation as well as the kinds of insights each numerical representation can give us.

We will use tools from the Python libraries `scikit-learn` and `gensim` to perform some popular text vectorisation methods:
* Re-cap of N-grams (unigram and bi-gram) term friquency
* TF-IDF (Term Frequency–Inverse Document Frequency)
* Word embedding—Word2Vec

In [None]:
# Import libraries

! pip install gensim
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Turning text into n-grams features 
### Unigrams

Compute the friquency of word occurance using count vectoriser in `scikit-learn`  

### Toy example

In [None]:
# Text corpus

# Load the parsed news dataset 
corpus = pd.read_csv('https://raw.githubusercontent.com/valdanchev/SC207/main/sample_news_large_phrased.csv', index_col='index')

In [None]:
corpus.head(1)

In [None]:
corpus[corpus['query']=='brexit']

In [None]:
# Subset news stories about brexit
corpus_brexit = corpus[corpus['query']=='brexit']

corpus_toy=corpus_brexit.iloc[[7,22], [1]]

# Set the maximum width of columns
pd.options.display.max_colwidth = 200

corpus_toy.head(5)

In [None]:
# Use CountVectorizer to tokenize a collection of text documents and convert 
# it into a matrix of token counts

# Create an instance of the CountVectorizer class
vectorizer = CountVectorizer()

# Learn the vocabulary from the corpus using the toy corpus
vectorizer.fit(corpus_toy['title'])

# Transform documents to document-term matrix
vector = vectorizer.transform(corpus_toy['title'])

# Print the tokens as a dictionary with tokens (keys) and 
# integer feature indices (values) using the vocabulary_ attribute
print(vectorizer.vocabulary_)

Note that punctuation and single letter's words are removed. We will use below the prerpocessed tokens you have already preprocessed.

In [None]:
# Access the feature index of a token
vectorizer.vocabulary_.get('block')

The numbers assigned to each token (e.g., "brexit") are indices. For clarity, indices are sorted in the cell bellow.

In [None]:
# Print the document-term matrix of rows (documents) and 
# columns (count for the number of times a token appeared in the document) 
print(vector.toarray())

`vector.toarray()` returns a matrix where the rows indicate the number of documents (two in our case) and the columns indicate the size of the vocabulary of the entire corpus (all documents).

Each document is encoded as a vector with a length indicating the size of the vocabulary of the entire corpus and an integer count for the number of times each token appeared in the document.

In [None]:
# Sort the dictionary of terms (keys) and indices (values) in the feature matrix by values in ascending order
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))

# Print the document-term matrix
print(vector.toarray())

The output consists of 24 unigram features. The 1st token `brexit` has appeared twice in the first title and once in the second title.

In [None]:
# Find (1) the most friquent token in a document, (2) the number of times it appears in that document 
# and (3) the document in which it appears
maximum = vector.toarray().max()
index_of_maximum = np.where(vector.toarray() == maximum)

print("max:", maximum)
print("index:", index_of_maximum)

In [None]:
# Sort the vector of integer count in ascending order
np.sort(vector.toarray())

### Example using the entire data set of News Tokens

In [None]:
corpus['text'].head()

In [None]:
# Convert a collection of text documents to a matrix of token counts

vectorizer_corpus = CountVectorizer()

#  Learn the vocabulary from the corpus and tokenise
vectorizer_corpus.fit(corpus['text'])

# Transform documents to document-term matrix
vector_corpus = vectorizer_corpus.transform(corpus['text'])

# Print the tokens as a dictionary with tokens (keys) and integer feature indices (values) using the vocabulary_ attribute
print(dict(sorted(vectorizer_corpus.vocabulary_.items(), key=lambda item: item[1])))

In [None]:
# Print the document-term matrix
print(vector_corpus.toarray())

In [None]:
# Dimensions of vector_corpus.toarray(), i.e., number of rows and columns
vector_corpus.toarray().shape

## Exercise 1

Using the entire corpus, find (1) the most friquent token in a document, (2) the number of times it appears in that document and (3) the document in which it appears.

In [None]:
# Please write below the code for Exercise 1

maximum = vector_corpus.toarray().max()
index_of_maximum = np.where(vector_corpus.toarray() == maximum)

print("max:", maximum)
print("token index:", index_of_maximum)

The the most frequent token is in document 3 and indexed 12823. 

In [None]:
# Find the token indexed 12823 by getting a key in a dictionary by its value 
# The value in the "vectorizer_corpus.vocabulary_" is the token index

dict((v,k) for k,v in vectorizer_corpus.vocabulary_.items())[12823]

In [None]:
# To double check, get value by key

vectorizer_corpus.vocabulary_.get('the')

### Bi-grams (combination of two tokens)
In the unigram transformation, each token is a feature. For example, `general` and `election` are two separate features. The bi-gram transformation relaxes this contrain by pairing each word to previous and subsequent words.  

In [None]:
# Extracting unigrams and bigrams
    # ngram_range of (1, 1) extracts unigrams
    # ngram_range of (1, 2) extracts unigrams and bigrams
    # ngram_range of (2, 2) extracts only bigrams

# Create an instance of the CountVectorizer class set bigram extraction   
vectorizer = CountVectorizer(ngram_range=(1,2))

# Learn the vocabulary from the corpus and tokenise
vectorizer.fit(corpus_toy['title'])

# Transform documents to document-term matrix
vector = vectorizer.transform(corpus_toy['title'])

# Print the tokens as a dictionary with tokens (keys) and integer feature indices (values) using vocabulary_
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))

# Print the document-term matrix
print(vector.toarray())

The output consists of 28 bigram-based features. The count is either 1 or 0 for each of our bigram.     

##  Term frequency–inverse document frequency (TF-IDF)

TF-IDF vectorisation weights down tokens that are present across many documents in the corpus (in particular, words like "of" and "the" if stop words are not removed) and are therefore less informative than tokens that are present in specific documents in the corpus. 

### Toy example

### Let's first get the `TF` (term frequency) as before 

In [None]:
# We use the CountVectorizer function we used above to count n-grams
vectorizer = CountVectorizer()
vectorizer.fit(corpus_toy['title'])
vector = vectorizer.transform(corpus_toy['title'])
print(vector.toarray())

#### Let's now compute the `IDF` part

IDF = log(N + 1 / n + 1) + 1 where N is the total number of documents and n is the number of documents in which the term appears; constant “1” is added to the numerator and denominator to prevent zero divisions (see [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)).

In [None]:
# Convert a collection of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(norm=None)

# Learn the vocabulary from the corpus and tokenise
matrix = vectorizer.fit_transform(corpus_toy['title'])

# Print the tokens as a dictionary with tokens (keys) and integer feature indices (values) using vocabulary_
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))

# Print the IDF scores 
print(vectorizer.idf_)

In [None]:
# IDF for the term 'block'
import math as m
m.log((2+1)/(1+1))+1

### Exercise
Compute the IDF for the term 'uk'

In [None]:
# Write your code here


#### Below we get the TF-IDF for our toy corpus

In [None]:
# Convert the TF-IDF matrix into a DataFrame   
tf_idf_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tf_idf_df

### How is TF-IDF computed by `scikit-learn`?  


TF-IDF(t,d) = TF * IDF

What is the TF-IDF of the term 'brexit' which is term 1 in document 0 so TF-IDF(1,0)

TF = 2

IDF = log(N + 1 / n + 1) + 1 where N is the total number of documents and n is the number of documents in which the term appears; constant “1” is added to the numerator and denominator to prevent zero divisions (see [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)). 


In [None]:
# the term "brexit" is present in two of two documents
IDF = m.log((2+1)/(2+1))+1 
IDF

So TF-IDF for term 1 (brexit) in document 0 is **TF-IDF (1,0) = TF * TDF = 2 * 1 = 2**

#### Let's try another example, the fourth term ('election') in document 0

TF-IDF(4.0) = TF * IDF

TF = 1

In [None]:
# the term "election" is present in one of two documents
IDF = m.log((2+1)/(1+1))+1
IDF

So TF-IDF for term 4 ('election') in document 0 is **TF-IDF (4,0) = TF * TDF = 1 * 1.405 = 1.405**

#### The above TF-IDF matrix is not normalised. Typically, it is recommended that the TF-IDF weights are normalised meaning that the weights in the matrix will range between 0 and 1. Below is the normalisation code (L2 normalisation is default in the TfidfVectorizer function but we indicate it below for clarity)

In [None]:
# Convert a collection of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(norm ='l2')

# Learn the vocabulary from the corpus and create a document-term matrix
matrix = vectorizer.fit_transform(corpus_toy['title'])

# Convert the TF-IDF matrix into a DataFrame
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())

### TF-IDF vectorisation of the 'raw' news sub-corpus related to Brexit

In [None]:
# Convert our corpus of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer()

# Learn the vocabulary from the corpus and create a document-term matrix
matrix = vectorizer.fit_transform(corpus_brexit['text'])

# Print the tokens as a dictionary with tokens (keys) and integer feature indices (values) using vocabulary_
print(dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])))

In [None]:
# Print the IDF scores
print(vectorizer.idf_)

In [None]:
# IDF of a few tokens in the brexit corpus
print("IDF score of the term 'the':",vectorizer.idf_[vectorizer.vocabulary_["the"]])
print("IDF score of the term 'brexit':",vectorizer.idf_[vectorizer.vocabulary_["brexit"]])
print("IDF score of the term 'deal':",vectorizer.idf_[vectorizer.vocabulary_["deal"]])
print("IDF score of the term 'protesters':", vectorizer.idf_[vectorizer.vocabulary_["protesters"]])

The word `"the"` is present in many documents and hence the vector value is close to 1; Converseley, the term `"protesters"` is present in few documents and has a higher IDF value. 

In [None]:
# TF-IDF matrix
# The vectorizer.get_feature_names() gives you the list of feature names
tf_idf_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tf_idf_df

In [None]:
# TF-IDF of the token "the" in the brexit corpus
tf_idf_df.loc[:,['the','brexit','deal','protesters']]

The token `"the"` is downweighted but still has high TF-IDF weights due to the high term frequency (Note that the TF-IDF score is a product of term frequency & inverse document frequency). The term `"protesters"` is present in a few documents and because it's term frequency is 0 in many documents, the TF-IDF score is 0 too. 

### Let's explore some parameters of the TfidfVectorizer function 
As with other functions, you can use Shift + Tab to explore the parameters

`stop_words` removes stopwords, only for english, some with issues; automatically filters stop words based on intra corpus document frequency of terms 

`min_df` ignores terms that have a document frequency lower than the given threshold (float or int, default=1.0)

`max_df` ignores terms that have a document frequency higher than the given threshold (float or int, default=1.0.)

`max_features` default=None, if not None, build a vocabulary that only consider the top features ordered by term frequency across the corpus.

In [None]:
# Convert our corpus of row documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', 
                             min_df = 0.2, 
                             max_df = 0.9) # threshold depends on corpus and question
                             # max_features=5
    
# Learn the vocabulary from the corpus and create a document-term matrix
matrix = vectorizer.fit_transform(corpus_brexit['text'])

# Summarize & print the tokens and the matrix of TF-IDF features
tf_idf_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tf_idf_df

#### TF-IDF vectorisation using the `tokenised` News sub-corpus related to Brexit

In [None]:
# Compute TF-IDF on your tokenised news corpus related to Brexit
            
vectorizer = TfidfVectorizer(stop_words='english', 
                             min_df = 0.2, 
                             max_df = 0.9) # threshold depends on corpus and question
                             # max_features = 5 # you can specify a subset of features to consider

# Learn the vocabulary from the corpus and create a document-term matrix
matrix = vectorizer.fit_transform(corpus_brexit['tokens'])

# Create a DataFrame 
tf_idf_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tf_idf_df

Below the word `"the"` appears in more than 90% of the documents and is removed on that basis. Also, the word `"protesters"` appears in less than 20% of the documents and is removed on that basis.   

In [None]:
# Show the TF-IDF vectors for a few tokens
# Error message indicating tokens not in our corpus due to the thresholding we performed
# tf_idf_df.loc[:,['the','brexit','deal','protesters']]

In [None]:
# Show only tokens that are in the tf_idf_df DataFrame
tf_idf_df.loc[:,['brexit','deal']]

#### Plot two features using a scatter plot

In [None]:
# Create figure and set figure size
sns.set_context("notebook", font_scale=1.5)
plt.figure(figsize = (15,10))

# Create scatterplot — alpha controls the transparency and s controls the size of markers
fig = sns.scatterplot(data=tf_idf_df, x='brexit', y='deal', alpha=0.4, s=600, color = 'm')
# fig.set_xlabel("Brexit")
# fig.set_ylabel("Deal")

# Add label for each point
for line in range(0,tf_idf_df.shape[0]):
    fig.text(tf_idf_df.brexit[line], tf_idf_df.deal[line], tf_idf_df.index[line], 
             horizontalalignment='center', size='small', color='black', weight='light') # possibly add fontsize=15

In the figure above, dots are documents and dot labels indicate the ID of the document in the index column in the DataFrame—if you have more interpretable labels, you could easily plot them instead of the index. You can identify from the figure, for example, the documents that focus on the word 'deal', including documents 11, 24, 12, 3, 0, and 21. While all documents are related to Brexit, the TF-IDF score for brexit is low for some documents due to the fact that the word brexit was not mentioned in the text of those document.

### Cluster the 25 docuemtns about Brexit using scikit-learn's implementations of [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [K-means clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

#### Principal Component Analysis (PCA)
The above analysis visualises documents with respect to only two tokens/features/vectors. To visualise documents with respect to all tokens, we will apply a useful technique for dimensionality reduction called Principal Component Analysis (PCA). PCA takes multidimensional data and projects each data point in the sample into few components (we will use the first two components) that preserve as much as possible the variance in the data. For more information about PCA and how to implement it in Python, read [here](http://www.textbook.ds100.org/ch/25/pca_dims.html).


In [None]:
# Principal Component Analysis

# Initialise the PCA estimator and keep the first 2 components
pca = PCA(n_components=2)

# Fit the PCA estimator; first convert the sparse matrix to an array using toarray 
pca_components=pca.fit_transform(matrix.toarray())
pca_components

#### K-means clustering
Clustering is an approach that aims to group a set of observations into subgroups or clusters (without any prior information about cluster membership) such that observations assigned to the same cluster are more similar to each other than those in other clusters. We will employ the _k_-means clustering algorithm. 

In [None]:
# Initialise the k-means estimator with 3 clusters
kmeans = KMeans(n_clusters=3)

# Fit the k-means estimator using the two components 
kmeans.fit(pca_components)
kmeans.labels_

In [None]:
# Add the cluster variable as a column in the tf_idf_df variable
tf_idf_df['cluster'] = kmeans.labels_
tf_idf_df

In [None]:
# Assign a document to a category 
tf_idf_df['category'] = kmeans.labels_
tf_idf_df['pca_components_1'] = pca_components[:, 0]
tf_idf_df['pca_components_2'] = pca_components[:, 1]

# Set figure size
sns.set_context("notebook", font_scale=1.5)
plt.figure(figsize = (11.7,8.27))

# Scatterplot with the 1st principal component on the horizontal x axes and 2nd principal component on the vertical y axis
fig = sns.scatterplot(x = pca_components[:, 0], y = pca_components[:, 1], hue=kmeans.labels_, alpha=0.8, s=200)

# This for loop assign country name to each data point iteratively
for line in range(0,tf_idf_df.shape[0]):
     fig.text(pca_components[line,0]+0.015, pca_components[line,1], # where the labels should be positioned
     tf_idf_df.index[line], # add labels to each data point 
     horizontalalignment='left', size='small', color='black', weight='light') # possibly add fontsize=10

# Add labels to the horisontal x axis and vertical y axis
labels = fig.set(xlabel='1st principal component', ylabel='2nd principal component')

# Add title 'Cluster' to the legend and locate it in the upper right of the plot
legend = plt.legend(title='Cluster', loc='upper right')

### Cluster the TF-IDF for the entire corpus using Principal Component Analysis and K-means clustering  

In [None]:
# Compute TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', 
                             min_df = 0.1, 
                             max_df = 0.9, # threshold depends on corpus and question
                             max_features=100) 
matrix = vectorizer.fit_transform(corpus['tokens'])

# DataFrame
tf_idf_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tf_idf_df

In [None]:
# Repeat the Principal Component Analysis workflow

# Initialise the PCA estimator with 2 components
pca = PCA(n_components=2)

# Fit the PCA estimator; first convert the sparse matrix to an array using toarray 
pca_components=pca.fit_transform(matrix.toarray())
pca_components

### How do we know how many clusters to form? 
We can learn the optimal number of clusters for our data authomatically. We run the k-means algorithm with various values of _k_ and plot each value of _k_ against the sum of squared distances between each data point (document) and its cluster centre.

In [None]:
Sum_of_squared_errors = [] # Initialise a list

K = range(1,31)
for k in K:
  kmeans = KMeans(n_clusters=k)
  kmeans.fit(pca_components)
  Sum_of_squared_errors.append(kmeans.inertia_)   

Sum_of_squared_errors

#### Plot k against the sum of squared distances
We perform multiple runs of the k-means clustering algorithm, and the plot below shows how the sum of squared distances varies with values of _k_ between 1 and 30. 

In [None]:
# Plot appearance and size
sns.set(rc={'figure.figsize':(8.2,5.8)})
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

# Generate the plot
fig = sns.lineplot(x= K, y = Sum_of_squared_errors)    

# Add x and y labels
labels = fig.set(xlabel='Number of clusters, k', ylabel='Total squared distances')

The total squared distances decreases slowly after _k_ in the range 4 to 6. We run our k-means algorithm on the entire dataset with _k_ = 4.

In [None]:
# Initialise the k-means estimator with 3 clusters
kmeans = KMeans(n_clusters=4)

# Fit the k-means estimator using the two components 
kmeans.fit(pca_components)
kmeans.labels_

In [None]:
# Assign a document to a category 
tf_idf_df['category'] = kmeans.labels_
tf_idf_df['pca_components_1'] = pca_components[:, 0]
tf_idf_df['pca_components_2'] = pca_components[:, 1]

# Set figure size
sns.set_context("notebook", font_scale=1.5)
plt.figure(figsize = (11.7,8.27))

# Scatterplot with the 1st principal component on the horizontal x axes and 2nd principal component on the vertical y axis
fig = sns.scatterplot(x = pca_components[:, 0], y = pca_components[:, 1], hue=kmeans.labels_, alpha=0.8, s=200)

# This for loop assign country name to each data point iteratively
for line in range(0,tf_idf_df.shape[0]):
     fig.text(pca_components[line,0]+0.015, pca_components[line,1], # where the labels should be positioned
     tf_idf_df.index[line], # add labels to each data point 
     horizontalalignment='left', size='small', color='black', weight='light') # possibly add fontsize=10

# Add labels to the horisontal x axis and vertical y axis
labels = fig.set(xlabel='1st principal component', ylabel='2nd principal component')

# Add title 'Cluster' to the legend and locate it in the upper right of the plot
legend = plt.legend(title='Cluster', loc='upper right')

In [None]:
### Use the TF-IDF matrix to compute the cosine similarity 

In [None]:
# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity # Generate cosine similarity matrix
cosine_sim = cosine_similarity(matrix, matrix)  
cosine_sim

In [None]:
# Convert to a DataFrame
print(len(cosine_sim))
cosine_sim_list  = cosine_sim.tolist()
cosine_sim_df = pd.DataFrame.from_records(cosine_sim_list)

In [None]:
plt.figure(figsize = (30,30))
sns.heatmap(cosine_sim_df)

## Acknowledgements

1. [Converting Text to Features,](https://learning.oreilly.com/library/view/natural-language-processing/9781484242674/html/475440_1_En_3_Chapter.xhtml#) in _Natural Language Processing Recipes_. Akshay Kulkarni & Adarsha Shivananda. 2019.
2. [Sklearn's module on feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html).
3. [Vector Semantics and Embeddings,](https://web.stanford.edu/~jurafsky/slp3/6.pdf) in _Speech and Language Processing_. Daniel Jurafsky & James H. Martin. Draft of December 30, 2020.
4. [K-Means Clustering with scikit-learn.](http://jonathansoma.com/lede/algorithms-2017/classes/clustering/k-means-clustering-with-scikit-learn/)
5. [Pandas for Everyone.](https://www.pearson.com/us/higher-education/program/Chen-Pandas-for-Everyone-Python-Data-Analysis/PGM335102.html). Daniel Chen. 2018. 