# Vectorization, TF-IDF, and Document Classification

The most difficult part of analyzing text data is that most machine learning models are built for numeric data. Text data doesn't have this luxury. Luckily, there are ways that we can covert our text data to numeric representations through vectorization.

In [None]:
import pandas as pd
import numpy as np
import re

from utils import clean_text

# Data
from sklearn.datasets import fetch_20newsgroups

# Vectorization methods
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Classification model
from sklearn.linear_model import LogisticRegression

### Compile train/test DataFrames using SKlearn's [`fetch_20newsgroups`](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)

In [None]:
n_docs = 100
categories = ['alt.atheism', 'sci.med', 'comp.graphics', 'sci.space']
# categories = ['misc.forsale', 'sci.electronics', 'comp.sys.ibm.pc.hardware', 'rec.autos']
    
# Gather data from sklearn's fetch_20newsgroups
news_train = fetch_20newsgroups(subset="train",
                                remove=('headers', 'footers', 'quotes'),
                                categories=categories)
news_test = fetch_20newsgroups(subset="test",
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)

# get documents and classification labels
train_docs = news_train.data[:n_docs]
train_labels = news_train.target[:n_docs]
test_docs = news_test.data[:n_docs]
test_labels = news_test.target[:n_docs]

# Convert to pandas DataFrame
train_df = pd.DataFrame({"body": train_docs, "category": train_labels})
test_df = pd.DataFrame({"body": test_docs, "category": test_labels})

# View the shapes of our datasets
print(f"Train Shape: {train_df.shape}")
print(f"Test Shape: {test_df.shape}")

## CountVectorizer

`CountVectorizer` is a simple tool that turns raw text into feature vectors. We vectorize the text in 2 steps: 
1. First, we `fit`, the training data to our vectorizer to compute the vocabulary (feature set). 
2. Then, we `transform` with our text for both train and test to count the number occurrences for each word in our vocabulary.

The output of the CountVectorizer's `transform` task is a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix), which condenses the matrix values to avoid storing an excessive amount of zeros.

In [None]:
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(train_df['body'])
train_vecs = vectorizer.transform(train_df['body'])
test_vecs = vectorizer.transform(test_df['body'])

#### What is the size of our vocabulary?

In [None]:
print(f"Number of documents: {train_vecs.shape[0]}")
print(f"Size of vocabulary: {train_vecs.shape[1]}")

#### How much of our feature set is just zeros?

As mentioned above, our vectorizer's `transform` function returns a sparse matrix. Using the `nnz` attribute of a sparse matrix returns the number of non-zero values

In [None]:
# Train
print(f"Number of TRAINING non-zero features: {train_vecs.nnz}")
print(f"Number of TRAINING zero features: {(train_vecs.shape[0]*train_vecs.shape[1])-train_vecs.nnz}")

# Test
print(f"Number of TEST non-zero features: {test_vecs.nnz}")
print(f"Number of TEST zero features: {(test_vecs.shape[0]*test_vecs.shape[1])-test_vecs.nnz}")

### Display a few terms and their tf-idf scores for a few documents. 

This is only meant to be used for demonstration purposes. The cell below has no impact on the actual execution of our task. Also, this cell is only intended for use when the number of documents is small (<100), otherwise it will likely only display a bunch of zeros.

In [None]:
df_counts = pd.DataFrame(train_vecs.toarray(), 
                         columns=vectorizer.get_feature_names())[:15].T
df_counts.tail(10)

## Term Frequency-Inverse Document Frequency (TF-IDF)

Tf-idf is a statistical representation of how relevant a word is to a particular document within a corpus. _Relevance_, in this scenario, can be defined as how much information a word provides about the context of one document vs all other documents in the corpus. 

In short, tf-idf is calculated by comparing the number of times that a particular terms occurs in a given document vs the number of other documents in the corpus that contain that word. A word that frequently occurs in 1 document, but only occurs in a very small number of other documents will have a high tf-idf score.

The calculation for tf-idf is the product of two smaller calculations:

$$TF_{i,j} = \frac{Number~of~times~word_{i}~occurs~in~document_{j}}{Total~number~of~words~in~document_{j}}$$


$$IDF_{i} = log(\frac{Total~number~of~documents~in~corpus}{Number~of~documents~that~contain~word_{i}})$$

##### Example: 

Let's say we have 10,000 documents about the solar system. If we were to take one single document with 200 terms and see that _Europa_ (one of Jupiter's moons) was mentioned 5 times, then _Europa's_ term frequency (tf) for that document would be: 

$$TF_{Europa, document} = \frac{5}{200}=0.025$$


Now if we were to see that _Europa_ only occurs in 50 of the total 10,000 documents, then the inverse document frequency (idf) would be: 

$$IDF_{Europa} = log(\frac{10,000}{50})=2.3$$

Therefore our tf-idf score for _Europa_ for that given document would be:

$$ 0.025 * 2.3 = 0.575 $$

### TF-IDF Vectorization

As you can imagine, this tf-idf score seems to be a bit more informative than a simple count of occurrences. Below, we'll vectorize our data using this calculation and then compare baseline classification results.

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_vectorizer.fit(train_df['body'])
train_tfidf_vecs = tfidf_vectorizer.transform(train_df['body'])
test_tfidf_vecs = tfidf_vectorizer.transform(test_df['body'])

### Display a few terms and their tf-idf scores for a few documents

This is only meant to be used for demonstration purposes. The cell below has no impact on the actual execution of our task. Also, this cell is only intended for use when the number of documents is small (<100), otherwise it will likely only display a bunch of zeros.

In [None]:
df_tfidf = pd.DataFrame(train_tfidf_vecs.toarray(), 
                         columns=tfidf_vectorizer.get_feature_names())[:15].T
df_tfidf.tail(20)

#### Comparison of the representation of the word "space" between the two vectorizers

In [None]:
pd.DataFrame({"TF-IDF: Space":df_tfidf.loc['space'], "CountVectorizer: Space":df_counts.loc['space']})

# Document Classification

Vectorizing our data has converted our text data into a numeric feature set. Using these vectors, we can now begin to develop machine learning models for things like classification.

Below, we'll use Logistic Regression, but you now that our data is numerically structured, you can apply any appropriate model.

To further this model, look into better preprocessing, regression regularization, vocabulary pruning for feature selection, and hyperparameter tuning.

### Run a logistic regression classification on the count vectors

In [None]:
count_logReg = LogisticRegression(multi_class="auto", solver='liblinear')
count_logReg.fit(train_vecs, train_df['category'])
count_preds = count_logReg.predict(test_vecs)

# Calculate the percentage of accurate predictions
accuracy = np.mean(count_preds==test_df['category'])
print(f"LogReg CountVectorizer accuracy: {accuracy}")

### Run a logistic regression classification on the TF-IDF vectors

In [None]:
tfidf_logReg = LogisticRegression(multi_class="auto", solver='liblinear')
tfidf_logReg.fit(train_tfidf_vecs, train_df['category'])
tfidf_preds = tfidf_logReg.predict(test_tfidf_vecs)

# Calculate the percentage of accurate predictions
accuracy = np.mean(tfidf_preds==test_df['category'])
print(f"LogReg TF-IDF accuracy: {accuracy}")

### View the terms with the highest coefficient values for each category

Notice that the terms highly weighted for each category seem to have highly negative weights for other categories. If we were to use more similarly related categories, we may not see such drastic differences.

Ignore the code behind this table. It is poorly written, but demonstrates the correct results.

In [None]:
from utils import getTopCoefs

getTopCoefs(num_terms=5, model=tfidf_logReg, class_labels=news_train.target_names, feature_names=tfidf_vectorizer.get_feature_names())

### View coefficient weights for CountVectorizer features

In [None]:
getTopCoefs(num_terms=5, model=count_logReg, class_labels=news_train.target_names, feature_names=vectorizer.get_feature_names())

### Investigate incorrect predictions

Particularly with text analytics, it can oftentimes be useful to investigate the records that your model predicted incorrectly. This can help you identify opportunities where a little more preprocessing may increase performance.

In [None]:
# Expand the max width of how our dataFrames display on screen
pd.options.display.max_colwidth = 1000

# Compile a dataframe with our text, the actual label, and the predicted label
final_df = pd.DataFrame({"body": test_df['body'], "Actual": test_df['category'], "Prediction": tfidf_preds})

# Display the rows of our dataframe where the actual label and predicted label don't match
final_df.loc[(final_df['Actual'] != final_df['Prediction'])]