# Create an Sklearn TFIDF Vectorizer from Scratch

![tfidf equation](static/tfidf-equation.jpeg)

**Import a text dataset**

Below we import a dataset of scraped policies proposed by the 2020 Democratic Presidential Candidates Bernie Sanders and Elizabeth Warren. 

In this notebook, we will create an sklearn `TfidfVectorizer` object and fit a Logistic Regression model to predict the candidate of the policy using each candidate's policy text as predictors. 

In [5]:
import pandas as pd

df = pd.read_csv('data/2020_policies_feb_24.csv')
df.head()

Unnamed: 0,name,policy,candidate
0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",warren
1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,warren
2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\nIm...",warren
3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,warren
4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,warren


## Create a count vectorizer

The first step for calculating tfidf is to calculate the `tf` which stands for term frequency. 

To do this we will first create a `CountVectorizer` that when given a list of documents, the following matrix will be produced:
- The rows of the matrix represent an individual document
- The columns of the matrix represent an individual word
- The values of the matrix represent the number of times a word occurs in a given document

In [152]:
import numpy as np
import scipy.sparse as sp

class CountVectorizer:
    
    def fit_transform(self, docs):
        
        # Create a bag of words
        tokens = docs.str.cat(sep=' ').split()
        
        # Assign a unique index val for each token
        self.feature_idx = {}
        idx = 0
        for token in tokens:
            if token not in self.feature_idx:
                self.feature_idx[token] = idx
                idx += 1
        
        # Create a list that will hold the index for each word, whenever it is counted
        features = []
        # Create a list that will hold the counts for each word
        values = []
        # Create a index pointer that will be used to create a sparse matrix
        # See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
        indptr = [0]
        # Loop over each document
        for doc in docs:
            # Create a document frequency counter
            feature_counter = {}
            # Loop over each word
            for token in doc.split():
                # Collect the unique token index
                token_idx = self.feature_idx[token]
                # Update the frequency counter
                # using the index for the word rather than
                # the text data (This makes creating a sparse matrix easier)
                if token_idx not in feature_counter:
                    feature_counter[token_idx] = 1
                else:
                    feature_counter[token_idx] += 1
            
            # Extend the features list with the document's counted word indices
            features.extend(feature_counter.keys())
            # Extend the values list with the document's count frequences
            values.extend(feature_counter.values())
            # Update the index pointer to indicate that indptr[i]:indptr[i+1]
            # indicate the data in `features` and `values` for a specific document
            indptr.append(len(features))
        
        # Convert each list to numpy array
        features = np.asarray(features)
        indptr = np.asarray(indptr)
        values = np.asarray(values)
        
        # Create a sparse matrix where 
        # each row represents a document
        # each column represents a word
        # each value represents the frequency a word in a document
        X = sp.csr_matrix((values, features, indptr),
                          shape=(len(indptr) - 1, len(self.feature_idx.keys())))
        
        # Sort the index so it aligns with the feature_index values
        X.sort_indices()
        
        # Return the unique vocab and the sparse matrix
        return list(self.feature_idx.keys()), X
    

## Create a `TfidfTransformer`

This object will receive the output of a `CounterVectorizer` and will product the `tfidf` matrix.

In [152]:
class TfidfTransformer:
    
    def fit(self, X):
        
        # Num rows and Num Columns
        n_samples, n_features = X.shape
        
        # Document Frequency
        # Count the number of times a word appears in a document
        # (Here bincount is counting the number of times each column != 0)
        df = np.bincount(X.indices, minlength=X.shape[1])
        
        # Inverse Document Frequency
        # (Number of documents/ Document Frequency)
        idf = np.log(n_samples / df) + 1
        
        # Get the diagonal of this result.
        # This allows us to multiply the Inverse Document Frequency
        # with the term frequency of every word, for every document
        self._idf_diag = sp.diags(idf, offsets=0,
                          shape=(n_features, n_features),
                          format='csr',
                          dtype=np.float64)

    def transform(self, X):
        n_samples, n_features = X.shape
        X = X * self._idf_diag
        return X

## Create a `TfidfVectorizer`

This object will:
- Inherit the `fit_transform` method from `CountVectorizer`
- Receive a list of documents 
- Transform the documents using `CountVectorizer.transform`
- Fit a `TfidfTransformer` object
- Transform the data using `TfidfTransformer.transform`. 

In [None]:
class TfidfVectorizer(CountVectorizer):

        def fit(self, docs):
            vocab, X = super().fit_transform(docs)
            self.transformer = TfidfTransform()
            self.transformer.fit(X)
            
        
        def transform(self, X):
            if not sp.issparse(X):
                vocab, X = super().fit_transform(X)
            return self.transformer.transform(X)

## Test our work

In the cell below, we initialize our `TfidfVecotizer` and fit it to the text data.

In [165]:
tfidf = TfidfVectorizer()
tfidf.fit(df.policy)
X_tfidf = tfidf.transform(df.policy)

**Now let's fit a classifier to the transformed data**

In [170]:
# Import modeling tools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_confusion_matrix

# Create a train test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df.candidate)

# Initialize and fit a model
model = LogisticRegression()
model.fit(X_train, y_train)

# Print train and test scores
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print('Train score:', '{:.0%}'.format(train_score))
print('Test score:', '{:.0%}'.format(test_score))

Train score: 100%
Test score: 89%
