<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Natural Language Processing: Vectorization
              
</p>
</div>

Data Science Cohort Live NYC Nov 2022
<p>Phase 4: Topic 37</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import string
import re

# Notice that these vectorizers are from `sklearn` and not `nltk`!
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Load in original satire data and normalized corpus

In [None]:
satire_df = pd.read_csv(
    'data/satire_nosatire.csv')
satire_df.head()

In [None]:
corpus = pd.read_csv(
    'data/satire_norm.csv').drop(
    columns = ['Unnamed: 0'])
corpus

#### Feature Extraction for NLP

- learn vector representation of tokenized data
- representing text in form for ML model:
    - encoding semantic information in numeric form
- A simple (yet surprisingly effective) method for many tasks: **Bag-of-words (BoW)**.

"Bag" of words: **information about the order of words in the document discarded**. 

- Intuition behind BoW: documents similar if they have similar token frequency distribution. 



<img src = "images/bag_of_words.png" >

Represented as **document-term matrix**:
- columns are tokens
- rows are documents
- values are token counts for given document.

$\downarrow$Doc\|Word$\rightarrow$|I|love|dogs|cats|all|animals|hate
-|-|-|-|-|-|-|-
Document_1|1|1|1|0|0|0|0
Document_2|1|1|0|1|0|0|0
Document_3|1|1|0|0|1|1|0
Document_4|1|0|1|0|0|0|1

#### Vectorization with sklearn

Sklearn has a few methods for constructing document-term frequency matrices
- CountVectorizer
- TfidfVectorizer
- HashVectorizer

#### `CountVectorizer`: simplest of the vectorizers
- Term counts for each document in corpus
- has options for cutting too common/too uncommon words

- CountVectorizer(min_df, max_df)

    - min_df: percentage lower cutoff for document frequncy of a term

    - max_df: percentage upper cutoff (corpus specific stop words)

**Important hyperparameters to tune when in pipeline**

In [None]:
corpus.body

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Convert our preprocessed strings (normalized token sequence) to a matrix of token counts

vec = CountVectorizer(min_df = 0.06, max_df = 0.95)
X = vec.fit_transform(corpus['body'])

# .get_feature_names_out() useful attribute
countvec_df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())
countvec_df.head()

Note that the output before converting to array is **sparse matrix**

In [None]:
X

There and many zeros:
- Typical of document-term matrix
- Compressed Sparse Row (CSR) representation enhances memory/computation resources.

In [None]:
countvec_df.shape

Substantially smaller feature set now:
- Some algorithms can handle count data with this many features for modeling purposes.

Important thing to think about when engineering cutoffs: class imbalance

- if extreme may have cut off relevant predictors for minority document class.

In [None]:
satire_df['target'].value_counts()

Not a problem here.

Vectorization complete.

In [None]:
corpus.body

In [None]:
countvec_df

#### The TfidfVectorizer (Term Frequency Inverse Document Frequency)

An approach to weight tokens based on how rare/common in corpus:
- want to downweight words that are too common throughout corpus.

- TF (Term Frequency):
    - Count of the word in the document
    - divided by the total number of words in the document.

- IDF (inverse document frequency)
    - how much information a word possesses for document differentiation

$$idf(w) = log (\frac{number\ of\ documents}{num\ of\ documents\ containing\ w})$$

**word present in every document likely not useful for document differentiation**

**Putting together**: TF-IDF

$$ w_{ij} = tf_{ij} \log(\frac{N}{df_{ij}}) $$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tf_vec = TfidfVectorizer()
X_tfidf = tf_vec.fit_transform(corpus['body'])

vec_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tf_vec.get_feature_names_out())
vec_tfidf.head()

In [None]:
X_tfidf

Much larger matrix: doesn't manually throw away features

Just downweights features that are too rare or too common

In [None]:
vec_tfidf.iloc[313].sort_values(ascending=False)[:10]

Let's compare the tfidf to the count vectorizer output for one document.

In [None]:
countvec_df.iloc[313].sort_values(ascending=False)[:10]

In [None]:
vec_tfidf.iloc[313].sort_values(ascending=False)[:10]

The tfidf downweighted common words:
- "also", which might have made it into the stopword list.
- Assigns "nerds" more weight than power (factoring in count and idf) 