# Advanced Data Science
<p/>

### Data Preprocessing - Transformation (Text Vectorization)

In [6]:
#Example of supress warnings for Numpy version out of range (optional)
import warnings
warnings.filterwarnings("ignore", category=Warning)

#Pull in the libraries we need 
import pandas as pd
from pandas import DataFrame

#New libraries for access to transformation tasks
from sklearn.feature_extraction.text import TfidfVectorizer

## Scikit Learn `TfidfVectorizer`
---
We access the Scikit Learn Toolkit at: https://scikit-learn.org

Our experiments today will be using the `TfidfVectorizer` class to convert a collection of raw documents to a matrix of TF-IDF features. <br>

TF-IDF stands for Term Frequency-Inverse Document Frequency, which is a statistical measure used to evaluate how important a word is to a document within a collection or corpus. At a high-level, this approach follows these steps:
- Measures how frequently a term (word) appears in a document
- Measures how important a term is by checking how commonly it appears across all documents in the corpus
- Combines TF and IDF to determine the weight of a term in a specific document relative to its occurrence across all documents in the corpus
- Higher TF-IDF values indicate that a term is more important to the document.

In [2]:
#First step is to setup a variable for some sample text, each line could  
#represent a document in entire corpus (collection of documents)
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

## Text Vectorization using TfidfVectorizer

In [3]:
#Initialize TF-IDF Vectorizer model object, then fit and transform the text
#Using the default values, we are only looking at words and N-grams of size 1
#which means each item in the sequence is a single word (or character if using char)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

#Get feature names (words or terms)
feature_names = vectorizer.get_feature_names_out()

#Convert the resultant TF-IDF matrix to a pandas Dataframe
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

In [4]:
#Interpreting the output ... TF-IDF scores for each word and document are included
#Each row is a document in the corpus and each column is a term (word) in the corpus
#A value of 0.687624 in the cell under the "document" column and the second row indicates 
#the TF-IDF score of the term "document" in the second document.
#
#Min: 0 indicates the term does not appear in the document or in every document (not discriminative)
#
#Max: No upper bound, high scores are assigned to terms that appear frequently in a specific document 
#but rarely in the corpus have high TF-IDF scores. 
#This often happens with terms that are unique or rare within the corpus.
tfidf_df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
