# 0. TF-IDF Tutorial
- TF-IDF with a context d in D (corpus):

$r_d = [tf-idf(w_1, d, D), tf-idf(w_2, d, D), ..., tf-idf(w_{|V|}, d, D)]$

with, $r_d \in R^{|V|}$ is a vector $|V|$ dims and $V = {w_i}$ is a dictionary (all words appear in $D$) respect to $D$

- Inside:

$tf-idf(w_i, d, D) = tf(w_i, d) * idf(w_i, D)$

with,

$tf(w_i, d) = \dfrac{f(w_i, d)}{max(f(w_j, d): w_j \in V)}$

$idf(w_i, D) = log_{10}^{\dfrac{|D|}{|d' \in D: w_i \in d'|}}$

- Identify dictionary V:

  - With each context $d$ in $D$:
    - Separate d to some word by punctuation, then collect $W_d$
    - Delete stop words from $W_d$
    - Convert word to original (stemming), then collect $W_d$
  - Finally:
    $V = $ Intersection of $W_d$ with $d \in D$
## 0.1. Processing Data

In [2]:
# Module Path
import os
# Other lib
import pandas as pd 
import numpy as np
import math 

In [3]:
# Init data
sentence_1 = "Data Science is the sexiest job of the 21st century"
sentence_2 = "Machine Learning is the key for Data Science"

# Process data 
sentence_1, sentence_2 = sentence_1.lower().split(), sentence_2.lower().split()
sentence_1n2 = set(sentence_1).union(sentence_2)

print(sentence_1, sentence_2, sentence_1n2, sep = "\n")

['data', 'science', 'is', 'the', 'sexiest', 'job', 'of', 'the', '21st', 'century']
['machine', 'learning', 'is', 'the', 'key', 'for', 'data', 'science']
{'learning', 'the', 'science', '21st', 'data', 'job', 'century', 'machine', 'for', 'key', 'of', 'is', 'sexiest'}


In [4]:
# Download file stopwords
import nltk
nltk.download('stopwords')         
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Filter sentence by stopwords with module ntlk
ft_sentence_1 = [word for word in sentence_1 if word not in stop_words]
ft_sentence_2 = [word for word in sentence_2 if word not in stop_words]
ft_sentence_1n2 = [word for word in sentence_1n2 if word not in stop_words]

print(ft_sentence_1, ft_sentence_2, ft_sentence_1n2, sep = "\n")

['data', 'science', 'sexiest', 'job', '21st', 'century']
['machine', 'learning', 'key', 'data', 'science']
['learning', 'science', '21st', 'data', 'job', 'century', 'machine', 'key', 'sexiest']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/charles/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# dict.fromkeys: create  dictionary with keys from iterable and values set to value.
dictA, dictB = dict.fromkeys(ft_sentence_1n2, 0), dict.fromkeys(ft_sentence_1n2, 0)
# Check element for each sentence 
for _ in ft_sentence_1:
    dictA[_] = dictA.get(_, 0) + 1 
for _ in ft_sentence_2:
    dictB[_] = dictB.get(_, 0) + 1 

# Create DF
df = pd.DataFrame([dictA, dictB])
df

Unnamed: 0,learning,science,21st,data,job,century,machine,key,sexiest
0,0,1,1,1,1,1,0,0,1
1,1,1,0,1,0,0,1,1,0


## 0.2. TF-IDF

In [6]:
# Compute TF-IDF
def compute_tfidf(word_dict):
    # Compute TF 
    def compute_tf(word_dict):
        tf_dict = {}
        for key, val in word_dict.items():
            tf_dict[key] = word_dict[key] / max(word_dict.values())
        return tf_dict
    
    # Compute IDF 
    def compute_idf(word_dict, data = df):
        # |D| is number of elements D (chose dict)
        N = data.shape[0]
        idf_dict = {}
        
        for key, val in word_dict.items():
            # Number of documents where the term t appears 
            val = np.sum(data[key] != 0)
            # If the term is not in the corpus, this will lead to a division-by-zero, then adjust the denominator +1 
            idf_dict[key] = math.log10(N / val)
        return idf_dict   
    
    # Compute TF-IDF
    tf_dict, idf_dict, tfidf_dict = compute_tf(word_dict), compute_idf(word_dict), {}
    for key_tf, val_tf in tf_dict.items():
        for key_idf, val_idf in idf_dict.items():
            if key_idf == key_tf:
                tfidf_dict[key_tf] = val_tf * val_idf
    return tfidf_dict

# Convert DF with TF-IDF
df = pd.DataFrame([compute_tfidf(dictA), compute_tfidf(dictB)])
df

Unnamed: 0,learning,science,21st,data,job,century,machine,key,sexiest
0,0.0,0.0,0.30103,0.0,0.30103,0.30103,0.0,0.0,0.30103
1,0.30103,0.0,0.0,0.0,0.0,0.0,0.30103,0.30103,0.0


## 0.3. Scikit-Learn

In [7]:
# Import sklearn tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
vectorize = TfidfVectorizer(stop_words='english')

# Init data 
first_sentence = "Data Science is the sexiest job of the 21st century"
second_sentence = "Machine Learning is the key for Data Science"

# fit & transform data 
process_text = vectorize.fit_transform([first_sentence, second_sentence])
print(process_text)

  (0, 1)	0.4466561618018052
  (0, 0)	0.4466561618018052
  (0, 3)	0.4466561618018052
  (0, 8)	0.4466561618018052
  (0, 7)	0.31779953783628945
  (0, 2)	0.31779953783628945
  (1, 4)	0.4992213265230509
  (1, 5)	0.4992213265230509
  (1, 6)	0.4992213265230509
  (1, 7)	0.35520008546852583
  (1, 2)	0.35520008546852583
