# TF-IDF with Scikit-Learn

https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html#visualize-tf-idf

In [20]:
# Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
from pathlib import Path  
import glob
import altair as alt
import numpy as np

## U.S Inaugural Addresses Dataset

In [11]:
directory_path = "US_Inaugural_Addresses/"
text_files = glob.glob(f"{directory_path}/*.txt")
text_titles = [Path(text).stem for text in text_files]
text_titles

['01_washington_1789',
 '02_washington_1793',
 '03_adams_john_1797',
 '04_jefferson_1801',
 '05_jefferson_1805',
 '06_madison_1809',
 '07_madison_1813',
 '08_monroe_1817',
 '09_monroe_1821',
 '10_adams_john_quincy_1825',
 '11_jackson_1829',
 '12_jackson_1833',
 '13_van_buren_1837',
 '14_harrison_1841',
 '15_polk_1845',
 '16_taylor_1849',
 '17_pierce_1853',
 '18_buchanan_1857',
 '19_lincoln_1861',
 '20_lincoln_1865',
 '21_grant_1869',
 '22_grant_1873',
 '23_hayes_1877',
 '24_garfield_1881',
 '25_cleveland_1885',
 '26_harrison_1889',
 '27_cleveland_1893',
 '28_mckinley_1897',
 '29_mckinley_1901',
 '30_roosevelt_theodore_1905',
 '31_taft_1909',
 '32_wilson_1913',
 '33_wilson_1917',
 '34_harding_1921',
 '35_coolidge_1925',
 '36_hoover_1929',
 '37_roosevelt_franklin_1933',
 '38_roosevelt_franklin_1937',
 '39_roosevelt_franklin_1941',
 '40_roosevelt_franklin_1945',
 '41_truman_1949',
 '42_eisenhower_1953',
 '43_eisenhower_1957',
 '44_kennedy_1961',
 '45_johnson_1965',
 '46_nixon_1969',
 '47_

## Calculate TF-IDF

We can calculate tf-idf scores using TfidfVectorizer from scikit-learn.

In [12]:
# Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

# Run TfidfVectorizer on text_files
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

# Make a DataFrame out of the resulting vector
# Set feature names (words) as columns and titles as rows
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

# Add column for document frequency (number of times word appears in all documents)
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()

tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)



Unnamed: 0,government,borders,people,obama,war,honor,foreign,men,women,children
00_Document Frequency,53.0,5.0,56.0,3.0,45.0,32.0,32.0,47.0,15.0,22.0
01_washington_1789,0.11,0.0,0.05,0.0,0.0,0.0,0.0,0.02,0.0,0.0
02_washington_1793,0.06,0.0,0.05,0.0,0.0,0.08,0.0,0.0,0.0,0.0
03_adams_john_1797,0.16,0.0,0.19,0.0,0.01,0.1,0.12,0.04,0.0,0.0
04_jefferson_1801,0.16,0.0,0.01,0.0,0.01,0.04,0.0,0.04,0.0,0.0
05_jefferson_1805,0.03,0.0,0.0,0.0,0.04,0.0,0.06,0.01,0.0,0.02
06_madison_1809,0.0,0.0,0.02,0.0,0.02,0.05,0.05,0.0,0.0,0.0
07_madison_1813,0.04,0.0,0.04,0.0,0.25,0.02,0.02,0.0,0.0,0.0
08_monroe_1817,0.17,0.0,0.11,0.0,0.09,0.01,0.1,0.04,0.0,0.0
09_monroe_1821,0.08,0.0,0.06,0.0,0.11,0.02,0.04,0.01,0.0,0.01


In [13]:
# Drop "OO_Document Frequency"
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

# Reorganize DataFrame so words are in rows rather than columns
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

# Top 10 words with highest tf-idf for each story
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
3707,01_washington_1789,government,0.113681
4108,01_washington_1789,immutable,0.103883
4175,01_washington_1789,impressions,0.103883
6337,01_washington_1789,providential,0.103883
5631,01_washington_1789,ought,0.103728
6351,01_washington_1789,public,0.103102
6117,01_washington_1789,present,0.097516
6389,01_washington_1789,qualifications,0.096372
5811,01_washington_1789,peculiarly,0.090546
653,01_washington_1789,article,0.085786


In [14]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

In [15]:
# Zoom in on a particular word
top_tfidf[top_tfidf['term'].str.contains('women')]

Unnamed: 0,document,term,tfidf
503861,56_obama_2009,women,0.084859


It appears that the term 'women' is distinctive in Obama's inaugural address.

In [16]:
top_tfidf[top_tfidf['document'].str.contains('obama')]

Unnamed: 0,document,term,tfidf
495406,56_obama_2009,america,0.148351
500298,56_obama_2009,nation,0.120229
500358,56_obama_2009,new,0.118002
503093,56_obama_2009,today,0.114792
498590,56_obama_2009,generation,0.100654
499762,56_obama_2009,let,0.0911
499578,56_obama_2009,jobs,0.090727
496911,56_obama_2009,crisis,0.087235
498779,56_obama_2009,hard,0.084859
503861,56_obama_2009,women,0.084859


In [17]:
top_tfidf[top_tfidf['document'].str.contains('trump')]

Unnamed: 0,document,term,tfidf
513404,58_trump_2017,america,0.350162
515585,58_trump_2017,dreams,0.156436
513405,58_trump_2017,american,0.149226
517576,58_trump_2017,jobs,0.142766
519262,58_trump_2017,protected,0.132439
518409,58_trump_2017,obama,0.120288
518766,58_trump_2017,people,0.11237
521001,58_trump_2017,thank,0.109171
513989,58_trump_2017,borders,0.107075
521596,58_trump_2017,ve,0.107075


In [18]:
top_tfidf[top_tfidf['document'].str.contains('kennedy')]

Unnamed: 0,document,term,tfidf
391774,44_kennedy_1961,let,0.267869
394306,44_kennedy_1961,sides,0.262849
392921,44_kennedy_1961,pledge,0.16096
387632,44_kennedy_1961,ask,0.107713
387864,44_kennedy_1961,begin,0.106495
388991,44_kennedy_1961,dare,0.106495
395895,44_kennedy_1961,world,0.10311
390313,44_kennedy_1961,final,0.102311
392370,44_kennedy_1961,new,0.0966
390120,44_kennedy_1961,explore,0.094223


## Visualize TF-IDF

We can also visualize the TF-IDF results using Altair, a data visualization library.

Let's make a heatmap of the highest scoring words for each president and put a red dot next to the two terms of interest:
+ war
+ peace

In [21]:
# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)

**What is the difference between a tf-idf score and raw word frequency?**

+ Term frequency represents the number of times a word appears in a document compared to the total number of words in the document whereas the inverse document frequency refers to the proportion of documents in the entire corupus that contains the term. TF provides information on how often words appear in a document and IDF refers to how rare it is in the collection of documents. 
+ The motivation behind using a tf-idf score is the idea that the importance of a term is inversely proportional to its frequency across all the documents in the corpus. (i.e. if a word appears in every document in the collection, it is likely not as important)

**Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?**

+ TF-IDF scores are unable to capture the context of words since they only consider the importance of a term based on term and document frequencies. For example, in situations where we have words that are synonyms or the same word has multiple meanings,  the syntax (language structure) and semantics (meaning) would be useful.

**What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?**

+ TF-IDF is great for analyzing large corpuses since it is easy to compute. A use case would be to understand the similarities or differences between different documents. An interesting collection of texts we could use this for would be to analyze reviews or the public opinion of a general topic. This could be tweets related to current events or reviews for a similar product across competing companies.