# tf-idf

*Lauren F. Klein wrote version 1.0 of this notebook, based of tutorials by [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) and [Kavita Ganesan](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XZVlcOdKhSw). I have supplemented it with material from Melanie Walsh's chapter [Web Scraping — Part 1](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Collecting-Cultural-Data/Web-Scraping.html) from her online textbook [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/welcome.html).*

We will learn powerful data science techniques soon. But, in many cases, just counting words can tell you a lot. 

Today, we're going to explore a method called Term Frequency - Inverse Document Frequency (tf-idf). Tf-idf comes up a lot in text analysis projects because it’s both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

The procedure was introduced in a 1972 paper by Karen Spärck Jones under the name “term specificity,” and the basic idea is this:

Instead of representing a term in a document by its raw frequency or its relative frequency (the term count divided by the document length), each term is *weighted* by dividing the term frequency by the number of documents in the corpus containing the word. 

The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in a document are often the most frequently used words in all of the documents.

By contrast, terms with the highest tf-idf scores are the terms in a document that are distinctively frequent in a document when that document is compared other documents. When you sort by tf-idf score, these distinctive terms rise to the top. 

## *New York Times* Obituaries

In this lesson, we're going to use tf-idf to study 378 obituaries published by *The New York Times*. This dataset is based on data originally collected by Matt Lavin for his *Programming Historian* [TF-IDF tutorial](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset). Melanie Walsh re-scraped the obituaries so that the subject's name and death year are included in each text file name; she also added 12 more ["Overlooked"](https://www.nytimes.com/interactive/2018/obituaries/overlooked.html) obituaries.

## Pre-processing: prepare the documents

Tf-idf works on a set of documents. Each document needs to be a single string. You'll get very familiar with writing document and text pre-processing code like this by the end of the class.

In [None]:
import os

base_dir = "../docs/NYT-Obituaries/"

all_docs = []
text_titles = []

docs = os.listdir(base_dir)

for doc in docs:
    with open(base_dir + doc, "r") as file:
        text = file.read()
        all_docs.append(text)
        text_titles.append(str(doc))
# just take a look at the first item to be sure
print(docs[0]) 
print("\n")
print(all_docs[0])

## Import libraries

Conveniently scikit-learn, which we were introduced to in the previous lesson, allows us to calculate tf-idf with just a few lines of code.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd # this will help us keep track of our data; 
# we'll talk about pandas in more detail later this semester

## Creat document-term matrix

We'll use a doc-term matrix to calculate tf-idf. Remember how to create a doc-term matrix from last lesson?

In [None]:
#instantiate CountVectorizer()
cv=CountVectorizer()

# this steps generates document-term matrix for the docs
dtm=cv.fit_transform(all_docs)

# check shape
dtm.shape

## Initialize TfidfTransformer

When you initialize TfidfTransformer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf. The recommended way to run `TfidfTransformer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in story length, and, overall, they'll produce more meaningful tf–idf scores. 

In [None]:
# Call tfidf_transformer.fit on the word count vector we computed earlier.
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(dtm)

## Produce inverse document frequence (idf) values

In [None]:
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

## Produce & print tf-idf scores

Once you have the idf values, you can compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the documents in our corpus.

In [None]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(dtm)

Now, let’s print the tf-idf values of the first document to see if they make sense. 

We'll place the tf-idf scores from the first document (The Who's "Baba O'Reilly") into a pandas dataframe and sort the dataframe in descending order of scores.

In [None]:
feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores for the first doc
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Notice that only certain words have scores. This is because only the words in this document have a tf-idf score and everything else, from other documents, shows up as zeroes.

Sometimes very common words ("the", "and", "a") are evidently distinctive, but they're not interesting. 

## tf-idf: the fast way

So now we're going to do it again with scikit-learn's stopword list. And since we're tf-idf pros, we're going to use scikit-learn's all-in-one tf-idf vectorizer. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

# to exclude stopwords, add the argument `stop_words='english'`
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True)
 
# send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(all_docs)

In [None]:
# as above, get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]
 
# place tf-idf values in a pandas dataframe
# reminder: we'll cover pandas in a later lesson
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

In [None]:
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

In [None]:
# Add row for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [None]:
tfidf_slice = tfidf_df[['south', 'war', 'politics', 'peace','america', 'woman', 'music', 'art']]
tfidf_slice

## Store tf-idf vectors & print top five for each doc

Finally, let's store our tf-idf vectors to files. Don't worry if you can't follow every bit of code below.

In [14]:
base_dir = "../docs/NYT-Obituaries/"

# make a directory to store them in
os.mkdir("./tf_idf_output")

docs = os.listdir(base_dir)

csvs = []

for doc in docs:
    csv = doc.replace(".txt",".csv")
    csvs.append(csv)

# convert sparse matrix to array
tfidf_vectors_as_array = tfidf_vectorizer_vectors.toarray()

# loop each item in tfidf_vectors_as_array, 
titles = []
for counter, doc in enumerate(tfidf_vectors_as_array): # note enumerate. useful! 
    # construct a dataframe
    tf_idf_tuples = list(zip(tfidf_vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

    # output to a csv using the enumerated value for the filename
    for csv in csvs:
        one_doc_as_df.to_csv(path_or_buf="./tf_idf_output/" + str(csv))
        title = csv.replace(".csv", "")
        titles.append(title)
    print("\n" + str(titles[counter]) + " top 5 terms: ")
    print(one_doc_as_df.head())


1959-Cecil-De-Mille top 5 terms: 
           term     score
0         mille  0.814593
1         lasky  0.116370
2            mr  0.110954
3  commandments  0.106190
4        screen  0.096381

1928-Mabel-Craty top 5 terms: 
        term     score
0     cratty  0.754049
1       miss  0.194203
2   bellaire  0.150810
3    council  0.139811
4  christian  0.123576


FileNotFoundError: [Errno 2] No such file or directory: './tf_idf_output/1955-Dale-Carnegie.csv'

## Exercise

**Analyze the printed results, the top five terms for each song in terms of tf-idf scores. The results are likely obvious for a toy corpus like ours. But does anything surprise you? And what is a corpus, an experiment, that you can imagine using tf-idf for?**

ANSWER HERE

## That's it!