# TF-IDF – Project Notebook

Use this notebook for carrying out the analyses from the workshop notebook on your own subreddit data.

### Icons Used in This Notebook
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

## Retrieving the Dataset

In [None]:
import os
import pandas as pd

In [None]:
# Replace this with your own preprocessed file!    
df = pd.read_csv('../../data/YOUR_FILE_PP.csv')

# Make sure the index is reset
df.reset_index(drop=True, inplace=True)

In [None]:
# Remove all rows that are '[removed]' or '[deleted]'
df = df.loc[~df['pp_text'].isin(['[removed]', '[deleted]' ]),:]

# Select only rows that have >3 characters in selftext
df = df.loc[df['pp_text'].str.len() > 3]

## Using TF-IDF on your data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Settings that you use for count vectorizer will go here
tfidf_vectorizer = TfidfVectorizer(max_df=0.85,
                                   decode_error='ignore',
                                   stop_words='english',
                                   smooth_idf=True,
                                   use_idf=True)

# Fit and transform the texts
tfidf = tfidf_vectorizer.fit_transform(df['pp_text'])

Let's have a look at some of the TF-IDF values:

In [None]:
# Place TF-IDF values in a DataFrame
tfidf_df = pd.DataFrame(tfidf.todense(), columns=tfidf_vectorizer.get_feature_names_out().ravel())

In [None]:
tfidf_df.reset_index(drop=True, inplace=True)

In [None]:
tfidf_df.head()

In [None]:
# Highest TF-IDF values across documents
tfidf_df.sum().sort_values(ascending=False)

## Top TF-IDF Terms per Post

Change `tfidf[10]` to another number if you want to see tf-idf counts for a different post.

In [None]:
import numpy as np

def get_top_tfidf_words(row, features, top_n=10):
    top_indices = np.argsort(row)[::-1][:top_n]
    return [(features[i], row[i]) for i in top_indices]

# Example: document 10
top_words = get_top_tfidf_words(tfidf[10].toarray()[0], tfidf_vectorizer.get_feature_names_out())
for word, score in top_words:
    print(f"{word}: {score:.4f}")

Let's look at the post itself to see what terms TF-IDF is considering "distinctive".

In [None]:
df.selftext[10]

We can visualize these TF-IDF-weighted terms as well. This code saves the plot in a PNG file.

In [None]:
import matplotlib.pyplot as plt

def plot_top_terms(tfidf_vector, feature_names, doc_id=0, top_n=10):
    row = tfidf_vector[doc_id].toarray()[0]
    top_indices = row.argsort()[-top_n:][::-1]
    terms = [feature_names[i] for i in top_indices]
    scores = [row[i] for i in top_indices]

    plt.figure(figsize=(8, 5))
    plt.barh(terms[::-1], scores[::-1])
    plt.title(f"Top {top_n} TF-IDF Terms for Document {doc_id}")
    plt.xlabel("TF-IDF Score")
    plt.tight_layout()
    plt.savefig(f"outputs_project/top_terms_doc_{doc_id}.png", dpi=300)
    plt.show()

# Change doc_id below to get data for a different post
plot_top_terms(tfidf, tfidf_vectorizer.get_feature_names_out(), doc_id=10)

## Top Terms Across the Corpus (Mean TF-IDF)

Now, let's move from document-level to corpus-level views:

In [None]:
mean_tfidf = tfidf.mean(axis=0).A1
terms = tfidf_vectorizer.get_feature_names_out()
top_indices = mean_tfidf.argsort()[-10:][::-1]

top_terms = [terms[i] for i in top_indices]
top_scores = [mean_tfidf[i] for i in top_indices]

plt.figure(figsize=(8, 5))
plt.barh(top_terms[::-1], top_scores[::-1])
plt.title("Top TF-IDF Terms Across Corpus")
plt.xlabel("Mean TF-IDF Score")
plt.tight_layout()
plt.savefig("outputs_project/top_terms_corpus.png", dpi=300)
plt.show()

## Using TF-IDF to find Similar Posts

Choose a post or comment from your data that has an interesting topic or tone. 

In [None]:
doc_idx = 25

Cange this `selftext` column to `body` if you are working with a comments DataFrame!

In [None]:
df['selftext'].iloc[doc_idx]

Let's have a quick look at the TF-IDF scores for the words in this submission to see if these words are indeed typical for this particular submission. Do the distinctive words have to do with the topic of the post?

In [None]:
tfidf_df.loc[doc_idx].sort_values(ascending=False)

Now let's find the closest posts to this one.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(tfidf)
similarities.shape

Put the text and scores in a dataframe, and sort by the score:

In [None]:
similar_df = pd.DataFrame({
    # Change this to "body" if working with comments
    'text': df['selftext'].values,
    'score': similarities[doc_idx]}).sort_values('score', ascending=False)

The top document will be the document itself (it's going to have a similarity of 1 with itself). So we look at the next document - does it seem similar?

In [None]:
similar_df['text'].iloc[0]

In [None]:
similar_df['text'].iloc[1]

💭 **Reflection**: Reading similar posts like this can help you expand your ideas about the **research question** you have about your data. For instance, you might find that the ideological concepts you are interested in are used in other contexts you hadn't previously considered. Or you might find "adjacent" concepts that give you a more robust understanding of the discourse of your community. 

## Using TF-IDF to Find Posts

💭 **Reflection**: Enter a term that makes sense given your dataset. It should be a term that says something about a dominant theme or topic you are expecting to find in your data. **Check out the output from `tfidf_df` above to see some some distinctive terms**.

The resulting DataFrame will be posts where the word has the greatest significance and specificity compared to the other posts.

If the resulting DataFrame is empty, lower the threshold from `.5` to something lower like `.3`.

In [None]:
# Subsetting one DF with the mask of another DF
tfidf_someword_df = df[tfidf_df['SOME_TERM'] > .5]
tfidf_someword_df.head(3)

The first post from that DataFrame is the post in which your chosen word has the most significance – according to TF-IDF, at least.

In [None]:
print(tfidf_someword_df['selftext'].iloc[0])

## Using TF-IDF Correlations to Explore Biases

In [None]:
corr = tfidf_df.corr()

💭 **Reflection**: Pick two words you are interested in comparing. Ideally, they should be binary constructs like "man" and "woman", or "progressive" and "conservative", or "Islam" and "Christianity". 

Change the `WORD1` and `WORD2` values to two binary concepts you are interested in comparing. Also change the `by=` argument to one of the words you picked.

The resulting DataFrame will be a list of words ordered by the amount of correlation with the word you picked when setting the `by=` argument. These correlations are a rough representation of other words that are frequently co-occurring with the word you picked.

In [None]:
corr[['WORD1','WORD1']].sort_values(by='WORD1_OR_WORD2',ascending=False)[:30]

## 💭 Reflection: 

- Do these related terms make sense? 
- Do you see some terms that could be indicative of a bias towards a binary construct in the data? 