# SISU Digital Humanities: Textual and Language Analysis on Social Media<br />
### Session 2: Preprocessing and tf-idf
Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk) <br />


# Preprocessing and comparing subreddits

Today we will (1) learn how to preprocess text in a DataFrame, and (2) learn about tf-idf. Tf-idf allows us to compare different related subreddits, in order to find the most distinctive words in a particular subreddit. It can also help us to find similar posts to ones we're interested in.

**After completing this notebook, you will be able to:**
- Preprocess social media data, including removing punctuation, tokenizing, and lemmatizing;
- Understand how tf-idf can be used to compare datasets;
- Find most-distinctive words in a subreddit using tf-idf;
- Find similar posts using tf-idf.

## Import packages and data

In [None]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import pandas as pd
import os
import pickle
import re 
import string
import time

Let's get our files. These three datasets are taken from an American social media platform that is organized around interest communities (like Hupu or Douban in China). 

The three communities we have here are, to different degrees, related to the "Manosphere" (see https://en.wikipedia.org/wiki/Manosphere for more). We might say they are in the same "genre" of discourse, which arguably allows us to compare them.  

In [None]:
# load the data
trp = pd.read_csv("data/TRP-submissions.csv", lineterminator="\n")
sed = pd.read_csv("data/seduction.csv", lineterminator="\n")
mgtow = pd.read_csv("data/mgtow.csv", lineterminator="\n")

How big are our datasets?

In [None]:
print("seduction: " + str(len(sed)))
print("redpill: " + str(len(trp)))
print("men going their own way: " + str(len(mgtow)))

### Removing rows
Missing values (`NaN`) in a DataFrame can cause a lot of errors. In general, it's a god idea to get rid of those rows whose "selftext" is missing. Here's an example of how this works:

In [None]:
data = {'Name':['Sai', 'Jack', 'Angela', 'Matt', 'Alisha', 'Ricky'],'Age':[28,34,None,42, "[removed]", "[deleted]"]}
df = pd.DataFrame(data) 
df

In [None]:
clean_df = df.dropna(subset=['Age'])  # Drop NaN in the column 'Age'
clean_df

We can also remove cells with particular text in it (this is relevant as Reddit datasets often contain posts that are removed or deleted!). We do this by using the `.isin()` method. We remove this selection from our DataFrame by using Python's bitwise NOT operator, `~`. See if you understand how this works!

In [None]:
cleaner_df = clean_df[~clean_df['Age'].isin(['[removed]', '[deleted]' ])]
cleaner_df

Your turn! Drop the missing (`NaN`), removed (`[removed]`) and deleted (`[deleted]`) values from our three DataFrames and assign the result to the same variable names.

In [None]:
# Your code here






Let's see if that shrinks our DataFrames.

In [None]:
print("seduction: " + str(len(sed)))
print("theredpill: " + str(len(trp)))
print("mgtow: " + str(len(mgtow)))

Looks like it did!

### Getting a slice
It's usually good to start small and see if all of your preprocessing functions work as expected, then scale up. Let's start with a slice of 100 posts based on the highest score. Yesterday, we saw how we can do that using the `.sort_values()` method. Recall that sorting and slicing a dataframe works like this:

In [None]:
# Sorting
sorted_df = trp.sort_values(by=['score'], ascending=False)

# Slicing
sliced_df = trp[:100]

Let's combine these two expressions to filter the 10 highest-scoring posts of `trp`. We assign it to a new variable: `trp_10`.

In [None]:
# Your code here





## Preprocessing data

Great, we got our data. Now, we need to preprocess it. This includes:
1. Removing special characters and punctuation
2. Tokenizing
3. Removing stopwords
4. Part of Speech (POS) tagging & filtering
5. Stemming / lemmatizing

### Removing punctuation

First, have a look at how to use `string.punctuation` to get rid of some punctuation characters. `string.punctuation` is not a function: it's a pre-initialized string which we can use to get rid of punctuation in a string.



In [None]:
old_sent = "I. don't. know. why. I'm. speaking. like. this."
new_sent = ""
for ch in old_sent:
  if ch not in string.punctuation:
    new_sent += ch

new_sent

Your turn! Try to create a function called `strip_punctuation` that strips punctuation from a string. It takes a string as a parameter, and returns a new string with all punctuation stripped out.

In [None]:
# Your code here






Try to see if it works:

1. Create an empty list called `trp_strip_punct`;
2. Run a `for`-loop that iterates over all the "selftexts" in the `trp_10` DataFrame, and that applies your function to each; 
3. Save the result in a new variable;
4. Print your new variable to see if it worked!

In [None]:
# Your code here






Looks good, except it looks like our function removed the punctuation between URLs, as well as some escaped newlines (`\n`) that are left over. We will learn how to deal with those later on.

### Tokenizing
Next, we need to create a tokenizer. Create another list called `trp_tokens`, then use another for-loop that applies NLTK's `word_tokenize()` method on each entry of our `trp_strip_punct` list. Use `.append()` instead of `.extend()` so that your loop creates 100 lists of tokens (instead of one long list of tokens, like we did yesterday). 

In [None]:
# Your code here





`trp_tokens` is a list of lists: each list contains the individual tokens of a post. What if we want to access a list within a list? It works like this:

In [None]:
list1 = [[10,13,17],[3,5,1],[13,11,12]]
list1[0][2]

Your turn! Print out the first 10 entries in the first entry of the `trp_tokens` list.

In [None]:
# Your code here







### Programming basics: Sets
Do these exercises if you need to learn about sets!

A set is an **unordered** and **unindexed** collection. This makes them different from lists, which are ordered, and from dictionaries, which are indexed. You can use sets to rapidly iterate through a list, when the order within that list doesn't matter. 

In Python sets are written with curly brackets, like so:

In [None]:
my_set = {"apple", "pear", "orange"}
print(my_set)

Note that the order is not preserved!

### Removing stopwords
Next, let's remove stopwords. We can do so using NLTK's stopwords list, which we imported above. Let's have a look at some of these stopwords.

In [None]:
stopwords.words('english')[:10]

Iterating through this list for *every* word in our two corpora is going to take a long time, so let's turn it into a set. This saves us some time, as sets are less memory-intensive.

Remember, when creating a set it shouldn't matter which order items are in – and for our stopwords list, that is the case!

In [None]:
set(stopwords.words('english'))

Your turn! 
1. Create a function called `strip_stopwords()` that takes `tokens` as a parameter;
2. In the function, create a list named `no_stop`; 
3. Turn `stopwords.words('english')` into a set (like above), then assign it to a variable named `stop`;
4. Run a for-loop that fills the `no_stop` list with only those tokens that are *not* in `stop` (you need an `if`-statement!);
5. Finally, `return` the list.

In [None]:
# Your code here






Run the following line of code to see if it worked. You should get a printout of the first 10 tokens in the first post of `trp` – without the stopwords of course!

In [None]:
# Run this
trp_clean = [strip_stopwords(tokens) for tokens in trp_tokens]
trp_clean[0][:10]

### Stemming
Tokenizers are great, but they're often not perfect. Look at the example below:

In [None]:
word_tokenize("Why won't this work?")

Looks like it did a pretty good job, except it considers "wo" and "n't" as different words.. Annoying. This is where **stemming** and **lemmatizing** come in handy. These are two text normalization techniques that are used to prepare text, words, and documents for further processing. 

See [this link](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034361&utm_targetid=dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1012831&gclid=Cj0KCQjwgJv4BRCrARIsAB17JI4kMKOUrJcdearlvPx4kl3VNVcqeZz-oeTSlbgikK3tJbXMrAmWTCwaAvUzEALw_wcB) for more information.

**Stemming** is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. First, let's load our stemmer:

In [None]:
stemmer = nltk.stem.LancasterStemmer()

In [None]:
for each in ["think", "thinker", "thinking"]:
    print(stemmer.stem(each))

...but stemming doesn't always produce the prettiest results:

In [None]:
for each in ["create", "creating", "creator"]:
    print(stemmer.stem(each))

### Lemmatizing
A lemma is the canonical, dictionary or citation form of a word. For instance, the lemma for "thinks" is "think." Lemmatization, in other words, is the process of converting a word to its base form.

Lemmatizing your data typically is a bit less intrusive than stemming it. Let's see it in action:

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
for each in ["trade", "trades", "trading", "trader", "traders"]:
    print(lemmatizer.lemmatize(each))

Your turn! 
1. Create a function called `lemmatize()` that takes `tokens` as a parameter;
2. Create a new list called `lemmas`
3. In the function, assign `nltk.stem.WordNetLemmatizer()` to a variable called `lemmatizer`, like above; 
4. Run a `for`-loop that uses `lemmatizer.lemmatize(each)` to lemmatize each token in `tokens`; append the output to our `lemmas` list;  
5. Finally, `return` the list.

In [None]:
# Your code here






Run the following line of code to see if it worked.

In [None]:
trp_clean[0][20:30]

In [None]:
# Run this
trp_lemmas = [lemmatize(tokens) for tokens in trp_clean]
trp_lemmas[0][20:30]

### Forcing to string
Sometimes, when we have a list, we actually want a string. For instance, some libraries of NLP tools require strings as input. In those cases, we can force lists into strings by applying the list `.join` method. Let's use it to turn the first entry of our `trp_lemmas` list into a string.

In [None]:
trp_str = ' '.join(trp_lemmas[0])
trp_str

## Putting it all together
After all that, you should be well-equipped to understand this preprocessing function. It takes a DataFrame in, removes the empty values, then removes punctuation, tokenizes and lemmatizes the selftext. It then spits the text back out as a string.

In [None]:
def preprocessing(df):
    """POS tags and filters DF by nouns"""
    dfLength = len(df)
    total = ""
    counter = 0
    clean = df[~df['selftext'].isin(['[removed]', '[deleted]' ])].dropna(subset=['selftext'])
    for text in clean['selftext']:
        # turn to lowercase
        text = text.lower()
        # remove punctuation
        text = ''.join(ch for ch in text if ch not in string.punctuation)
        # tokenize
        tokens = word_tokenize(text)
        # lemmatize
        lemmas = ' '.join([wordnet_lemmatizer.lemmatize(token) for token in tokens])
        # save
        total += lemmas
        counter += 1
        if counter % 100 == 0:
            print("Saved " + str(counter) + " out of " + str(dfLength) + " entries") 
    return total

Let's run our function on the first 1000 entries of our DataFrames (just to save some time).

In [None]:
trp_pp = preprocessing(trp[:1000])

In [None]:
sed_pp = preprocessing(sed[:1000])

In [None]:
mgtow_pp = preprocessing(mgtow[:1000])

In [None]:
trp_pp[:1000]

## Tf-idf
Tf–idf or TFIDF, short for *term frequency–inverse document frequenc*y, is a numerical statistic that reflects how important a word is to a document in a collection or corpus.
Tf_idf is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document (the term frequency, or tf), and is offset by the number of documents in the corpus that contain the word (the inverse document frequency, or idf). This helps to adjust for the fact that some words appear more frequently in general – such as articles and prepositions.

### Testing tf-idf with a toy dataset

Let's try tf-idf out with a toy dataset. Here we have three documents about Python, but with different meanings.

In [None]:
document1 = """Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen."""

document2 = """Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known."""

document3 = """Monty Python (also collectively known as the Pythons) are a British 
surreal comedy group who created the sketch comedy television show Monty Python's 
Flying Circus, which first aired on the BBC in 1969. Forty-five episodes were made 
over four series."""

document4 = """Python is an interpreted, high-level, general-purpose programming language. 
Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes 
code readability with its notable use of significant whitespace. Its language constructs and 
object-oriented approach aim to help programmers write clear, logical code for small and 
large-scale projects."""

document5 = """The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment."""

document6 = """The Pythonidae, commonly known simply as pythons, from the Greek word python 
(πυθων), are a family of nonvenomous snakes found in Africa, Asia, and Australia. 
Among its members are some of the largest snakes in the world. Eight genera and 31
species are currently recognized."""

test_list = [document1, document2, document3, document4, document5, document6]

We will be using Scikit-LEARN `TfidfVectorizer`. It is a class that basically allows us to create a matrix of word counts, and immediately transform them into tf-idf values. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for the documentation if you want to learn more.

In the second line below, we instantiate an object of the vectorizer. Then, we run it by applying the `fit_transform` method to our `test_list`.

In [None]:
# settings that you use for count vectorizer will go here
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, decode_error='ignore', stop_words='english',smooth_idf=True,use_idf=True)

# fit and transform the texts
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(test_list)

Let's have a peek at our matrix by running the `.toarray()` method. 
This shows us one value per word in the total vocabulary.

Notice that we're printing the vectors at index 2: this shows us the tf-idf features of all the words in `document3` (due to zero-based indexing). All the empty values are simply words that are not present in `document3`!


In [None]:
tfidf_vectorizer_vectors.toarray()[2]

We can also have a look at some of the words in our total vocabulary by running `get_feature_names()` method. Let's grab a few.

In [None]:
tfidf_vectorizer.get_feature_names()[100:110]

As you can see, the second word is '1969', which as you can see in the printout of our `.toarray()` is a distinctive word for Monty Python (the first airing of their TV show).

### Putting distinctive words in a DataFrame
We can now take out one vector (i.e., the tf-idf values of one text) that `.fit_transform()` yielded. We can put them in a DataFrame, and print out that DataFrame after sorting it based on the highest score.

In [None]:
# get the vector for the third document
vector_tfidfvectorizer = tfidf_vectorizer_vectors[2] # Note that 2 refers to document3, due to zero-based indexing

# place tf-idf values in a DataFrame
df = pd.DataFrame(vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)[:10]

Looks like it works! Through tf-idf, we have found the words that are most-distinctive of `document3`!

Note that we have used `Tfidfvectorizer` here, which internally computes word counts, IDF values, and tf-idf scores for our dataset. If you only want to use the term frequency (term count) vectors for different tasks, you have to use `Tfidftransformer`. See e.g. [here](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XviBUJMzbzg) for more information.

## Using tf-idf on Reddit datasets
Tf-idf is a basic but intuitive way to find words that are typical of a particular subreddit, when compared to other comparable subreddits.

We'll implement scikit-learn's tf-idf functionality to find distinctive words for each document (i.e. subreddit). So we'll treat each subreddit we feed into the `TfidfTransformer` as a document.

In [None]:
reddit_list = [trp_pp, sed_pp, mgtow_pp]

Your turn! Using `TfidfVectorizer`, repeat what we did with our test data, but this time with the `reddit_list`! It's exactly the same procedure – only the name of the list changes.

In [None]:
# Your code here











As you can see, there are still some HTML artifacts left, such as "nbsp". At a later point we'll look at how to remove these annoying tokens. 

## Bonus: Using TF-IDF to find similar documents
*Note: the below code is a bit more advanced, and for demonstration purposes. Don't worry if you don't fully get it!*

We can also use tf-idf to work out the similarity between any pair of documents. So given one post or comment, we could see which posts or comments are most similar. This can be useful if you're trying to find other examples of a pattern you have found and want to explore further.

This time, our "documents" will not be entire subreddits, but posts/submissions within one subreddit. Let's import the submissions and run the vectorizer without the preprocessing and lemmatizing. Tf-idf will still work this way, and this way, we will be able to read our posts.

In [None]:
tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
word_count_vectors = tfidf_vectorizer.fit_transform([post for post in trp['selftext']])

We'll start by finding a post with a clear topic. Let's grab the 1st entry in our dataframe:

In [None]:
trp['selftext'][0]

This one seems to be about Joe Rogan, an American celebrity. Let's have a quick look at the tfidf scores for the words in this submission to see if his name is indeed typical for this particular submission:

In [None]:
# get a vector out
vector_tfidfvectorizer = word_count_vectors[0] # change this number if you want to pick out a different vector / text

# place tf-idf values in a pandas data frame
df = pd.DataFrame(vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)[:10]

Looks like it is.

Now let's find some similar document(s). 
The fact that our documents are now in a vector space is convenient: it allows us to make use of mathematical similarity metrics.

**Cosine similarity** is one metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

Don't worry too much about this function for now: just run it and let's see how it works!

In [None]:
def find_similar(word_count_vectors, index, top_n = 5):   # you can change the `top_n` parameter if you want to retrieve more similar documents
    cosine_similarities = linear_kernel(word_count_vectors[index:index+1], word_count_vectors).flatten()
    related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]

The above function finds similar words. It uses scikit-LEARN's `linear kernel`, which uses cosine similarity to find documents that are most alike.

We can now throw the resulting scores and similar posts in a list, feed that list into a DataFrame, and check out one of them to see if it works:

In [None]:
cosine = []
for index, score in find_similar(word_count_vectors, 0):
  cosine.append(
      {'cos_score': score, 
       'text': trp['selftext'][index]
       }
  )
cosine_df = pd.DataFrame(cosine)
cosine_df

In [None]:
cosine_df['text'][1]

This post does seem comparable! It's also about Joe Rogan.

## Reflection: hypothesis generation using tf-idf

Think about a hypothesis or research question you could construct about your own subreddit based on these distinctive words and related posts.