<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2019
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 3-3: Topic modeling <br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)
    </span>
</p>


# Topic modeling

Topic modeling is a type of statistical modeling for the discovery of abstract "topics" that occur in a collection of documents. It is frequently used in NLP to aid the discovery of hidden semantic structures in a collection of texts.

By the end of this notebook, you should:

* have practiced with creating and visualising topic models using Scikit-LEARN;
* have practiced with close reading top documents associated with topics of interest.

There are lots of Python packages for topic modeling. We will use Scikit-LEARN. 
If you have access to your Reddit dataset, feel free to use that instead.

The example dataset once again comes from the banned subreddit "The Red Pill".


## Importing libraries

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction import text
import re
import pandas as pd
import numpy as np
from more_itertools import chunked
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize
from nltk.corpus import brown
!pip install pyLDAvis
import pyLDAvis.sklearn 
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]

## Importing data

Our (abridged) DataFrame includes 10,000 posts and 10,000 comments, sorted by the highest score. We have two files: one for the submissions, and one for the comments.

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
downloaded = drive.CreateFile({'id':"1hN5eqCYVZOX_O0i8waJUQxDlJjqDMenK"})   # replace the id with id of file you want to access
downloaded.GetContentFile('TRP_submissions.csv')       

In [None]:
downloaded = drive.CreateFile({'id':"1nY9JtXoGJa7B-OmU6afh4qfcFGPCQIHW"})   # replace the id with id of file you want to access
downloaded.GetContentFile('TRP_comments.csv')        

In [None]:
trp_sub = pd.read_csv("TRP_submissions.csv", lineterminator='\n')
trp_com = pd.read_csv("TRP_comments.csv", lineterminator='\n')

Let's have a quick look at the comments:

In [None]:
trp_com.sort_values(by=['score'], ascending=False).head()

How many comments do we have?

In [None]:
len(trp_com)

## Preprocessing
Let's have a look at the text in our `trp_com['body']` column.

In [None]:
trp_com['body'][1]

Looks like we have a bit of cleanup to do...

### Data cleaning using RegEx
Regular Expressions are often used to clean up data – special characters, newlines, and so on. Here we use RegEx to remove newlines and single quotes from our `trp_com['body']` column:

In [None]:
# Remove new line characters
trp_com['body'] = [re.sub(r'\s+', ' ', sent) for sent in trp_com['body']]
# Remove distracting single quotes
trp_com['body'] = [re.sub(r"\'", "", sent) for sent in trp_com['body']]

In [None]:
trp_com['body'][1]

### Getting a slice of data
Next, let's create a small list for testing purposes from the texts in one of our DataFrames. We can easily create lists from DataFrame columns using the Pandas `.tolist()` method.

In [None]:
trp_test = trp_com['body'][:10].tolist()
trp_test[1]

### POS tagging & filtering

POS refers to "Part Of Speech". There are eight parts of speech in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. This indicates how the word functions in meaning as well as grammatically within the sentence.

NLTK and other libraries allow us to POS tag our tokens. This is pretty straightforwardly done using the `.pos_tag()` method.

Instead of creating our topic model from all parts of speech in our, we can limit it to nouns. This is a way to focus on thematic information though it does not capture attitudes towards the theme, or other nuances that may be very important to our analysis. 

*Make sure that, when you create your own topic models, you think about which kinds of topics you aim to find: if you want attitudes, for instance, you might be more interested in adjectives and adverbs.*


In [None]:
# Run me to see how it works
tokens = ["that", "is", "the", "most", "foul", "cruel", "and", "bad-tempered", "rodent", "you", "ever", "set", "eyes", "on"]
pos = nltk.pos_tag(tokens)     
pos

As you can see, `.pos_tag()` returns a list of tuples, with the word at the 0th index, and a POS tag at the 1st index.

This is useful, as we can create a new list which only includes certain tags. Let's focus on nouns for now (they are tagged 'NN') (For a rundown of all the NLTK POS tags, see [here](https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/).)

In [None]:
nouns = []
for tup in pos:   
  if tup[1] == "NN":
    nouns.append(tup[0])
nouns

As you see, POS tagging is not perfect.

Your turn! Create a function called `pos_tagger`. It takes one parameter called `text`, which is a string.
1. Use `re.sub` to get rid of newlines and special characters; 
2. Use `.word_tokenize` to tokenize the string'; 
3. Use `nlkt.pos_tag` to tag the tokens;
4. Put only the nouns (so *not* the tags!) in a new list;
5. `return` that list.

In [None]:
# Your code here





Now, run your function over the `trp_test` list we just created, and print out some nouns to see if it worked (remember to run your function in a `for`-loop!)

In [None]:
trp_nouns = [pos_tagger(each) for each in trp_test]
trp_nouns[1]

Now, let's apply function to our entire DataFrame so that you can work with either the cleaned-up or the original text. This is generally good practice.

1. Create a new column in our `trp_com` DataFrame, called "body_nouns";
2. Assign it to the output of your `pos_tagger` function, which you loop over each comment in `trp_com['body']`. Make sure to use `' '.join()` on the output of the `pos_tagger` first, so that you get a string.

Have a look at your DataFrame once you're done to see if you succeeded.



In [None]:
# Your code here






## Topic modeling
Time to build our topic model! Before we do so, we need to turn our corpus into word counts.

### Using `CountVectorizer`
We use the Scikit-LEARN's `CountVectorizer` from last week's tf-idf exercise again – but this time, we only look at term frequencies. What this results in is a matrix of (almost) the entire vocabulary within our corpus, and the counts of these words. 

Note that we set the `max_features` to 1000, which means we only use the top-1000 words in terms of TF. We also remove words that don't occur more than twice (`min_df=2`), and words that occur in more than 95% of the documents (`max_df=0.95`).

In [None]:
# We're only training for 1000 features (i.e., most-occurring words) Feel free to change this.
no_features = 1000

# Using TF vectorizer to get top terms
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(trp_com['body_nouns'])
tf_feature_names = tf_vectorizer.get_feature_names()

Note that `tf_feature_names`, which we got through Scikit-LEARN's `.get_feature_names()` method, is just a list of the words in our 1000-word vocabulary.

In [None]:
len(tf_feature_names)

In [None]:
tf_feature_names[:10]

### Topic modeling using `LatentDirichletAllocation` 

Next, we run scikit-LEARN's `LatentDirichletAllocation` class. Note that we can choose how many topics we want to find – by far the most important parameter to set when creating a topic model. Let's start with 10.

Some other parameters to understand:
- `max_iter` determines the maximum number of iterations to be performed when fitting the model.
- Setting `random_state` to 0 controls the random number generator used by Scikit-LEARN. This results in reproducible topics.

For more info, see [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).


In [None]:
# We're only training for 10 topics in our topic model (feel free to change this)
no_topics = 10

# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

In [None]:
len(lda_W)

Okay, what variables have we created here? 
- `lda_model` sets up our model and its parameters, after which we apply `.fit(tf)` to *fit* it to our TF matrix; 
- `LDA_W` is a topics-to-documents matrix (the probability distribution of the topics present in each document, or in our case, comment) - so a list of 10000 elements (the amount of comments we have here).
- `LDA_H` is a words-to-topics matrix (the probability distribution of the words belonging to each topic) - so a list of 10 elements (the amount of topics we decided to infer).

We can use these two matrices to print out the most significant words for each topic in the next step.

### Displaying the topics

So we have a topic model. But how to display it?
We'll write a `display_topics()` function, which takes both the words-to-topics matrix (`H`) and the `feature_names` as parameters.

Our `display_topics` function prints out a numerical index as the topic name, and prints the top words in the topic. Numpy's `argsort()` method is used to sort the row or column of the matrix: it returns the indexes for the cells that have the highest weights in order.

In [None]:
def display_topics(H, feature_names,no_top_words):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:
# Now print out the top words for each topic
no_top_words = 10
display_topics(lda_H, tf_feature_names, no_top_words)

Here we have our 10 topics and the most-associated words. While these topics are probably not very accurate (we used only a very small dataset), we could derive some insights from this already: for instance, topic 8 seems to be about posts in which members of The Red Pill discuss how men are turned into "betas" through the force of feminism. 

You can see that, in order to make sense of the topics you create, you have to understand the lingo and logic of a particular community – hence the annotations you've been doing!

### Retrieving top documents per topic

The output of our `display_topics` function involved assigning a numeric label to the topics and printing out the top words in each topic. This is common practice. However, just displaying the top words in a topic may not help us to understand what each topic is about or determine the *context* in which these words are used.

So, let's define a function that gets both the topics and the associated top document.

This function now also needs to take the original "document" collection (our `trp_com['body']` column) and number of top documents (no_top_documents), as well as the words (feature_names) and number of top words (no_top_words). It then prints the top documents in the topic. The top words and top documents have the highest weights in the returned matrices. 

The function returns our 10 topics again, but this time also prints the associated top document.

In [None]:
def display_topic_docs(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

# We're printing 10 top words per topic, and 3 "most representative" documents per topic. Feel free to change this.
no_top_words = 10
no_top_documents = 3
display_topic_docs(lda_H, lda_W, tf_feature_names, trp_com['body'], no_top_words, no_top_documents)

### Visualizing topics using pyLDAvis
We can visualize our topic model using pyLDAvis:

In [None]:
import pyLDAvis.sklearn 

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, tf, tf_vectorizer, mds='tsne')
pyLDAvis.save_html(panel, 'lda.html')
panel

To the left, you see your topics, represented as bubbles. To the right, you see the top words based on overall term frequency. You can click on the bubbles to see the most-prevalent words within particular topics.

Using the λ slider, you can rank the terms according to term relevance. By default, the terms of a topic are ranked in decreasing order according to their topic-specific probability ( λ = 1 ). Moving the slider allows you to adjust the rank of terms based on how discriminatory (or "relevant") they are for the specific topic. The suggested “optimal” value of λ is 0.6.

*Note: a "good" topic model will have non-overlapping, fairly big sized blobs for each topic.*

**OPTIONAL CHALLENGE: sorting data by thread!**

If you have time left, have another look at your original two DataFrames. Try to think of a way to use the IDs of both comments and submissions to create a list in which the original submission and comments are put together. Tip: look into the `pd.merge()` method.


In [None]:
# Your code here






## Exercise: from distant to close reading

- Play around with the amount of topics, and the amount of tokens being processed per "document";
- Try to locate interesting topics, and their associated top documents;
- Close read the top documents associated with your topic of interest, paying attention to concerns we discussed this week:
    - Formal aspects; e.g. emotive VS rationalist language, specialised lexicon
    - Russian Formalists: *defamiliarization*
    - Derrida: *breaking down implied dichotomies*
    - Barthes: *myth-making*
    
- **If you've been able to download your own Reddit dataset, try to use that data!**