# SISU Digital Humanities: Textual and Language Analysis on Social Media<br />
### Distant Reading using NLTK and Pandas
Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk) <br />


# Distant Reading

This notebook focuses on methods to engage in a basic distant reading using NLTK. It also reiterates some basic Pandas operations from the previous notebook. By the end of this notebook, you will:

* Know how to open and perform simple operations on a DataFrame;
* Use NLTK's `Text()` object to perform some basic distant reading operations on a subreddit.


## Importing packages

Let's start by importing some packages:

In [None]:
import nltk

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.book import FreqDist
from nltk.text import Text
import pandas as pd
from datetime import datetime
import collections
import string

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]

## Pandas basics & Working with Reddit data

Let's get the data, which is taken from the MOOC Intercultural Communication (https://centerforinterculturaldialogue.org/2021/03/17/mooc-intercultural-communication-2021-china/). 

Using the `.head()` method we can get the first n rows of a df. The default is 5. We can add a *parameter* (here 3) to indicate how many rows we want to print.

In [None]:
df = pd.read_csv("data/icc3-comments.csv") 

In [None]:
df.head(5)

Here's what we're seeing. Pay special attention to the "NaN" labels, indicating missing values. 

This particular dataset includes the comments on the MOOC under the "text" column, which is what we're most interested in.

other columns contain valuable metadata you can use in your analyses, such as the "likes" column (i.e., the amount of likes a post has received), the "timestamp" column which indicates when the post was written, and the "step" column which refers to pedagogically organised "Learning Steps" focusing on particular activities.

### Sorting a DF
Using the `.sort_values()` method we can sort the df by particular columms. We use two parameters: the `by` parameter indicates by which column we want to sort, the `ascending` parameter indicated whether our sortation is in ascending or descending order.

Here, I'm assigning my sorted DataFrame to the same variable `df`, effectively overwriting the old version. These are the top 10 comments based on the amount of "likes" they have received!

In [None]:
df = df.sort_values(by=['score'], ascending=False)
df.head(10)

### Selecting a column
To select a single column of data, simply put the name of the column in between brackets. Let’s select the 'text' column. We can print out the first entry in this column as follows:

In [None]:
df.body[17626]

As you see, using the `[]` operator selects a set of rows and/or columns from a DataFrame.

Your turn! Use slicing to retrieve the first 10 "selftext" entries in our DataFrame.

In [None]:
# Your code here

df['text'][:10]

One thing we often do when we’re exploring a dataset is filtering the data based on a given condition. For example, we might need to find all the rows in our dataset that have received more than 10 "likes". We can use the `.loc[]` method to do so.

In [None]:
df.loc[df.score >= 10]

`.loc[]` is a powerful method that can be used for all kinds of research purposes, if you want to filter or prune your dataset based on some condition. For more info, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

Your turn! Use `loc[]` to retrieve only the posts that have over 200 comments.

### Removing rows
Missing values (`NaN`) in a DataFrame can cause a lot of errors. In general, it's a god idea to get rid of those rows whose "selftext" is missing. This works as follows:

In [None]:
print("length of df is now " + str(len(df)))
clean_df = df.dropna(subset=['text'])
print("length of df is now " + str(len(clean_df)))


## Distant reading with NLTK 
Let's use NLTK's `word_tokenize()` method. We've imported this library at the beginning of this notebook.
`word_tokenize()` works like this:


In [None]:
word_tokenize("He is a lumberjack and he is okay. He sleeps all night and he works all day.")

Your turn! Let's tokenize our "selftext" column. Here's what you need to do: 
- Create a new list called `sed_tokens`;
- Begin a for-loop that iterates over the "selftext" column of our `sed` DataFrame; 
- `For` each text in that column, tokenize it using `word_tokenize()`; 
- Add these tokenized words to our new `sed_tokens` list using the list `.extend()` method*.
 
*We use `.extend` instead of `.append`. This is because we want one long list, instead of a list of lists. While `append` adds its argument as a single element to the end of a list – meaning the length of the list itself will increase by one – `extend` adds each element to the list, extending the list.

In [None]:
# Your code here

df_tokens = []

for text in df['body']:
    df_tokens.extend(word_tokenize(text))

## The NLTK `Text()` class
Now, let's have a look at our data. NLTK provides a `Text()` class, which is a "wrapper" that allows for inital exploration of texts. It supports counting, concordancing, collocation discovery, etc. 

In [None]:
# Here, we create our NLTK Text object
df_t = Text(df_tokens)

Let's print out the "docstring" of NLTK's `Text()` object, as well as all the things you can do with this object. Have a read through this to see what it allows you to do!

In [None]:
help(Text)

### Word counts
Let's run a few of these functions. How often do people talk about a "problem"? Let's find out using the `.count()` method.

In [None]:
df_t.count('problem')

### Concordances 
One of the most basic, but quite helpful, ways to quickly get an overview of the contexts in which a word appears is through a `.concordance()` view. 

In [None]:
df_t.concordance('problem', width=115)

### Collocations

A collocation is a sequence of words that often appear together. The `.collocations()` method can find these in our data.

In [None]:
df_t.collocations()

### Word plotting
Using the `.dispersion_plot()` method we can easily visualize how often some word appears throughout the text. We have to feed it a list with several words in it.

Sorting our df by date allows us to look "through time" to see whether particular words start (dis)appearing in our data.

In [None]:
df_t.dispersion_plot(["problem", "issue", "worry"])

### Similar words
Using the `.similar()` method we can look at "distributional similarity": finding other words which appear in the same contexts as the specified word.
 

In [None]:
df_t.similar('problem')

### Common context
The `.common_contexts()` method allows us to study the common context of two or more words. We must enclose these words in square brackets and round brackets, separated by commas. For instance, when people talk about problems, how often do they mention the word "communication"?

In [None]:
df_t.common_contexts(['problem', 'job'])

In [None]:
fdist = FreqDist(df_t)
fdist.most_common(30)

We can visualize these numbers.

In [None]:
fdist.plot(50, cumulative=False)

Finally, let's have a look at the top-10 words whose length is greater than 6 and whose word frequency is greater than 1500.

In [None]:
sorted(w for w in set(df_t) if len(w)>6 and fdist[w]>1500)[:10]