<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 2-3: Distant reading on Reddit <br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)
    </span>
</p>

# Distant Reading on Reddit

This notebook focuses on some basics to Pandas, as well as some methods to engage in a simple distant reading using NLTK. By the end of this notebook, you will:

* Know how to open and perform simple operations on a DataFrame;
* Use NLTK's `Text()` object to perform some basic distant reading operations on a subreddit.

We'll be using data from the subreddit [r/seduction](https://via.hypothes.is/https://www.reddit.com/r/seduction/top/?t=all). The community describes itself as a space for "Help with dating, with a focus on how to get something started up, whether the goal is casual sex or a relationship. Learn how to connect with the ones you're trying to get with!"

Begin your investigation by taking 10 minutes to explore [some of the most popular posts](https://via.hypothes.is/https://www.reddit.com/r/seduction/top/?t=all) using hypothes.is. The dataset we'll be using only includes posts, so you can disregard the comments for now. 

When you have identified some themes, return back here to take a more "distant" perspective.

## Importing packages

Let's start by importing some packages:

In [None]:
import nltk

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.text import Text
import pandas as pd
from datetime import datetime
import collections
import string

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]

## Retrieving the dataset
Let's authenticate ourselves and get the dataset from Google Drive.


In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
downloaded = drive.CreateFile({'id':"1fOe3l9dLKb51jrwqUNOvwO4A7F7sM6Xx"})   # replace the id with id of file you want to access
downloaded.GetContentFile('seduction-submissions.csv')       

In [None]:
sed = pd.read_csv("seduction-submissions.csv", lineterminator="\n")

## Pandas basics & Working with Reddit data

Using the `.head()` method we can get the first n rows of a df. The default is 5. We can add a *parameter* (here 3) to indicate how many rows we want to print.

In [None]:
sed.head(3)

Remember what Pandas DataFrames look like? Pay special attention to the "NaN" labels, indicating missing values (we might want to get rid of them). Also remember the naming convention for the column and row axes (which Pandas uses when accessing particular rows/columns).
![df](http://www.digitalhermeneutics.com/wp-content/uploads/2020/07/df.png)

This particular dataset only includes the original posts in the subreddit (so not the comments on the posts). The "selftext" column contains the actual posts.

other columns contain valuable metadata you can use in your analyses, such as: 
- "created" (the time of the post's creation)
- "score" (amount of upvotes minus downvotes)
- "textlen" (amount of words)
- "num_comments" (the amount of comments)
- "flair_text" (a 'tag' that users within a subreddit can add)
- "augmented_count" (how often a user or moderator has edited the text)

### Sorting a DF
Using the `.sort_values()` method we can sort the df by particular columms. We use two parameters: the `by` parameter indicates by which column we want to sort, the `ascending` parameter indicated whether our sortation is in ascending or descending order.

Here, I'm assigning my sorted DataFrame to the same variable `sed`, effectiveluy overwriting the old version.

In [None]:
sed = sed.sort_values(by=['score'], ascending=False)

Your turn! Sort the DataFrame by **creation date** (look up the name of this column first), and set `ascending` to `True`. Then, assign that sorted dataframe to the same variable name, `sed`.

In [None]:
# Your code here





### Converting to datetime
Did you ever wonder which format the "created" column is in? It is a Unix timestamp: the number of seconds that have elapsed since the Unix epoch, minus leap seconds; the Unix epoch is 00:00:00 UTC on 1 January 1970.

In [None]:
pd.to_datetime(1207632114,unit='s')

Pandas allows us to create a new column evaluating the Unix timestamp to more readable datetimes using the `.to_datetime` method. 

Creating a new column in Pandas is as easy as using the bracket notation to write a new column name, then assigning it. In this case, we just use the `.to_datetime` method again to point to the entire "created" column.

In [None]:
sed['created_datetime'] = pd.to_datetime(sed['created'],unit='s')

### Selecting a column
To select a single column of data, simply put the name of the column in between brackets. Let’s select the 'selftext' column. We can print out the first entry in this column as follows:

In [None]:
sed['selftext'][0]

As you see, using the `[]` operator selects a set of rows and/or columns from a DataFrame.

Your turn! Use slicing to retrieve the first 10 "selftext" entries in oue DataFrame.

In [None]:
# Your code here





One thing we often do when we’re exploring a dataset is filtering the data based on a given condition. For example, we might need to find all the rows in our dataset where the score is over 500. We can use the `.loc[]` method to do so.

In [None]:
sed.loc[sed.score >= 500]

`.loc[]` is a powerful method that can be used for all kinds of research purposes, if you want to filter or prune your dataset based on some condition. For more info, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

Your turn! Use `loc[]` to retrieve only the posts that have over 200 comments.

In [None]:
# Your code here






## Distant reading with NLTK 
Tomorrow, we will look at preprocessing our text in more detail. For now, let's automate most of it using NLTK's `word_tokenize()` method. We've imported this library at the beginning of this notebook.
`word_tokenize()` works like this:


In [None]:
word_tokenize("He is a lumberjack and he is okay. He sleeps all night and he works all day.")

Your turn! Let's tokenize our "selftext" column. Here's what you need to do: 
- Create a new list called `sed_tokens`;
- Begin a for-loop that iterates over the "selftext" column of our `sed` DataFrame; 
- `For` each text in that column, tokenize it using `word_tokenize()`; 
- Add these tokenized words to our new `sed_tokens` list using the list `.extend()` method*.
 
*We use `.extend` instead of `.append`. This is because we want one long list, instead of a list of lists. While `append` adds its argument as a single element to the end of a list – meaning the length of the list itself will increase by one – `extend` adds each element to the list, extending the list.

In [None]:
# Your code here






## The NLTK `Text()` class
Now, let's have a look at our data. NLTK provides a `Text()` class, which is a "wrapper" that allows for inital exploration of texts. It supports counting, concordancing, collocation discovery, etc. 

In [None]:
sed_t = Text(sed_tokens)

Let's print out the "docstring" of NLTK's `Text()` object, as well as all the things you can do with this object. Have a read through this to see what it allows you to do!

In [None]:
help(Text)

### Concordances 
One of the most basic, but quite helpful, ways to quickly get an overview of the contexts in which a word appears is through a concordance view. 

In [None]:
sed_t.concordance('game', width=115)

### Word plotting
Using the `dispersion_plot()` method we can easily visualize how often some word appears throughout the text. We have to feed it a list with several words in it.

Sorting our df by date allows us to look "through time" to see whether particular words start (dis)appearing in our data.

In [None]:
sed_t.dispersion_plot(["feminism", "feminist"])

### Similar words
Using the `.similar()` method we can look at "distributional similarity": finding other words which appear in the same contexts as the specified word.
 

In [None]:
sed_t.similar('love')

**Exploring texts using NLTK**

Use these NLTK methods on this dataset. The `Text()` object has other functionalitities; look through the `help(Text)` instructions we just printed out if you want to check them out.

In [None]:
# Your code here





