<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2019
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 2 exercises: CLOSE AND DISTANT READING <br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)
    </span>
</p>

<img style="width: 240px; height: 120px; float: right; margin: 0 0 0 0;" src="http://www.merritt.edu/wp/histotech/wp-content/uploads/sites/275/2018/08/berkeley-logo.jpg" />
</div>

# Distant Reading

This notebook focuses on some simple methods for close and distant reading using NLTK. Make sure you have this package installed (if you have Acaconda, it should be there).

By the end of this notebook, you should:

* have practiced with moving from close to distant reading, and applying research questions using Natural Language Processing methods;
* be able to work with some basic natural language processing methods using NLTK.

**Make sure to read and respond to the assignment at the end of the notebook.**

### Distant reading Ferrante and Knausgård

For this distant reading, you can explore the themes you found in your close reading related to how Knausgard and Ferrante (either separately, or in relation to the other) construct gendered and bodily identity. 

If you have found a pattern you want to explore yourself, you’re free to do so. Otherwise, read the next paragraph for a more pointed question. 

_Film scholar Linda Williams (1991) has written about the so-called ‘body genres’: pornography, horror, and melodrama that are characterized by excess that cause these genres to be often classified as ‘low culture.’ This is related to the fact that such works of art elicit certain bodily reactions in the viewer, framed in specifically gendered ways (‘tear jerker,’ ‘fear jerker,’ to ‘jerk off’) that make them suspect because of a lack of proper aesthetic distance and a sense of over-involvement in sensation and emotion. We can apply Williams’ notions to the literary texts of Ferrante and Knausgård to see what this yields in terms of gender analysis, by mining words related to bodily materiality, such bodily fluids: e.g. "sweat" or "tears". This can then be offset against cerebral terms, e.g. "spirit" or "thinking"._ 

Try to think about how you could operationalize the patterns and tensions we discussed yesterday by looking for **keywords, word contexts, and stylistic differences** between these two authors. 

* First, make a list of as many possible English words related to the topics you have encountered in your close reading. Make sure to include disambiguations. If you are interested in ‘tears,’ for instance, also include ‘cry,’ ‘cries,’ ‘crying’ etc.  

* Explore the different ways, described below, in which distant reading might help you better understand these themes and topics.  

## Importing

Let's start by importing some packages:

In [None]:
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
from nltk.text import Text

import collections
import string

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]

## Loading the data

Now, let's get our data. Locate the data files in the folder this notebook is in (they should be in the same folder if you want to access them!). Look closely at what they're called. Now we read them into python by calling `open()` and `read()`. 

In [None]:
# Make sure that ferrante.txt and knaus.txt are in the same folder as this notebook!
ferrante = open('ferrante.txt', encoding="UTF-8").read().lower()
knaus = open('knaus.txt', encoding="UTF-8").read().lower()

Great, we got our data. Now, we need to create a tokenizer. A tokenizer is basically a function that takes a string and outputs the words into seperate entities (like a list). You can create your own, but NLTK helpfully has one too. 
Let's see the NLTK's `word_tokenize()` function in action:

In [None]:
ferranteTokens = word_tokenize(ferrante)
knausTokens = word_tokenize(knaus)

In [None]:
# See if it works!
knausTokens[:7]

Let's see how long our lists are:

In [None]:
print(len(ferranteTokens))
print(len(knausTokens))

Turns out Knausgard's text is much longer than Ferrante's (almost double!) This means we'll have to normalize many of the findings we make. 
**Think about why this is important.**

## Stemming

Tokenizers are great, but they're often not perfect. Look at the example below:

In [None]:
word_tokenize("Why won't this work?")

Hm, looks like it did a pretty good job, except it considers "wo" and "n't" as different words.. Annoying. This is where **stemming** and **lemmatizing** come in handy.

First we have to load our stemmer:

In [None]:
stemmer = nltk.stem.LancasterStemmer()

In [None]:
for each in ["think", "thinker", "thinking"]:
    print(stemmer.stem(each))

...but stemming doesn't always produce the prettiest results:

In [None]:
for each in ["create", "creating", "creator"]:
    print(stemmer.stem(each))

## Lemmatizing
A lemma is the canonical, dictionary or citation form of a word. For instance, the lemma for "thinks" is "think." 
Lemmas typically are a bit less intrusive than stemmers to your data. Let's see it in action:

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
for each in ["trade", "trades", "trading", "trader", "traders"]:
    print(lemmatizer.lemmatize(each))

**Optional challenge: lemmatizing / stemming**

*Note: if the below exercise is too difficult, don't worry about it for now--just tokenize your data using `word_tokenizer()`*.

We're going to tokenize our data, then stem or lemmatize it. In order to do that, you have to:
1. Assign your tokenized data to a variable
2. Create a new empty list, assigning it to a variable
3. Create a `for`-loop that iterates through all the words in your tokenized data
4. For each word you loop through, assign that word to some variable
5. ...and add the word to your new list

Good luck!

In [None]:
# Enter your code here







## Word Frequencies and Keywords

NLTK allows us to analyze the word frequencies for both Ferrante and Knausgard. First, we should remove punctuation from our text, tokenize them, remove stopwords, and lowercase each token. In the future, we can write a long function to do all of this at once, but for now, let's see how it works step by step.

In [None]:
# Here, we define a function that will strip punctuation from a string (your `knaus` and `ferrante` variables)
def stripPunctuation(s):
    return ''.join(ch for ch in s if ch not in string.punctuation)

Next, we run this function on our two variables:

In [None]:
ferranteNoPunct = stripPunctuation(ferrante)
knausNoPunct = stripPunctuation(knaus)

Now, we tokenize:

In [None]:
ferranteTokens = word_tokenize(ferranteNoPunct)
knausTokens = word_tokenize(knausNoPunct)

Now let's remove stopwords:

In [None]:
def stripStopwords(tokens):
    stopWords = set(stopwords.words('english'))
    return [w for w in tokens if not w in stopWords] 

In [None]:
ferranteTokensClean = stripStopwords(ferranteTokens)
knausTokensClean = stripStopwords(knausTokens)

In [None]:
# Let's see if it worked!
knausTokensClean[:10]

Now, let's start counting these words. Using the `Counter()` function (from the `collections` module), we can count the elements in our lists.

In [None]:
ferranteCounts = collections.Counter(ferranteTokensClean)
knausCounts = collections.Counter(knausTokensClean)

We can now use this to count any individual word type in both our texts:

In [None]:
knausCounts['heart']

Remember we were talking about normalizing our frequencies? That's what we'll do next. We'll build a dictionary to compare the relative word proportions in Ferrante's and Knausgard's texts. This way, we can define the **keywords** for both authors.

In [None]:
knausKeys = {}
for word in knausCounts: 
    # How often does Knausgard use the word we're interested in? 
    knausCount = knausCounts[word]
    # Normalizing the frequencies
    knausProportion = knausCount / len(knausTokens)
    # Now we need to compare this to Ferrante's use of the same word. 
    # We will use the dictionary `.get()` method, as it allows us to return something even if the word
    # isn't in our dictionary.
    ferranteCount = ferranteCounts.get(word, 0)
    ferranteProportion = ferranteCount / len(ferranteTokens)  
    # We can now define the "keywords" for Knausgard, which is the relative proportion of this word 
    # as compared to Ferrante.
    knausKey = (knausProportion - ferranteProportion)*100
    # Finally, we add the word to our dictionary
    knausKeys[word] = knausKey

In [None]:
# Let's retrieve some key-value pairs from our new dictionary 
first2pairs = {k: knausKeys[k] for k in list(knausKeys)[:7]}
first2pairs

Now we can find the top 10 words for Knausgard!

In [None]:
knausSorted = sorted(knausKeys.items(), key=lambda item: item[1], reverse=True)
for key, value in knausSorted[:10]: 
    print(key, value)

## Concordances

A concordance list is an alphabetical list of the words (especially the important ones) present in a text, usually with citations of the passages concerned. Using NLTK, we can fairly easily create a concordance list of our texts.

In [None]:
# Let's get our punctuation back, as it reads a bit easier
ferranteTokens = word_tokenize(ferrante)
knausTokens = word_tokenize(knaus)

NLTK contains a `Text()` object, which is a "wrapper" that supports inital exploration of texts.

In [None]:
# Here, we create our NLTK Text object
ferranteT = Text(ferranteTokens)
knausT = Text(knausTokens)

Let's print out thze "docstring" of NLTK's `Text()` object, as well as all the things you can do with this object. **Have a read through this** to see what it allows you to do!

In [None]:
help(Text)

In [None]:
knausT.concordance('heart', width=115)

**Optional challenge: exploring texts using NLTK**

If the above is familiar, and you have explored the data using these methods, you are invited to use other NLTK methods on our texts. The `Text()` object has many other functionalitities, for instance. You can also have a look at POS tagging: https://www.nltk.org/book/ch05.html. Use it to seek out specific nouns or verbs!

In [None]:
# Enter your code here







## ASSIGNMENT

1.	What functionalities of NLTK did you use, and what patterns were you able to find?
2.	How did these methods help you answer the questions you had?
3.	What have you not been able to answer using these methods?
4.	How did the move from close to distant reading influence your findings?

**Respond to these questions in the following Google Docs (don't forget to include your name):**
https://docs.google.com/document/d/1WxElsUWg8WHQB_wh8N_35maDgwXpIP6Gi6HyVUpffMk/edit?usp=sharing
