# SISU Digital Humanities: Textual and Language Analysis on Social Media

### Session 1: Introduction
Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk) <br />


# Welcome!

In this notebook we will go over some basic operations in Python. If you are familiar with Python, this notebook will cover things you already know.

## 1. Working with .txt files

Write a small utility function `read_file(filename)` that reads a specified file and simply returns all contents as a single string.

In [3]:
# Opening file - your code here



'\nOne mild, overcast day in August 1969, a bus came'

In [4]:
# Opening file with function - your code here



'\nOne mild, overcast day in August 1969, a bus came'

Now, we are going to create a function `split_sentences` that performs some very simple sentence splitting when passed a text string. Each sentence will be represented as a new string, so the function as a whole returns a list of sentence strings. We assume that any occurrence of either  . or ! or ? marks the end of a sentence.

First, we'll create a function called `end_of_sentence_marker` that takes as argument a character and returns True if it is an end-of-sentence marker, otherwise it returns False.

In [11]:
# Define your function here




# these tests should return True if your code is correct
print(end_of_sentence_marker("?") == True)
print(end_of_sentence_marker("a") == False)

True
True


An important function we will use is the built in `enumerate`. `enumerate` takes as argument any iterable (a string a list etc.). Let's see it in action:

In [6]:
for i, character in enumerate("Python"):
    print(i, character)

0 P
1 y
2 t
3 h
4 o
5 n


As we can see, enumerate allows us to iterate over an iterable and for each element in that iterable, it gives us its corresponding index. 

Now we can create our function `split_sentences`. 

In [12]:
def split_sentences(text):
    "Split a text string into a list of sentences."
    sentences = []
    start = 0
    for end, character in enumerate(text):
        if end_of_sentence_marker(character):
            sentence = text[start: end + 1]
            sentences.append(sentence)
            start = end + 1
    return sentences

split = split_sentences(test)
split[:10]

['\nOne mild, overcast day in August 1969, a bus came winding its way along a narrow road at the far end of an island in southern Norway, between gardens and rocks, meadows and woods, up and down dale, around sharp bends, sometimes with trees on both sides, as if through a tunnel, sometimes with the sea straight ahead.',
 ' It belonged to the Arendal Steamship Company and was, like all its buses, painted in two-tone-light and dark-brown livery.',
 ' It drove over a bridge, along a bay, signaled right, and drew to a halt.',
 ' The door opened and out stepped a little family.',
 ' The father, a tall, slim man in a white shirt and light polyester trousers, was carrying two suitcases.',
 ' The mother, wearing a beige coat and with a light-blue kerchief covering her long hair, was clutching a stroller in one hand and holding the hand of a small boy in the other.',
 ' The oily, gray exhaust fumes from the bus hung in the air for a moment as it receded into the distance.',
 '\n\n“It’s quite a

Within `split_sentences`, we define a variable 'sentences' in which we store the individual sentences. Next, we define a variable `start` and set it to zero. We're doing this as we need to extract both the start position and the end position of each sentence, and we know that the first sentence will always start at position 0.

Next, we use `enumerate` to *loop* over all individual characters in the text. Remember that enumerate returns pairs of indexes and their corresponding elements (here characters). For each character we check whether it is an end-of-sentence marker. If it is, the variable end marks the position in text where a sentence ends. 

There is an easier way to do this, however, which is through NLTK. We will use the `sent_tokenize` package, import it, and run it on our `test` data set.

In [14]:
# your code here




['\nOne mild, overcast day in August 1969, a bus came winding its way along a narrow road at the far end of an island in southern Norway, between gardens and rocks, meadows and woods, up and down dale, around sharp bends, sometimes with trees on both sides, as if through a tunnel, sometimes with the sea straight ahead.',
 'It belonged to the Arendal Steamship Company and was, like all its buses, painted in two-tone-light and dark-brown livery.',
 'It drove over a bridge, along a bay, signaled right, and drew to a halt.',
 'The door opened and out stepped a little family.',
 'The father, a tall, slim man in a white shirt and light polyester trousers, was carrying two suitcases.']

Finally, let's visualize some of these results. We'll create a new variable called `sentence_length`, assigning an empty list to it. We'll then loop over our `split` variable (which contains all split sentences in our test file) and add the length of each sentence to the `sentence_length` variable (tip: use the built-in `len()` function).

In [None]:
# your code here




Finally, we'll import matplotlib and plot `sentence_length`. If you did everything right, the below code should give you a graph of the sentence lengths!

In [None]:
import matplotlib.pyplot as plt

plt.plot(sentence_length)

## 2. Working with Pandas

Next, we'll import a .csv into a Pandas dataframe. The data was taken from a forum ran by supporters of former US president Donald Trump.

We're just getting started today, but ask yourself already: what kinds of questions could I ask of this data? What kinds of themes, patterns, or regularities might I be interested in exploring?

In [None]:
# Your code here




In [None]:
dfNew = df[:10].dropna()

In [None]:
dfNew

Now, we'll write a function `tokenizer()` that takes as input a string. There are lots of ways to do this, but here's one option.
- turn the string into lower case 
- remove newlines and tabs using `translate`
- clean up punctuation using `translate` and `string.punctuation`
- put all words in a list after removing digits
- return said list

In [None]:
import string
def tokenizer(text):
    '''cleans up and tokenizes input string'''
    text = text.lower()
    bad_chars = ['\n', '\t', '”', '“']
    textClean = text.translate(str.maketrans({ch: " " for ch in bad_chars}))
    table = str.maketrans({ch: None for ch in string.punctuation})
    no_punct = (s.translate(table) for s in textClean.split(' ') if s != '')
    digits_out = [word for word in no_punct if not word.isdigit()]       
    return digits_out

If you don't understand all these operations, don't worry! For now, we can do the same (but less verbose) using NLTK. All we need to do is to run the `word_tokenize` method.

In [None]:
import nltk
nltk.download('punkt')

def nltk_tokenizer(text): 
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    return tokens

Next, we might want to use our tokenizer on only a subset of our data. Create a new dataframe which only contains the first 20 entries of our DataFrame. Also, remove all rows containing empty values (you can use Pandas' `dropna()` method).

In [None]:
# your code here



Now let's run one of our tokenizer functions, taking as input the `body` column from your dataframe. Loop over each row of your dataframe, and print out the tokenized `body` of each row to see if it works.
Also, notice how the output of our own tokenizer differs from the NLTK one.

In [None]:
# your code here




Next, we'll create the type-token ratio for each user in our df, to see whose language is the most 'complex'. First, we'll create a function for you that computes the TTR (see if you understand how it works!)

In [None]:
def typeTokenRatio(tokens): 
    numTokens = len(tokens)
    numTypes = len(set(tokens))
    return numTypes/numTokens

Finally, loop over the 'body' column of each row in your df again. This time, within the loop, create a variable `tokens` and assign to it the output of your tokenizer function. Then, print the output of the `typeTokenRatio` function, which you run on `tokens`.

If things go well, you'll see the TTR for each of the 20 posts.

In [None]:
# your code here




We see that some posts have a TTR of 1, meaning all words are unique. TTR does not tell us much here, as these are all short posts. Still, it is one way of finding more unique posts.

## Most-used words

Finally, we'll write a short program that tells you the *10 most-used words* for a single user comment in the DataFrame. We will use the `Counter` class from the `collections` module. If you don't know the answer: the Internet is your friend!

In [None]:
# Your code here
