#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science

# Notebook 1: Getting Started with Text Data and Jupyter Notebooks

#### A note about class notebooks

Most class sessions will involve a Jupyter Notebook not unlike this one. These notebooks will include code you can modify and execute to get a feel for how things work. Many of the notebooks will have additional exercises at the end that ask you to use code provided elsewhere in the notebook in new ways to answer research questions you come up with yourself. Over the course of the quarter, you will be expected to complete the exercises for four of those notebooks.

Each notebook will be used to follow along with lecture material explaining the use of different tools, various considerations when using them, how they have been used in published social research, and other ways they might be used.

#### A note about notebooks in general

Project Jupyter ([wiki link](https://en.wikipedia.org/wiki/Project_Jupyter)) "exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages," according to their [official site](https://jupyter.org/). According to Project Jupyter,

> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

While you are welcome to use the cloud-based implementation from Google, [Google Colab](https://research.google.com/colaboratory/), I recommend running notebooks locally. Although we will use Python 3 for this class, notebooks can be used for certain other languages (including R, Julia, and even Stata, among others).

Notebooks allow you to blend code with markdown. One use case is that you can write code and comment the code in a more attractive and effective way. At the other extreme, another use case is writing a document in markdown and including a bit of code here and there.

#### A note about markdown

Here are some basics of formatting markdown:

```# Heading 1```

# Heading 1


```## Heading 2```

## Heading 2

``` *Italics* ```

*Italics*

``` **Boldface** ```

**Boldface**

``` ***Italics and boldface***```

***Italics and boldface***

``` `Monospace` ```

`Monospace`

``` [Link to a guide](https://ingeh.medium.com/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed) ```

[Link to a guide](https://ingeh.medium.com/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed)

Everything to this point was written in markdown cells. If you are actively in a cell and click on `Cell` in the menu at the top, you can change the cell type. The two main types we'll use are Code and Markdown. If you double-click on a markdown cell, it temporarily looks like a code one (but isn't one, and won't behave like it). You can just `Run` the cell after editing it to make it looks
nice again.

Try editing this cell. Maybe write in a secret. You won't be turning this in, after all. Double-click on the cell to edit it, then `Run` the cell. This executes the code only in the current cell. You can also use `ctrl + enter` to run the cell.

Now delete the previous cell, where you may have written a secret. There's no use leaving that on your hard drive in a random file you might forget to protect. To delete the cell, select it, and then click the scissors icon on the menu at the top.

You can add a new cell by clicking the `+` button. You can also execute your current cell and add a new cell at the same time using `alt + enter` (which I think is `option + enter` on Macs).

The `Kernel` button at the top may come in handy when we start running more complicated code, but it also gives you the option to restart the notebook and clear all the output, restart the notebook and the run the entire thing, or change kernels.

Now let's look at Code cells.

In [None]:
# This is a comment in a code cell
# Anything after "#" (number sign, pound sign, hash, etc.) is "commented out"

# If you want to write a comment on multiple lines,
# you can keep using this symbol to "comment out" the line.

"""You can write comments that span multiple lines by surrounding them with triple quotation marks, but they just run off the screen unless you manually break the line"""

"""But triple quotes can be useful if you pay attention to line width and insert linebreaks at appropriate points, 
preventing the line from running outside the visible portion of the cell. You may see these inside Python functions,
where they are called docstrings. docstrings describe functions and are accessible outside of the function."""

def example_function():
    """This function prints a string telling you it's an example."""
    print("This is just an example.")

Now call the function:

In [None]:
example_function()

If a function has a docstring, we can access that by writing `func.__doc__`, where "func" is the name of the function. You can see that when you execute the code in the cell below (replacing "func" with "example_function"). In this case, it'd be fine to just look at the cell above where we've defined the function. However, we will often import libraries and use functions when the code is not visibly present in the notebook. If the code is documented well, accessing the docstrings can be useful.

In [None]:
example_function.__doc__

We can also look at the docstring for the `print()` function itself.

In [None]:
print.__doc__

...and we can `print` that to make it a bit more attractive:

In [None]:
print(print.__doc__)

In [None]:
1 - 1 # You can also write comments after code

As you can see in the cell above, you can execute Python code in code cells and get the output directly. We wrote 
`1 - 1` and got 0 as the output. Let's try that twice and see what happens.

In [None]:
1 - 1 # first time
2 - 3 # second time

Why didn't we get 0 *and* -1? The answer is that we only see the output of the last line, although each line of code is executed sequentially.

Let's verify this:

In [None]:
assert 2 + 2 == 5

print("If this line were to print, that would mean the previous line of code was not executed")

If we want to see the output from more than one cell, we can use the `print()` function for each, like so:

In [None]:
print(1 - 1)
print(2 - 3)

The code is interpreted line by line, from top to bottom. If a line throws an error, the subsequent code does not execute. Print statements are also printed sequentially.

#### Now let's interact with some text data.

Now let's look at a few things we can do to string variables.

`str.upper()` makes text uppercase.

`str.lower()` makes text lowercase.

`str.capitalize()` does what it says on the tin.

`str.startswith()` checks whether a string variable starts with the specified characters.

`str.endswith()` checks whether a string ends with the specified characters.

`str.replace()` takes two arguments: the substring you want to replace and what you want to replace it with.

`str.strip()` removes the specified character from the beginning and end of the string, and the default character is whitespace.

`str.lstrip()` does the same, but only for the left (start) of the string.

`str.rstrip()` does the same for the right (end).

`str.split()` defaults to splitting on white space, like spaces. However, if you supply a character to the `split()` method, it will split text on that instead. When you split a string variable, a list is returned.

Let's use as an example the first principle from Grimmer and Stewart (2013, p. 269): "All quantitative models of language are wrong—but some are useful."

In [None]:
s = "All quantitative models of language are wrong—but some are useful."

In [None]:
s

In [None]:
print(s)

In [None]:
s.lower()

In [None]:
s.upper()

In [None]:
s.startswith("Sphinx")

In [None]:
s.split()

In [None]:
s.split(",") # length == 1 because there are no commas, but it's still a list!

In [None]:
len(s.split()) # splitting on whitespace

In [None]:
s = s.replace("—", ", ")
s

In [None]:
len(s.split(","))

We can save the result of using `str.split()` as a variable and confirm it's a list.

In [None]:
words = s.split()

In [None]:
for word in words:
    print(word)

In [None]:
s.replace("useful", "really cool")

Initially, we'll just look at *The Picture of Dorian Gray* by Oscar Wilde and *Frankenstein* by Mary Shelley. These are both in the public domain, and they are here courtesy of [Project Gutenberg](https://www.gutenberg.org/).

The two lines of code below declare the variable `dorian_gray` to be the string "picture_of_dorian_gray.txt" and the variable `frankenstein` to be the string "frankenstein.txt"

In [None]:
dorian_gray = "picture_of_dorian_gray.txt"
frankenstein = "frankenstein.txt"

If we execute code with just the variable names, we get the assigned values (i.e., the file names).

In [None]:
dorian_gray

In [None]:
print(frankenstein)

To acess the text, we need to read the file into memory. We can read in lines of text and interact with them straight away, or we can read them into memory and store them.

In [None]:
# Warning--this will print the entire book!

with open(dorian_gray, "r", encoding="utf-8") as reader:
    for line in reader:
        print(line)

The "with open(...) as..." approach closes the file after it executes, which is quite handy. "r" tells the function `open()` to read (instead of write, append, etc.). We can change the variable names, though:

In [None]:
with open(dorian_gray, "r", encoding="utf-8") as thing1:
    for thing2 in thing1:
        print(thing2)
        break # this stops the loop after the first iteration, so it only prints the first line

We can also store the book in memory.

In [None]:
whole_book = open(dorian_gray, "r", encoding="utf-8").read()

If we execute a cell with just the variable name, we get the whole book, but it isn't very attractive.

In [None]:
whole_book

In [None]:
whole_book[:100] # first 100 characters

Let's use `print()` instead.

In [None]:
print(whole_book)

Now might be a good time to tell you that you can press `ctrl + /` to comment out selected code. If you don't have any code selected, it will comment out whichever line you are on. If you want to get rid of all of the output, you can comment out the code in the cells above and run them again.

In [None]:
dorian_gray_lines = 0
frankenstein_lines = 0

with open(dorian_gray, "r", encoding="utf-8") as reader:
    for line in reader:
        dorian_gray_lines += 1
        
with open(frankenstein, "r", encoding="utf-8") as reader:
    for line in reader:
        frankenstein_lines += 1
        
print(f"The Picture of Dorian Gray has {dorian_gray_lines} lines.")
print(f"Frankenstein has {frankenstein_lines} words.")

In [None]:
wilde_wordcount = 0
shelley_wordcount = 0

with open(dorian_gray, "r", encoding="utf-8") as reader:
    for line in reader:
        line = line.split()
        wilde_wordcount += len(line)
        
with open(frankenstein, "r", encoding="utf-8") as reader:
    for line in reader:
        line = line.split()
        shelley_wordcount += len(line)
        
print(f"The Picture of Dorian Gray has {wilde_wordcount} words.")
print(f"Frankenstein has {shelley_wordcount} words.")

In [None]:
questions_in_jc = []

with open("julius_caesar.txt", "r", encoding="utf-8") as reader:
    for line in reader:
        line = line.rstrip("\n")
        if line.endswith("?"):
            questions_in_jc.append(line)
        
questions_in_jc

Remember docstrings? Let's take a look at one for a function we're about to use.

In [None]:
import pandas as pd

print(pd.read_csv.__doc__)

`pandas` is a fantastic library for manipulating data. Visually, the dataframes are a lot like spreadsheets.

Let's have an initial go at using `pandas` with a dataset of tweets from January 6, 2021. You can download the dataset from Kaggle at [this link](https://www.kaggle.com/mrmorj/capitol-riot-tweets). Just unzip/extract the archive and place the CSV in the same directory as this notebook (or edit the `f = "..."` line in the following cell to include the file path).

In [None]:
f = "tweets_2021-01-06.csv"
df = pd.read_csv(f)

In [None]:
df.head()

In [None]:
df[["follower_count", "likes", "retweets"]].corr()

In [None]:
df[["tweet_id", "query"]].groupby("query").count()

In [None]:
df["wordcount"] = df["text"].apply(lambda x: len(str(x).split()))

In [None]:
df.head()

In [None]:
df[["query", "wordcount"]].groupby("query").mean().plot.bar()

In [None]:
import matplotlib.pyplot as plt

df[["query", "wordcount"]].groupby("query").mean().plot.bar()
plt.title("Average Wordcount of Tweets by Query")
plt.ylabel("Wordcount")
plt.xlabel("Query")
plt.show()

In [None]:
import matplotlib.pyplot as plt

df[["query", "retweets"]].groupby("query").mean().plot.bar()
plt.title("Average Retweets of Tweets by Query")
plt.ylabel("Wordcount")
plt.xlabel("Query")
plt.show()