# Introduction to Python for Humanists

In this lesson, we'll provide a brief introduction to using Python for some common tasks that humanists encounter when working with data.

## Jupyter Notebooks and Literate Programming

We are using something called a Jupyter notebook, which provides an integrated environment that combines human-readable text and computer-readable coding blocks.

Jupyter notebooks allow you to do any or all of the following:
- Write and execute your code in the browser
- Display the results of your analysis in a table or visualization
- Interweave your code with narrative to give it context or explain your thought process
- Output the notebook to static webpage, pdf, slideshow, etc.

Jupyter notebooks are a valuable tool for creating a literate programming environment.[<sup>1</sup>](#fn1) It also allows you to document your research methods in order to produce publishable and reproducible results for peer review.

While some of the methods described below might be achieved in Excel, using a Jupyter notebook environment provides a better way to document your thought process and save your processing steps. That way, when you add new data, you don't have to remember all the earlier steps you took to clean your data. Or if a new researcher joins your team, she can easily follow along and understand what is happening. 

[<sup>1</sup>](#fn1) <span id="fn1">You can learn more about the literate programming paradigm and how to install and use Jupyter notebooks at this tutorial from the Programming Historian: https://programminghistorian.org/en/lessons/jupyter-notebooks.</span> 






## Using the Jupyter Notebook

A notebook is made up of a series of cells (like paragraphs). There are two main types: a code cell and a Markdown cell.

### Code Cells

Code cells consist of code snippets and the program can run and execute the code. You'll see a `[ ]:` to the left of the cell. This will tell you that the cell has code to run. After you've run the cell, you'll see a number `[1]:` that gives what order the cell was run in.  

To run the code in a cell block, you can do one of the following:
- press shift + enter, or
- click the <i class="fa-step-forward fa"></i> run button in the toobar at the top of the page

The notebook will remember the code that you ran earlier and save any variables you created. If you want to re-run a cell, you can do so. Or you can re-run the whole notebook from the beginning. 

### Markdown Cells

Markdown cells are where you add your narrative (like this one). Markdown is a lightweight syntax for formatting your text. Some digital humanists prefer to take notes in a text editor using Markdown. You can learn more about writing in Markdown here: https://daringfireball.net/projects/markdown/syntax.   

### If you get stuck

Sometimes the notebook just gets stuck. If that happens, go `Kernel` in the menu bar and select `Restart Kernel and Clear All Outputs`. The notebook will reset and you'll have to rerun the cells. 

## Your first code

Type this phrase in the code cell below.

`print('Hello world')`

then run the cell -- [remember: shift + enter or the <i class="fa-step-forward fa"></i> run button in the toolbar]

and you'll see the results printed immediately below the code cell block.

In [None]:
print("Hello world")

Anything within quotation marks (single or double is fine) is interpreted by Python as a string (text). Python also recognizes numerical values, such as integers (whole numbers) and floats (decimal numbers). Try this, and see if you can predict what will happen:

`print(1+3)`

`print('1+3')` 

In [None]:
print(1+3)
print("1+3")

In the first case, Python performed the mathematical calculation and output the result. In the second one, the quotations marks told Python to interpret it as a string and it output the characters between the quotations marks.

You can also add strings together. For example, you have a first names column and a last names column and you want to make a single full names column.

In [None]:
first_name = "Cornelius"
last_name = "Vanderbilt"

full_name = first_name + " " + last_name 

print(full_name)

Sometimes you have full names and you want to split them into first and last names. You can do that too. This is very useful when you want to alphabetize a list of names written first name, last name.

In [None]:
full_name = "Cornelius Vanderbilt"

first_name = full_name.split()[0]
last_name = full_name.split()[1]

print(first_name)
print(last_name)

## Working with Variables

You can assign values to variables and manipulate them in functions.

In the following code cell, we assign the `name` variable a string. Then we tell Python to retrieve that variable to complete the sentence.


In [None]:
name = "Cornelius"
print("My name is " + name + ".")

If we make this a function, we can call the function repeatedly with different variables.

In [None]:
name1 = "Cornelius"
name2 = "Gloria"

def print_name(name):
    print("My name is " + name + ".")
    
print_name(name1)
print_name(name2)
    


Let's try some math.

Go ahead and run the code cell below and feel free to change the var variable and see what happens.  Or change the math operand.

In [None]:
var = 20
def double_it(var):
    return var*2

double_it(var)

## How might you use this?

Let's say you have list of birth and death dates for several hundred people and you want to calculate their age upon death because you are analyzing life expectancy. You might have a spreadsheet that looks something like this:

| Name | Birth | Death |
|------|-------|-----|
|Amy|1822|1863|
|Ben|1847|1902|
|Chet|1835|1837|


First we will want to import our dataset into our notebook. We do this using the csv library (or package) - an add-on for Python that simplifies our task. Another library will want is pandas, which helps us create tables (or dataframes) with the rows and columns of our csv file.


In [None]:
import csv
import pandas as pd

# this will import my file called "demo.csv" that is in the same file directory as my notebook
# and assign it to the pandas dataframe variable named "df"
df = pd.read_csv("demo.csv")  
df

We want to tell Python to add a new column for Age and then for each name, calculate the age.

In [None]:
df["Age"] = df['Death'] - df['Birth'] 
df


You can imagine how fast this is for a list of hundreds or thousands of names.


## Basic Data Analysis

Now let's do some data analyis. Let' have it compute the average age.

In [None]:
number_list = df["Age"] 
avg = sum(number_list)/len(number_list)

print("The average life expectancy is ", round(avg,2))

Or we want to find out all the people born before 1840:

In [None]:
pre_1840 = df.loc[df['Birth'] < 1840]
pre_1840

Q: Write the code to identify those who died after 1860

In [None]:
# Write your code below this line


Let's say we have painstakingly accumulated data about all the passengers on the Titanic into a spreadsheet and we would like to do some data analysis about our passenger list. (The Titanic passenger list is a commonly used dataset in beginner data science classes and the full list can be found here: https://www.kaggle.com/c/titanic/data).

First we'll import our dataset into our notebook and assign it to the variable table. 


In [None]:
import csv
import pandas as pd 

table = pd.read_csv("titanic_data.csv")  # we are assigning the imported csv file to the variable table
table

You're seeing the first and last 5 rows of our dataset so you can see what types of information our spreadsheet has. It's telling us that our dataset has 891 rows (passengers) and 12 columns of information about those passengers, like name, age, ticket price, and whether or not they survived.

Let's see we want to see how many unique values there are for the 'Pclass' (passenger class) column. 

In [None]:
class_list = pd.unique(table['Pclass'])
print(class_list)

This code gave us a list [1,2,3], telling us that the 'Pclass' column has three possible values for First Class, Second Class, and Third Class passengers.

Let's analyze our passengers by class. The Pandas library provides some built-in functions that let us do so quickly.

First we are going to group our passengers by class.
Then we ask pandas to provide some summary statistics.
And then we ask it to give us the mean (average) for any column that has a number in it.


In [None]:
# group data table by passenger class
grouped_class_data = table.groupby('Pclass')
# summary statistics for all numeric columns by class
grouped_class_data.describe()
# provide the mean for each requested column by class
grouped_class_data[['Survived', 'Age', 'Fare']].mean()

We can quickly see that First Class passengers had a higher survival rate than Third Class passengers. We can also see that First Class passengers tended to be older and paid over four times as much as Second Class passengers. 

Q: How would you calculate survival rate by gender instead of passenger class?

Now I want to count how many passengers in each category.

In [None]:
grouped_sex_data = table.groupby('Sex')
# summary statistics for all numeric columns by class
grouped_sex_data.describe()
# provide the mean for each requested column by class
grouped_sex_data.count()

Our Titantic dataset has 314 female passengers and 577 male passengers.

Q: Write code to find out how many passengers in each class.

In [None]:
grouped_data = table.groupby('Pclass')
grouped_data['Name'].count()

Here we learned that there were more Third Class passengers than First and Second Class passengers combined.

## Preparing a dataset

We would like to make an Omeka collection of photographs from productions of the Vanderbilt University Theatre Department. (https://theatre.library.vanderbilt.edu/). We have a list of plays that we would like to batch import into Omeka as a csv file.

Omeka wants our csv file to be formatted in a specific way. In this exercise, we are going to read in the existing file, make some changes to it, and then write the changes to the csv file.


In [None]:
import csv
import pandas as pd 

table = pd.read_csv("plays.csv")  # we are assigning the imported csv file to the variable table
table

It looks like our dates are written as M-D-YY or MM-DD-YY and we want them to all be written as YYYY-MM-DD. The YYYY-MM-DD format is standard for cultural heritage and helps avoid confusion between American and non-American dating conventions.

Let's first see how it works on a single date.

In [None]:
import datetime

date_str = "2/20/14" 
new_date = datetime.datetime.strptime(date_str, "%m/%d/%y")

print(new_date)


Now we want to change all the dates in the 'OpeningDate' and 'ClosingDate' columns to the YYYY-MM-DD format.

In [None]:
table['OpeningDate'] = pd.to_datetime(table['OpeningDate'])
table['ClosingDate'] = pd.to_datetime(table['ClosingDate'])

table

Notice how the names in the Director column are written as FirstName, LastName. We want to write them as LastName, FirstName so we can alphabetize them.

We can use the split string function we learned earlier.

In [None]:
# extract the Director column and assign it to the names list variable

names = table['Director'].values.tolist()

new_names = []

for name in names:
    try:
        # split the name on the space and assign the string 
        # to the left of the space to first name and 
        # the string to the right of the space to the last name
        
        first_name = name.split()[0]
        last_name = name.split()[1]    
        
        # create new full_name field with last_name, first_name format 
        full_name = last_name + ", " + first_name
        # add it to a list
        new_names.append(full_name)
    except:
        # we have some blanks in the director column
        new_names.append("n/a")

# then replace Director column with our new list of names
table['Director'] = new_names
table

We've changed our date and director fields in the Jupyter notebook, but now we want to export the revised table to a csv file so we can import it into Omeka.

The csv library will help us again.



In [None]:
table.to_csv('revised_plays.csv',  mode='w', header = True)

And voila, our csv is now ready to be importing into Omeka.

## Creating unique identifiers

Let's say you've been analyzing corresondence and want to keep track of all the letters from various archives you have visisted. You want to assign a unique identifier to each letter that combines the name of the archive with the letter number. Having a unique identifer for each letter will help you keep track and disambiguate any letters with similar metadata. You will also need to have a unique identifer for each item if you are importing into a content management system or digital archive.  

Here's an unlikely list of correspondents:

In [None]:
import csv
import pandas as pd 

ltrs = pd.read_csv("demoltrs.csv")  
ltrs

In [None]:
ltrs['UniqueID'] = ltrs['Archive'] + "_" + ltrs['LetterNo'].astype(str)
ltrs

This example illustrated different variable types. The 'Archive' column is a string, but the 'LetterNo' column is a number. We needed tell Python to read the 'LetterNo' column as a string so we could concatenate it to the 'Archive' column to create our 'UniqueID' field. 

## Next Steps

If you have time permitting before our live session, you are welcome to review the following tutorials.

The Programming Historian has a 15-part series on learning Python, with an emphasis on web-scraping. Start here: https://programminghistorian.org/en/lessons/introduction-and-installation

The Programming Historian offers a more advanced tutorial on using Python to summarize and visualize data on millions of texts from the HathiTrust Research Center’s Extracted Features dataset: https://programminghistorian.org/en/lessons/text-mining-with-extracted-features

The Art of Literary Text Analysis is a series of Jupyter notebooks developed in conjunction with a literary text analysis class. They introduce concepts such as analyzing parts of speech, sentiment analysis, topic modelling, collocations, and more. You can find it at this Github repository: https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb

General Python lessons from Vanderbilt Library's Digital Scholarship Office: https://heardlibrary.github.io/digital-scholarship/script/python/wg/