# Introduction to Python for Humanists

In this lesson, we'll provide a brief introduction to using Python for some common tasks that humanists encounter when working with data.

## Jupyter Notebooks and Literate Programming

We are using something called a Jupyter notebook, which provides an integrated environment that combines human-readable text and computer-readable coding blocks.

Jupyter notebooks allow you to do any or all of the following:
- Write and execute your code in the browser
- Display the results of your analysis in a table or visualization
- Interweave your code with narrative to give it context or explain your thought process
- Output the notebook to static webpage, pdf, slideshow, etc.

Jupyter notebooks are a valuable tool for creating a literate programming environment.[<sup>1</sup>](#fn1) It also allows you to document your research methods in order to produce publishable and reproducible results for peer review.

While some of the methods described below might be achieved in Excel, using a Jupyter notebook environment provides a better way to document your thought process and save your processing steps. That way, when you add new data, you don't have to remember all the earlier steps you took to clean your data. Or if a new researcher joins your team, she can easily follow along and understand what is happening. 

[<sup>1</sup>](#fn1) <span id="fn1">You can learn more about the literate programming paradigm and how to install and use Jupyter notebooks at this tutorial from the Programming Historian: https://programminghistorian.org/en/lessons/jupyter-notebooks.</span> 






## Using the Jupyter Notebook

A notebook is made up of a series of cells (like paragraphs). There are two main types: a code cell and a Markdown cell.

### Code Cells

Code cells consist of code snippets and the program can run and execute the code. To run the code in a cell block, you can do one of the following:
- press shift + enter, or
- click the <i class="fa-step-forward fa"></i> run button in the toobar at the top of the page

The notebook will remember the code that you ran earlier and save any variables you created. If you want to re-run a cell, you can do so. Or you can re-run the whole notebook from the beginning. 

### Markdown Cells

Markdown cells are whee you add your narrative (like this one). Markdown is a lightweight syntax for formatting your text. Some digital humanists prefer to take notes in a text editor using Markdown. You can learn more about writing in Markdown here: https://daringfireball.net/projects/markdown/syntax.   

## Your first code

Type this phrase in the code cell below.

`print('Hello world')`

then run the cell -- [remember: shift + enter or the <i class="fa-step-forward fa"></i> run button in the toolbar]

and you'll see the results printed immediately below the code cell block.

In [1]:
print("Hello world")

Hello world


Anything within quotation marks (single or double is fine) is interpreted by Python as a string (text). Python also recognizes numerical values, such as integers (whole numbers) and floats (decimal numbers). Try this, and see if you can predict what will happen:

`print(1+3)`

`print('1+3')` 

In [2]:
print(1+3)
print("1+3")

4
1+3


In the first case, Python performed the mathematical calculation and output the result. In the second one, the quotations marks told Python to interpret it as a string and it output the characters between the quotations marks.

You can also add strings together. For example, you have a first names column and a last names column and you want to make a single full names column.

In [3]:
first_name = "Cornelius"
last_name = "Vanderbilt"

full_name = first_name + " " + last_name 

print(full_name)

Cornelius Vanderbilt


Sometimes you have full names and you want to split them into first and last names. You can do that too. This is very useful when you want to alphabetize a list of names written first name, last name.

In [None]:
full_name = "Cornelius Vanderbilt"

first_name = full_name.split()[0]
last_name = full_name.split()[1]

print(first_name)
print(last_name)

## Working with Variables

You can assign values to variables and manipulate them in functions.

In the following code cell, we assign the `name` variable a string. Then we tell Python to retrieve that variable to complete the sentence.


In [None]:
name = "Cornelius"
print("My name is " + name + ".")

If we make this a function, we can call the function repeatedly with different variables.

In [None]:
name1 = "Cornelius"
name2 = "Gloria"

def print_name(name):
    print("My name is " + name + ".")
    
print_name(name1)
print_name(name2)
    


Let's try some math.

Go ahead and run the code cell below and feel free to change the var variable and see what happens.  Or change the math operand.

In [None]:
var = 20
def double_it(var):
    return var*2

double_it(var)

## How might you use this?

Let's say you have list of birth and death dates for several hundred people and you want to calculate their age upon death because you are analyzing life expectancy. You might have a spreadsheet that looks something like this:

| Name | Birth | Death |
|------|-------|-----|
|Amy|1822|1863|
|Ben|1847|1902|
|Chet|1835|1837|

First, we're going to import our csv spreadsheet into our notebook. 


In [4]:
# import pandas as pd

# names = ["Amy", "Ben", "Chet"]
# birth_yrs = [1822, 1847, 1835]
# death_yrs = [1863, 1902, 1837]

# df = pd.DataFrame(list(zip(names, birth_yrs, death_yrs)), columns=['Name', 'Birth', 'Death'] )
# df

# # check if can add csv to binder to avoid this

# # or use dictionary? 
# # [{'Name': 'Amy', 'Birth': 1822, 'Death': 1863}, {'Name': 'Ben', 'Birth': 1847, 'Death': 1902}, {'Name': 'Chet', 'Birth': 1835, 'Death': 1837}]


Unnamed: 0,Name,Birth,Death
0,Amy,1822,1863
1,Ben,1847,1902
2,Chet,1835,1837


We want to tell Python to add a new column for Age and then for each name, calculate the age.

In [5]:
df["Age"] = df['Death'] - df['Birth'] 
df


Unnamed: 0,Name,Birth,Death,Age
0,Amy,1822,1863,41
1,Ben,1847,1902,55
2,Chet,1835,1837,2


You can imagine how fast this is for a list of hundreds or thousands of names.


## Basic Data Analysis

Now let's do some data analyis. Let' have it compute the average age.

In [6]:
number_list = df["Age"] 
avg = sum(number_list)/len(number_list)
print("The average life expectancy is ", round(avg,2))

The average life expectancy is  32.67


Or we want to find out all the people born before 1840:

In [7]:
pre_1840 = df.loc[df['Birth'] < 1840]
pre_1840

Unnamed: 0,Name,Birth,Death,Age
0,Amy,1822,1863,41
2,Chet,1835,1837,2


Q: Write the code to identify those who died after 1860

In [8]:
# right your code below this line


## Preparing a dataset

Let's say you've been analyzing corresondence and want to keep track of all the letters from various archives you have visisted. You want to assign a unique identifier to each letter that combines the name of the archive with the letter number. Here's an unlikely list of correspondents:

| Archive | Letter No | Sender |
|------|-------|-----|
|British Library| 422 |Virginia Woolf|
|Folger| 98 |William Shakespeare|
|Harvard| 735 |Maya Angelou|

In [25]:
archive = ["British Library", "Folger", "Harvard"]
letternum = [422, 98, 735]
sender = ["Virginia Woolf", "William Shakespeare", "Maya Angelou"]

df = pd.DataFrame(list(zip(archive, letternum, sender)), columns=['Archive', 'LetterNum', 'Sender'] )
df

Unnamed: 0,Archive,LetterNum,Sender
0,British Library,422,Virginia Woolf
1,Folger,98,William Shakespeare
2,Harvard,735,Maya Angelou


In [26]:
df['UniqueID'] = df['Archive'] + df['LetterNum'].astype(str)
df

Unnamed: 0,Archive,LetterNum,Sender,UniqueID
0,British Library,422,Virginia Woolf,British Library422
1,Folger,98,William Shakespeare,Folger98
2,Harvard,735,Maya Angelou,Harvard735


In [27]:
df['UniqueID'] = df['Sender'] + df['LetterNum'].astype(str)
df

Unnamed: 0,Archive,LetterNum,Sender,UniqueID
0,British Library,422,Virginia Woolf,Virginia Woolf422
1,Folger,98,William Shakespeare,William Shakespeare98
2,Harvard,735,Maya Angelou,Maya Angelou735


## Further Resources

The Programming Historian has a 15-part series on learning Python for web-scraping. Start here: https://programminghistorian.org/en/lessons/introduction-and-installation

The Art of Literary Text Analysis is a series of Jupyter notebooks developed in conjunction with a literary text analysis class. They introduce concepts such as analyzing parts of speech, sentiment analysis, topic modelling, collocations, and more. You can find it at this Github repository: https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb

Let's say we have painstakingly accumulated data about all the passengers on the Titanic into a spreadsheet and we would like to do some data analysis about our passenger list. (The Titanic passenger list is a commonly used dataset in beginner data science classes and the full list can be found here: https://www.kaggle.com/c/titanic/data).

First we will want to import our dataset into our notebook. We do this uses the csv library (or package) - an add-on for Python that simplifies our task. Another library will want is pandas, which helps us create tables (or dataframes) with rows and columns of our csv file.



In [20]:
import csv
import pandas as pd 

table = pd.read_csv("titanic_data.csv")  # we are assigning the imported csv file to the variable table
table

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


You're seeing the first and last 5 rows of our dataset so you can see what types of information our spreadsheet has. It's telling us that our dataset has 891 rows (passengers) and 12 columns of information about those passengers, like name, age, ticket price, and whether or not they survived.

Let's see we want to see how many unique values there are for the 'Pclass' (passenger class) column. 

In [26]:
class_list = pd.unique(table['Pclass'])
print(class_list)

[3 1 2]


This code gave us a list [1,2,3], telling us that the 'Pclass' column has three possible values for First Class, Second Class, and Third Class passengers.

Let's say we want to analyze our passengers by class. The Pandas library provides some built-in functions that let us do so quickly.

First we are going to group our passengers by class.
Then we ask pandas to provide some summary statistics.
And then we ask it to give us the mean (average) for any column that has a number in it.


In [27]:
# group data table by passenger class
grouped_class_data = table.groupby('Pclass')
# summary statistics for all numeric columns by class
grouped_class_data.describe()
# provide the mean for each requested column by class
grouped_class_data['Survived', 'Age', 'Fare'].mean()

Unnamed: 0_level_0,Survived,Age,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.62963,38.233441,84.154687
2,0.472826,29.87763,20.662183
3,0.242363,25.14062,13.67555


We can quickly see that First Class passengers had a higher survival rate than Third Class passengers. We can also see that First Class passengers tended to be older and paid over four times as much as Second Class passengers. 

Q: How would you calculate survival rate by gender instead of passenger class?

Now I want to count how many passengers in each category.

In [43]:
grouped_sex_data = table.groupby('Sex')
# summary statistics for all numeric columns by class
grouped_sex_data.describe()
# provide the mean for each requested column by class
grouped_sex_data.count()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
female,314,314,314,314,261,314,314,314,314,97,312
male,577,577,577,577,453,577,577,577,577,107,577


Our Titantic dataset has 314 female passengers and 577 male passengers.

Q: Write code to find out how many passengers in each class.

In [50]:
grouped_data = table.groupby('Pclass')
grouped_data['Name'].count()

Pclass
1    216
2    184
3    491
Name: Name, dtype: int64

Here we learned that there were more Third Class passengers than First and Second Class passengers combined.

## Let's switch to a different dataset.

We would like to make an Omeka collection of photographs from productions of the Vanderbilt University Theatre Department. (https://theatre.library.vanderbilt.edu/). We have a list of plays that we would like to batch import into Omeka as a csv file.

Omeka wants our csv file to be formatted in a specific way. In this exercise, we are going to read in the existing file, make some changes to it, and then write the changes to the csv file


In [1]:
import csv
import pandas as pd 

table = pd.read_csv("plays.csv")  # we are assigning the imported csv file to the variable table
table

Unnamed: 0,Title,Author,Season,OpeningDate,ClosingDate,Director
0,24-Hour Production,Vanderbilt University Theatre,2013-2014,4/5/14,4/6/14,
1,A Shayna Maidel,"Lebow, Barbara",2016-2017,11/4/16,11/6/16,Santiago Sosa
2,Beaux' Stratagem,"Farquhar, George",2014-2015,2/13/15,2/21/15,Jon Hallquist
3,Cabaret Vanderbilt,Vanderbilt University Theatre,2013-2014,4/16/14,4/16/14,Christin Essin
4,Cabaret Vanderbilt: Gender Play,Vanderbilt University Theatre,2016-2017,4/13/17,4/15/17,Christin Essin
5,Children's Hour,"Hellman, Lillian",2011-2012,2/17/12,2/24/12,Jon Hallquist
6,City of Songs,"Granger, Brian | Vanderbilt University Theatre",2015-2016,11/6/15,11/14/15,Brian Granger
7,Cradle Will Rock,"Blitzstein, Marc",2013-2014,11/1/13,11/9/13,Leah Lowe
8,Dead Man's Cell Phone,"Ruhl, Sarah",2013-2014,9/27/12,9/30/12,Leah Lowe
9,How to Build a Forest,"Damour, Pearl | Hall, Shawn",2013-2014,3/28/14,3/29/14,


It looks like our dates are written as M-D-YY or MM-DD-YY and we want them to all be written as YYYY-MM-DD. The YYYY-MM-DD format is standard for cultural heritage and helps avoid confusion between American and non-American dating conventions.

Let's first see how it works on a single date.

In [2]:
import datetime

date_str = "2/20/14" 
new_date = datetime.datetime.strptime(date_str, "%m/%d/%y")

print(new_date)


2014-02-20 00:00:00


Now we want to change all the dates in the 'OpeningDate' and 'ClosingDate' columns to the YYYY-MM-DD format.

In [95]:
table['OpeningDate'] = pd.to_datetime(table['OpeningDate'])
table['ClosingDate'] = pd.to_datetime(table['ClosingDate'])

table

Unnamed: 0.1,Unnamed: 0,Title,Author,Season,OpeningDate,ClosingDate,Director
0,0,24-Hour Production,Vanderbilt University Theatre,2013-2014,2014-04-05,2014-04-06,
1,1,A Shayna Maidel,"Lebow, Barbara",2016-2017,2016-11-04,2016-11-06,Santiago Sosa
2,2,Beaux' Stratagem,"Farquhar, George",2014-2015,2015-02-13,2015-02-21,Jon Hallquist
3,3,Cabaret Vanderbilt,Vanderbilt University Theatre,2013-2014,2014-04-16,2014-04-16,Christin Essin
4,4,Cabaret Vanderbilt: Gender Play,Vanderbilt University Theatre,2016-2017,2017-04-13,2017-04-15,Christin Essin
5,5,Children's Hour,"Hellman, Lillian",2011-2012,2012-02-17,2012-02-24,Jon Hallquist
6,6,City of Songs,"Granger, Brian | Vanderbilt University Theatre",2015-2016,2015-11-06,2015-11-14,Brian Granger
7,7,Cradle Will Rock,"Blitzstein, Marc",2013-2014,2013-11-01,2013-11-09,Leah Lowe
8,8,Dead Man's Cell Phone,"Ruhl, Sarah",2013-2014,2012-09-27,2012-09-30,Leah Lowe
9,9,How to Build a Forest,"Damour, Pearl | Hall, Shawn",2013-2014,2014-03-28,2014-03-29,


In [3]:
# table['Director'] = table['Director'].split()[1] + ", " + table['Director'].split()[1]
table['Director'] = table['Director'].split(" ") 

AttributeError: 'Series' object has no attribute 'split'

We've changed our date and director fields in the Jupyter notebook, but now we want to export the revised table to a csv file so we can import it into Omeka.

The csv library will help us again.



In [92]:
table.to_csv('revised_plays.csv',  mode='w', header = True)