# Introduction to Python (Continued)

Thursday, Sep. 28, 2022

Today, we're going to practice what we've learned to do so far in Python and learn a few more basic elements of Python data analysis.

- [Exercise 1.Using Python to calculate the most frequent words in a text file](#Exercise-1:-Using-Python-to-calculate-the-most-frequent-words-in-a-text-file)
- [2. Manipulate, clean, and sort lists](#2.-Using-Python-to-manipulate,-clean,-and-sort-lists-of-data) 
- [3. Read  and explore tabular data](#3.-Using-Python-to-read-&-explore-tabular-data)
- [4. (time permitting) Make simple data visualizations](#4.-Making-simple-data-visualizations)

## Recap of Python basics:

As part of your homework, you learned some of the basic ways we can use Python to open and manipulate text files (files that end in .txt). 

We also learned how to store and sort variables in short lists.

We learned the basic shape of a Python script:

- **import** statements come first, telling pythong what libraries 
- next we **define** filepaths and assign variables
- then we **define** any functions
- then we **read in** any external text files (files ending .txt) or tabular data that you'll be working with (files ending .csv)
- then we **manipulate** and **ananlyze your file** using the functions we've imported or defined
- then we **output** results
- And throughout the script we **add** comments with # hashtags to explain what your script does



## QUESTIONS??? <img align="right" src="../_images/cat-typing.gif" width="300" height="200"/>

What was confusing? Interesting?

Are there words or terms that it would be helpful to define?

What lingering questions do you have about the exercises?

> Python can be confusing‚Äì‚Äìwhen I first encountered it, it seemed confusing! (Why do I need quotation marks around some things? What's the difference between a function and a variable? Why do I keep getting errors?) 

>If you, like me, felt like this cat flailing around on a keyboard, that's TOTALLY NORMAL!!! =>

### Let's practice what we've learned! 

## Exercise 1: Using Python to calculate the most frequent words in a text file

### Step 1. 
Copy our script for counting word frequency in [Introduction to Python Basics](https://mybinder.org/v2/gh/sceckert/introdhfall2022/main?urlpath=lab/tree/_week4/introduction-to-python.ipynb), part 1 into the cell below:

In [4]:
# word-frequencies.py

# Import Libraries and Modules

import re
from collections import Counter

# Define Functions


def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text) 
    return split_words

# Define Filepaths and Assign Variables

# Try substituting in a new file here!
filepath_of_text = '../_datasets/literature/19th-and-early-20th-century-literature/Mary-Shelley-Frankenstein-1818.txt' 
number_of_desired_words = 40

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

# Read in File

full_text = open(filepath_of_text, encoding="utf-8").read()

# Manipulate and Analyze File

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in stopwords]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

# Output Results

most_frequent_meaningful_words

[('one', 198),
 ('could', 182),
 ('would', 176),
 ('yet', 160),
 ('father', 144),
 ('man', 134),
 ('upon', 125),
 ('may', 121),
 ('every', 120),
 ('life', 111),
 ('time', 111),
 ('might', 111),
 ('shall', 107),
 ('said', 107),
 ('first', 101),
 ('eyes', 99),
 ('gutenberg', 97),
 ('day', 94),
 ('saw', 94),
 ('towards', 91),
 ('elizabeth', 91),
 ('night', 90),
 ('project', 87),
 ('mind', 87),
 ('found', 87),
 ('death', 84),
 ('ever', 83),
 ('even', 82),
 ('feelings', 80),
 ('work', 79),
 ('felt', 78),
 ('heart', 77),
 ('must', 76),
 ('dear', 73),
 ('thought', 73),
 ('many', 71),
 ('friend', 70),
 ('also', 69),
 ('never', 68),
 ('soon', 67)]

### Step 2. 
As a group, take a minute to **describe what this script does**, in plain English. 

- What is the sequence of steps in the script?
- What are stopwords? 
- What efect do they have on the output?

> Sidenote:  
> If you want to learn more about stopwords (and their history), check out Daniel Rosenberg's article ["Stop, Words"](https://www-jstor-org.ezproxy.princeton.edu/stable/10.1525/rep.2014.127.1.83?seq=1#metadata_info_tab_contents) (2014)

### Step 3. 
Let's test out our script on other texts. 

1. Choose a text from one of the following list and paste it in to `filepath_of_text` variable. in the cell above
     - Start by comparing the 1818 and 1831 editions of Mary Shelley's novel (both downloaded from the [Project Gutenberg](https://www.gutenberg.org/) text library)
     - Then try some others, like the full text collections that Amardeep Singh put together of 19th-century African American Literature (read more about [here](https://github.com/amardeepmsingh/African-American-Literature-Text-Corpus-1853-1923)) or Colonial South Asian Literature (read more about this collection [here](https://github.com/amardeepmsingh/Colonial-South-Asian-Literature))

```
filepath_of_text = '../_datasets/literature/sample-corpus/Mary-Shelley-Frankenstein-1818.txt'
filepath_of_text = '../_datasets/literature/sample-corpus/Mary-Shelley-Frankenstein-1831.txt'
filepath_of_text = '../_datasets/literature/sample-corpus/Jane-Austen-Pride-and-Prejudice.txt'
filepath_of_text = '../_datasets/literature/sample-corpus/Mary-Shelley-The-Last-Man-1826.txt'
filepath_of_text = '../_datasets/literature/African-American-Literature-Text-Corpus/African-American-Literature-text-files/w-e-b-du-bois-the-quest-of-the-silver-fleece-1911.txt' 

```

### Step 4.

Look carefully back at the list of filepaths above that we've used to read in different text files. 

1. What do the different parts of the filepath mean? (Think back to our lesson on the command line & what we learned about directories)
2. What files do we need (and where do they need to be in relation to this Jupyter notebook) in order to run this code?
3. If you were running this notebook on your own computer through Anaconda Navigator's version of JupyterLabs (not on this cloud-hosted Binder), what files would you need to make sure to have (and where would you need to put them)?


***‚ÄºÔ∏è This might seem trivial, but a small mistake in a filepath  can be a significant source of error when you're running Python!‚ÄºÔ∏è*** 

### ‚ú®For the future‚ú®:
### I want to try this out on other texts!  Where else can I find text files? 

[Project Gutenberg](https://www.gutenberg.org/)
<img src="../_images/gutenberg.png" width="700" height="40"/>

[Oxford Text Archive](https://ota.bodleian.ox.ac.uk/repository/xmlui/)
<img src="../_images/ota.png" width="700" height="40"/>

Alan Liu's [list of demo corpora](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora) -- collections of text files

Projects that create databases for specifice :
- [Old Bailey Online](https://www.oldbaileyonline.org/) (a database of 18th-century criminal trials)
- [Early Novels Database](https://github.com/earlynovels/end-dataset) (a database of early fiction)


-----
## üí°Let's learn some new Python tricks! üí°
---

## 2. Using Python to manipulate, clean, and sort lists of data

Let's say we're a historian who are interested in critically analyzing representations of disease in public health data. You might turn to a dataset like the Bellevue Almhouse Dataset, which records Irish imigrants in the  how "disease" was captured in the [Bellevue Almhouse Dataset](https://nyuirish.net/almshouse/the-almshouse-records/), which draws on records from the 19th Bellevue Almhouse in New York City.

We could look at a small sample of this data. 

In [4]:
import pandas as pd # This command imports the library `pandas` --  we'll be learning more about in a later lesson!
pd.read_csv('../_datasets/bellevue-almshouse-dataset/bellevue_almshouse_modified.csv').head(10)

Unnamed: 0,date_in,first_name,last_name,age,disease,profession,gender,children
0,1847-04-17,Mary,Gallagher,28.0,recent emigrant,married,w,Child Alana 10 days
1,1847-04-08,John,Sanin (?),19.0,recent emigrant,laborer,m,Catherine 2 mo
2,1847-04-17,Anthony,Clark,60.0,recent emigrant,laborer,m,Charles Riley afed 10 days
3,1847-04-08,Lawrence,Feeney,32.0,recent emigrant,laborer,m,Child
4,1847-04-13,Henry,Joyce,21.0,recent emigrant,,m,Child 1 mo
5,1847-04-14,Bridget,Hart,20.0,recent emigrant,spinster,w,Child
6,1847-04-14,Mary,Green,40.0,recent emigrant,spinster,w,And child 2 months
7,1847-04-19,Daniel,Loftus,27.0,destitution,laborer,m,
8,1847-04-10,James,Day,35.0,recent emigrant,laborer,m,
9,1847-04-10,Margaret,Farrell,30.0,recent emigrant,widow,w,


In our dataset above, notice those NaN values?? This is the way that the dataframe datatype indicates MISSING DATA‚Äì‚Äìi.e. a blank field in our CSV file. Keep this at the back of your mind, we'll come back to what those blank values might mean for our dataset. 

What if we wanted to:

- know how many times a certain value appears in the data (e.g., the so-called disease ‚Äúrecent emigrant‚Äù)

- programatically change all blank values in the data (e.g., from a blank to ‚Äúno disease recorded‚Äù)

- find the most and least common values in the data (e.g., most common ‚Äúdiseases‚Äù or professions)?

We can use something called a Python list to store data and perform an operation on it!

Let's look at some sample lists. Each of the lists below contain rows that are drawn from the dataset above.

In [None]:
first_names = ['Unity', 'Catherine', 'Thomas', 'William', 'Patrick', 'Mary Anne', 'Morris',
               'Michael', 'Ellen', 'James', 'Michael', 'Hannah', 'Alexander', 'Mary A', 'Serena?',
               'Margaret', 'Michael', 'Jane', 'Rosanna', 'James', 'Michael', 'John', 'John', 'Mary',
               'Bantel', 'Marcella', 'Arthur', 'Michael', 'Mary', 'Martin']

last_names =  ['Harkin', 'Doyle', 'McDonald', 'Jordan', 'Rouse', 'Keene', 'Brown',
               'McLoughlin', 'Cassidy', 'Whittle', 'Coyle', 'Cullen', 'Cozens', 
               'Maly', 'McGuire', 'Laly', 'Bahan', 'Combs', 'McGovern', 'Gallagher', 
               'Crone', 'Brannon', 'McDonal', 'Atkins', 'Garragan', 'Wood', 'Kelly', 'Galeny', 'Welch', 'Kerly']

diseases = ['', 'recent emigrant', 'sickness', '', '', '', 'destitution', '', 'sickness', '',
            'sickness', 'recent emigrant', '', 'insane', 'recent emigrant', 'insane', '', '',
            'sickness', 'sickness', '', 'syphilis', 'sickness', '', 'recent emigrant', 'destitution',
            'sickness', 'recent emigrant', 'sickness', 'sickness']


ages = ['22', '21', '23', '47', '45', '28', '23', '50', '26', '28', '30', '30', '65', '17', '35',
        '27', '32', '40', '22', '30', '27', '40', '41', '37', '16', '20', '30', '30', '35', '9']

### Using a `for` loop to iterate over a list
In your homework, you learned about `for` loops: they're a way of iterating over a set of items in a list in Python. 

For instance, we could iterate over all of the categories that appear in our `diseases` list:

In [None]:
for disease in diseases:
    print(disease)

We could also add more information, like print the line number for each item in our list.

To do this, we use the `enumerate()` operation:

###  `enumerate()` 

In [None]:
# We can get a little fancy and add more to our label: 
for number, disease in enumerate(diseases):
    print(f'Person {number}:', disease) # Here we use an `f-string` to add a text string to our variable.


### Using loops to extract subsets of our data

We can use a loop to select only some items from a list.

We create a new empty list and then use a `for` loop and an `if` to add items to it from our list *if* they meet our requirements:

In [None]:
original_list = ['oranges', 'item_we_want', 'item_we_want', 'apples','item_we_want','item_we_DO_NOT_want']

new_list = []
for item in original_list:
    if item == 'item_we_want':
        new_list.append(item)

In [None]:
print(new_list)

Say we wanted a list of persons whose listed `disease` was "sickness."  We create a new (empty) list, called `diseases_subset`, and then use a `for` loop and an `if` to add items to it.

In [None]:
diseases_subset = []

for disease in diseases:
    if disease == "sickness":
        diseases_subset.append(disease)

In [None]:
print(diseases_subset)

We can create this list in a slightly more compact fashion, called a *list comprehension*.

Instead of writing out the full `for` loop, we can put the loop *INSIDE* the brackets that contain the list we want to make:

In [None]:
newer_list = [item for item in original_list if item == 'item_we_want']

In [None]:
print(newer_list)

### Using loops to clean our data
What if we wanted to update our data, so that blank data was not just a blank, but something more descriptive?

We can use `for` loops and conditional statements to add items to the list if they meet some conditions, or modify and add them if they meet others:

We create a new list called `updated_diseases`, and then use `if` and `else` statements:

In [None]:
updated_diseases = []

for disease in diseases:
    if disease == '':# Here, we're telling the loop to look for diseases that are marked as blank
        new_disease = 'no disease recorded' # assigning a new variable called `new_disease`
        updated_diseases.append(new_disease) # tell the loop to record 'no disease recorded' if disease field originally blank
    else:
        updated_diseases.append(disease) # Here we're telling the loop to add the disease label as is, if it's not blank.

 Let's check inside our new list:

In [None]:
updated_diseases

### Counting items in a list or collection

To count items in lists or collections, we'll have to **`import`** a pre-written function called `Counter` from a library called `collections`


In [None]:
from collections import Counter

To use the `Counter()` function, we just need to put the thing we want counted inside the parentheses.

In [None]:
Counter(updated_diseases)

What we've created is a dictionary where the entries are the category of "disease" and the number of times it appears in our list.

We can do some nifty things with this dictionary, like sort it into the most and least common diseases.


In [None]:
# To sort into the most common diseases:
disease_tally = Counter(updated_diseases)
disease_tally.most_common()

In [None]:
# To sort into the two least common diseases:
disease_tally = Counter(updated_diseases)
disease_tally.most_common()[-2:] # here we sort in reverse from the least common

### Exercise 2: Counting and sorting "professions" in a list

In [None]:
professions = ['married', 'married', 'laborer', 'laborer', 'widow', 'married', 'spinster',
                     'laborer', 'spinster', 'laborer', 'spinster', 'spinster', 'married', 'laborer',
                     'laborer', 'spinster', 'laborer', 'laborer', 'laborer', 'laborer', 'laborer', 'spinster',
                     'laborer', 'spinster', 'widow', 'spinster', 'painter', 'laborer', 'weaver', 'laborer']


1. Count the number of times each profession appears in this list

In [None]:
## Your code here

2. Make a list of the top 5 most common professions

In [None]:
## Your code here

3. Make a new list of professions that contains only the enteries where `profession` is "spinster"

In [None]:
## Your code here

4. Make a `for`` loop that considers each item in the professions list and prints "Person's profession is ___"

In [None]:
## Your code here
    ## Your code here

----
## üí°Check in: 

Let's take a minute to think about the categories we've been analyzing-- `diseases` and `professions`. 

What are we actually counting when we count these labels? 

Is anything in the Anelise Shrout essay that could help us think about these labels? 

Remember that categories that these Irish immigrants were slotted into by the government. For example, the so-called "disease" that many of the people in this dataset exhibited ‚Äî the reason they were admitted to the Almshouse in the first place ‚Äî is "recent emigrant." What does this uncomfortable fact tell us about data more broadly? What should we make of the fact that Python, as a programming language, doesn't understand the meaning or historical context of this data?

---

## 3. Using Python to read & explore tabular data

We'll be covering this more in future week, but for now, we can look at a few more things that Python can help us do:

- Read in a CSV file (remember, CSV is short for comma-separarated values, a format for storing tabular data)
- Explore and filter data

Let's look back at our Bellevue Almshouse Dataset. First we have to load a special library called "pandas" which will help us load in tabular data (from a CSV) and store it as something called a **dataframe** -- a special Python object that we can perform operations on. Think of a dataframe as a souped up spreadsheet‚Äì‚Äìit stores values in an array.

In [None]:
# Import Pandas library, nicknaming it "pd"
import pandas as pd
# Set the maximum number of rows to display
pd.options.display.max_rows = 100

### Read in our CSV
Here, we're going to read in a CSV spreadsheet of bellevue_almshouse data using pandas (which we've nicknamed `pd`) and the `.read_csv()` operation and assign it to the variable `bellevue_df` (so we can remind ourselves that this is the bellevue dataframe).

In [None]:
bellevue_df = pd.read_csv('../_datasets/bellevue_almshouse_modified.csv', delimiter=",", parse_dates=['date_in'])

### Display our DataFrame
Like other Python variables, we can simply type the name of our DataFrame to get a peek at what it contains

In [None]:
bellevue_df

Display just the first 5 rows:

In [None]:
bellevue_df.head(5) # Use the .head() operation. Remember head and tail from the command line? It's the same principle

### Calculate summary statistics of the DataFrame
We can calculate some statistics on a DataFrame. This is a little like calculating summary statistics in Microsoft Excel or GoogleSheets 

In [None]:
bellevue_df.describe(include='all')

### Select columns
To select a column from the DataFrame, we will type the name of the DataFrame followed by square brackets `[]`and a column name in quotations marks.

In [None]:
# Select only the column labeled "disease"
bellevue_df['disease']

To select multiple columns, we need to treat them like a datafram, and enclose them in TWO sets of square brackets `[[ ]]`. 

In [None]:
bellevue_df[['first_name', 'last_name', 'disease']]

### Count Values
To count values, we select a column, and use the `.value_counts()` operator

In [None]:
bellevue_df['disease'].value_counts()

To count only the top 10 "diseases", use brackets to slice:

In [None]:
bellevue_df['disease'].value_counts()[:10]

### Exercise 3: Counting and sorting professions from a DataFrame


Using the operations we've learned, count the top 10 most common "professions" in our `bellevue_df`:

In [None]:
### Your code here

## 4. Making simple data visualizations

We can make some simple visualizations of what we've found!
Run the cell below:

In [None]:
bellevue_df['disease'].value_counts()[:10].plot(kind='bar', title='Bellevue Almshouse:\nMost Frequent "Diseases"') 

But wait --- what about all those ***blank spots*** that show up as NaN??

Plotting functions in Python will ignore blank values. So all of those blank columns are ignored.

We can try and make these blank spots a little more descriptive, much like we did for our simpler lists! 


### `.isna()` , `.notna()`, and `.fillna()`
For dataframes, there are ways of sorting through missing data. These operations are called `.isna()` `.notna()` and `.fillna()`, which allow us to check if a value is NaN (or not), and to fill in blank values in a dataframe or in a section of a dataframe (like a column).

In [None]:
# Create a new column caled `disease_updated` that fills in all the blank spots in our `disease` column with 
# "no disease recorded"
bellevue_df['disease_updated'] = bellevue_df['disease'].fillna('no disease recorded')

Now, let's check our dataframe -- we should see a new column called `disease_updated`:

In [None]:
bellevue_df

And now let's try and plot again, this time, plotting the column of updated diseases:

In [None]:
bellevue_df['disease_updated'].value_counts()[:10].plot(kind='bar', title='Bellevue Almshouse:\nMost Frequent "Diseases"') 

Compare this to our earlier plot. What do you notice?

### Exercise 4:
Let's try to explore another column of our dataset, `professions`.
With a partner, try and outline the code you would have to write if you wanted to plot the top 10 professions t

> Hint: Don't forget about missing data! 

> Are there persons in our dataset whose professions are not recorded? How might we capture that fact in our visualization?