# Introduction to Python (Continued)

Thursday, Feb. 25, 2021

Today, we're going to learn to do a few more with Python.


## Recap of Python basics:

As part of your homework, you learned some of the basic ways we can use Python to open and manipulate text files (files that end in .txt). We also learned how to store and sort variables in short lists.

We learned how any basic work with Python takes the same basic shape:

- **import** any libraries 
- **define** filepaths and assign variables
- **define** any functions
- **read in** any external text files (files ending .txt) or tabular data that you'll be working with (files ending .csv)
- **manipulate** and **ananlyze your file**
- **output** results 

And you remembered to 
- **add** comments with # hashtags



## QUESTIONS??

<img src="https://media.giphy.com/media/lJNoBCvQYp7nq/giphy.gif" width="300" height="200"/>

## 1. Using Python to calculate the word frequency of a text file

### Let's practice! 

1. Copy our script for counting word frequency in [Introduction to Python Basics](https://mybinder.org/v2/gh/sceckert/introdhspring2021/main?urlpath=lab/tree/_week4/introduction-to-python.ipynb), part 1 into the cell below:

2. Choose a text from one of the following sets to serve as your file


#### Where else can I find text files? 

[Project Gutenberg](https://www.gutenberg.org/)
<img src="images/gutenberg.png" width="700" height="40"/>

[Oxford Text Archive](https://ota.bodleian.ox.ac.uk/repository/xmlui/)
<img src="images/ota.png" width="700" height="40"/>

Alan Liu's [list of demo corpora](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora) -- collections of text files

Projects that create databases for specifice :
- [Old Bailey Online](https://www.oldbaileyonline.org/) (a database of 18th-century criminal trials)
- [Early Novels Database](https://github.com/earlynovels/end-dataset) (a database of early fiction)


## 2. Using Python to manipulate, clean, and sort lists of data

Let's say we're a historian who are interested in critically analyzing representations of disease in public health data. You might turn to a dataset like the Bellevue Almhouse Dataset, which records Irish imigrants in the  how "disease" was captured in the [Bellevue Almhouse Dataset](https://nyuirish.net/almshouse/the-almshouse-records/), which draws on records from the 19th Bellevue Almhouse in New York City.

We could look at a small sample of this data. 

In [None]:
import pandas as pd # This command imports the library `pandas` --  we'll be learning more about in a later lesson!
pd.read_csv('../_datasets/bellevue_almshouse_modified.csv').head(10)

In our dataset above, notice those NaN values?? This is the way that the dataframe datatype indicates MISSING DATA––i.e. a blank field in our CSV file. Keep this at the back of your mind, we'll come back to what those blank values might mean for our dataset. 

What if we wanted to:

- know how many times a certain value appears in the data (e.g., the so-called disease “recent emigrant”)

- programatically change all blank values in the data (e.g., from a blank to “no disease recorded”)

- find the most and least common values in the data (e.g., most common “diseases” or professions)?

We can use something called a Python list!

Let's learn what that is using some sample lists. Each of the lists below contain rows that are drawn from the dataset above.

In [None]:
first_names = ['Unity', 'Catherine', 'Thomas', 'William', 'Patrick', 'Mary Anne', 'Morris',
               'Michael', 'Ellen', 'James', 'Michael', 'Hannah', 'Alexander', 'Mary A', 'Serena?',
               'Margaret', 'Michael', 'Jane', 'Rosanna', 'James', 'Michael', 'John', 'John', 'Mary',
               'Bantel', 'Marcella', 'Arthur', 'Michael', 'Mary', 'Martin']

last_names =  ['Harkin', 'Doyle', 'McDonald', 'Jordan', 'Rouse', 'Keene', 'Brown',
               'McLoughlin', 'Cassidy', 'Whittle', 'Coyle', 'Cullen', 'Cozens', 
               'Maly', 'McGuire', 'Laly', 'Bahan', 'Combs', 'McGovern', 'Gallagher', 
               'Crone', 'Brannon', 'McDonal', 'Atkins', 'Garragan', 'Wood', 'Kelly', 'Galeny', 'Welch', 'Kerly']

diseases = ['', 'recent emigrant', 'sickness', '', '', '', 'destitution', '', 'sickness', '',
            'sickness', 'recent emigrant', '', 'insane', 'recent emigrant', 'insane', '', '',
            'sickness', 'sickness', '', 'syphilis', 'sickness', '', 'recent emigrant', 'destitution',
            'sickness', 'recent emigrant', 'sickness', 'sickness']


ages = ['22', '21', '23', '47', '45', '28', '23', '50', '26', '28', '30', '30', '65', '17', '35',
        '27', '32', '40', '22', '30', '27', '40', '41', '37', '16', '20', '30', '30', '35', '9']

### Using a `for` loop to iterate over a list
In your homework, you learned about `for` loops: they're a way of iterating over a set of items in a list in Python. 

For instance, we could iterate over all of the categories that appear in our `diseases` list:

In [None]:
for disease in diseases:
    print(disease)

We could also add more information, like print the line number for each item in our list.

To do this, we use the `enumerate()` operation:

###  `enumerate()` 

In [None]:
# We can get a little fancy and add more to our label: 
for number, disease in enumerate(diseases):
    print(f'Person {number}:', disease) # Here we use an `f-string` to add a text string to our variable.


### Using loops to extract subsets of our data

We can use a loop to select only some items from a list.

We create a new empty list and then use a `for` loop and an `if` to add items to it from our list *if* they meet our requirements:

In [None]:
original_list = ['oranges', 'item_we_want', 'item_we_want', 'apples','item_we_want','item_we_DO_NOT_want']

new_list = []
for item in original_list:
    if item == 'item_we_want':
        new_list.append(item)

In [None]:
print(new_list)

Say we wanted a list of persons whose listed `disease` was "sickness."  We create a new (empty) list, called `diseases_subset`, and then use a `for` loop and an `if` to add items to it.

In [None]:
diseases_subset = []

for disease in diseases:
    if disease == "sickness":
        diseases_subset.append(disease)

In [None]:
print(diseases_subset)

We can create this list in a slightly more compact fashion, called a *list comprehension*.

Instead of writing out the full `for` loop, we can put the loop *INSIDE* the brackets that contain the list we want to make:

In [None]:
newer_list = [item for item in original_list if item == 'item_we_want']

In [None]:
print(newer_list)

### Using loops to clean our data
What if we wanted to update our data, so that blank data was not just a blank, but something more descriptive?

We can use `for` loops and conditional statements to add items to the list if they meet some conditions, or modify and add them if they meet others:

We create a new list called `updated_diseases`, and then use `if` and `else` statements:

In [None]:
updated_diseases = []

for disease in diseases:
    if disease == '':# Here, we're telling the loop to look for diseases that are marked as blank
        new_disease = 'no disease recorded' # assigning a new variable called `new_disease`
        updated_diseases.append(new_disease) # tell the loop to record 'no disease recorded' if disease field originally blank
    else:
        updated_diseases.append(disease) # Here we're telling the loop to add the disease label as is, if it's not blank.

 Let's check inside our new list:

In [None]:
updated_diseases

### Counting items in a list or collection

To count items in lists or collections, we'll have to **`import`** a pre-written function called `Counter` from a library called `collections`


In [None]:
from collections import Counter

To use the `Counter()` function, we just need to put the thing we want counted inside the parentheses.

In [None]:
Counter(updated_diseases)

What we've created is a dictionary where the entries are the category of "disease" and the number of times it appears in our list.

We can do some nifty things with this dictionary, like sort it into the most and least common diseases.


In [None]:
# To sort into the most common diseases:
disease_tally = Counter(updated_diseases)
disease_tally.most_common()

In [None]:
# To sort into the two least common diseases:
disease_tally = Counter(updated_diseases)
disease_tally.most_common()[-2:] # here we sort in reverse from the least common

Try to do the same with professions:

## 3. Using Python to read & explore tabular data

We'll be covering this more in future week, but for now, we can look at a few more things that Python can help us do:

- Read in a CSV file (remember, CSV is short for comma-separarated values, a format for storing tabular data)
- Explore and filter data
- Make simple plots and data visualizations

Let's look back at our Bellevue Almshouse Dataset. We're going to get a piek 

In [199]:
# Import Pandas library, nicknaming it "pd"
import pandas as pd
# Set the maximum number of rows to display
pd.options.display.max_rows = 100

### Read in our CSV

In [None]:
bellevue_df = pd.read_csv('../_datasets/bellevue_almshouse_modified.csv', delimiter=",", parse_dates=['date_in'])

### Display our dataframe

In [None]:
bellevue_df

Display just the first 5 rows:

In [None]:
bellevue_df.head(5) # Use the .head() operation. Remember head and tail from the command line? It's the same principle

### Calculate statistics:

In [None]:
bellevue_df.describe(include='all')

### Select Columns
To select a column from the DataFrame, we will type the name of the DataFrame followed by square brackets and a column name in quotations marks.

In [None]:
# Select only the column labeled "disease"
bellevue_df['disease']

To select multiple columns, we need to treat them like a datafram, and enclose them in TWO sets of square brackets:

In [None]:
bellevue_df[['first_name', 'last_name', 'disease']]

### Count Values
To count values, we select a column, and use the `.value_counts()` operator

In [None]:
bellevue_df['disease'].value_counts()

To count only the top 10, use brackets to slice:

In [None]:
bellevue_df['disease'].value_counts()[:10]

In [None]:
bellevue_df['profession'].value_counts()[:10]

## 4. Making simple data visualizations

We can make some simple visualizations of what we've found!
Run the cell below:

In [None]:
bellevue_df['disease'].value_counts()[:10].plot(kind='bar', title='Bellevue Almshouse:\nMost Frequent "Diseases"') 

But wait --- what about all those ***blank spots*** that show up as NaN??

Plotting functions in Python will ignore blank values. So all of those blank columns are ignored.

We can try and fix this, much like we did for our simpler lists! 

### `.isna()` , `.notna()`, and `.fillna()`
For dataframes, there are ways of sorting through the missing data. These operations are called `.isna()` `.notna()` and `.fillna()`, which allow us to check if a value is NaN (or not), and to fill in blank values in a dataframe or in a section of a dataframe (like a column).

In [None]:
# Create a new column caled `disease_updated` that fills in all the blank spots in our `disease` column with 
# "no disease recorded"
bellevue_df['disease_updated'] = bellevue_df['disease'].fillna('no disease recorded')

Now, let's check our dataframe -- we should see a new column called `disease_updated`:

In [None]:
bellevue_df

And now let's try and plot again, this time, plotting the column of updated diseases:

In [None]:
bellevue_df['disease_updated'].value_counts()[:10].plot(kind='bar', title='Bellevue Almshouse:\nMost Frequent "Diseases"') 

Compare this to our earlier plot. What do you notice?