# Introduction to Python (Continued)

Wednesday, Feb. 2, 2022

Today, we're going to practice what we've learned to do so far in Python and learn a few more basic elements of Python data analysis.

- [Exercise 1.Using Python to calculate the most frequent words in a text file](#Exercise-1:-Using-Python-to-calculate-the-most-frequent-words-in-a-text-file)
- [2. Manipulate, clean, and sort lists](#2.-Using-Python-to-manipulate,-clean,-and-sort-lists-of-data) 


## Recap of Python basics:

As part of your homework, you learned some of the basic ways we can use Python to open and manipulate text files (files that end in .txt). 

We also learned how to store and sort variables in short lists.

We learned the basic shape of a Python script:

- **import** statements come first, telling pythong what libraries 
- next we **define** filepaths and assign variables
- then we **define** any functions
- then we **read in** any external text files (files ending .txt) or tabular data that you'll be working with (files ending .csv)
- then we **manipulate** and **ananlyze your file** using the functions we've imported or defined
- then we **output** results
- And throughout the script we **add** comments with # hashtags to explain what your script does



## QUESTIONS??? <img align="right" src="images/cat-typing.gif" width="300" height="200"/>

What was confusing? Interesting?

Are there words or terms that it would be helpful to define?

What lingering questions do you have about the exercises?

> Python can be confusing––when I first encountered it, it seemed confusing! (Why do I need quotation marks around some things? What's the difference between a function and a variable? Why do I keep getting errors?) 

>If you, like me, felt like this cat flailing around on a keyboard, that's TOTALLY NORMAL!!! =>

### Let's practice what we've learned! 

## Exercise 1: Using Python to calculate the most frequent words in a text file

### Step 1. 
Copy our script for counting word frequency in [Introduction to Python Basics](https://mybinder.org/v2/gh/sceckert/introdhspring2021/main?urlpath=lab/tree/_week4/introduction-to-python.ipynb), part 1 into the cell below:

In [4]:
# word-frequencies.py

# Import Libraries and Modules

import re
from collections import Counter

# Define Functions


def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text) 
    return split_words

# Define Filepaths and Assign Variables

# Try substituting in a new file here!
filepath_of_text = '../_datasets/texts/literature/Mary-Shelley-Frankenstein-1818.txt' 
number_of_desired_words = 40

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

# Read in File

full_text = open(filepath_of_text, encoding="utf-8").read()

# Manipulate and Analyze File

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in stopwords]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

# Output Results

most_frequent_meaningful_words

[('one', 198),
 ('could', 182),
 ('would', 176),
 ('yet', 160),
 ('father', 144),
 ('man', 134),
 ('upon', 125),
 ('may', 121),
 ('every', 120),
 ('life', 111),
 ('time', 111),
 ('might', 111),
 ('shall', 107),
 ('said', 107),
 ('first', 101),
 ('eyes', 99),
 ('gutenberg', 97),
 ('day', 94),
 ('saw', 94),
 ('towards', 91),
 ('elizabeth', 91),
 ('night', 90),
 ('project', 87),
 ('mind', 87),
 ('found', 87),
 ('death', 84),
 ('ever', 83),
 ('even', 82),
 ('feelings', 80),
 ('work', 79),
 ('felt', 78),
 ('heart', 77),
 ('must', 76),
 ('dear', 73),
 ('thought', 73),
 ('many', 71),
 ('friend', 70),
 ('also', 69),
 ('never', 68),
 ('soon', 67)]

### Step 2. 
As a group, take a minute to **describe what this script does**, in plain English. 

- What is the sequence of steps in the script?
- What are stopwords? 
- What efect do they have on the output?

> Sidenote:  
> If you want to learn more about stopwords (and their history), check out Daniel Rosenberg's article ["Stop, Words"](https://www-jstor-org.ezproxy.princeton.edu/stable/10.1525/rep.2014.127.1.83?seq=1#metadata_info_tab_contents) (2014)

### Step 3. 
Let's test out our script on other texts. 

1. Choose a text from one of the following list and paste it in to `filepath_of_text` variable. in the cell above
     - Start by comparing the 1818 and 1831 editions of Mary Shelley's novel (both downloaded from the [Project Gutenberg](https://www.gutenberg.org/) text library)
     - Then try some others, like the full text collections that Amardeep Singh put together of 19th-century African American Literature (read more about [here](https://github.com/amardeepmsingh/African-American-Literature-Text-Corpus-1853-1923)) or Colonial South Asian Literature (read more about this collection [here](https://github.com/amardeepmsingh/Colonial-South-Asian-Literature))

```
filepath_of_text = '../_datasets/texts/literature/Mary-Shelley-Frankenstein-1818.txt'
filepath_of_text = '../_datasets/texts/literature/Mary-Shelley-Frankenstein-1831.txt'
filepath_of_text = '../_datasets/texts/literature/Jane-Austen-Pride-and-Prejudice.txt'
filepath_of_text = '../_datasets/texts/literature/Daniel-Defoe-Journal-of-a-Plague-Year.txt'
filepath_of_text = '../_datasets/texts/literature/African-American-Literature-Text-Corpus/African-American-Literature-text-files/w-e-b-du-bois-the-quest-of-the-silver-fleece-1911.txt' 

```

### Step 4.

Look carefully back at the list of filepaths above that we've used to read in different text files. 

1. What do the different parts of the filepath mean? (Think back to our lesson on the command line & what we learned about directories)
2. What files do we need (and where do they need to be in relation to this Jupyter notebook) in order to run this code?
3. If you were running this notebook on your own computer through Anaconda Navigator's version of JupyterLabs (not on this cloud-hosted Binder), what files would you need to make sure to have (and where would you need to put them)?


***‼️ This might seem trivial, but a small mistake in a filepath  can be a significant source of error when you're running Python!‼️*** 

### ✨For the future✨:
### I want to try this out on other texts!  Where else can I find text files? 

[Project Gutenberg](https://www.gutenberg.org/)
<img src="../_images/gutenberg.png" width="700" height="40"/>

[Oxford Text Archive](https://ota.bodleian.ox.ac.uk/repository/xmlui/)
<img src="../_images/ota.png" width="700" height="40"/>

Alan Liu's [list of demo corpora](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora) -- collections of text files

Projects that create databases for specifice :
- [Old Bailey Online](https://www.oldbaileyonline.org/) (a database of 18th-century criminal trials)
- [Early Novels Database](https://github.com/earlynovels/end-dataset) (a database of early fiction)


-----
## 💡Let's learn some new Python tricks! 💡
---

## 2. Using Python to manipulate, clean, and sort lists of data

Let's say we're interested in early African American fiction from the 19th and early 20th century. You might turn to a small curated dataset like the African American Literature Text Corpus that Amardeep put together, which lists around 100 texts by African American writers. This is the accompanying "metadata" file ([link](https://github.com/amardeepmsingh/African-American-Literature-Text-Corpus-1853-1923#readme)), which gives us some information about the full text collection he assembled, along with links to the originating repository from the text (Many come from Project Gutenberg, HathiTrust, the American Verse Project at the University of Michigan, the Library of Congress, and the History of Black Writing Novel Corpus.)

We could look at a small sample of this data:

In [2]:
import pandas as pd # This command imports the library `pandas` --  we'll be learning more about in a later lesson!
pd.read_csv('../_datasets/texts/literature/African-American-Literature-Text-Corpus/African-American-Literature-Corpus-Metadata-Amardeep-Singh.csv').head(10)

Unnamed: 0,"Author (last, first)",Title,Year Published,Genre,Publisher,Location of Publisher,Location signed by author,Keywords,Derived From,Status and Links
0,"Adams, Clayton",Ethiopia: the Land of Promise; A Book With a P...,1917,Fiction,Cosmopolitan Press,New York,,Black utopia; segregation; reconstruction,HathiTrust,https://catalog.hathitrust.org/Record/008407122
1,"Anderson, William and Walter H. Stowers",Appointed: An American Novel,1894,Fiction,Detroit Law Printing Co.,Detroit,,Interracial friendship; Northerners going south,HathiTrust,https://catalog.hathitrust.org/Record/005568825
2,"Andrews, W.T.","A Waif--A Prince; or, A Mother's Triumph",1895,Fiction,"Publishing House, Methodist Episcopal Church S...","Nashville, Tennessee",,Religious allegory; Egypt (Hebrews as oppresse...,History of Black Writing Corpus,Also see LOC: https://www.loc.gov/item/06002450/
3,"Ashby, William M.",Redder Blood,1915,Fiction,Cosmopolitan Press,New York,,Passing; Interracial desire,History of Black Writing Corpus,https://catalog.hathitrust.org/Record/004237253
4,"Bennett, John","Madam Margot, a Grotesque Legend of Old Charle...",1917,Fiction,Century Co.,New York,,Supernatural; Romance,History of Black Writing Corpus,https://catalog.hathitrust.org/Record/00858464...
5,"Bibb, Eloise A.",Poems,1895,Poetry,Monthly Review Press,"Boston, Massachusetts",,Mentions Alice Dunbar-Nelson; Poem to Frederic...,Digital Schomburg,Also see American Verse Project: https://quod....
6,"Blackson, Lorenzo D.",Rise and Progress of the Kingdoms of Light and...,1867,Fiction,"J. Nicholas, Printer","Philadelphia, Pennsylvania",,Christian; Allegory,History of Black Writing Corpus,Also see Archive.org: https://archive.org/deta...
7,"Braithwaite, William Stanley",Lyrics of Life and Love,1904,Poetry,Herbert B. Turner co.,"Boston, Massachusetts",,,U-Michigan American Verse Project,
8,"Braithwaite, William Stanley","House of Falling Leaves, With Other Poems",1908,Poetry,John W. Luce and Co,"Boston, Massachusetts",,,U-Michigan American Verse Project,
9,"Brown, William Wells","Clotel; Or, the President's Daughter: A Narrat...",1853,Fiction,Partridge and Oakey,"London, England",,Slavery; Passing; Interracial; Fugitive Slave ...,History of Black Writing Corpus,Also see Documenting the American South: https...


In our dataset above, notice those NaN values?? This is the way that the dataframe datatype indicates MISSING DATA––i.e. a blank field in our CSV file. Keep this at the back of your mind, we'll come back to what those blank values might mean for our dataset. 

What if we wanted to:

- know how many times a certain value appears in the data (e.g., the appearance of "London, England" as a publication location)

- programatically change all blank values in the data (e.g., from a blank to “no data recorded”)

- find the most and least common values in the data (e.g., most common “Genre” or Publishers)?

We can use something called a Python list to store data and perform an operation on it!

Let's look at some sample lists. Each of the lists below contain rows that are drawn from the dataset above.

In [49]:
authors = ['Adams, Clayton', 'Anderson, William and Walter H. Stowers',
       'Andrews, W.T.', 'Ashby, William M.', 'Bennett, John',
       'Bibb, Eloise A.', 'Blackson, Lorenzo D.',
       'Braithwaite, William Stanley', 'Braithwaite, William Stanley',
       'Brown, William Wells', 'Bruce, John Edward',
       'Burgess, Marie Louise', 'Bush-Banks, Olivia Ward',
       'Bush-Banks, Olivia Ward', 'Chesnutt, Charles',
       'Chesnutt, Charles', 'Chesnutt, Charles', 'Chesnutt, Charles',
       'Chesnutt, Charles', 'Clifford, Carrie Williams']

titles =  ['Ethiopia: the Land of Promise; A Book With a Purpose',
       'Appointed: An American Novel',
       "A Waif--A Prince; or, A Mother's Triumph", 'Redder Blood',
       'Madam Margot, a Grotesque Legend of Old Charleston', 'Poems',
       'Rise and Progress of the Kingdoms of Light and Darkness: Or, The Reign of the Kings Alpha and Abedon"',
       'Lyrics of Life and Love',
       'House of Falling Leaves, With Other Poems',
       "Clotel; Or, the President's Daughter: A Narrative of Slave Life in the United States",
       'Awakening of Hezekiah Jones', 'Ave Maria, A Tale',
       'Original Poems', 'Driftwood', 'The Conjure Woman',
       "The Colonel's Dream", 'The House Behind the Cedars',
       'The Marrow of Tradition',
       'The Wife of His Youth and Other Stories of the Color Line',
       'Race Rhymes']

genres = ['Fiction', 'Fiction', 'Fiction', 'Fiction', 'Fiction', 'Poetry',
       'Fiction', 'Poetry', 'Poetry', 'Fiction', 'Fiction', 'Fiction',
       'Poetry', 'Poetry', 'Fiction', 'Fiction', 'Fiction', 'Fiction',
       'Fiction, Nonfiction', 'Poetry']

location_signed = ['', '', '', '', '', '', '', '', '', '', '', '',
       'Providence, Rhode Island', '', '', '', '', '', '', '']


year_published = [1917, 1894, 1895, 1915, 1917, 1895, 1867, 1904, 1908, 1853, 1916,
       1895, 1899, 1914, 1899, 1905, 1900, 1901, 1899, 1911]

In [61]:
M = df['Location of Publisher'].head(20).to_numpy()
M

array(['New York', 'Detroit', 'Nashville, Tennessee', 'New York',
       'New York', 'Boston, Massachusetts', 'Philadelphia, Pennsylvania',
       'Boston, Massachusetts', 'Boston, Massachusetts',
       'London, England', 'Hopkinsville, Kentucky', nan,
       'Providence, Rhode Island', 'Providence, Rhode Island', 'New York',
       'New York', 'Boston, Massachusetts; New York, New York',
       'Boston, Massachusetts; New York, New York',
       'Boston, Massachusetts; New York, New York', 'Washington, DC'],
      dtype=object)

### Using a `for` loop to iterate over a list
In your homework, you learned about `for` loops: they're a way of iterating over a set of items in a list in Python. 

For instance, we could iterate over all of the categories that appear in our `genre` list:

In [25]:
for genre in genres:
    print(genre)

Fiction
Fiction
Fiction
Fiction
Fiction
Poetry
Fiction
Poetry
Poetry
Fiction
Fiction
Fiction
Poetry
Poetry
Fiction
Fiction
Fiction
Fiction
Fiction, Nonfiction
Poetry


We could also add more information, like print the line number for each item in our list.

To do this, we use the `enumerate()` operation:

###  `enumerate()` 

In [26]:
# We can get a little fancy and add more to our label: 
for number, genre in enumerate(genres):
    print(f'Title {number}:', genre) # Here we use an `f-string` to add a text string to our variable.

Title 0: Fiction
Title 1: Fiction
Title 2: Fiction
Title 3: Fiction
Title 4: Fiction
Title 5: Poetry
Title 6: Fiction
Title 7: Poetry
Title 8: Poetry
Title 9: Fiction
Title 10: Fiction
Title 11: Fiction
Title 12: Poetry
Title 13: Poetry
Title 14: Fiction
Title 15: Fiction
Title 16: Fiction
Title 17: Fiction
Title 18: Fiction, Nonfiction
Title 19: Poetry



### Using loops to extract subsets of our data

We can use a loop to select only some items from a list.

We create a new empty list and then use a `for` loop and an `if` to add items to it from our list *if* they meet our requirements:

In [27]:
original_list = ['oranges', 'item_we_want', 'item_we_want', 'apples','item_we_want','item_we_DO_NOT_want']

new_list = []
for item in original_list:
    if item == 'item_we_want':
        new_list.append(item)

In [28]:
print(new_list)

['item_we_want', 'item_we_want', 'item_we_want']


Say we wanted a list of titles whose listed `genre` was "Poetry."  We create a new (empty) list, called `genres_subset`, and then use a `for` loop and an `if` to add items to it.

In [34]:
genres_subset = []

for genre in genres:
    if genre == "Poetry":
        genres_subset.append(genre)

In [31]:
print(genres_subset)

['Poetry', 'Poetry', 'Poetry', 'Poetry', 'Poetry', 'Poetry']


We could do the same for `location_signed`

In [50]:
location_signed_subset = []

for location in location_signed:
    if location != "" : # Here, we're constructing a for loop to add all the locations that are not blank
        location_signed_subset.append(location)

In [51]:
print(location_signed_subset)

['Providence, Rhode Island']


We can create this list in a slightly more compact fashion, called a *list comprehension*.

Instead of writing out the full `for` loop, we can put the loop *INSIDE* the brackets that contain the list we want to make:

In [32]:
newer_list = [item for item in original_list if item == 'item_we_want']

In [33]:
print(newer_list)

['item_we_want', 'item_we_want', 'item_we_want']


### Using loops to clean our data
What if we wanted to update our data, so that blank data was not just a blank, but something more descriptive?

We can use `for` loops and conditional statements to add items to the list if they meet some conditions, or modify and add them if they meet others:

We create a new list called `updated_location_signed`, and then use `if` and `else` statements:

In [52]:
updated_location_signed = []

for location in location_signed:
    if location == '':# Here, we're telling the loop to look for location_signed that are marked as blank
        new_location_signed = 'no location recorded' # assigning a new variable called `new_location_signed`
        updated_location_signed.append(new_location_signed) # tell the loop to record 'no location recorded' if location_signed field originally blank
    else:
        updated_location_signed.append(location) # Here we're telling the loop to add the location signed label as is, if it's not blank.

 Let's check inside our new list:

In [53]:
updated_location_signed

['no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'Providence, Rhode Island',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded',
 'no location recorded']

### Counting items in a list or collection

To count items in lists or collections, we'll have to **`import`** a pre-written function called `Counter` from a library called `collections`


In [54]:
from collections import Counter

To use the `Counter()` function, we just need to put the thing we want counted inside the parentheses.

In [56]:
Counter(genres)

Counter({'Fiction': 13, 'Poetry': 6, 'Fiction, Nonfiction': 1})

What we've created is a dictionary where the entries are the category of "genre" and the number of times it appears in our list.

We can do some nifty things with this dictionary, like sort it into the most and least common diseases.


In [57]:
# To sort into the most common diseases:
genres_tally = Counter(genres)
genres_tally.most_common()

[('Fiction', 13), ('Poetry', 6), ('Fiction, Nonfiction', 1)]

In [58]:
# To sort into the two least common diseases:
genres_tally = Counter(genres)
genres_tally.most_common()[-2:] # here we sort in reverse from the least common

[('Poetry', 6), ('Fiction, Nonfiction', 1)]

### Exercise 2: Counting and sorting "professions" in a list

In [64]:
publisher_location = ['New York', 'Detroit', 'Nashville, Tennessee', 'New York',
       'New York', 'Boston, Massachusetts', 'Philadelphia, Pennsylvania',
       'Boston, Massachusetts', 'Boston, Massachusetts',
       'London, England', 'Hopkinsville, Kentucky', '',
       'Providence, Rhode Island', 'Providence, Rhode Island', 'New York',
       'New York', 'Boston, Massachusetts; New York, New York',
       'Boston, Massachusetts; New York, New York',
       'Boston, Massachusetts; New York, New York', 'Washington, DC']


1. Count the number of times each publisher location appears in this list

In [None]:
## Your code here

2. Make a list of the top 5 most common locations that these works were published

In [None]:
## Your code here

3. Make a new list of publisher locations that contains only the enteries where `publisher_location` is "London, England"

In [None]:
## Your code here

4. Make a `for`` loop that considers each item in the publisher locations list and prints "Title was published in ___"

In [None]:
## Your code here
    ## Your code here