# Loops

Loops are used to repeat a process over and over until a given condition is met. It is similar to our process of, for example, searching for a specific quote in a text.

Read sentence, "Is this the sentence I'm looking for?" you asking yourself, "No" your brain affirms-- and so you repeat this process of reading each sentence until you find the one you're looking for. Translating this process into computer, it would look something like this:

```
for every sentence on this page:
    if this is the quote I am looking for:
        I can stop reading, I've found it!
    if this isn't the quote:
        Let's read the next sentence.
```

In DH applications, for loops allow you to search a large amount of data very quickly.

In [1]:
# here is a list of names
list_of_names = ["George Atherton",
                "Marian Hyde",
                "Sybil Dickenson",
                "Sabina Dobson",
                "Jessica Bradbury",
                "Cindy Salter",
                "Carolina McCabe",
                "Glynis Graves",
                "Laurie Dobson",
                "Phoebe Watkins",
                "Noel Boardman"]

# first, we can make a loop to "iterate" over the list with no conditions
# it will simply continue to go over each item in the list until there is nothing left
# to simply just print out every name in the list:

for name in list_of_names:
    print(name)

# NOTE: "name" is a variable declared only in the loop, and it stores the item that the loop is presently looking at
# in our case, in the first loop "name" = "George Atherton", and then after that name is printed, the loop repeats and "name" = "Marian Hyde", in the next loop "name" = "Sybil Dickenson", and so on until the end of the list

George Atherton
Marian Hyde
Sybil Dickenson
Sabina Dobson
Jessica Bradbury
Cindy Salter
Carolina McCabe
Glynis Graves
Laurie Dobson
Phoebe Watkins
Noel Boardman


In [2]:
# now, like stated earlier, you'll more likely want to use a loop to find something relevant to your work
# let's say we're only interested in people with the surname "Dobson"
# we can use a combination of for loops and if statements to create a new list of only Dobsons!

# declare your new list that we will add to
only_dobsons = []

for name in list_of_names:
    # we check if this name includes "Dobson"
    if "Dobson" in name:
        # if this is True, we add this name to our new list
        only_dobsons.append(name)

print(only_dobsons)

['Sabina Dobson', 'Laurie Dobson']


## Practice Activity #4: Loop and Look 👀
Here is a list of quotes from the novel *Little Women* by Louisa May Alcott. In this activity, use a `for` loop and `if` statement as done above to find quotes that include the word **"work"**. These quotes should be added to a new list, then printed.

In [4]:
little_women_quotes = [
                        "...I do think washing dishes and keeping things tidy is the worst work in the world. It makes me cross; and my hands get so stiff, I can't practise well at all.",
                        "I don't see how you can write and act such splendid things, Jo. You're a regular Shakespeare!",
                        "But it does seem so nice to have little suppers and bouquets, and go to parties, and drive home, and read and rest, and not work. It's like other people, you know, and I always envy girls who do such things; I'm so fond of luxury",
                        "She caught up her knitting, which had dropped out of her hands, gave me a sharp look through her specs, and said, in her short way, 'Finish the chapter, and don't be impertinent, miss.'",
                        "You may try your experiment for a week, and see how you like it. I think by Saturday night you will find that all play and no work is as bad as all work and no play."
                    ]

for name in little_women_quotes:
    if "work" in name:
        print(name)

...I do think washing dishes and keeping things tidy is the worst work in the world. It makes me cross; and my hands get so stiff, I can't practise well at all.
But it does seem so nice to have little suppers and bouquets, and go to parties, and drive home, and read and rest, and not work. It's like other people, you know, and I always envy girls who do such things; I'm so fond of luxury
You may try your experiment for a week, and see how you like it. I think by Saturday night you will find that all play and no work is as bad as all work and no play.


# Functions

'Functions' are blocks of reuseable code; as you know by now, in Python there are many functions such as `print()` or `len()` which were designed to perform specific tasks when called. If in your own code you believe that there is a task you will need to repeat multiple times at various points, you can write a function yourself! For example, instead of having something like this: 

In [None]:
# find age of each person from records
life_records = [["Cindy Salter", "Born: 1903", "Died: 1933"], ["Glynis Graves", "Born: 1911", "Died: 1989"], ["Noel Boardman", "Born: 1908", "Died: 1972"]]

cs_born = life_records[0][1]
cs_death = life_records[0][2]

# get only the number
for word in cs_born.split():
    if word.isdigit():
        cs_born = int(word)

for word in cs_death.split():
    if word.isdigit():
        cs_death = int(word)

cs_age = cs_death - cs_born
print(life_records[0][0] + "'s age: " + str(cs_age))

gg_born = life_records[1][1]
gg_death = life_records[1][2]

# get only the number
for word in gg_born.split():
    if word.isdigit():
        gg_born = int(word)

for word in gg_death.split():
    if word.isdigit():
        gg_death = int(word)

gg_age = gg_death - gg_born
print(life_records[1][0] + "'s age: " + str(gg_age))

nb_born = life_records[2][1]
nb_death = life_records[2][2]

# get only the number
for word in nb_born.split():
    if word.isdigit():
        nb_born = int(word)

for word in nb_death.split():
    if word.isdigit():
        nb_death = int(word)

nb_age = nb_death - nb_born
print(life_records[2][0] + "'s age: " + str(nb_age))


...We could have something much tidier and easier to read, like this:

In [5]:
# find age of each person from records
life_records = [["Cindy Salter", "Born: 1903", "Died: 1933"], ["Glynis Graves", "Born: 1911", "Died: 1989"], ["Noel Boardman", "Born: 1908", "Died: 1972"]]

def find_age(record):
    born = ''
    death = ''
    for word in record[1].split():
        if word.isdigit():
            born = int(word)

    for word in record[2].split():
        if word.isdigit():
            death = int(word)

    age = record[0] + "'s age: " + str(death - born)
    return age

print(find_age(life_records[0]))
print(find_age(life_records[1]))
print(find_age(life_records[2]))

Cindy Salter's age: 30
Glynis Graves's age: 78
Noel Boardman's age: 64


# Libraries

Now, what can make your code even *tidier*, plus easier to read *and* write? Libraries! Also referred to as "packages", these helpful tools are essentially large collections of pre-written functions that you can install in your Python environment and import so that you can use these functions in your code. 

## NLTK (Natural Language Tool Kit)

In this workshop, we will be introducing two libraries which are necessities for any digital humanist's tool kit the first of which is [NLTK (Natural Language Tool Kit)](https://www.nltk.org/). This is an all-encompassing library to support work in natural language processing (NLP), a multidisciplinary field which deals with the interactions between "natural" human language and computers. It has its roots in linguisitics which is why it can do things like this:

In [1]:
!pip3 install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.5.15-cp312-cp312-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     --------- ------------------------------ 10.2/42.0 kB ? eta -:--:--
     -------------------------------------  41.0/42.0 kB 960.0 kB/s eta 0:00:01
     -------------------------------------- 42.0/42.0 kB 401.7 kB/s eta 0:00:00
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------ --------------------------------- 0.2/1.5 MB 7.6 MB/s eta 0:00:01
   -------------------------- ------------- 1.0/1.5 MB 12.5 MB/s eta 0:00:01
   ---------------------------------------  1.5/1.5 MB

In [2]:
import nltk
from nltk import pos_tag, word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# tagging PoS in inputted text
text = word_tokenize("Be careful with that butter knife.")
nltk.pos_tag(text)



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wjncu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\wjncu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


[('Be', 'VB'),
 ('careful', 'JJ'),
 ('with', 'IN'),
 ('that', 'DT'),
 ('butter', 'NN'),
 ('knife', 'NN'),
 ('.', '.')]

...But as the `word_tokenize()` function hints at, NLTK is also excellent at preparing text for and performing textual analysis in a less particulated manner!

(**Note**: NLTK uses the Penn Treebank Tag Set for POS tagging, [which can be found here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).)

In [3]:
import nltk

# tokenization is the process of splitting strings into their individual "tokens"
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

# to import a .txt file we use the "open" function, giving it the path to our text file and an instrution about what we want to do with the file
# here, we would like to "read" our file into a variable so 
transcript = open('Bette-Smith-Transcript.txt').read().lower()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wjncu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wjncu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [4]:
# we could then tokenize by sentence, which splits the text into sentences
transcript_sentences = sent_tokenize(transcript)
transcript_sentences

["my name is bette smith, and i grew up in ottawa in the carleton, carling area, sorry i'm going back a little bit.",
 'and i went to fisher park high school, and while i was there, i went to grade 13, which usually prepares you for university.',
 "however, in my grade 13 year, i really didn't feel i wanted to go to university, but i decided instead to take the money that my parents had put aside for me for schooling or whatever and went to business college instead.",
 'so i took a year at willis business college in downtown ottawa, and at the end of that, they just said to me well, where would you like to work?',
 'and i said maybe i would like to work at the university.',
 "so they actually got me the job, i didn't apply.",
 'and i started at carleton university in 1972, so i would have been just 19 years of age.',
 'i started as a steno 03, which is basically the bottom of the pile, and i remember that i made 3,500 a year.',
 "which at that point wasn't enough to live on by yourself

In [5]:
# or more commonly, we can tokenize into words, which splits the sentences into its parts of speech
transcript_words = word_tokenize(transcript)
transcript_words

['my',
 'name',
 'is',
 'bette',
 'smith',
 ',',
 'and',
 'i',
 'grew',
 'up',
 'in',
 'ottawa',
 'in',
 'the',
 'carleton',
 ',',
 'carling',
 'area',
 ',',
 'sorry',
 'i',
 "'m",
 'going',
 'back',
 'a',
 'little',
 'bit',
 '.',
 'and',
 'i',
 'went',
 'to',
 'fisher',
 'park',
 'high',
 'school',
 ',',
 'and',
 'while',
 'i',
 'was',
 'there',
 ',',
 'i',
 'went',
 'to',
 'grade',
 '13',
 ',',
 'which',
 'usually',
 'prepares',
 'you',
 'for',
 'university',
 '.',
 'however',
 ',',
 'in',
 'my',
 'grade',
 '13',
 'year',
 ',',
 'i',
 'really',
 'did',
 "n't",
 'feel',
 'i',
 'wanted',
 'to',
 'go',
 'to',
 'university',
 ',',
 'but',
 'i',
 'decided',
 'instead',
 'to',
 'take',
 'the',
 'money',
 'that',
 'my',
 'parents',
 'had',
 'put',
 'aside',
 'for',
 'me',
 'for',
 'schooling',
 'or',
 'whatever',
 'and',
 'went',
 'to',
 'business',
 'college',
 'instead',
 '.',
 'so',
 'i',
 'took',
 'a',
 'year',
 'at',
 'willis',
 'business',
 'college',
 'in',
 'downtown',
 'ottawa',
 '

In [6]:
# now remember that huge block of stopwords manually typed out in the sample block of code from the first lesson? That comes built in to NLTK as you may have guessed from the earlier import statment
# we can assign the NLTK stopwords to a variable like so:
stop_words = stopwords.words('english')

# and then remove the stopwords from out text using a loop to check if each word in the transcript and only keep the words that are NOT in out stopword list
filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words:
        filtered_transcript_words.append(word)


In [7]:
# finally, we can simply find word frequeny with NLTK's frequnecy distribution function
from nltk import FreqDist

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)



[(',', 885),
 ('.', 712),
 ('?', 119),
 ("n't", 111),
 ('would', 94),
 ("'s", 86),
 ('know', 73),
 ('staff', 61),
 ('faculty', 60),
 ('think', 58)]

In [9]:
# now, as you can see, our list is topped by punctuation and contractions!

# to remove punctuation, we can use Python's string library to create a list of punctuation
from string import punctuation
punctuation = list(punctuation)
# print(punctuation)

# and luckily, you can modify your stopwords and punctuation lists like any other list!
# let's add "n't", "'s", and "would"
# to add multiple elements to a list at once, we use extend() rather that append()
stop_words.extend(["n't", "'s", 'would'])

In [10]:
# let's re-run with our new stopwords and punctuation list to see the improved results
filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)

[('know', 73),
 ('staff', 61),
 ('faculty', 60),
 ('think', 58),
 ('well', 52),
 ('work', 52),
 ('going', 47),
 ('yeah', 44),
 ('really', 41),
 ('like', 39)]

In [13]:
# now that we have a word frequency list, we can even use NLTK for concordance analysis (seeing word in context)
# we can choose a word from the word frequency list, and search the original tokenized text for it after making it a Text object
from nltk.text import Text

text_list = Text(transcript_words)
text_list.concordance("work", lines=10)


Displaying 10 of 52 matches:
to me well , where would you like to work ? and i said maybe i would like to w
k ? and i said maybe i would like to work at the university . so they actually
counts payable , accounts receivable work . my brother gordon went on to do a 
 went on to do a master 's in social work , sorry not social work , sociology 
's in social work , sorry not social work , sociology and actually did some do
ology and actually did some doctoral work at london school of economics . marg
n dj of course is also done graduate work in political science , actually as w
h it 's very hard , as you know , to work , to work full time and do courses .
ry hard , as you know , to work , to work full time and do courses . and have 
ves . so then did he , when , did he work that farm ? dad was never really , h


In [None]:
# Here is what our original Python script from the Python-I notebook now looks like with NLTK
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from string import punctuation
punctuation = list(punctuation)

nltk.download('punkt')
nltk.download('stopwords')

transcript = open('Bette-Smith-Transcript.txt', encoding="utf-8").read().lower()

transcript_words = word_tokenize(transcript)

stop_words = stopwords.words('english')
stop_words.extend(["n't", "'s", 'would'])

filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)

## Practice Activity #5: Investigate your own text 🔍
For this activity, use a `.txt` file you have on hand, or download a plain text file from [Project Gutenburg](https://www.gutenberg.org/). Place it in the same folder as this notebook, then open it in your code and see if you can use the NLTK to perform a frequency distribution or concordance analysis!

## Pandas

[Pandas](https://pandas.pydata.org/docs/) is a data analysis and manipulation tool, working with data in the form of a `dataframe`. A `dataframe` is a Python version of a spreadsheet!

Like a spreadsheet, each column can be of a different type, and using Pandas means we can quickly perform a number of operations on our `dataframe` to prepare our data for use in analysis. To demonstrate functionality, we will be using an exported list of individuals accused of witchcraft in Scotland, from the [Survey of Scottish Witchcraft](https://www.shca.ed.ac.uk/Research/witches/).

In [None]:
!pip3 install pandas

In [None]:
import pandas as pd
# we can add these arguments to set how many columns and rows we want Jupyter Notebook to display
pd.options.display.max_columns = 70
pd.options.display.max_rows = 70

# we can import a CSV file very simply using Pandas's built in function
witches_df = pd.read_csv("wdb_accused.csv",  delimiter=",") 
witches_df

In [None]:
# wow! that's a lot of confusing data!

# to get the contents of only one column you can call the column by name
print(witches_df['res_county'])

In [None]:
# you can treat individual columns like lists by assigning them to a variable
witches_residence = witches_df['res_county']
print(type(witches_residence))

# ...but this is still a pandas series, so to make a column into a list "officially" to avoid surprise errors, you can cast the column to be a list
witches_residence = list(witches_df['res_county'])
print(type(witches_residence))

In [None]:
# there's a lot of columns, so let's reshape our dataframe to only have a few we're interested in
witches_df = witches_df[['firstname', 'lastname', 'sex', 'age', 'res_county', 'maritalstatus', 'socioecstatus', 'occupation', 'notes']].copy()
witches_df

In [None]:
# much better! now we can change a column name to make naming clearer
witches_df = witches_df.rename(columns={"res_county": "residing_county"})
witches_df

In [None]:
# let's say we want to look at the occupation of each accused witch
# there are a lot of NaN (Not a Number aka blank cells) which we can filter out using Pandas's .loc() and .notna() functions
witches_df.loc[witches_df["occupation"].notna()]

# NOTE: there is also a function .isna() that does the opposite of .notna()!

In [None]:
# if I want to look only at those who were midwives, I can use .loc() with a comparison operator
witches_df.loc[witches_df["occupation"] == "Midwife"]

In [None]:
# like FreqDist in NLTK, Pandas has .value_counts() which will tally up the occurances of unique values in a given row
# so let's check the distribution of occupations
witches_df["occupation"].value_counts()

In [None]:
# if we want all basic statistics for numerical columns we can use .describe()
# I'm interested to see the mean age of the accused
witches_df.describe()

In [None]:
# if we want to replace all instances of NaN in the dataframe with something more meaningful we can use the .fillna() function
witches_df = witches_df.fillna("Unknown")

In [None]:
witches_df

In [None]:
# and take note that you can apply string methods to any column! 
# let's make everything in the "notes" column lowercase so it's normalised in case you need it for text analysis later
witches_df["notes"] = witches_df["notes"].str.lower()


In [None]:
witches_df

In [None]:
# finally, Pandas makes it really easy to export your dataframe as a CSV for publication or later use
witches_df.to_csv("accused_cleaned.csv")

# Putting Everything Together

In [None]:
# write activity code here

## Answer key (No peeking!)

In [None]:
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

pd.options.display.max_rows = 100

In [None]:
litrev_df = pd.read_csv('digihum-lit-rev.csv', delimiter=",")

litrev_df

In [None]:
litrev_df = litrev_df[litrev_df["Description"].notna()]

abstracts_as_text = ""

for i in litrev_df["Description"]:
    abstracts_as_text += i + "\n"    
    
abstractTokens = word_tokenize(abstracts_as_text.lower())

cleaned_abstractTokens = []

for word in list(abstractTokens):
    if word not in stopwords.words("english") and word.isalpha():
        cleaned_abstractTokens.append(word)

abstracts_df = pd.DataFrame(cleaned_abstractTokens, columns =['uniqueWords'])
        
keywords = abstracts_df["uniqueWords"].value_counts()

keywords[100000000000]

# Identifying and Solving Errors

Try and correct the following errors! For more of a challenge, try and identify the errors before running the code 🔎

In [None]:
# Error 1

people = [
    {'name': 'Jolene', 'birth_year': 1955, 'death_year': 1972},
    {'name': 'George', 'birth_year': 1942, 'death_year': 2010},
    {'name': 'Charlene', 'birth_year': 1927, 'death_year': 1941},
    {'name': 'David', 'birth_year': 1830, 'death_year': 1923},
    {'name': 'Eve', 'birth_year': 1899, 'death_year': 1940},
]

print(people[5])

In [None]:
# Error 2

# takes two arguments and returns their sum
def add_numbers(x, y):
    return x + y

result = add_numbers(5, 10)

print("The sum of the numbers is:", result)
print("The difference of the numbers is:", result2)

In [None]:
# Error 3

year = 1955
name = "Jolene Barrie"

result = name + " was born in " + year
print(result)

In [None]:
# Error 4

people = [
    {'name': 'Jolene', 'birth_year': 1955, 'death_year': 1972},
    {'name': 'George', 'birth_year': 1942, 'death_year': 2010},
    {'name': 'Charlene', 'birth_year': 1927, 'death_year': 1941},
    {'name': 'David', 'birth_year': 1830, 'death_year' 1923},
    {'name': 'Eve', 'birth_year': 1899, 'death_year': 1940},
]

for person in people:
    print("Age at death: " + str(person['death_year'] - person['birth_year']))

In [None]:
# Error 5

# convert strings to int
def convert_to_num(year):
    return int(year)

year = "1955"
name = "Jolene Barrie"

convert_to_num(name)