# Dictionaries

## Instantiating a dictionary

Examples

In [1]:
catalog = {'1008':'widget', '2149':'flange', '19x5':'smoke shifter', '992':'poiuyt'}
profile = {'name':'Mickey Mouse', 'company':'Disney', 'animated':True, 'fingers':8}

print(catalog['2149'])
print(profile['name'])

flange
Mickey Mouse


In [3]:
# The key can be a string variable rather than a literal
characteristic = 'animated'
print(profile[characteristic])

trait = input('What do you want to know about the character? ')
print("The character's", trait, 'is', profile[trait])

True


KeyError: 'tail'

## Editing a dictionary

In [5]:
my_dict = {}
my_dict['name'] = input('What is the character name? ')
print(my_dict)
my_dict['company'] = input('Who does the character work for? ')
print(my_dict)
my_dict['fingers'] = int(input('How many fingers does the character have? '))
print(my_dict)

{'name': 'Scooby-Doo'}
{'name': 'Scooby-Doo', 'company': 'Shaggy'}


ValueError: invalid literal for int() with base 10: 'idk'

In [6]:
print(catalog)

catalog['2149'] = 'thingamajig'
print(catalog)

del catalog['1008']
print(catalog)

{'1008': 'widget', '2149': 'flange', '19x5': 'smoke shifter', '992': 'poiuyt'}
{'1008': 'widget', '2149': 'thingamajig', '19x5': 'smoke shifter', '992': 'poiuyt'}
{'2149': 'thingamajig', '19x5': 'smoke shifter', '992': 'poiuyt'}


## Practice

The starter code has some data retrieved from the Twitter API for a tweet. Print the text of the tweet, then change the vanlue of the `lang` key to Spanish (language code `es`) and print the dictionary.

In [13]:
tweet = {'created_at':'Wed Sep 18 14:08:54 +0000 2019', 'text':'RT @wnprwheelhouse: @wnprharriet кричать @wnpr !','lang':'ru'}
# print(tweet['entities']['user_mentions'][0]['screen_name'])
print(tweet['text'])
# change the language of the tweet to Spanish
tweet['lang'] = 'es'
print(tweet)

RT @wnprwheelhouse: @wnprharriet кричать @wnpr !
{'created_at': 'Wed Sep 18 14:08:54 +0000 2019', 'text': 'RT @wnprwheelhouse: @wnprharriet кричать @wnpr !', 'lang': 'es'}


# Complex data structures

## Lists of lists


In [14]:
first_row = [3, 5, 7, 9]
second_row = [4, 11, -1, 5]
third_row = [-99, 0, 45, 0]
data = [first_row, second_row, third_row]
print(data)

[[3, 5, 7, 9], [4, 11, -1, 5], [-99, 0, 45, 0]]


In [15]:
print(len(data))

3


In [16]:
print(data[1])
print(len(data[1]))

[4, 11, -1, 5]
4


In [17]:
data = [[3, 5, 7, 9], [4, 11, -1, 5], [-99, 0, 45, 0]]
print(data[2][0])

-99


## Lists of dictionaries

In [18]:
characters = [{'name':'Mickey Mouse', 'company':'Disney', 'gender': 'male'}, {'name':'Daisy Duck', 'company':'Disney', 'gender': 'female'}, {'name':'Daffy Duck', 'company':'Warner Brothers', 'gender': 'male'},  {'name':'Fred Flintstone', 'company':'Hanna Barbera', 'gender': 'male'}, {'name':'WALL-E', 'company':'Pixar', 'gender': 'neutral'}, {'name':'Fiona', 'company':'DreamWorks', 'gender': 'female'}]
print(characters[1]['company'])
print(characters[0]['name'])
print(characters[4]['gender'])

Disney
Mickey Mouse
neutral


# Pandas

Standard import statement for pandas

In [23]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.1-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting numpy<2,>=1.26.0 (from pandas)
  Downloading numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
     ---------------------------------------- 0.0/61.0 kB ? eta -:--:--
     ------ --------------------------------- 10.2/61.0 kB ? eta -:--:--
     ------------------- ------------------ 30.7/61.0 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 61.0/61.0 kB 806.3 kB/s eta 0:00:00
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.1-cp312-cp312-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
    --------------------------------------- 0.2/11.5 MB 4.6 MB/s eta 0:00:03
   - -------------------------------------- 0.5/11.5 MB 6.0 MB/s eta 0:00:02
 

In [24]:
import pandas as pd

# Create a dictionary to use for experiments
states_dict = {'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'}
print(states_dict)

{'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'}


## pandas Series

Series are one-dimensional pandas data structures that are sort of a hybrid of dictionaries and lists. They are ordered, but they also are labeled.

We can create an instance of a Series by passing a dictionary as an argument into `pd.Series()`:

In [25]:
states_series = pd.Series(states_dict)
print(states_series)

OH            Ohio
TN       Tennessee
AZ         Arizona
PA    Pennsylvania
AK          Alaska
dtype: object


When a Series is displayed, the label index is shown on the left and the Series items are shown on the right.

We can refer to items in a Series by either position (using an integer index) or by their name (using the label index for the item). Integer indexing is zero-based as with everything else in Python.

In [26]:
print(states_series[2])
print(states_series['TN'])

Arizona
Tennessee


  print(states_series[2])


Series item 2 is the third item in the Series since we start counting with zero.

There is an alternate way of referring to Series items by position that makes the indexing system explicit. `.loc[]` locates items by label index, and `.iloc[]` locates items by integer index. WARNING: One gotcha here is that the "i" in "iloc" should be thought of as referring to "integer", NOT "index". In pandas, when the term "index" is used by itself, it refers to the label index, not the integer index.

Specifying a single index in `.loc[]` or `.iloc[]` returns a single value from the Series. In this case the values are strings, so the type of the returned value is string.

In [27]:
print(states_series.iloc[2])
print(states_series.loc['TN'])
print(type(states_series.loc['TN']))

Arizona
Tennessee
<class 'str'>


## pandas DataFrames

DataFrames are two-dimensional data structures composed of Series with shared indices.

DataFrames can be created from a dictionary of Series.

In [28]:
text_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
capital_series = pd.Series({'OH': 'Columbus', 'TN': 'Nashville', 'AZ': 'Phoenix', 'PA': 'Harrisburg', 'AK': 'Juneau'})
population_series = pd.Series({'OH': 11799448, 'TN': 6910840, 'AZ': 7151502, 'PA': 13002700, 'AK': 733391})
print(text_series)
print()
print(capital_series)
print()
print(population_series)

states_dict = {'text': text_series, 'capital': capital_series, 'population': population_series}
states_df = pd.DataFrame(states_dict)

OH            Ohio
TN       Tennessee
AZ         Arizona
PA    Pennsylvania
AK          Alaska
dtype: object

OH      Columbus
TN     Nashville
AZ       Phoenix
PA    Harrisburg
AK        Juneau
dtype: object

OH    11799448
TN     6910840
AZ     7151502
PA    13002700
AK      733391
dtype: int64


When created in this way, the dictionary keys are used as the column headers (column label indices) and each series becomes a column. The label indices of the series are shared by all of the rows as the row label indices.

When you print a pandas DataFrame, you get a text representation. If the name is given as the last line of the notebook cell, it's displayed in a "prettier" form.

In [31]:
print(states_df)
states_df

            text     capital  population
OH          Ohio    Columbus    11799448
TN     Tennessee   Nashville     6910840
AZ       Arizona     Phoenix     7151502
PA  Pennsylvania  Harrisburg    13002700
AK        Alaska      Juneau      733391


Unnamed: 0,text,capital,population
OH,Ohio,Columbus,11799448
TN,Tennessee,Nashville,6910840
AZ,Arizona,Phoenix,7151502
PA,Pennsylvania,Harrisburg,13002700
AK,Alaska,Juneau,733391


## Specifying a column

We can specify a column by using its column header as the label index in square brackets. The resulting column is a pandas Series.

In [32]:
print(states_df['capital'])
print()
print(type(states_df['capital']))

OH      Columbus
TN     Nashville
AZ       Phoenix
PA    Harrisburg
AK        Juneau
Name: capital, dtype: object

<class 'pandas.core.series.Series'>


## Specifying a row

Select a row using `.loc` with the label index and `.iloc` with the integer position. The resulting output is a series, with the column labels as its label indices.

In [33]:
print(states_df.loc['AZ'])
print()
print(states_df.iloc[1])

text          Arizona
capital       Phoenix
population    7151502
Name: AZ, dtype: object

text          Tennessee
capital       Nashville
population      6910840
Name: TN, dtype: object


## Specifying a cell

Select a cell using `.loc` with the label index and column label. The resulting output is the type of data containted in the cell.

In [34]:
print(states_df.loc['PA', 'population'])
print(type(states_df.loc['PA', 'population']))
print(states_df.loc['AK', 'capital'])
print(type(states_df.loc['AK', 'capital']))

13002700
<class 'numpy.int64'>
Juneau
<class 'str'>


## Practice

Print the expressions for the following:
- The population column
- The row for Tennessee, using the label index
- The row for Alaska, using the integer index
- The capital of Pennsylvania

In [42]:
print(states_df ['population'])
print(states_df.loc ['TN'])
print(states_df.iloc [4])
print(states_df.loc ['PA', 'capital'])

OH    11799448
TN     6910840
AZ     7151502
PA    13002700
AK      733391
Name: population, dtype: int64
text          Tennessee
capital       Nashville
population      6910840
Name: TN, dtype: object
text          Alaska
capital       Juneau
population    733391
Name: AK, dtype: object
Harrisburg


# Loading a DataFrame from a file

Although there are a number of ways to build a pandas DataFrame from simpler Python objects, most of the time we will create them from data that are already in tablular form in a file. 

You can load a CSV file by passing in its URL as the argument of the `.read_csv()` function. Since the `School ID` column is a unique identifier for each row, we can use it as the index column.

In [43]:
schools_df = pd.read_csv('https://raw.githubusercontent.com/HeardLibrary/digital-scholarship/master/data/gis/wg/Metro_Nashville_Schools.csv')
# Set the row label index to be the School ID column
schools_df = schools_df.set_index('School ID')
schools_df

Unnamed: 0_level_0,School Year,School Level,School Name,State School ID,Zip Code,Grade PreK 3yrs,Grade PreK 4yrs,Grade K,Grade 1,Grade 2,...,Native Hawaiian or Other Pacific Islander,White,Male,Female,Economically Disadvantaged,Disability,Limited English Proficiency,Latitude,Longitude,Mapped Location
School ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
496,18-19,Elementary School,A. Z. Kelley Elementary,1,37013,,40.0,153.0,146.0,177.0,...,,218,423,424,300.0,91,301.0,36.021817,-86.658848,"(36.02181712, -86.65884778)"
375,18-19,Elementary School,Alex Green Elementary,5,37189,,37.0,53.0,46.0,40.0,...,,15,123,143,183.0,19,25.0,36.252961,-86.832229,"(36.2529607, -86.8322292)"
105,18-19,Elementary School,Amqui Elementary,10,37115,2.0,29.0,91.0,81.0,89.0,...,,73,230,234,244.0,47,122.0,36.273766,-86.703832,"(36.27376585, -86.70383153)"
460,18-19,Elementary School,Andrew Jackson Elementary,15,37138,4.0,29.0,93.0,85.0,90.0,...,1.0,270,252,250,99.0,66,33.0,36.231585,-86.623775,"(36.23158465, -86.62377469)"
110,18-19,High School,Antioch High School,20,37013,,,,,,...,1.0,442,1047,909,716.0,223,544.0,36.046675,-86.599418,"(36.04667464, -86.59941833)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
770,18-19,Middle School,West End Middle,690,37205,,,,,,...,1.0,253,276,256,132.0,106,27.0,36.131975,-86.824323,"(36.13197471, -86.82432265)"
775,18-19,Elementary School,Westmeade Elementary,695,37205,,,88.0,85.0,88.0,...,2.0,198,237,187,128.0,54,61.0,36.091997,-86.894137,"(36.09199678, -86.89413665)"
787,18-19,High School,Whites Creek High School,704,37189,,,,,,...,1.0,80,316,336,370.0,132,18.0,36.276645,-86.818833,"(36.27664532, -86.81883299)"
612,18-19,Middle School,William Henry Oliver Middle,538,37211,,,,,,...,4.0,444,476,484,253.0,113,208.0,36.020174,-86.712207,"(36.02017398, -86.7122071)"


If a DataFrame is large, it will be difficult to examine the whole thing at once. We can use several methods to view characteristics of the DataFrame.

The `.head()` method will display the first 5 rows of the DataFrame. You can pass in a different number of rows to display as an argument. 

In [44]:
schools_df.head()

Unnamed: 0_level_0,School Year,School Level,School Name,State School ID,Zip Code,Grade PreK 3yrs,Grade PreK 4yrs,Grade K,Grade 1,Grade 2,...,Native Hawaiian or Other Pacific Islander,White,Male,Female,Economically Disadvantaged,Disability,Limited English Proficiency,Latitude,Longitude,Mapped Location
School ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
496,18-19,Elementary School,A. Z. Kelley Elementary,1,37013,,40.0,153.0,146.0,177.0,...,,218,423,424,300.0,91,301.0,36.021817,-86.658848,"(36.02181712, -86.65884778)"
375,18-19,Elementary School,Alex Green Elementary,5,37189,,37.0,53.0,46.0,40.0,...,,15,123,143,183.0,19,25.0,36.252961,-86.832229,"(36.2529607, -86.8322292)"
105,18-19,Elementary School,Amqui Elementary,10,37115,2.0,29.0,91.0,81.0,89.0,...,,73,230,234,244.0,47,122.0,36.273766,-86.703832,"(36.27376585, -86.70383153)"
460,18-19,Elementary School,Andrew Jackson Elementary,15,37138,4.0,29.0,93.0,85.0,90.0,...,1.0,270,252,250,99.0,66,33.0,36.231585,-86.623775,"(36.23158465, -86.62377469)"
110,18-19,High School,Antioch High School,20,37013,,,,,,...,1.0,442,1047,909,716.0,223,544.0,36.046675,-86.599418,"(36.04667464, -86.59941833)"


## Practice

Print the `School Name` column. Then print the first 3 rows of the DataFrame.

In [47]:
print(schools_df['School Name'])
schools_df.head(3)

School ID
496        A. Z. Kelley Elementary
375          Alex Green Elementary
105               Amqui Elementary
460      Andrew Jackson Elementary
110            Antioch High School
                  ...             
770                West End Middle
775           Westmeade Elementary
787       Whites Creek High School
612    William Henry Oliver Middle
805                  Wright Middle
Name: School Name, Length: 169, dtype: object


Unnamed: 0_level_0,School Year,School Level,School Name,State School ID,Zip Code,Grade PreK 3yrs,Grade PreK 4yrs,Grade K,Grade 1,Grade 2,...,Native Hawaiian or Other Pacific Islander,White,Male,Female,Economically Disadvantaged,Disability,Limited English Proficiency,Latitude,Longitude,Mapped Location
School ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
496,18-19,Elementary School,A. Z. Kelley Elementary,1,37013,,40.0,153.0,146.0,177.0,...,,218,423,424,300.0,91,301.0,36.021817,-86.658848,"(36.02181712, -86.65884778)"
375,18-19,Elementary School,Alex Green Elementary,5,37189,,37.0,53.0,46.0,40.0,...,,15,123,143,183.0,19,25.0,36.252961,-86.832229,"(36.2529607, -86.8322292)"
105,18-19,Elementary School,Amqui Elementary,10,37115,2.0,29.0,91.0,81.0,89.0,...,,73,230,234,244.0,47,122.0,36.273766,-86.703832,"(36.27376585, -86.70383153)"


## Vectorized operations

Pandas Series and DataFrames support vectorized operations, which means that operations are applied to every item in the Series or DataFrame at once. 

We can calculate the total number of students in each school by adding the `Male` and `Female` columns together.

In [48]:
total_students = schools_df['Male'] + schools_df['Female']
print(total_students)

# Create a DataFram to make the answers easier to read.
summary_df = pd.DataFrame({'male': schools_df['Male'], 'female': schools_df['Female'], 'total': total_students})
summary_df

School ID
496     847
375     266
105     464
460     502
110    1956
       ... 
770     532
775     424
787     652
612     960
805     738
Length: 169, dtype: int64


Unnamed: 0_level_0,male,female,total
School ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
496,423,424,847
375,123,143,266
105,230,234,464
460,252,250,502
110,1047,909,1956
...,...,...,...
770,276,256,532
775,237,187,424
787,316,336,652
612,476,484,960


## Practice

Calculate the percentage of male students in each school and print the series.

In [50]:
male_percent = (schools_df ['male']) / (schools_df ['total'])
male_percent

KeyError: 'male'

# Loops

## Iterating using `for`

Example

In [None]:
basket = ['apple', 'orange', 'banana', 'lemon', 'lime']
for fruit in basket:
    print('I ate one ' + fruit)
print("I'm full now!")

In [None]:
word = 'supercalifragilisticexpialidocious'
print('Spell it out!')
for letter in word:
    print(letter)
print('That wore me out.')

## Building a sequence with a for loop

The pattern of creating an empty thing and then adding a sequence of items to it in a loop is a common one. 

```
sequence = sequence + item
```

can be replaced with 

```
sequence += item
```

Code with explicit concatenation:

In [None]:
list_of_words = ['The ', 'quick ', 'brown ', 'fox ', 'jumps ', 'over ', 'the ', 'lazy ', 'dog ']
sentence = ''
for word in list_of_words:
    sentence = sentence + word # Concatenate the word to the sentence
print(sentence + '!')

Code with shorthand:


In [None]:
sentence = ''
for word in list_of_words:
    sentence += word
print(sentence + '!')

Same strategy, but doing creating a total by summing a list of numbers:


In [None]:
total = 0
for number in [3, 5, 7, 9]:
    total += number
print('The total is', total)

Using a `range()` object to create a list from user input:


In [None]:
bird_list = []
for i in range(4):
    bird = input('Enter a bird name: ')
    bird_list.append(bird)
print('Your bird list is:', bird_list)

# Iterating through rows in a DataFrame

One of the main purposes of pandas is to make it possible to perform operations on entire columns using vectorized operations. However, there are some situations where it makes sense to iterate through each row in the DataFrame and deal with values one row at a time. These situations would include complex operations that require multiple lines of code to describe, or actions that must happen sequentially, such as retrieving data from a URL.

Our example will use information about websites

In [None]:
websites = {
    'name': {'alphabet': 'Google', 'vu': 'Vanderbilt', 'fake': 'Obsolete Website'}, 
    'url': {'alphabet': 'https://www.google.com/', 'vu': 'https://www.vanderbilt.edu/', 'fake': 'https://example.org/fake_url'},
    'status': {'alphabet': 'unknown', 'vu': 'unknown', 'fake': 'unknown'}
           }
websites_df = pd.DataFrame(websites)
websites_df

To generate an iterable object from the DataFrame we use the `.iterrows()` method. Iterating using a `for` loop generates a tuple consisting of the label index and the data from the row, in the form of a Series.

In [None]:
for label_index, website_series in websites_df.iterrows():
    print(label_index)
    print()
    print(website_series)
    print()
    print()

To access a value from the row Series, we can use direct indexing by providing the column label index.

In [None]:
for label_index, website_series in websites_df.iterrows():
    print(website_series['url'])
    print()

Iterating will allow us to check the status of each website one at a time.

In [None]:
import requests
for label_index, website_series in websites_df.iterrows():
    response = requests.get(website_series['url'])
    # HTTP status code 200 means the website is up, 404 means it's down.
    print(label_index, website_series['url'], response.status_code)
    # Assign the status to the status column in the DataFrame
    websites_df.loc[label_index, 'status'] = response.status_code

# Print the updated DataFrame
websites_df

## Practice

Use the .head() method to assign the first 10 rows of the schools DataFrame to a new DataFrame called `schools_subset`. Then iterate through the rows of `schools_subset` and print the `School Name` and `Zip Code` for each row.