**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 2. Warm Up

## In this lesson...

- A very brief introduction to Pandas


- Method chaining


- Formatting Python code: line continuation

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## A very brief introduction to Pandas 🐼

- In the same folder as this notebook, there is a file `data/gapminder.csv` containing health and population data for a number of countries between 1955 and 2005
    - `data/gapminder.csv` = `gapminder.csv` in the folder `data`
    - This data was collected by [Gapminder](https://www.gapminder.org/)

- CSV stands for **comma-separated values**


- CSV files are a common way to store tabular data


- Let's see what the Gapminder data looks like, using JupyterLab's file viewer...

- How can we use this data in Python?


- [__Pandas__](https://pandas.pydata.org/) is a Python package for data analysis and manipulation 
    - [Here is the Pandas documentation](https://pandas.pydata.org/docs/)

- We will spend a significant portion of this course getting fluent in Pandas


- For now, we just need a few basics so that we can start visualizing data in Python

- First, let's import Pandas as `pd`:

In [1]:
import pandas as pd

## The DataFrame object

- A Pandas __DataFrame__ is a two-dimensional table, with rows and columns


- Sometimes we refer to the columns as _variables_ and the rows as *observations*, since tabular data is commonly set up this way


- We can use the Pandas `read_csv()` function to read the Gapminder data `data/gapminder.csv` into a DataFrame called `df`, like this:

In [2]:
df = pd.read_csv('data/gapminder.csv')

- By default, `read_csv()` assumes the first row of the CSV file contains the names of each column

- It's a good habit to take a quick look at the DataFrame that `read_csv()` creates, just in case something went wrong


- To view the first 5 rows of a DataFrame, we can use the `.head()` method:

In [3]:
# Solution
df.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
0,1955,Afghanistan,South Asia,8891209,30.332,7.7
1,1960,Afghanistan,South Asia,9829450,31.997,7.7
2,1965,Afghanistan,South Asia,10997885,34.02,7.7
3,1970,Afghanistan,South Asia,12430623,36.088,7.7
4,1975,Afghanistan,South Asia,14132019,38.438,7.7


- Each row of this DataFrame contains the following data for each `country` and `year`:
    - region of the world (`cluster`)
    - total population (`pop`)
    - average life expectancy in years (`life_expect`)
    - number of children per woman (`fertility`)

- By default, Pandas assigns a **label** to each row/observation: these labels are called the **index**
    - Above, you can see the labels all the way on the left
    - Note: the index does *not* count as a column/variable of the DataFrame
    - We'll come back to this later in the semester

- The `.shape` attribute of a DataFrame contains the number of rows and columns in the DataFrame:

In [4]:
# Solution
df.shape

(693, 6)

- The `.info()` method of a DataFrame prints some useful information about a DataFrame, including the type of values in each column:

In [5]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693 entries, 0 to 692
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   year         693 non-null    int64  
 1   country      693 non-null    object 
 2   cluster      693 non-null    object 
 3   pop          693 non-null    int64  
 4   life_expect  693 non-null    float64
 5   fertility    693 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 32.6+ KB


* The `.describe()` method of a DataFrame outputs summary statistics for the columns with numeric data:

In [6]:
# Solution
df.describe()

Unnamed: 0,year,pop,life_expect,fertility
count,693.0,693.0,693.0,693.0
mean,1980.0,56234310.0,66.146406,3.605755
std,15.822809,155301400.0,10.714033,1.921234
min,1955.0,53865.0,23.599,0.94
25%,1965.0,4563732.0,59.957,2.015
50%,1980.0,12292000.0,69.498,2.93
75%,1995.0,44434440.0,73.84,5.0005
max,2005.0,1303182000.0,82.603,8.5


- That's all we need for now


- We'll learn _much_ more about Pandas later in the semester

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Method chaining

- Consider the following string in the variable `sentence`:

In [7]:
sentence = '  the QUICK foX jumPED over the laZY CAt.    '

- We can format this string programmatically using Python's string methods to:
    - Remove whitespace at the beginning and end of the string with `.strip()`
    - Capitalize the sentence correctly with `.capitalize()`
    - Replace any instance of 'cat' with 'dog' with `.replace()`

- We want to end up with:

    ```python
    'The quick fox jumped over the lazy dog.'
    ```

- Such code might look like this:

In [None]:
# Remove whitespace at the beginning and end of the string
sentence2 = 

# Capitalize the sentence correctly
sentence3 = 

# Replace any instance of 'cat' with 'dog'
sentence4 = 

# Print the final sentence
print(sentence4)

In [8]:
# Solution
# Remove whitespace at the beginning and end of the string
sentence2 = sentence.strip()

# Capitalize the sentence correctly
sentence3 = sentence2.capitalize()

# Replace any instance of 'cat' with 'dog'
sentence4 = sentence3.replace('cat', 'dog')

# Print the final sentence
print(sentence4)

The quick fox jumped over the lazy dog.


- We can inspect the results of the intermediate steps:

In [9]:
print(sentence2)
print(sentence3)

the QUICK foX jumPED over the laZY CAt.
The quick fox jumped over the lazy cat.


- This works, but is a bit cumbersome: we have to come up with variable names for the intermediate steps, which we don't really care about


- There is a cleaner way!


- Note that each of the string methods (`.strip()`, `.capitalize()`, `.replace()`) we used above *results in another string*


- So instead of storing these results in intermediate variables, we can directly *chain* the method calls one after another, like this:

In [10]:
# Solution
sentence.strip().capitalize().replace('cat', 'dog')

'The quick fox jumped over the lazy dog.'

- In the code above, we can easily read the sequence of actions we took from left to right


- **Method chaining** often improves the readability of code and reduces the amount of code needed


- We will use method chaining a lot in this course: the libraries for data wrangling and visualization that we will use are well-suited for method chaining

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Formatting Python code &mdash; line continuation

- Consider the following encrypted message:

    ```
    giuifg cei iprc tpnn du cei qprcni
    ```


- The message above was encrypted by the following *simple substitution cipher*:


|plain    |a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|
|---------|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|encrypted|p|h|q|g|i|u|m|e|a|y|l|n|o|f|d|x|j|k|r|c|v|s|t|z|w|b|


- So the first character of the encrypted message, `g`, translates to `d` in plain text


- Here is a function that decrypts any message encrypted with this cipher, using `.replace()` repeatedly in a long method chain:

In [11]:
def decrypt(message):
    return message.replace('p', 'A').replace('h', 'B').replace('q', 'C').replace('g', 'D').replace('i', 'E').replace('u', 'F').replace('m', 'G').replace('e', 'H').replace('a', 'I').replace('y', 'J').replace('l', 'K').replace('n', 'L').replace('o', 'M').replace('f', 'N').replace('d', 'O').replace('x', 'P').replace('j', 'Q').replace('k', 'R').replace('r', 'S').replace('c', 'T').replace('v', 'U').replace('s', 'V').replace('t', 'W').replace('z', 'X').replace('w', 'Y').replace('b', 'Z')

- First, let's see if it works...

In [12]:
message = 'giuifg cei iprc tpnn du cei qprcni'
decrypt(message)

'DEFEND THE EAST WALL OF THE CASTLE'

- Now, unless you have a very, very wide screen, you probably could not read the entire definition of the function `decrypt()` above without scrolling to the right


- Long lines make code harder to read, especially on smaller screens or when you have multiple files open side-by-side


- In fact, it is common to limit lines in Python code to 79 characters
    - Actually, this is common for many programming languages
    - Why 79? [Probably historical reasons](https://en.wikipedia.org/wiki/Characters_per_line)

- So, how do we do this? Python provides several ways to run statements over several lines


- The recommended way is **implied line continuation**: Python will assume line continuation if code is contained within parentheses `()`, brackets `[]`, or braces `{}`


- One way to rewrite the function `decrypt` above would be:

In [13]:
def decrypt(message):
    return message.replace(
        'p', 'A'
    ).replace(
        'h', 'B'
    ).replace(
        'q', 'C'
    ).replace(
        'g', 'D'
    ).replace(
        'i', 'E'
    ).replace(
        'u', 'F'
    ).replace(
        'm', 'G'
    ).replace(
        'e', 'H'
    ).replace(
        'a', 'I'
    ).replace(
        'y', 'J'
    ).replace(
        'l', 'K'
    ).replace(
        'n', 'L'
    ).replace(
        'o', 'M'
    ).replace(
        'f', 'N'
    ).replace(
        'd', 'O'
    ).replace(
        'x', 'P'
    ).replace(
        'j', 'Q'
    ).replace(
        'k', 'R'
    ).replace(
        'r', 'S'
    ).replace(
        'c', 'T'
    ).replace(
        'v', 'U'
    ).replace(
        's', 'V'
    ).replace(
        't', 'W'
    ).replace(
        'z', 'X'
    ).replace(
        'w', 'Y'
    ).replace(
        'b', 'Z'
    )

decrypt(message)

'DEFEND THE EAST WALL OF THE CASTLE'

- Putting each subsitution on a separate line makes it easy to see what the cipher looks like at a glance - almost like a table


- Another similar, perhaps prettier, way would be to first wrap the entire method chain with parentheses, like this:

In [14]:
def decrypt(message):
    return (
      message.replace('p', 'A')
             .replace('h', 'B')
             .replace('q', 'C')
             .replace('g', 'D')
             .replace('i', 'E')
             .replace('u', 'F')
             .replace('m', 'G')
             .replace('e', 'H')
             .replace('a', 'I')
             .replace('y', 'J')
             .replace('l', 'K')
             .replace('n', 'L')
             .replace('o', 'M')
             .replace('f', 'N')
             .replace('d', 'O')
             .replace('x', 'P')
             .replace('j', 'Q')
             .replace('k', 'R')
             .replace('r', 'S')
             .replace('c', 'T')
             .replace('v', 'U')
             .replace('s', 'V')
             .replace('t', 'W')
             .replace('z', 'X')
             .replace('w', 'Y')
             .replace('b', 'Z')
    )

decrypt(message)

'DEFEND THE EAST WALL OF THE CASTLE'

- Note: wrapping the entire chain in parentheses in the example above is important &mdash; otherwise, Python would not look to continue the line from one `.replace()` call to the next

- Finally, note that in the examples above, we *indented* the line continuations


- This improves the readability of your code
    - It helps the reader distinguish between two lines of code and a single line of code that spans two lines

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## What's next?

- Introduction to data visualization with [Altair](https://altair-viz.github.io/)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 1

In the same folder as this notebook, there is a file `data/cars.csv`, which contains data about cars in the 1970s and 1980s.

1. Read this CSV file into a DataFrame.
1. Display the first 5 rows of this dataset.
1. How many observations and variables does this dataset have?
1. What variables are contained in this dataset?
1. What is the minimum and maximum weight among the cars in this dataset?

In [15]:
# Solution
cars_df = pd.read_csv('data/cars.csv')

In [16]:
# Solution
cars_df.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970,USA


In [17]:
# Solution
cars_df.shape

(406, 9)

In [18]:
# Solution
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    int64  
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.7+ KB


In [19]:
# Solution
cars_df.describe()

Unnamed: 0,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year
count,398.0,406.0,406.0,400.0,406.0,406.0,406.0
mean,23.514573,5.475369,194.779557,105.0825,2979.413793,15.519704,1975.995074
std,7.815984,1.71216,104.922458,38.768779,847.004328,2.803359,3.856689
min,9.0,3.0,68.0,46.0,1613.0,8.0,1970.0
25%,17.5,4.0,105.0,75.75,2226.5,13.7,1973.0
50%,23.0,4.0,151.0,95.0,2822.5,15.5,1976.0
75%,29.0,8.0,302.0,130.0,3618.25,17.175,1979.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,1982.0


_Write your answer here. Double-click to edit._

_Solution._ This dataset has 406 observations. It has 9 variables: `Name`, `Miles_Per_Gallon`, `Cylinders`, `Displacement`, `Horsepower`, `Weight_in_lbs`, `Acceleration`, `Year`, and `Origin`. The minimum weight among the cars in this data is 1613 lbs, the maximum is 5140 lbs.

### Problem 2

Use the string methods `.replace()`, `.strip()`, and `.upper()` in a chain to change the string in the variable `banner` defined in the code cell below to

```
'GO NAVY BEAT ARMY!'
```

In [20]:
banner = 'go navy beat tulane!  '

In [21]:
# Solution
banner = 'go navy beat tulane!  '

banner.strip().replace('tulane', 'army').upper()

'GO NAVY BEAT ARMY!'

### Problem 3

In the simple substitution cipher example above, we might have written the function `decrypt` to make the subsitutions with lowercase letters, like this:

```python
def decrypt(message):
    return (
      message.replace('p', 'a')
             .replace('h', 'b')
             .replace('q', 'c')
             .replace('g', 'd')
             # and so on
    )
```

Would this have worked? Why or why not?

_Write your answer here. Double-click to edit._

_Solution._ This would not have worked, because the replacements are not done simultaneously. For example, the first call of `replace` replaces all `p`s with `a`s. Then, later on in the chain, these new `a`s would be replaced with `i`s, which is not what we want.

### Problem 4

The code cell below contains a chain of methods called on a Pandas DataFrame. Don't worry too much about how the code works for now &mdash; you'll learn about this later in the semester.

Reformat the code using the line continuation techniques above so that it fits on your screen and is easy to read.

In [22]:
df.query('year == 1970').groupby('cluster').agg(mean_pop=('pop', 'mean'), min_pop=('pop', 'min'), max_pop=('pop', 'max'), mean_life_expect=('life_expect', 'mean'), max_fertility=('fertility', 'max'))

Unnamed: 0_level_0,mean_pop,min_pop,max_pop,mean_life_expect,max_fertility
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Americas,23317880.0,59039,205052000,63.323286,6.5
East Asia & Pacific,126989800.0,2828050,818315000,65.135764,6.0
Europe & Central Asia,21831740.0,204104,77783164,70.969105,5.3
Middle East & North Africa,13886040.0,2383029,33574026,59.043,7.298
South Asia,171634800.0,12430623,541000000,45.98,7.7
Sub-Saharan Africa,22195950.0,3769171,51027516,48.669,8.29


In [23]:
# Solution - one possibility
(
    df
    .query('year == 1970')
    .groupby('cluster')
    .agg(
       mean_pop=('pop', 'mean'), 
       min_pop=('pop', 'min'), 
       max_pop=('pop', 'max'), 
       mean_life_expect=('life_expect', 'mean'), 
       max_fertility=('fertility', 'max')
    )
)

Unnamed: 0_level_0,mean_pop,min_pop,max_pop,mean_life_expect,max_fertility
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Americas,23317880.0,59039,205052000,63.323286,6.5
East Asia & Pacific,126989800.0,2828050,818315000,65.135764,6.0
Europe & Central Asia,21831740.0,204104,77783164,70.969105,5.3
Middle East & North Africa,13886040.0,2383029,33574026,59.043,7.298
South Asia,171634800.0,12430623,541000000,45.98,7.7
Sub-Saharan Africa,22195950.0,3769171,51027516,48.669,8.29


<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- The simple substitution cipher example was taken from [Practical Cryptography](http://practicalcryptography.com/ciphers/simple-substitution-cipher/)


- [PEP 8](https://www.python.org/dev/peps/pep-0008/) is a document that provides guidelines and best practices on how to write Python code
    - For example, it has suggestions on how to do line continuation
    - Real Python has a nice [tutorial on PEP 8](https://realpython.com/python-pep8)