**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 2. Warm Up

## In this lesson...

- A very brief introduction to Pandas


- Method chaining


- Formatting Python code: line continuation

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## A very brief introduction to Pandas 🐼

- In the same folder as this notebook, there is a file `data/gapminder.csv` containing health and population data for a number of countries between 1955 and 2005
    - `data/gapminder.csv` = `gapminder.csv` in the folder `data`
    - This data was collected by [Gapminder](https://www.gapminder.org/)

- CSV stands for **comma-separated values**


- CSV files are a common way to store tabular data


- Let's see what the Gapminder data looks like, using JupyterLab's file viewer...

- How can we use this data in Python?


- [__Pandas__](https://pandas.pydata.org/) is a Python package for data analysis and manipulation 
    - [Here is the Pandas documentation](https://pandas.pydata.org/docs/)

- We will spend a significant portion of this course getting fluent in Pandas


- For now, we just need a few basics so that we can start visualizing data in Python

- First, let's import Pandas as `pd`:

In [None]:
import pandas as pd

## The DataFrame object

- A Pandas __DataFrame__ is a two-dimensional table, with rows and columns


- Sometimes we refer to the columns as _variables_ and the rows as *observations*, since tabular data is commonly set up this way


- We can use the Pandas `read_csv()` function to read the Gapminder data `data/gapminder.csv` into a DataFrame called `df`, like this:

In [None]:
df = pd.read_csv('data/gapminder.csv')

- By default, `read_csv()` assumes the first row of the CSV file contains the names of each column

- It's a good habit to take a quick look at the DataFrame that `read_csv()` creates, just in case something went wrong


- To view the first 5 rows of a DataFrame, we can use the `.head()` method:

- Each row of this DataFrame contains the following data for each `country` and `year`:
    - region of the world (`cluster`)
    - total population (`pop`)
    - average life expectancy in years (`life_expect`)
    - number of children per woman (`fertility`)

- By default, Pandas assigns a **label** to each row/observation: these labels are called the **index**
    - Above, you can see the labels all the way on the left
    - Note: the index does *not* count as a column/variable of the DataFrame
    - We'll come back to this later in the semester

- The `.shape` attribute of a DataFrame contains the number of rows and columns in the DataFrame:

- The `.info()` method of a DataFrame prints some useful information about a DataFrame, including the type of values in each column:

* The `.describe()` method of a DataFrame outputs summary statistics for the columns with numeric data:

- That's all we need for now


- We'll learn _much_ more about Pandas later in the semester

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Method chaining

- Consider the following string in the variable `sentence`:

In [None]:
sentence = '  the QUICK foX jumPED over the laZY CAt.    '

- We can format this string programmatically using Python's string methods to:
    - Remove whitespace at the beginning and end of the string with `.strip()`
    - Capitalize the sentence correctly with `.capitalize()`
    - Replace any instance of 'cat' with 'dog' with `.replace()`

- We want to end up with:

    ```python
    'The quick fox jumped over the lazy dog.'
    ```

- Such code might look like this:

In [None]:
# Remove whitespace at the beginning and end of the string
sentence2 = 

# Capitalize the sentence correctly
sentence3 = 

# Replace any instance of 'cat' with 'dog'
sentence4 = 

# Print the final sentence
print(sentence4)

In [None]:
print(sentence2)
print(sentence3)

- This works, but is a bit cumbersome: we have to come up with variable names for the intermediate steps, which we don't really care about


- There is a cleaner way!


- Note that each of the string methods (`.strip()`, `.capitalize()`, `.replace()`) we used above *results in another string*


- So instead of storing these results in intermediate variables, we can directly *chain* the method calls one after another, like this:

- In the code above, we can easily read the sequence of actions we took from left to right


- **Method chaining** often improves the readability of code and reduces the amount of code needed


- We will use method chaining a lot in this course: the libraries for data wrangling and visualization that we will use are well-suited for method chaining

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Formatting Python code &mdash; line continuation

- Consider the following encrypted message:

    ```
    giuifg cei iprc tpnn du cei qprcni
    ```


- The message above was encrypted by the following *simple substitution cipher*:


|plain    |a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|
|---------|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|encrypted|p|h|q|g|i|u|m|e|a|y|l|n|o|f|d|x|j|k|r|c|v|s|t|z|w|b|


- So the first character of the encrypted message, `g`, translates to `d` in plain text


- Here is a function that decrypts any message encrypted with this cipher, using `.replace()` repeatedly in a long method chain:

In [None]:
def decrypt(message):
    return message.replace('p', 'A').replace('h', 'B').replace('q', 'C').replace('g', 'D').replace('i', 'E').replace('u', 'F').replace('m', 'G').replace('e', 'H').replace('a', 'I').replace('y', 'J').replace('l', 'K').replace('n', 'L').replace('o', 'M').replace('f', 'N').replace('d', 'O').replace('x', 'P').replace('j', 'Q').replace('k', 'R').replace('r', 'S').replace('c', 'T').replace('v', 'U').replace('s', 'V').replace('t', 'W').replace('z', 'X').replace('w', 'Y').replace('b', 'Z')

- First, let's see if it works...

In [None]:
message = 'giuifg cei iprc tpnn du cei qprcni'
decrypt(message)

- Now, unless you have a very, very wide screen, you probably could not read the entire definition of the function `decrypt()` above without scrolling to the right


- Long lines make code harder to read, especially on smaller screens or when you have multiple files open side-by-side


- In fact, it is common to limit lines in Python code to 79 characters
    - Actually, this is common for many programming languages
    - Why 79? [Probably historical reasons](https://en.wikipedia.org/wiki/Characters_per_line)

- So, how do we do this? Python provides several ways to run statements over several lines


- The recommended way is **implied line continuation**: Python will assume line continuation if code is contained within parentheses `()`, brackets `[]`, or braces `{}`


- One way to rewrite the function `decrypt` above would be:

In [None]:
def decrypt(message):
    return message.replace(
        'p', 'A'
    ).replace(
        'h', 'B'
    ).replace(
        'q', 'C'
    ).replace(
        'g', 'D'
    ).replace(
        'i', 'E'
    ).replace(
        'u', 'F'
    ).replace(
        'm', 'G'
    ).replace(
        'e', 'H'
    ).replace(
        'a', 'I'
    ).replace(
        'y', 'J'
    ).replace(
        'l', 'K'
    ).replace(
        'n', 'L'
    ).replace(
        'o', 'M'
    ).replace(
        'f', 'N'
    ).replace(
        'd', 'O'
    ).replace(
        'x', 'P'
    ).replace(
        'j', 'Q'
    ).replace(
        'k', 'R'
    ).replace(
        'r', 'S'
    ).replace(
        'c', 'T'
    ).replace(
        'v', 'U'
    ).replace(
        's', 'V'
    ).replace(
        't', 'W'
    ).replace(
        'z', 'X'
    ).replace(
        'w', 'Y'
    ).replace(
        'b', 'Z'
    )

decrypt(message)

- Putting each subsitution on a separate line makes it easy to see what the cipher looks like at a glance - almost like a table


- Another similar, perhaps prettier, way would be to first wrap the entire method chain with parentheses, like this:

In [None]:
def decrypt(message):
    return (
      message.replace('p', 'A')
             .replace('h', 'B')
             .replace('q', 'C')
             .replace('g', 'D')
             .replace('i', 'E')
             .replace('u', 'F')
             .replace('m', 'G')
             .replace('e', 'H')
             .replace('a', 'I')
             .replace('y', 'J')
             .replace('l', 'K')
             .replace('n', 'L')
             .replace('o', 'M')
             .replace('f', 'N')
             .replace('d', 'O')
             .replace('x', 'P')
             .replace('j', 'Q')
             .replace('k', 'R')
             .replace('r', 'S')
             .replace('c', 'T')
             .replace('v', 'U')
             .replace('s', 'V')
             .replace('t', 'W')
             .replace('z', 'X')
             .replace('w', 'Y')
             .replace('b', 'Z')
    )

decrypt(message)

- Note: wrapping the entire chain in parentheses in the example above is important &mdash; otherwise, Python would not look to continue the line from one `.replace()` call to the next

- Finally, note that in the examples above, we *indented* the line continuations


- This improves the readability of your code
    - It helps the reader distinguish between two lines of code and a single line of code that spans two lines

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## What's next?

- Introduction to data visualization with [Altair](https://altair-viz.github.io/)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 1

In the same folder as this notebook, there is a file `data/cars.csv`, which contains data about cars in the 1970s and 1980s.

1. Read this CSV file into a DataFrame.
1. Display the first 5 rows of this dataset.
1. How many observations and variables does this dataset have?
1. What variables are contained in this dataset?
1. What is the minimum and maximum weight among the cars in this dataset?

_Write your answer here. Double-click to edit._

### Problem 2

Use the string methods `.replace()`, `.strip()`, and `.upper()` in a chain to change the string in the variable `banner` defined in the code cell below to

```
'GO NAVY BEAT ARMY!'
```

In [None]:
banner = 'go navy beat tulane!  '

### Problem 3

In the simple substitution cipher example above, we might have written the function `decrypt` to make the subsitutions with lowercase letters, like this:

```python
def decrypt(message):
    return (
      message.replace('p', 'a')
             .replace('h', 'b')
             .replace('q', 'c')
             .replace('g', 'd')
             # and so on
    )
```

Would this have worked? Why or why not?

_Write your answer here. Double-click to edit._

### Problem 4

The code cell below contains a chain of methods called on a Pandas DataFrame. Don't worry too much about how the code works for now &mdash; you'll learn about this later in the semester.

Reformat the code using the line continuation techniques above so that it fits on your screen and is easy to read.

In [None]:
df.query('year == 1970').groupby('cluster').agg(mean_pop=('pop', 'mean'), min_pop=('pop', 'min'), max_pop=('pop', 'max'), mean_life_expect=('life_expect', 'mean'), max_fertility=('fertility', 'max'))

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- The simple substitution cipher example was taken from [Practical Cryptography](http://practicalcryptography.com/ciphers/simple-substitution-cipher/)


- [PEP 8](https://www.python.org/dev/peps/pep-0008/) is a document that provides guidelines and best practices on how to write Python code
    - For example, it has suggestions on how to do line continuation
    - Real Python has a nice [tutorial on PEP 8](https://realpython.com/python-pep8)