# Lecture 4 –Fall 2023

A demonstration of advanced `pandas` syntax to accompany Lecture 4.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

## Dataset: California baby names

In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "/content/drive/MyDrive/data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

### Exercises
We want to obtain the first three baby names with `count > 250`.

1.Code this using, loc and head()

2.Code this using, loc and iloc()

3.Code this using [] and head ()


In [None]:
# Answer Here

In [None]:
# Answer Here


In [None]:
# Answer Here

### `.isin` for Selection based on a list, array, or `Series`

In [None]:
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability

( babynames[(babynames["Name"] == "Bella") |
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Narges") |
              (babynames["Name"] == "Lisa")])


In [None]:
# A more concise method to achieve the above: .isin
#Answer Here

### `.str` Functions for Defining a Condition

In [None]:
# What if we only want names that start with "J"?
#Answer Here

## Adding, Removing, and Modifying Columns

### Add a Column
To add a column, use `[]` to reference the desired new column, then assign it to a `Series` or array of appropriate length.

In [None]:
# Create a Series of the length of each name

# Add a column named "name_lengths" that includes the length of each name


### Modify a Column
To modify a column, use `[]` to access the desired column, then re-assign it to a new array or Series.

In [None]:
# Modify the "name_lengths" column to be one less than its original value


### Rename a Column Name
Rename a column using the `.rename()` method.

In [None]:
# Rename "name_lengths" to "Length"


### Delete a Column
Remove a column using `.drop()`.

In [None]:
# Remove our new "Length" column

## Custom sorting

In [None]:
# Sort a Series Containing Names



In [None]:
# Sort a DataFrame – there are lots of Michaels in California


### Approach 1: Create a temporary column

In [None]:
# Create a Series of the length of each name

# Add a column named "name_lengths" that includes the length of each name

# Sort by the temporary column


In [None]:
# Drop the 'name_length' column


### Approach 2: Sorting using the `key` argument

In [None]:
# Answer Here

### Approach 3: Sorting Using the `map` Function

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [None]:
# First, define a function to count the number of times "sa" or "me" appear in each name


# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork



In [None]:
# Drop the `dr_ea_count` column


## Grouping

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).

<img src="images/groupby.png" width="800"/>

In [None]:
# DataFrame with baby gril names only
# Answer Here
#Groupby similar features like year and apply aggregate
# Answer Here
# Sort by Count
# Answer Here


In [None]:
# print first 10 entries


In [None]:
#the total baby count in each year
# Answer Here


There are many different aggregation functions we can use, all of which are useful in different applications.

In [None]:
# What is the earliest year in which each name appeared?
# Answer Here

In [None]:
# What is the largest single-year count of each name?
# Answer Here

In [None]:
#Can you find the most popular baby name in the state of California (CA) for each year? use idxmax function.
#Provide a list of years along with the corresponding most popular names."
result = babynames.groupby("Year")['Count'].idxmax()
#Answer Here

## Case Study: Name "Popularity"

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage. We'll start by filtering `babynames` to only include names corresponding to sex "F".

In [None]:
#Answer Here

In [None]:
# We sort the data by year

To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [None]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(f_babynames[f_babynames["Name"] == "Jennifer"],
              x = "Year", y = "Count")
fig.update_layout(font_size = 18,
                  autosize=False,
                 width=1000,
                  height=400)

We'll need a mathematical definition for the change in popularity of a name.

Define the metric "ratio to peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in *any* year.

A demo calculation for Jennifer:

In [None]:
# Find the highest Jennifer 'count'


In [None]:
# Remember that we sorted f_babynames by year.
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born


In [None]:
# Compute the RTP


We can also write a function that produces the `ratio_to_peak`for a given `Series`. This will allow us to use `.groupby` to speed up our computation for all names in the dataset.

In [None]:
# define the function for RTP
"""
Compute the RTP for a Series containing the counts per year for a single name
"""


In [None]:
# Construct a Series containing our Jennifer count data

# Then, find the RTP using the function define above


Now, let's use `.groupby` to compute the RTPs for *all* names in the dataset.

You may see a warning message when running the cell below. As discussed in lecture, `pandas` can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). By default, `.groupby` will drop any columns that cannot be aggregated.

In [None]:
# Results in a TypeError
#rtp_table = f_babynames.groupby("Name").agg(ratio_to_peak)
#rtp_table

In [None]:
# Find the RTP fro all names at once using groupby as describe in lec slides


To avoid the warning message above, we explicitly extract only the columns relevant to our analysis before using `.agg`.

In [None]:
# Recompute the RTPs, but only performing the calculation on the "Count" column


In [None]:
# Rename "Count" to "Count RTP" for clarity


In [None]:
# What name has fallen the most in popularity?


We can visualize the decrease in the popularity of the name "?:"

In [None]:
def plot_name(*names):
    fig = px.line(f_babynames[f_babynames["Name"].isin(names)],
                  x = "Year", y = "Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size = 18,
                  autosize=False,
                  width=1000,
                  height=400)
    return fig
# pass the name into plot_name
plot_name("-")

In [None]:
# Find the 10 names that have decreased the most in popularity
# Answer Here

In [None]:
plot_name(*top10)

For fun, try plotting your name or your friends' names.