# 👩‍⚕️ Lecture 5 (Part 1, Tuberculosis) – Data 100, Spring 2025

Data 100, Spring 2025

[Acknowledgments Page](https://ds100.org/sp25/acks/)

In [11]:
import numpy as np
import pandas as pd

In [12]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# Use 5 decimal places instead of scientific notation in pandas
pd.set_option('display.float_format', '{:.5f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# 🦠 Tuberculosis in the United States

What can we say about the presence of Tuberculosis in the United States?

Let's work with the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w) published in 2021.

<br>

---

# 📖 Reading CSVs

The TB case count data is saved as a CSV file located at `data/cdc_tuberculosis.csv`.

We can explore the CSV file in many ways:
1. Using the JupyterLab explorer tool (read-only!).
2. Opening the CSV in DataHub, or Excel, or Google Sheets, etc.
3. Inspecting the Python file object
4. With `pandas`, using `pd.read_csv()`

<br>


---

## 🧭 Methods 1 and 2: Play with the data in the JupyterLab Explorer and DataHub
 To solidify the idea of a CSV as **rectangular data** (i.e., tabular data) stored as comma-separated values, let's start with the first two methods.  

 **1. Use the file browser in JupyterLab to locate the CSV at `data/cdc_tuberculosis.csv`, and double-click on it.**

  **2. Right-click on the CSV in the file browser. Select `Open With` --> `Editor`. But, don't make any changes to the file!**

<br>

---

## 🐍 Method 3: Play with the data in Python

Next, we will load in the data as a Python file object and inspect a couple lines. 

With the code below, we can check out the first four lines of the CSV:

In [None]:
# Open the TB case count CSV, and print the first four lines
with open("data/cdc_tuberculosis.csv", "r") as f:
    for i, row in enumerate(f):
        print(row)
        if i >= 3: break

As expected, we have four lines of comma-separated values!

> Why are there blank lines between each line of the CSV file?
>
> You may recall that line breaks in text files are encoded with the special newline character `\n`. 
> 
> Python's `print()` function prints each line, interpreting the `\n` at the end of each line as a newline, **and also adds an additional newline**.

We can use the `repr()` ("representation") function to return the raw string representation of each line (i.e., all special characters will be visible).

- In other words, `print()` will not interpret `\n` as a newline. Instead, it will literally print `\n`.

- Note, though, `print()` adds a newline each time it is called. Otherwise, we would have one long string below instead of four lines.

In [None]:
# Open the TB case count CSV, and print the raw representation of
# the first four lines
with open("data/cdc_tuberculosis.csv", "r") as f:
    for i, row in enumerate(f):
        print(repr(row)) # print raw strings
        if i >= 3: break

<br/>

---

## 🐼 Method 4: Play with the data using `pandas`

It's time for the tried-and-true Data 100 approach: `pandas`.

In [None]:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv",)
tb_df

What's going on with the "Unnamed" column names? 

And why does the first row look different than the other rows?

We're ready to wrangle the data! 

A reasonable first step is to **identify the row with the right header** (i.e., the row with the column names). 

The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has a convenient `header` parameter for specifying the index of the row you want to use as the header:

In [None]:
# header=1 tells pandas to ignore row 0, and use row 1 as the column names
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1)
tb_df

Notice that we no longer have "Unnamed" columns.

<br>

**Instructor note: Return to slides!**

<br><br><br>

<br/><br/>

---

# 🔎 Granularity of records

Notice that the first record (i.e., row 0) differs from the other records:

In [None]:
tb_df.head()

Row 0 is what we call a **rollup record**, or a summary record. 

- The **granularity** of record 0 (i.e., the total counts summed over all states) differs from the granularity of the rest of the records (i.e., the counts for individual states).

- Rollup records are often useful when displaying tables to humans. But, rollup records are generally less useful when using the data for further analysis, since the rollup record "overlaps" with other records (i.e., info from other rows is aggregated to create the rollup record).

<br/>

Okay, EDA step two. How was the rollup record aggregated?

Let's check if total TB cases (row 0) is indeed the sum of all state TB cases (all other rows). 

- To do this, we can drop row 0, and sum up all the remaining rows. 

In [None]:
tb_df.drop(0)

In [None]:
tb_df.drop(0).sum()

<br/>

This doesn't look very pretty!

Let's check out the column types:

In [None]:
tb_df.dtypes

<br/>

The commas within the counts (e.g., `1,234`) cause `pd.read_csv` to read in the counts as the `object` datatype, or **storage type**. Strings are of the `object` datatype.

- So, `pandas` is concatenating strings (e.g., `'1' + '2' = '12'`) instead of adding integers (e.g., `1 + 2 = 3`).

<br/>

Fortunately `read_csv` also has a `thousands` parameter to handle exactly this issue ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))

- Note: This is not a fact that most data scientists would know off the top of their head! At this point, it would be very natural to Google/ask an LLM `How do I get pandas to ignore commas in numeric columns?`, and then learn about the `thousands` parameter from the results.

In [None]:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
tb_df

Notice that there are no more commas in the first row!

Now, let's sum up the columns, ignoring the first row:

In [None]:
tb_df.drop(0).sum()

Much better! 

- Though you should note that string concatenation is still happening with the state names. To improve our code, we probably should not sum up the state name column. This exercise is left to you!

Finally, let's compare this output to the first row of the original data:

In [None]:
tb_df.head(1)

The sums of the three TB cases columns are the same as the counts in the rollup record. Excellent!

Next, we will compute TB **incidence** for each state and the U.S. as a whole.

**Instructor note: Return to the lecture!**

<br/><br/>

<br/><br/>

---

# 🧺 Gather Census Data

**Run the code in this section, but we won't review it during lecture.**

- This section is a nice example of transforming data downloaded directly from
a public website into a format that is convenient for analysis.

The code in this section transforms CSV files with U.S. Census population estimates ([source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2010s), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020s)) into a form that is compatible with the TB case count data.

- We encourage you to explore the CSVs and study these lines outside of lecture.

There are a few new methods here:
* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) drops rows containing any NA value. This function will be explained in more detail in a future lecture.

In [None]:
census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")

# Notice we have more than just state data!
display(census_2010s_df.head(10))

# Also notice that the bottom of the file includes metadata (data about data).
# We want to ignore this!
display(census_2010s_df.tail())

Here we do a bit more basic data cleaning:

In [None]:
census_2010s_df = (
    census_2010s_df
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    .drop(columns=["Census", "Estimates Base"])
    .convert_dtypes() # "smart" converting of columns to int, use at your own risk
    .dropna()  # we'll introduce this very soon
)
census_2010s_df

You might ask yourself: What is the granularity of each row in this table?

Notice there is a `'.'` at the beginning of all the states.  We need to remove that.

In [None]:
census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
census_2010s_df

The 2020s data is in a separate file.

So, we will repeat the same data cleaning process on the 2020s dataset.

- Even better, we could write a re-usable function to carry out the similar cleaning process for both datasets. For this demo, we will use the same code twice.

In [None]:
# census 2020s data
census_2020s_df = pd.read_csv("data/NST-EST2024-POP.csv", header=3, thousands=",")

# Once again, we have more than just state data, and metadata at the bottom.
# But, we also have a ton of extra blank columns!
display(census_2020s_df.head(10))
display(census_2020s_df.tail())

In [None]:
# census 2020s data
census_2020s_df = (
    census_2020s_df
    .drop(columns=["Unnamed: 1"])
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    # ignore all the blank extra columns
    .loc[:, "Geographic Area":"2024"]
    .convert_dtypes()       
    .dropna()                  
)
census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
census_2020s_df

With that, we're in business!

We now have U.S. Census data from 2019, 2020, and 2021 in a format that is compatible with our TB case count data.

<br/><br/>

---

# 👥 Joining TB case counts with census data

Time to `merge` our datasets (i.e., join them)! 

In [None]:
# Show the three tables that we are going to join.
# To keep things simple, let's just look at the last two rows of each df.
display(tb_df.tail(2))
display(census_2010s_df.tail(2))
display(census_2020s_df.tail(2))

We're only interested in the population for the years 2019, 2020, and 2021, so let's select just those columns:

In [None]:
# Select only the relevant population years
census_2019_df = census_2010s_df[['Geographic Area', '2019']]
census_2020_2021_df = census_2020s_df[['Geographic Area', '2020', '2021']]

display(tb_df.tail(2))
display(census_2019_df.tail(2))
display(census_2020_2021_df.tail(2))

All three dataframes have a column containing U.S. states, along with some other geographic areas. These columns are our **join keys**.

- Below, we use `df1.merge(right=df2, ...)` to carry out the merge ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). 

- We could have alternatively used the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)).

In [None]:
# merge TB dataframe with two US census dataframes
tb_census_df = (
    tb_df
    .merge(right=census_2019_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area")
    .merge(right=census_2020_2021_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area")
)
tb_census_df.tail(2)

To see what's going on with the duplicate columns, and the `_x` and `_y`, let's do the just the first merge:

In [None]:
tb_df.merge(right=census_2019_df, 
            left_on="U.S. jurisdiction", 
            right_on="Geographic Area").head()

Notice that the columns containing the **join keys** have all been retained, and all contain the same values.

- Furthermore, notice that the duplicated columns are appended with `_x` and `_y` to keep the column names unique.

- In the TB case count data, column `2019` represents the number of TB cases in 2019, but in the Census data, column `2019` represents the U.S. population.

We can use the `suffixes` argument to modify the `_x` and `_y` defaults to our liking ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)).

In [None]:
# Specify the suffixes to use for duplicated column names
tb_df.merge(right=census_2019_df,
           left_on="U.S. jurisdiction", 
           right_on="Geographic Area",
           suffixes=('_cases', '_population')).head()

Notice the `_x` and `_y` have changed to `_cases` and `_population`, just like we specified.

Putting it all together, and dropping the duplicated `Geographic Area` columns:

In [None]:
# Redux: merge TB dataframe with two US census dataframes
tb_census_df = (
    tb_df
    
    .merge(right=census_2019_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area",
           suffixes=('_cases', '_population'))
    .drop(columns="Geographic Area")

    .merge(right=census_2020_2021_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area",
           suffixes=('_cases', '_population'))
    .drop(columns="Geographic Area")
    
)
tb_census_df.tail(2)

## ♻️ Reproduce incidence

Let's see if we can reproduce the original CDC numbers from our augmented dataset of TB case counts and state populations.

- Recall that the nationwide TB incidence was **2.7 in 2019**, **2.2 in 2020**, and **2.4 in 2021**.

- Along the way, we'll also compute state-level incidence.

From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”

Let's start with a simpler question: What is the per person incidence? 

- In other words, what is the probability that a randomly selected person in the population had TB within a given year?

$$\text{TB incidence per person} = \frac{\text{\# TB cases in population}}{\text{Total population size}}$$

Let's calculate per person incidence for 2019:

In [None]:
# Calculate per person incidence for 2019
tb_census_df["per person incidence 2019"] = (
    tb_census_df["2019_cases"]/tb_census_df["2019_population"]
)
tb_census_df

TB is really rare in the United States, so per person TB incidence is really low, as expected.

- But, if we were to consider 100,000 people, the probability of seeing a TB case is higher.

- In fact, it would be 100,000 times higher!

$$\text{TB incidence per 100,000} = \text{100,000} * \text{TB incidence per person}$$

In [None]:
# To help read bigger numbers in Python, you can use _ to separate thousands,
# akin to using commas. 100_000 is the same as writing 100000, but more readable.
tb_census_df["per 100k incidence 2019"] = (
    100_000 * tb_census_df["per person incidence 2019"] 
)
tb_census_df

Now we're seeing more human-readable values.

- For example, there 5.3 tuberculosis cases for every 100,000 California residents in 2019.

To wrap up this exercise, let's calculate the nationwide incidence of TB in 2019.

In [None]:
# Recall that the CDC reported an incidence of 2.7 per 100,000 in 2019.
tot_tb_cases_50_states = tb_census_df["2019_cases"].sum()
tot_pop_50_states = tb_census_df["2019_population"].sum()
tb_per_100k_50_states = 100_000 * tot_tb_cases_50_states / tot_pop_50_states
tb_per_100k_50_states

We can use a `for` loop to compute the incidence for 2019, 2020, and 2021.

- You'll notice that we get the same numbers reported by the CDC!

In [None]:
# f strings (f"...") are a handy way to pass in variables to strings.
for year in [2019, 2020, 2021]:
  tot_tb_cases_50_states = tb_census_df[f"{year}_cases"].sum()
  tot_pop_50_states = tb_census_df[f"{year}_population"].sum()
  tb_per_100k_50_states = 100_000 * tot_tb_cases_50_states / tot_pop_50_states
  print(tb_per_100k_50_states)

<br><br><br>

**Instructor Note: Return to Slides!**