<a href="https://colab.research.google.com/github/shradsb19/econ470/blob/main/data_management.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basics of Data Management

This notebook follows the **Basics of Data Management** slides and implements the examples in both **R** and **Python** using Medicare Advantage data snippets for Georgia.

## Setup: R in Colab

First, enable R in this Colab environment using `rpy2`. Run this cell once at the top of the notebook.

In [20]:
!pip -q install rpy2
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


## Setup: R packages

Install and load the R packages we will use. You only need to install once; on later runs, `pacman::p_load()` will just load them.

In [21]:
%%R
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Install pacman if it is not already installed
if (!"pacman" %in% rownames(installed.packages())) {
  install.packages("pacman")
}

library(pacman)

# Load the packages you want
p_load(tidyverse, janitor, lubridate, readr)

print("R is ready in Colab ✅")


[1] "R is ready in Colab ✅"


## Loading the Georgia Medicare Advantage snippets (R)

We will read in three CSV files hosted on GitHub: enrollment, contracts, and service areas.

In [22]:
%%R
base_url <- "https://raw.githubusercontent.com/imccart/empirical-methods-content/main/data/output/ma-snippets/"

ga_enrollment <- read_csv(paste0(base_url, "ga-enrollment.csv"))
ga_contract   <- read_csv(paste0(base_url, "ga-contract.csv"))
ga_service    <- read_csv(paste0(base_url, "ga-service-area.csv"))

glue::glue("ga_enrollment: {nrow(ga_enrollment)} rows, {ncol(ga_enrollment)} cols")


Rows: 7333 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): contractid, state, county
dbl (4): planid, ssa, fips, enrollment

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 172 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): contractid, org_type, plan_type, partd, snp, eghp, org_name, org_m...
dbl  (1): planid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 4012 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): contractid, org_name, org_type, plan_type, county, state
dbl (2): ssa, fips
lgl (2): partial, eghp

ℹ Use `spec()` to retrieve the full column specificatio

## Setup: Python imports

Now set up the Python side with `pandas` (and `numpy` for a few transformations).

In [23]:
import pandas as pd
import numpy as np

print("Python is ready in Colab ✅")


Python is ready in Colab ✅


## Loading the Georgia Medicare Advantage snippets (Python)

We read the same CSVs directly from GitHub using `pandas.read_csv()`.

In [24]:
base_url = "https://raw.githubusercontent.com/imccart/empirical-methods-content/main/data/output/ma-snippets/"

ga_enrollment = pd.read_csv(base_url + "ga-enrollment.csv")
ga_contract   = pd.read_csv(base_url + "ga-contract.csv")
ga_service    = pd.read_csv(base_url + "ga-service-area.csv")

ga_enrollment.shape, ga_contract.shape, ga_service.shape


((7333, 7), (172, 12), (4012, 10))

# Looking at your data

We start by doing basic checks: object type, dimensions, and the first few rows.

### First checks after loading (R)

In [25]:
%%R
# Basic info about ga_enrollment
class(ga_enrollment)
dim(ga_enrollment)
nrow(ga_enrollment)
ncol(ga_enrollment)

# Peek at the data
head(ga_enrollment)


# A tibble: 6 × 7
  contractid planid   ssa  fips state county   enrollment
  <chr>       <dbl> <dbl> <dbl> <chr> <chr>         <dbl>
1 H0111           1 11000 13001 GA    Appling          11
2 H0111           1 11030 13009 GA    Baldwin          27
3 H0111           1 11050 13013 GA    Barrow           47
4 H0111           1 11060 13015 GA    Bartow           43
5 H0111           1 11070 13017 GA    Ben Hill         14
6 H0111           1 11080 13019 GA    Berrien          13


### First checks after loading (Python)

In [26]:
# Basic info about ga_enrollment
type(ga_enrollment)
ga_enrollment.shape
ga_enrollment.shape[0]   # rows
ga_enrollment.shape[1]   # columns

# Peek at the data
ga_enrollment.head()


Unnamed: 0,contractid,planid,ssa,fips,state,county,enrollment
0,H0111,1,11000,13001,GA,Appling,11
1,H0111,1,11030,13009,GA,Baldwin,27
2,H0111,1,11050,13013,GA,Barrow,47
3,H0111,1,11060,13015,GA,Bartow,43
4,H0111,1,11070,13017,GA,Ben Hill,14


### Structure and summaries

Next, look at variable names, types, and basic numeric summaries.

#### R

In [27]:
%%R
library(dplyr)

# Structure
glimpse(ga_enrollment)

# Basic numeric summaries
summary(ga_enrollment)


Rows: 7,333
Columns: 7
$ contractid <chr> "H0111", "H0111", "H0111", "H0111", "H0111", "H0111", "H011…
$ planid     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ssa        <dbl> 11000, 11030, 11050, 11060, 11070, 11080, 11090, 11100, 111…
$ fips       <dbl> 13001, 13009, 13013, 13015, 13017, 13019, 13021, 13023, 130…
$ state      <chr> "GA", "GA", "GA", "GA", "GA", "GA", "GA", "GA", "GA", "GA",…
$ county     <chr> "Appling", "Baldwin", "Barrow", "Bartow", "Ben Hill", "Berr…
$ enrollment <dbl> 11, 27, 47, 43, 14, 13, 116, 21, 40, 73, 44, 42, 113, 108, …
  contractid            planid            ssa             fips      
 Length:7333        Min.   :  1.00   Min.   :11000   Min.   :13001  
 Class :character   1st Qu.:  7.00   1st Qu.:11320   1st Qu.:13077  
 Mode  :character   Median : 39.00   Median :11590   Median :13149  
                    Mean   : 92.56   Mean   :11560   Mean   :13155  
                    3rd Qu.:185.00   3rd Qu.:11821   3rd Qu.:13231  
   

#### Python

In [28]:
# Structure (column names and types)
ga_enrollment.info()

# Basic numeric summaries
ga_enrollment.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7333 entries, 0 to 7332
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   contractid  7333 non-null   object
 1   planid      7333 non-null   int64 
 2   ssa         7333 non-null   int64 
 3   fips        7333 non-null   int64 
 4   state       7333 non-null   object
 5   county      7333 non-null   object
 6   enrollment  7333 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 401.2+ KB


Unnamed: 0,planid,ssa,fips,enrollment
count,7333.0,7333.0,7333.0,7333.0
mean,92.555162,11559.76708,13155.324015,161.169099
std,103.61958,279.211177,89.938112,408.294312
min,1.0,11000.0,13001.0,11.0
25%,7.0,11320.0,13077.0,23.0
50%,39.0,11590.0,13149.0,53.0
75%,185.0,11821.0,13231.0,141.0
max,392.0,11980.0,13321.0,9958.0


### What to look for

- **Ranges and typical values**: negative ages, impossible years, absurd enrollment counts
- **Missingness**: are key variables mostly missing? do some years/counties have more missing than others?
- **Outliers**: a few very large or very small values that might be data errors
- **Types**: are IDs stored as strings? are dates actually dates?


### Simple summaries and counts

We can quickly summarise enrollment by year and other groups.

#### R

In [29]:
%%R
library(dplyr)

# How many plans per county
ga_enrollment %>%
  count(fips)

# Average enrollment by plan by county
ga_enrollment %>%
  group_by(fips) %>%
  summarise(
    avg_enrollment = mean(enrollment, na.rm = TRUE),
    n_plans        = n(),
    .groups        = "drop"
  )


# A tibble: 159 × 3
    fips avg_enrollment n_plans
   <dbl>          <dbl>   <int>
 1 13001           64.5      40
 2 13003           39.2      21
 3 13005           53.6      27
 4 13007           27.2      15
 5 13009           95.5      55
 6 13011           61.6      40
 7 13013          133.       72
 8 13015          182.       72
 9 13017           63.7      40
10 13019           66.4      37
# ℹ 149 more rows
# ℹ Use `print(n = ...)` to see more rows


#### Python

In [30]:
# How many plans per county
ga_enrollment.groupby("fips").size()

# Average enrollment by plan by county
ga_enrollment.groupby("fips").agg(
    avg_enrollment=("enrollment", "mean"),
    n_plans=("enrollment", "size")
)


Unnamed: 0_level_0,avg_enrollment,n_plans
fips,Unnamed: 1_level_1,Unnamed: 2_level_1
13001,64.500000,40
13003,39.190476,21
13005,53.592593,27
13007,27.200000,15
13009,95.509091,55
...,...,...
13313,275.422222,45
13315,40.416667,24
13317,51.257143,35
13319,41.633333,30


# Cleaning and managing your data

Now we start selecting variables, transforming them, and handling missing or implausible values.

### Selecting variables and filtering rows (R)

In [31]:
%%R
library(dplyr)

ga_enrollment_small <- ga_enrollment %>%
  select(contractid, planid, county, enrollment) %>% #remove year
  filter(!is.na(enrollment))

glimpse(ga_enrollment_small)


Rows: 7,333
Columns: 4
$ contractid <chr> "H0111", "H0111", "H0111", "H0111", "H0111", "H0111", "H011…
$ planid     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county     <chr> "Appling", "Baldwin", "Barrow", "Bartow", "Ben Hill", "Berr…
$ enrollment <dbl> 11, 27, 47, 43, 14, 13, 116, 21, 40, 73, 44, 42, 113, 108, …


### Selecting variables and filtering rows (Python)

In [32]:
ga_enrollment_small = (
    ga_enrollment
        [["contractid", "planid", "county", "enrollment"]]
        .dropna(subset=["enrollment"])
)

ga_enrollment_small.head()


Unnamed: 0,contractid,planid,county,enrollment
0,H0111,1,Appling,11
1,H0111,1,Baldwin,27
2,H0111,1,Barrow,47
3,H0111,1,Bartow,43
4,H0111,1,Ben Hill,14


### Creating and transforming variables (R)

In [33]:
%%R
ga_enrollment_clean <- ga_enrollment_small %>%
  mutate(
    log_enrollment = log(enrollment)
  )

glimpse(ga_enrollment_clean)


Rows: 7,333
Columns: 5
$ contractid     <chr> "H0111", "H0111", "H0111", "H0111", "H0111", "H0111", "…
$ planid         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ county         <chr> "Appling", "Baldwin", "Barrow", "Bartow", "Ben Hill", "…
$ enrollment     <dbl> 11, 27, 47, 43, 14, 13, 116, 21, 40, 73, 44, 42, 113, 1…
$ log_enrollment <dbl> 2.397895, 3.295837, 3.850148, 3.761200, 2.639057, 2.564…


### Creating and transforming variables (Python)

In [34]:
ga_enrollment_clean = ga_enrollment_small.copy()
ga_enrollment_clean["log_enrollment"] = np.log(ga_enrollment_clean["enrollment"])

ga_enrollment_clean.head()


Unnamed: 0,contractid,planid,county,enrollment,log_enrollment
0,H0111,1,Appling,11,2.397895
1,H0111,1,Baldwin,27,3.295837
2,H0111,1,Barrow,47,3.850148
3,H0111,1,Bartow,43,3.7612
4,H0111,1,Ben Hill,14,2.639057


### Handling missing and implausible values (R)

In [35]:
%%R
ga_enrollment_checked <- ga_enrollment_clean %>%
  mutate(
    enrollment = if_else(enrollment < 0, NA_real_, enrollment)
  ) %>%
  filter(!is.na(enrollment), !is.na(contractid))

glimpse(ga_enrollment_checked)


Rows: 7,333
Columns: 5
$ contractid     <chr> "H0111", "H0111", "H0111", "H0111", "H0111", "H0111", "…
$ planid         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ county         <chr> "Appling", "Baldwin", "Barrow", "Bartow", "Ben Hill", "…
$ enrollment     <dbl> 11, 27, 47, 43, 14, 13, 116, 21, 40, 73, 44, 42, 113, 1…
$ log_enrollment <dbl> 2.397895, 3.295837, 3.850148, 3.761200, 2.639057, 2.564…


### Handling missing and implausible values (Python)

In [36]:
ga_enrollment_checked = ga_enrollment_clean.copy()

# Recode negative enrollment as missing
ga_enrollment_checked.loc[
    ga_enrollment_checked["enrollment"] < 0,
    "enrollment"
] = pd.NA

# Drop rows missing key fields
ga_enrollment_checked = ga_enrollment_checked.dropna(
    subset=["enrollment", "contractid"]
)

ga_enrollment_checked.head()


Unnamed: 0,contractid,planid,county,enrollment,log_enrollment
0,H0111,1,Appling,11.0,2.397895
1,H0111,1,Baldwin,27.0,3.295837
2,H0111,1,Barrow,47.0,3.850148
3,H0111,1,Bartow,43.0,3.7612
4,H0111,1,Ben Hill,14.0,2.639057


# Merging and reshaping data

We now combine information across the three tables and practice reshaping between long and wide formats.

### Merging tables with keys (R)

In [37]:
%%R
library(dplyr)

# Merge enrollment with contract-level info
ga_enroll_contract <- ga_enrollment_checked %>%
  left_join(
    ga_contract,
    by = c("contractid")
  )

# Add service area info
ga_enroll_full <- ga_enroll_contract %>%
  left_join(
    ga_service,
    by = c("contractid", "county")
  )

glue::glue("Rows before merge: {nrow(ga_enrollment_checked)}, after merge: {nrow(ga_enroll_full)}")


Rows before merge: 7333, after merge: 52509
FALSE


1: In left_join(., ga_contract, by = c("contractid")) :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
2: In left_join(., ga_service, by = c("contractid", "county")) :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 7990 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =


### Merging tables with keys (Python)

In [38]:
# Merge enrollment with contract-level info
ga_enroll_contract = ga_enrollment_checked.merge(
    ga_contract,
    on=["contractid"],
    how="left"
)

# Add service area info
ga_enroll_full = ga_enroll_contract.merge(
    ga_service,
    on=["contractid", "county"],
    how="left"
)

ga_enroll_full.shape


(52509, 24)

# Stacking (binding) multiple datasets

### Stacking with R

In [None]:

years <- 2015:2018

ga_enrollment_multi <- map_dfr(
  years,
  ~ read_csv(paste0("data/output/ma-snippets/ga-enrollment-", .x, ".csv")) %>% #data path edit
      mutate(year = .x)
)

Error in `map()`:
ℹ In index: 1.
Caused by error:
! 'data/output/ma-snippets/ga-enrollment-2015.csv' does not exist in current working directory ('/Users/shraddha/Documents/GitHub/econ470').
Run `rlang::last_trace()` to see where the error occurred.

Error in map(.x, .f, ...) : 
Caused by error:
! 'data/output/ma-snippets/ga-enrollment-2015.csv' does not exist in current working directory ('/Users/shraddha/Documents/GitHub/econ470').


RInterpreterError: Failed to parse and evaluate line '\nyears <- 2015:2018\n\nga_enrollment_multi <- map_dfr(\n  years,\n  ~ read_csv(paste0("data/output/ma-snippets/ga-enrollment-", .x, ".csv")) %>%\n      mutate(year = .x)\n)\n'.
R error message: "Error in map(.x, .f, ...) : \nCaused by error:\n! 'data/output/ma-snippets/ga-enrollment-2015.csv' does not exist in current working directory ('/Users/shraddha/Documents/GitHub/econ470')."

### Stacking with Python

In [None]:
years = list(range(2015, 2019))

frames = []
for y in years:
    df_y = pd.read_csv(f"data/output/ma-snippets/ga-enrollment-{y}.csv") #data path edit
    df_y["year"] = y
    frames.append(df_y)

ga_enrollment_multi = pd.concat(frames, ignore_index=True)

FileNotFoundError: [Errno 2] No such file or directory: 'data/output/ma-snippets/ga-enrollment-2015.csv'

# Reshaping: Wide vs Long

### Reshaping with R

In [None]:
# Aggregate enrollment by contract and year
ga_contract_year <- ga_enrollment %>%
  group_by(contractid, year) %>%
  summarise(
    total_enrollment = sum(enrollment, na.rm = TRUE),
    .groups = "drop"
  )

# Long → wide: years as columns
ga_contract_wide <- ga_contract_year %>%
  pivot_wider(
    names_from = year,
    values_from = total_enrollment,
    names_prefix = "enroll_"
  )

# Wide → long: back to year/enrollment columns
ga_contract_long <- ga_contract_wide %>%
  pivot_longer(
    cols = starts_with("enroll_"),
    names_to = "year",
    names_prefix = "enroll_",
    values_to = "total_enrollment"
  ) %>%
  mutate(year = as.integer(year))

### Reshaping with Python

In [40]:
# Aggregate enrollment by contract and year
ga_contract_year = (
    ga_enrollment
      .groupby(["contractid", "year"], as_index=False)
      .agg(total_enrollment=("enrollment", "sum"))
)

# Long → wide: years as columns
ga_contract_wide = ga_contract_year.pivot(
    index="contractid",
    columns="year",
    values="total_enrollment"
).add_prefix("enroll_").reset_index()

# Wide → long: back to year/enrollment columns
ga_contract_long = ga_contract_wide.melt(
    id_vars="contractid",
    var_name="year",
    value_name="total_enrollment"
)

# Strip prefix and convert year to int
ga_contract_long["year"] = (
    ga_contract_long["year"].str.replace("enroll_", "", regex=False).astype(int)
)

KeyError: 'year'

# SAVING CLEANED DATA TO CSV

### Saving with R

In [None]:
# Save cleaned enrollment data as CSV
write_csv(
  ga_enrollment_checked,
  "data/output/ma-snippets/ga-enrollment-clean.csv"
)

### Saving with Python

In [None]:
import pandas as pd

# Save cleaned enrollment data as CSV
ga_enrollment_checked.to_csv(
    "data/output/ma-snippets/ga-enrollment-clean.csv",
    index=False
)