## Project 1

The goal of the first project is to do some wrangling, EDA, and visualization, and generate sequences of values. We will focus on:

- CDC National Health and Nutritional Examination Survey (NHANES, 1999-2000): https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=1999
- CDC Linked Mortality File (LMF, 1999-2000): https://www.cdc.gov/nchs/data-linkage/mortality-public.htm

NHANES is a rich panel dataset on health and behavior, collected bi-yearly from around 1999 to now. We will focus on the 1999 wave, because that has the largest follow-up window, providing us with the richest mortality data. The mortality data is provided by the CDC Linked Mortality File.

The purpose of the project is to use $k$-NN to predict who dies (hard or soft classification) and how long they live (regression).

### Part 1: Wrangling and EDA (40/100 pts)

First, go to the NHANES and LMF web sites and familiarize yourself with the data sources. Download codebooks. Think about what resources are available. The CDC Linked Mortality File is somewhat of a pain to work with, so I have pre-cleaned it for you. It is available at httts://github.com/ds4e/undergraduate_ml_assignments in the data folder, as `lmf_parsed.cav`. From the CDC LMF web page, get the SAS program to load the data; it is the real codebook.

Second, download the demographic data for the 1999--2000 wave from the NHANES page. You can use the following code chunk to merge the LMF and DEMO data:

``` python
import pandas as pd
mdf = pd.read_csv('linked_mortality_file_1999_2000.csv') # Load mortality file
print( mdf.head() )
gdf = pd.read_sas("DEMO.xpt", format="xport") # Load demographics file
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") # Merge mortality and demographics on SEQN variable
```

Third, the variables `ELIGSTAT`, `MORTSTAT`, `PERMTH_INT`, and `RIDAGEEX` are particularly important. Look them up in the documentation and clearly describe them. (5/100 pts.)

Second, the goal of the project is to use whatever demographic, behavioral, and health data you like to predict mortality (`MORTSTAT`) and life expectancy (`PERMTH_INT`). Go to the NHANES 1999--2000 web page and select your data and download it. Clearly explain your rationale for selecting these data. Use `.merge` to combine your data into one complete dataframe. Document missing values. (5/100 pts)

Third, do basic EDA and visualization of the key variables. Are any important variables skewed? Are there outliers? How correlated are pairs of variables? Do pairs of categorical variables exhibit interesting patterns in contingency tables? Provide a clear discussion and examination of the data and the variables you are interested in using. (20/100 pts)


### Part 2: $k$-NN classification/regression, write-up (50/100 pts)

Submit a notebook that clearly addresses the following, using code and markdown chunks:

1. Describe the data, particularly what an observation is and whether there are any missing data that might impact your analysis. Who collected the data and why? What known limitations are there to analysis? (10/100 pts)
2. Describe the variables you selected to predict mortality and life expectancy, and the rationale behind them. Analyze your variables using describe tables, kernel densities, scatter plots, and conditional kernel densities. Are there any patterns of interest to notice? (10/100 pts)
3. Using your variables to predict mortality using a $k$-Nearest Neighbor Classifier. Analyze its performance and explain clearly how you select $k$. (10/100 pts)
4. Using your variables to predict life expectancy using a $k$-Nearest Neighbor Regressor. Analyze its performance and explain clearly how you select $k$. (10/100 pts)
5. Describe how your model could be used for health interventions based on patient characteristics. Are there any limitations or risks to consider? (10/100 pts)

## Submission (10/100 pts)

Submit your work in a well-organized GitHub repo, where the code is appropriately commented and all members of the group have made significant contributions to the commit history. (10/100 pts)

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
import os
os.listdir()
os.listdir('sample_data')

['README.md',
 'anscombe.json',
 'linked_mortality_file_1999_2000.csv',
 'mnist_train_small.csv',
 'california_housing_test.csv',
 'mnist_test.csv',
 'california_housing_train.csv']

In [14]:
mdf = pd.read_csv('sample_data/linked_mortality_file_1999_2000.csv') # Load mortality file
print( mdf.head() )
gdf = pd.read_sas("sample_data/DEMO.xpt", format="xport") # Load demographics file
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") # Merge mortality and demographics on SEQN variable

   SEQN  ELIGSTAT  MORTSTAT  UCOD_LEADING  DIABETES  HYPERTEN  PERMTH_INT  \
0     1         2       NaN           NaN       NaN       NaN         NaN   
1     2         1       1.0           6.0       0.0       0.0       177.0   
2     3         2       NaN           NaN       NaN       NaN         NaN   
3     4         2       NaN           NaN       NaN       NaN         NaN   
4     5         1       0.0           NaN       NaN       NaN       244.0   

   PERMTH_EXM  
0         NaN  
1       177.0  
2         NaN  
3         NaN  
4       244.0  
   SEQN  SDDSRVYR  RIDSTATR  RIDEXMON  RIAGENDR  RIDAGEYR  RIDAGEMN  RIDAGEEX  \
0   1.0       1.0       2.0       2.0       2.0       2.0      29.0      31.0   
1   2.0       1.0       2.0       2.0       1.0      77.0     926.0     926.0   
2   3.0       1.0       2.0       1.0       2.0      10.0     125.0     126.0   
3   4.0       1.0       2.0       2.0       1.0       1.0      22.0      23.0   
4   5.0       1.0       2.0       2.

In [15]:
df = df[df["ELIGSTAT"] == 1]
df = df[df["MORTSTAT"].notna()]

print("Shape:", df.shape)
print("Death rate:", df["MORTSTAT"].mean())
print(df["MORTSTAT"].value_counts())

Shape: (5445, 151)
Death rate: 0.3076216712580349
MORTSTAT
0.0    3770
1.0    1675
Name: count, dtype: int64


## Description of Key Mortality Variables

### ELIGSTAT – Linkage Eligibility Status

The variable **ELIGSTAT** indicates whether a survey participant was eligible for linkage to the National Death Index (NDI) for mortality follow-up. In the public-use linked mortality files, eligibility is determined based on the availability of sufficient identifying information.

The coding is as follows:

- **1** = Eligible for mortality linkage  
- **2** = Under age 18 and not eligible for public-use mortality release  
- **3** = Not linkage-eligible due to insufficient identifying data  

In this analysis, we restrict the dataset to individuals with `ELIGSTAT = 1` to ensure that mortality outcomes are valid and interpretable.

---

### MORTSTAT – Mortality Status

The variable **MORTSTAT** represents the participant’s vital status at the end of the mortality follow-up period. This variable serves as the primary outcome for mortality classification.

The coding is as follows:

- **0** = Assumed alive  
- **1** = Assumed deceased  

Vital status is determined through probabilistic linkage to the National Death Index. Only linkage-eligible participants have valid mortality status values. In this project, **MORTSTAT is used as the dependent variable in mortality prediction models.**

---

### PERMTH_INT – Follow-up Time (Person-Months from Interview Date)

The variable **PERMTH_INT** represents the number of person-months from the date of the participant’s interview until either:

- The date of death (if deceased), or  
- The end of the mortality follow-up period (December 31, 2019), if assumed alive.

This variable captures survival duration and is useful for time-to-event analyses. However, it is not used as a predictor variable in the classification model because it contains post-baseline follow-up information and would introduce data leakage.

---

### RIDAGEEX – Age at Examination

The variable **RIDAGEEX** indicates the participant’s age in years at the time of examination. Age is a fundamental demographic variable and is strongly associated with mortality risk.

In predictive modeling, age is expected to be one of the most influential predictors of mortality, as mortality risk increases substantially across the lifespan.

---

Together, these variables define eligibility, outcome status, follow-up duration, and a key demographic risk factor. Proper understanding of these variables is essential to ensure valid modeling decisions and appropriate interpretation of mortality predictions.

## Data Selection and Integration

### Rationale for Data Selection

The objective of this project is to predict mortality status (**MORTSTAT**) and survival time in person-months (**PERMTH_INT**) using demographic, behavioral, and health-related variables from NHANES 1999–2000.

To accomplish this, we selected variables from the following NHANES components:

1. **Demographic Data (DEMO.xpt)**  
   This file includes core demographic variables such as:
   - Age at examination (RIDAGEEX)
   - Sex (RIAGENDR)
   - Race/ethnicity (RIDRETH1)
   - Socioeconomic indicators (e.g., education, income proxies)

   These variables are foundational predictors of mortality risk and are commonly used in epidemiological modeling.

2. **Mortality Linkage File (Linked Mortality File, 1999–2000)**  
   This file provides:
   - Mortality status (MORTSTAT)
   - Eligibility status (ELIGSTAT)
   - Follow-up time in person-months (PERMTH_INT)
   - Cause-of-death indicators

   This file supplies the outcome variables required for both classification (mortality) and regression (life expectancy proxy).

Additional behavioral or health variables (if included in your model, such as smoking, BMI, hypertension, or diabetes indicators) were selected based on established associations with mortality risk in public health literature.

Overall, the selected variables represent a combination of demographic, socioeconomic, and clinical risk factors that are theoretically and empirically linked to mortality outcomes.

---

### Data Merging Procedure

All NHANES datasets contain a unique respondent identifier variable, **SEQN**.  
This variable was used as the merge key to combine files.

The datasets were merged using an inner join:

- Demographic file (DEMO.xpt)
- Linked Mortality File (linked_mortality_file_1999_2000.csv)

The merge ensured that each observation corresponds to a single participant with both baseline demographic data and mortality follow-up information.

After merging, the resulting dataframe contains 151 variables and 9,965 total observations before filtering.

---

### Handling Linkage Eligibility

Because mortality status is only defined for linkage-eligible individuals, we restricted the dataset to:

- `ELIGSTAT == 1` (eligible for linkage)
- Non-missing `MORTSTAT` values

This resulted in:

- **5,445 eligible participants**
- 1,675 deceased (30.76%)
- 3,770 alive (69.24%)

This restriction ensures that mortality outcomes are valid and interpretable.

---

### Documentation of Missing Values

Missing values are present in several variables for the following reasons:

1. Participants under age 18 are coded as ineligible (ELIGSTAT = 2), resulting in missing mortality outcomes.
2. Some health and behavioral variables contain item nonresponse.
3. Follow-up time variables are missing for linkage-ineligible participants.

After filtering to eligible individuals, mortality status contains no missing values.  
Other predictor variables were evaluated for missingness using summary statistics and will be handled appropriately (e.g., removal, imputation, or encoding) prior to model fitting.

---

### Summary

The final analytical dataset combines demographic and mortality follow-up data using SEQN as the merge key. The selected variables were chosen based on epidemiological relevance to mortality risk, ensuring theoretical justification for predictive modeling.