## Project 1

The goal of the first project is to do some wrangling, EDA, and visualization, and generate sequences of values. We will focus on:

- CDC National Health and Nutritional Examination Survey (NHANES, 1999-2000): https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=1999
- CDC Linked Mortality File (LMF, 1999-2000): https://www.cdc.gov/nchs/data-linkage/mortality-public.htm

NHANES is a rich panel dataset on health and behavior, collected bi-yearly from around 1999 to now. We will focus on the 1999 wave, because that has the largest follow-up window, providing us with the richest mortality data. The mortality data is provided by the CDC Linked Mortality File. 

The purpose of the project is to use $k$-NN to predict who dies (hard or soft classification) and how long they live (regression).

### Part 1: Wrangling and EDA (40/100 pts)

First, go to the NHANES and LMF web sites and familiarize yourself with the data sources. Download codebooks. Think about what resources are available. The CDC Linked Mortality File is somewhat of a pain to work with, so I have pre-cleaned it for you. It is available at httts://github.com/ds4e/undergraduate_ml_assignments in the data folder, as `lmf_parsed.cav`. From the CDC LMF web page, get the SAS program to load the data; it is the real codebook.

Second, download the demographic data for the 1999--2000 wave from the NHANES page. You can use the following code chunk to merge the LMF and DEMO data:

``` python
import pandas as pd
mdf = pd.read_csv('linked_mortality_file_1999_2000.csv') # Load mortality file
print( mdf.head() )
gdf = pd.read_sas("DEMO.xpt", format="xport") # Load demographics file
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") # Merge mortality and demographics on SEQN variable
```

In [None]:
# setup
import pandas as pd
mdf = pd.read_csv('linked_mortality_file_1999_2000.csv') # Load mortality file
print( mdf.head() )
gdf = pd.read_sas("DEMO.xpt", format="xport") # Load demographics file
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") # Merge mortality and demographics on SEQN variable

   SEQN  ELIGSTAT  MORTSTAT  UCOD_LEADING  DIABETES  HYPERTEN  PERMTH_INT  \
0     1         2       NaN           NaN       NaN       NaN         NaN   
1     2         1       1.0           6.0       0.0       0.0       177.0   
2     3         2       NaN           NaN       NaN       NaN         NaN   
3     4         2       NaN           NaN       NaN       NaN         NaN   
4     5         1       0.0           NaN       NaN       NaN       244.0   

   PERMTH_EXM  
0         NaN  
1       177.0  
2         NaN  
3         NaN  
4       244.0  
   SEQN  SDDSRVYR  RIDSTATR  RIDEXMON  RIAGENDR  RIDAGEYR  RIDAGEMN  RIDAGEEX  \
0   1.0       1.0       2.0       2.0       2.0       2.0      29.0      31.0   
1   2.0       1.0       2.0       2.0       1.0      77.0     926.0     926.0   
2   3.0       1.0       2.0       1.0       2.0      10.0     125.0     126.0   
3   4.0       1.0       2.0       2.0       1.0       1.0      22.0      23.0   
4   5.0       1.0       2.0       2.

Third, the variables `ELIGSTAT`, `MORTSTAT`, `PERMTH_INT`, and `RIDAGEEX` are particularly important. Look them up in the documentation and clearly describe them. (5/100 pts.)

In [None]:
# Girl Arya

Second, the goal of the project is to use whatever demographic, behavioral, and health data you like to predict mortality (`MORTSTAT`) and life expectancy (`PERMTH_INT`). Go to the NHANES 1999--2000 web page and select your data and download it. Clearly explain your rationale for selecting these data. Use `.merge` to combine your data into one complete dataframe. Document missing values. (5/100 pts)

In [None]:
# Eshal

'''
To predict mortality (MORTSTAT) and life expectancy (PERMTH_INT), we focused on three broad domains from NHANES 1999 to 2000:

Demographic variables (from DEMO.xpt)
1. Age (RIDAGEYR), because mortality risk and remaining life expectancy rise steeply with age.
2. Sex (RIAGENDR), race/ethnicity (RIDRETH1), and family income poverty ratio (INDFMPIR), since social and structural factors strongly shape mortality risk and access to care.
3. Marital status (DMDMARTL), which is often associated with social support and mortality.

Behavioral variables (e.g., smoking and alcohol questionnaires)
1. Smoking status and intensity (e.g., SMQ/SMD variables), because tobacco use is a major predictor of cardiovascular, cancer, and overall mortality.
2. Alcohol use (ALQ variables), given its established relationship with liver disease, injury, and all cause mortality.
3. Physical activity indicators, where available, as low activity is strongly associated with higher mortality.

Health status and clinical variables (exam and lab files)
1. Body mass index and other anthropometrics (e.g., BMXBMI, waist circumference), because obesity and central adiposity predict diabetes, cardiovascular disease, and death.
2. Self rated health (e.g., general health question) as a global, subjective predictor that consistently correlates with all cause mortality.
3. Selected chronic condition indicators (e.g., self reported diabetes, hypertension) where available, because existing chronic disease burden is directly related to near‑term mortality risk.

Our rationale was to choose variables that:
Are plausibly causal or strongly associated with mortality from prior literature (age, smoking, obesity, chronic disease).
Capture structural risk (income, race/ethnicity), behavioral risk (smoking, alcohol, physical activity), and clinical risk (BMI, chronic conditions).
Are relatively well measured in NHANES 1999 to 2000 and not limited to tiny subsamples, so that we retain enough observations for kNN.

'''

Third, do basic EDA and visualization of the key variables. Are any important variables skewed? Are there outliers? How correlated are pairs of variables? Do pairs of categorical variables exhibit interesting patterns in contingency tables? Provide a clear discussion and examination of the data and the variables you are interested in using. (20/100 pts)

In [None]:
# Boy Arya and Neha

### Part 2: $k$-NN classification/regression, write-up (50/100 pts)

Submit a notebook that clearly addresses the following, using code and markdown chunks:

1. Describe the data, particularly what an observation is and whether there are any missing data that might impact your analysis. Who collected the data and why? What known limitations are there to analysis? (10/100 pts)


In [None]:
# Eshal

'''
What is an observation? 
Each row in our merged dataset is one NHANES 1999 to 2000 participant (identified by SEQN) with their survey variables plus linked mortality information (MORTSTAT, PERMTH_INT, ELIGSTAT) from the Linked Mortality File.

Who collected the data and why?
The National Center for Health Statistics (CDC) collected NHANES using interviews and physical exams in mobile examination centers to monitor the health and nutritional status of the civilian, non institutionalized U.S. population and track disease risk factors over time. Mortality follow up is added by NCHS through probabilistic linkage of NHANES participants to the National Death Index, yielding vital status, follow up months, and cause of death up to a fixed censoring date.

Missing data that might impact our analysis
Important missingness includes: (a) participants who are not linkage eligible (missing mortality outcomes), (b) public use perturbation of some follow up times and causes of death, and (c) substantial item nonresponse and subsampling in NHANES exam/lab components, which can leave many predictors missing and reduce usable sample size if we rely on complete case analysis.

Known limitations
Key limitations are:
The complex survey design and weights are not fully incorporated in standard kNN, so our models are predictive for this sample, not strictly nationally representative.
Mortality outcomes involve probabilistic linkage and some data perturbation, which introduce outcome error and cap achievable performance.
PERMTH_INT is right censored follow up time, but we treat it as a fully observed regression target instead of using survival methods, which can bias life expectancy predictions.
Many risk factors are measured once and with error, and we only use a subset of available variables, so important predictors may be omitted, and models trained on 1999 to 2000 data may not generalize to today’s population.

'''

2. Describe the variables you selected to predict mortality and life expectancy, and the rationale behind them. Analyze your variables using describe tables, kernel densities, scatter plots, and conditional kernel densities. Are there any patterns of interest to notice? (10/100 pts)

In [None]:
# Heer and Shalmali

3. Using your variables to predict mortality using a $k$-Nearest Neighbor Classifier. Analyze its performance and explain clearly how you select $k$. (10/100 pts)

In [None]:
# Shalmali

4. Using your variables to predict life expectancy using a $k$-Nearest Neighbor Regressor. Analyze its performance and explain clearly how you select $k$. (10/100 pts)

In [None]:
# Heer

5. Describe how your model could be used for health interventions based on patient characteristics. Are there any limitations or risks to consider? (10/100 pts)

In [None]:
# Arya


## Submission (10/100 pts)

Submit your work in a well-organized GitHub repo, where the code is appropriately commented and all members of the group have made significant contributions to the commit history. (10/100 pts)