#### Motivation

In this project, we are interested in predicting the life expectency of people of age group less than 1 year old. Or, more specifically, the "life expectancy at birth."

Lets first get the dataset of just a single year, so that we can apply the process to a small dataset before applying to a large dataset.

Once we have the dataset for each individual year, we can then combine them all into a single, larger dataset that includes the information for all the years.

The function below allows us to extract the dataset for a single year. Only the rows with 'age_name' == '<1' and 'race_name' == 'total' are considered. 

In [7]:
import pandas as pd 

In [9]:
def process_life_expectancy_data(year):
    """
    Processes life expectancy data for a specific year.
    
    Parameters:
    year (int): The year for which the data needs to be processed (e.g., 2010).
    
    Returns:
    None: Saves the processed CSV file in the specified directory.
    """
    # Construct the input file path dynamically
    input_file_path = f"../data/raw/IHME_USA_COD_COUNTY_RACE_ETHN_2000_2019_LT_BOTH 2/IHME_USA_COD_COUNTY_RACE_ETHN_2000_2019_LT_{year}_ALL_BOTH_Y2023M06D12.CSV"
    
    # Read the raw data
    df1 = pd.read_csv(input_file_path)

    # Filter for rows where race_name is 'Total' and age_name is '<1 year'
    df2 = df1[(df1['race_name'] == 'Total') & (df1['age_name'] == '<1 year')]

    # Remove empty cells
    df3 = df2.dropna()

    # Keep only county-level data (fips > 60)
    df4 = df3[(df3['fips'] > 60)]

    # Drop unnecessary columns
    columns_to_drop = ['measure_id', 'location_id', 'fips', 'measure_name', 'race_id',
                       'race_name', 'sex_id', 'sex_name', 'age_group_id', 'age_name',
                       'metric_id', 'metric_name', 'upper', 'lower']
    df5 = df4.drop(columns=columns_to_drop)

    # Rename the 'val' column to 'MeanLifeExpectency'
    df5 = df5.rename(columns={'val': 'mean_life_expectancy'})

    # Output file path for the processed CSV
    output_file_path = f"../data/processed/le_single_year/life_expectancy_{year}.csv"

    # Save the processed data to a CSV file
    df5.to_csv(output_file_path, index=False)

    print(f"Processed life expectancy data for {year} saved to {output_file_path}")

# Example usage:
# process_life_expectancy_data(2010)
# process_life_expectancy_data(2019)

In [39]:
process_life_expectancy_data(2010)

Processed life expectancy data for 2010 saved to ../data/processed/le_single_year/life_expectancy_2010.csv


In [11]:
for year in range(2010,2020):
    process_life_expectancy_data(year)

Processed life expectancy data for 2010 saved to ../data/processed/le_single_year/life_expectancy_2010.csv
Processed life expectancy data for 2011 saved to ../data/processed/le_single_year/life_expectancy_2011.csv
Processed life expectancy data for 2012 saved to ../data/processed/le_single_year/life_expectancy_2012.csv
Processed life expectancy data for 2013 saved to ../data/processed/le_single_year/life_expectancy_2013.csv
Processed life expectancy data for 2014 saved to ../data/processed/le_single_year/life_expectancy_2014.csv
Processed life expectancy data for 2015 saved to ../data/processed/le_single_year/life_expectancy_2015.csv
Processed life expectancy data for 2016 saved to ../data/processed/le_single_year/life_expectancy_2016.csv
Processed life expectancy data for 2017 saved to ../data/processed/le_single_year/life_expectancy_2017.csv
Processed life expectancy data for 2018 saved to ../data/processed/le_single_year/life_expectancy_2018.csv
Processed life expectancy data for 20