## Predict heart disease rate by county in the US: Regression

The goal is to predict the rate of heart disease (per 100,000 individuals) across the United States at the county-level from other socioeconomic indicators. Data was scraped from the USDA ERS website.

The target column is labeled `heart_disease_mortality_per_100k` which is in the 'Training_labels.csv' file.

For more reference, you can access the original kaggle dataset here: [Microsoft Data Science Capstone](https://www.kaggle.com/nandvard/microsoft-data-science-capstone).

---

In [8]:
# Import neccessary packages for data wrangling
import pandas as pd
import numpy as np
import os

# show all columns
pd.set_option('display.max_column', None)

In [10]:
# Check out all the files used for this project
for file in os.listdir('../Data/predict heart disease rate'):
    print(file)

Test_values.csv
Training_labels.csv
Training_values.csv


### <font color='teal'> For this project, we will only be using the two 'Training_xxx' files.

We will not be using the 'Test_values.csv' file because it does not have any target values associated with it. This dataset was originally meant for a kaggle competition, where the 'Test_values.csv' file would generate a prediction file used for the competition.

In [338]:
# Load the training dataset
df = pd.read_csv('../Data/predict heart disease rate/Training_values.csv')
df.head(2)

Unnamed: 0,row_id,area__rucc,area__urban_influence,econ__economic_typology,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__homicides_per_100k,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician,yr
0,0,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Manufacturing-dependent,0.408,0.057,0.254,0.066,0.516,0.235,0.176,0.109,0.039,0.829,0.004,0.011,0.194223,0.424303,0.227092,0.154382,12,12,0.297,0.23,0.131,0.089,,0.332,13.0,2.8,15.09,1650.0,1489.0,a
1,1,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Mining-dependent,0.556,0.039,0.26,0.143,0.503,0.272,0.101,0.41,0.07,0.493,0.008,0.015,0.164134,0.234043,0.342452,0.259372,19,7,0.288,0.19,0.09,0.082,0.181,0.265,10.0,2.3,19.79,2010.0,2480.0,a


In [339]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3198 entries, 0 to 3197
Data columns (total 34 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   row_id                                            3198 non-null   int64  
 1   area__rucc                                        3198 non-null   object 
 2   area__urban_influence                             3198 non-null   object 
 3   econ__economic_typology                           3198 non-null   object 
 4   econ__pct_civilian_labor                          3198 non-null   float64
 5   econ__pct_unemployment                            3198 non-null   float64
 6   econ__pct_uninsured_adults                        3196 non-null   float64
 7   econ__pct_uninsured_children                      3196 non-null   float64
 8   demo__pct_female                                  3196 non-null   float64
 9   demo__pct_below_18_

In [7]:
df.describe()

Unnamed: 0,row_id,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__homicides_per_100k,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician
count,3198.0,3198.0,3198.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3198.0,3198.0,3198.0,3198.0,3198.0,3198.0,3196.0,2734.0,3196.0,3016.0,2220.0,3196.0,3170.0,1231.0,2781.0,2954.0,2968.0
mean,3116.985616,0.467191,0.059696,0.217463,0.086067,0.498811,0.227715,0.170043,0.090207,0.091046,0.769989,0.02468,0.013109,0.148815,0.350567,0.301143,0.199475,11.676986,10.301126,0.307668,0.213628,0.10926,0.083896,0.164841,0.277161,11.625868,5.947498,21.132618,3431.433649,2551.339286
std,1830.236781,0.0744,0.022947,0.067362,0.039849,0.024399,0.034282,0.043694,0.142763,0.147165,0.20785,0.084563,0.025431,0.068208,0.070554,0.052318,0.089308,2.739516,2.786143,0.043228,0.062895,0.023216,0.022251,0.050474,0.053003,1.557996,5.031822,10.485923,2569.450603,2100.459467
min,0.0,0.207,0.01,0.046,0.012,0.278,0.092,0.045,0.0,0.0,0.053,0.0,0.0,0.015075,0.065327,0.109548,0.011078,4.0,0.0,0.131,0.046,0.032,0.033,0.038,0.09,7.0,-0.4,3.14,339.0,189.0
25%,1504.25,0.42,0.044,0.166,0.057,0.493,0.206,0.141,0.019,0.006,0.649,0.002,0.002,0.096588,0.305357,0.264861,0.139234,10.0,8.0,0.284,0.172,0.094,0.069,0.13,0.24275,10.0,2.62,13.49,1812.25,1420.0
50%,3113.5,0.468,0.057,0.216,0.077,0.503,0.226,0.167,0.035,0.022,0.853,0.007,0.007,0.133234,0.355015,0.301587,0.176471,11.0,10.0,0.309,0.21,0.109,0.081,0.164,0.28,12.0,4.7,19.63,2690.0,1999.0
75%,4724.75,0.514,0.072,0.261,0.106,0.512,0.246,0.195,0.087,0.096,0.93625,0.014,0.013,0.194796,0.399554,0.33659,0.231354,13.0,12.0,0.334,0.249,0.124,0.095,0.197,0.313,13.0,7.89,26.49,4089.75,2859.0
max,6276.0,1.0,0.248,0.496,0.281,0.573,0.417,0.346,0.932,0.858,0.99,0.859,0.341,0.473526,0.558912,0.473953,0.798995,29.0,27.0,0.471,0.513,0.203,0.238,0.367,0.442,15.0,50.49,110.45,28130.0,23399.0


---
#### <font color='teal'> Since our target column is not in this CSV file, we will add it to our dataframe using the `Training_labels.csv` file

In [187]:
target_vals = pd.read_csv('../Data/predict heart disease rate/Training_labels.csv')
target_vals.head()

Unnamed: 0,row_id,heart_disease_mortality_per_100k
0,0,312
1,1,257
2,4,195
3,5,218
4,6,355


In [340]:
# merge dataframes using the 'row_id' column to merge on 'row_id'
df = pd.merge(df, target_vals, on='row_id')
df.head()

Unnamed: 0,row_id,area__rucc,area__urban_influence,econ__economic_typology,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__homicides_per_100k,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician,yr,heart_disease_mortality_per_100k
0,0,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Manufacturing-dependent,0.408,0.057,0.254,0.066,0.516,0.235,0.176,0.109,0.039,0.829,0.004,0.011,0.194223,0.424303,0.227092,0.154382,12,12,0.297,0.23,0.131,0.089,,0.332,13.0,2.8,15.09,1650.0,1489.0,a,312
1,1,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Mining-dependent,0.556,0.039,0.26,0.143,0.503,0.272,0.101,0.41,0.07,0.493,0.008,0.015,0.164134,0.234043,0.342452,0.259372,19,7,0.288,0.19,0.09,0.082,0.181,0.265,10.0,2.3,19.79,2010.0,2480.0,a,257
2,4,Metro - Counties in metro areas of 1 million p...,Large-in a metro area with at least 1 million ...,Nonspecialized,0.541,0.057,0.07,0.023,0.522,0.179,0.115,0.202,0.198,0.479,0.013,0.085,0.158573,0.237859,0.186323,0.417245,12,6,0.212,0.156,0.084,0.098,0.195,0.209,10.0,9.31,3.14,629.0,690.0,b,195
3,5,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore adjacent to a small metro with town of...,Nonspecialized,0.5,0.061,0.203,0.059,0.525,0.2,0.164,0.013,0.049,0.897,0.007,0.001,0.181637,0.407186,0.248503,0.162675,11,12,0.285,,0.104,0.058,,0.238,13.0,,,1810.0,6630.0,b,218
4,6,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore not adjacent to a metro/micro area and...,Nonspecialized,0.471,0.05,0.225,0.103,0.511,0.237,0.171,0.025,0.008,0.953,0.003,0.0,0.122367,0.41324,0.306921,0.157472,14,12,0.284,0.234,0.137,0.07,0.194,0.29,9.0,,29.39,3489.0,2590.0,a,355


In [341]:
# Print the shape of df
df.shape

(3198, 35)

### <font color='teal'> Reviewing the columns

#### <font color='teal'> Column descriptions
    
| Column | Description |
| --- | --- |
| **area__** | **information about the county** |
| area__rucc | Rural-Urban Continuum Codes "form a classification scheme that distinguishes metropolitan counties by the population size of their metro area, and nonmetropolitan counties by degree of urbanization and adjacency to a metro area. The official Office of Management and Budget (OMB) metro and nonmetro categories have been subdivided into three metro and six nonmetro categories. Each county in the U.S. is assigned one of the 9 codes." (USDA Economic Research Service, https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/) |
| area_urbaninfluence | Urban Influence Codes "form a classification scheme that distinguishes metropolitan counties by population size of their metro area, and nonmetropolitan counties by size of the largest city or town and proximity to metro and micropolitan areas." (USDA Economic Research Service, https://www.ers.usda.gov/data-products/urban-influence-codes/) |
| **econ__** | **economic indicators** |
| econ_economictypology | County Typology Codes "classify all U.S. counties according to six mutually exclusive categories of economic dependence and six overlapping categories of policy-relevant themes. The economic dependence types include farming, mining, manufacturing, Federal/State government, recreation, and nonspecialized counties. The policy-relevant types include low education, low employment, persistent poverty, persistent child poverty, population loss, and retirement destination." (USDA Economic Research Service, https://www.ers.usda.gov/data-products/county-typology-codes.aspx) |
| econ_pctcivilian_labor | Civilian labor force, annual average, as percent of population (Bureau of Labor Statistics, http://www.bls.gov/lau/) |
| econ_pctunemployment | Unemployment, annual average, as percent of population (Bureau of Labor Statistics, http://www.bls.gov/lau/) |
| econpctuninsuredadults | Percent of adults without health insurance (Bureau of Labor Statistics, http://www.bls.gov/lau/)|
| econpctuninsuredchildren | Percent of children without health insurance (Bureau of Labor Statistics, http://www.bls.gov/lau/) |
| **health__** | **health indicators** |
| health_pctadult_obesity | Percent of adults who meet clinical definition of obese (National Center for Chronic Disease Prevention and Health Promotion) |
| health_pctadult_smoking | Percent of adults who smoke (Behavioral Risk Factor Surveillance System) |
| health_pctdiabetes | Percent of population with diabetes (National Center for Chronic Disease Prevention and Health Promotion, Division of Diabetes Translation) |
| health_pctlow_birthweight | Percent of babies born with low birth weight (National Center for Health Statistics) |
| health_pctexcessive_drinking | Percent of adult population that engages in excessive consumption of alcohol (Behavioral Risk Factor Surveillance System, ) |
| health_pctphysical_inacticity | Percent of adult population that is physically inactive (National Center for Chronic Disease Prevention and Health Promotion) |
| health_airpollutionparticulatematter | Fine particulate matter in µg/m³ (CDC WONDER, https://wonder.cdc.gov/wonder/help/pm.html) |
| health_homicidesper_100k | Deaths by homicide per 100,000 population (National Center for Health Statistics) |
| health_motorvehiclecrashdeathsper100k | Deaths by motor vehicle crash per 100,000 population (National Center for Health Statistics) |
| health_popper_dentist | Population per dentist (HRSA Area Resource File) |
| health_popperprimarycare_physician | Population per Primary Care Physician (HRSA Area Resource File) |
| **demo__** | **demographics information** |
| demo_pctfemale | Percent of population that is female (US Census Population Estimates) |
| demo_pctbelow18yearsofage | Percent of population that is below 18 years of age (US Census Population Estimates) |
| demo_pctaged65yearsandolder | Percent of population that is aged 65 years or older (US Census Population Estimates) |
| demo_pcthispanic | Percent of population that identifies as Hispanic (US Census Population Estimates) |
| demo_pctnonhispanicafrican_american | Percent of population that identifies as African American (US Census Population Estimates) |
| demo_pctnonhispanicwhite | Percent of population that identifies as Hispanic and White (US Census Population Estimates) |
| demo_pctamericanindianoralaskannative | Percent of population that identifies as Native American (US Census Population Estimates) |
| demo_pctasian | Percent of population that identifies as Asian (US Census Population Estimates) |
| demo_pctadultslessthanahighschooldiploma | Percent of adult population that does not have a high school diploma (US Census, American Community Survey) |
| demo_pctadultswithhighschooldiploma | Percent of adult population which has a high school diploma as highest level of education achieved (US Census, American Community Survey) |
| demo_pctadultswithsome_college | Percent of adult population which has some college as highest level of education achieved (US Census, American Community Survey) |
| demo_pctadultsbachelorsor_higher | Percent of adult population which has a bachelor's degree or higher as highest level of education achieved (US Census, American Community Survey) |
| demo_birthrateper1k | Births per 1,000 of population (US Census Population Estimates) |
| demo_deathrateper1k | Deaths per 1,000 of population (US Census Population Estimates) |

In [342]:
# group data for column names, datatype, and null values and percent for each column
dtypes = pd.DataFrame(df.dtypes.values, columns=['dtype'])
dcolumns = pd.DataFrame(df.columns, columns=['Column Name'])
dnull = pd.DataFrame(df.isnull().sum().values, columns=['NaN count'])
dnullpct = pd.DataFrame(data=(100 * (df.isnull().sum() / len(df))).values, columns=['Percent NaN'])

# print column names with associated datatype and percent of null values
df_defs = pd.concat([dcolumns, dtypes, dnull, dnullpct], axis=1)
df_defs

Unnamed: 0,Column Name,dtype,NaN count,Percent NaN
0,row_id,int64,0,0.0
1,area__rucc,object,0,0.0
2,area__urban_influence,object,0,0.0
3,econ__economic_typology,object,0,0.0
4,econ__pct_civilian_labor,float64,0,0.0
5,econ__pct_unemployment,float64,0,0.0
6,econ__pct_uninsured_adults,float64,2,0.062539
7,econ__pct_uninsured_children,float64,2,0.062539
8,demo__pct_female,float64,2,0.062539
9,demo__pct_below_18_years_of_age,float64,2,0.062539


In [224]:
# Create a dictionary with the categorical columns and the number of unique values
df_cat = {col:df[col].nunique() for col in df.columns if df[col].dtype == 'O'}
df_cat

{'area__rucc': 9,
 'area__urban_influence': 12,
 'econ__economic_typology': 6,
 'yr': 2}

## <font color='teal'> Data Cleaning
    
Handling the missing values

In [225]:
# again review the columns with missing values
df_defs[df_defs['Percent NaN'] > 0]

Unnamed: 0,Column Name,dtype,NaN count,Percent NaN
6,econ__pct_uninsured_adults,float64,2,0.062539
7,econ__pct_uninsured_children,float64,2,0.062539
8,demo__pct_female,float64,2,0.062539
9,demo__pct_below_18_years_of_age,float64,2,0.062539
10,demo__pct_aged_65_years_and_older,float64,2,0.062539
11,demo__pct_hispanic,float64,2,0.062539
12,demo__pct_non_hispanic_african_american,float64,2,0.062539
13,demo__pct_non_hispanic_white,float64,2,0.062539
14,demo__pct_american_indian_or_alaskan_native,float64,2,0.062539
15,demo__pct_asian,float64,2,0.062539


#### <font color='teal'> It seems that there are many columns that only have 2 values missing, we will ignore them for now. Let's start by looking at the column with the most missing data: `health__homicides_per_100k`

In [193]:
df['health__homicides_per_100k'].describe()

count    1231.000000
mean        5.947498
std         5.031822
min        -0.400000
25%         2.620000
50%         4.700000
75%         7.890000
max        50.490000
Name: health__homicides_per_100k, dtype: float64

<font color='teal'> This column contains more that 61% missing values, and does not indicate that NaN values would be 0 percent, therefore we will drop the column.

In [343]:
# drop the 'health__homicides_per_100k' column
df.drop('health__homicides_per_100k', axis=1, inplace=True)

<font color='teal'> Let's look at all the columns with less than 10% of missing values.
    
We will create a table that can be used to get the average value for each column based on 3 categorical columns

In [344]:
# Create a table that produces the mean values for each column based on:
# 'area__rucc', 'area__urban_influence', and 'econ_economic_typology'
mean_table = df.groupby(['area__rucc', 'area__urban_influence', 'econ__economic_typology']).mean()

In [345]:
# Create a function to be used in the .apply() method to retrieve mean values for missing data
def fillna_vals(val, table, columns=[], id_col=None):
    
    '''
    This function will fill NaN values with the given inputs:
    val - Target column for filling NaN values
    table - Table that will be used for indexing from and retrieving a value
    columns - Default is an empty list. Create a list of which columns are used to index from the given table argument.
    id_col - Default is None. Used to identify a column from a table that has multiple target columns with values.
    '''
    
    if id_col == None:
        if pd.isnull(val):
            # check to see if there is a null/None value in the list; we will ignore these and return a NaN value
            if None in columns:
                return np.nan
            else:
                return float(table.xs(columns))
        else:
            return val
    else:
        if pd.isnull(val):
            if None in columns:
                return pd.isnull()
            else:
                return float(table.xs(columns)[id_col])
        else:
            return val

In [346]:
# Create a df with columns that are less than 10% missing values (ignoring columns with only 2 missing values!)
df_10pct = df_defs[(df_defs['Percent NaN'] > 0) & (df_defs['Percent NaN'] <= 10) & (df_defs['NaN count'] != 2)]
df_10pct

Unnamed: 0,Column Name,dtype,NaN count,Percent NaN
25,health__pct_low_birthweight,float64,182,5.691057
28,health__air_pollution_particulate_matter,float64,28,0.875547
31,health__pop_per_dentist,float64,244,7.629769
32,health__pop_per_primary_care_physician,float64,230,7.191995


In [347]:
for col in df_10pct['Column Name']:
    df[col] = df.apply(lambda x: fillna_vals(x[col], mean_table, columns=[x['area__rucc'], x['area__urban_influence'],
                                                                          x['econ__economic_typology']], id_col=col), axis=1)

In [348]:
df.head()

Unnamed: 0,row_id,area__rucc,area__urban_influence,econ__economic_typology,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician,yr,heart_disease_mortality_per_100k
0,0,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Manufacturing-dependent,0.408,0.057,0.254,0.066,0.516,0.235,0.176,0.109,0.039,0.829,0.004,0.011,0.194223,0.424303,0.227092,0.154382,12,12,0.297,0.23,0.131,0.089,,0.332,13.0,15.09,1650.0,1489.0,a,312
1,1,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Mining-dependent,0.556,0.039,0.26,0.143,0.503,0.272,0.101,0.41,0.07,0.493,0.008,0.015,0.164134,0.234043,0.342452,0.259372,19,7,0.288,0.19,0.09,0.082,0.181,0.265,10.0,19.79,2010.0,2480.0,a,257
2,4,Metro - Counties in metro areas of 1 million p...,Large-in a metro area with at least 1 million ...,Nonspecialized,0.541,0.057,0.07,0.023,0.522,0.179,0.115,0.202,0.198,0.479,0.013,0.085,0.158573,0.237859,0.186323,0.417245,12,6,0.212,0.156,0.084,0.098,0.195,0.209,10.0,3.14,629.0,690.0,b,195
3,5,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore adjacent to a small metro with town of...,Nonspecialized,0.5,0.061,0.203,0.059,0.525,0.2,0.164,0.013,0.049,0.897,0.007,0.001,0.181637,0.407186,0.248503,0.162675,11,12,0.285,,0.104,0.058,,0.238,13.0,,1810.0,6630.0,b,218
4,6,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore not adjacent to a metro/micro area and...,Nonspecialized,0.471,0.05,0.225,0.103,0.511,0.237,0.171,0.025,0.008,0.953,0.003,0.0,0.122367,0.41324,0.306921,0.157472,14,12,0.284,0.234,0.137,0.07,0.194,0.29,9.0,29.39,3489.0,2590.0,a,355


In [349]:
# Check the null count for all the columns in df
df.isnull().sum()

row_id                                                0
area__rucc                                            0
area__urban_influence                                 0
econ__economic_typology                               0
econ__pct_civilian_labor                              0
econ__pct_unemployment                                0
econ__pct_uninsured_adults                            2
econ__pct_uninsured_children                          2
demo__pct_female                                      2
demo__pct_below_18_years_of_age                       2
demo__pct_aged_65_years_and_older                     2
demo__pct_hispanic                                    2
demo__pct_non_hispanic_african_american               2
demo__pct_non_hispanic_white                          2
demo__pct_american_indian_or_alaskan_native           2
demo__pct_asian                                       2
demo__pct_adults_less_than_a_high_school_diploma      0
demo__pct_adults_with_high_school_diploma       

<font color='teal'> Now lets look at the remaining columns with more than 10% missing values

In [350]:
df_GT10pct = df_defs[(df_defs['Percent NaN'] > 10) & (df_defs['Percent NaN'] < 60)]
df_GT10pct

Unnamed: 0,Column Name,dtype,NaN count,Percent NaN
23,health__pct_adult_smoking,float64,464,14.509068
26,health__pct_excessive_drinking,float64,978,30.581614
30,health__motor_vehicle_crash_deaths_per_100k,float64,417,13.0394


<font color='teal'> Let's review the 'health__pct_excessive_drinking' column

In [351]:
df['health__pct_excessive_drinking'].describe()

count    2220.000000
mean        0.164841
std         0.050474
min         0.038000
25%         0.130000
50%         0.164000
75%         0.197000
max         0.367000
Name: health__pct_excessive_drinking, dtype: float64

In [352]:
# determine which columns are most correlated to this column
df.corr()['health__pct_excessive_drinking'].sort_values()[:-1]

demo__pct_adults_less_than_a_high_school_diploma   -0.412436
health__pct_diabetes                               -0.384207
heart_disease_mortality_per_100k                   -0.382172
health__pct_low_birthweight                        -0.378372
econ__pct_uninsured_adults                         -0.340959
health__pct_physical_inacticity                    -0.337469
demo__pct_non_hispanic_african_american            -0.284243
health__pct_adult_obesity                          -0.235439
demo__pct_female                                   -0.215347
econ__pct_unemployment                             -0.214149
health__air_pollution_particulate_matter           -0.195515
health__motor_vehicle_crash_deaths_per_100k        -0.175239
demo__death_rate_per_1k                            -0.149701
demo__pct_below_18_years_of_age                    -0.125961
health__pop_per_dentist                            -0.114424
demo__birth_rate_per_1k                            -0.108256
health__pct_adult_smokin

<font color='teal'> Looking at the correlating columns, it looks like `econ__pct_civilian_labor` is most positively correlated, and `demo__pct_adults_less_than_a_high_school_diploma` is most negatively correlated.
    
**We will use these columns to fill NaN values with reasonably correlated mean values.**

In [289]:
# Review the .describe() method for the two correlated columns to determine how to chunk the data into categorical columns
# example: 0-5%, 5-10%, etc...
df[['econ__pct_civilian_labor', 'demo__pct_adults_less_than_a_high_school_diploma']].describe()

Unnamed: 0,econ__pct_civilian_labor,demo__pct_adults_less_than_a_high_school_diploma
count,3198.0,3198.0
mean,0.467191,0.148815
std,0.0744,0.068208
min,0.207,0.015075
25%,0.42,0.096588
50%,0.468,0.133234
75%,0.514,0.194796
max,1.0,0.473526


<font color='teal'> Based on the information above, the columns above will be used to create 2 new categorical columns split up by the following:
    
| Column | range, mean, std | Chunk categories (%) |
| --- | --- | --- |
| econ__pct_civilian_labor | 0.207-1.000, 0.467, 0.074 | 20-30, 30-40, 40-45, 45-50, 50-55, 55-60, 60-80, 80-100 |
| demo__pct_adults_less_than_a_high_school_diploma | 0.015-0.474, 0.149, 0.068 | 0-5, 5-10, 10-15, 15-20, 20-30, 30-40, 40-50 |

In [175]:
# Create the functions that will generate the new columns based on chunk size
def chunk_econ_pcl(val):
    if val > 0.20 and val <= 0.30:
        return '20-30'
    elif val > 0.30 and val <= 0.40:
        return '30-40'
    elif val > 0.40 and val <= 0.45:
        return '40-45'
    elif val > 0.45 and val <= 0.50:
        return '45-50'
    elif val > 0.50 and val <= 0.55:
        return '50-55'
    elif val > 0.55 and val <= 0.60:
        return '55-60'
    elif val > 0.60 and val <= 0.80:
        return '60-80'
    elif val > 0.80 and val <= 1.00:
        return '80-100'
    
def chunk_demo_hs_diplo(val):
    if val > 0.00 and val <= 0.05:
        return '0-5'
    elif val > 0.05 and val <= 0.10:
        return '5-10'
    elif val > 0.10 and val <= 0.15:
        return '10-15'
    elif val > 0.15 and val <= 0.20:
        return '15-20'
    elif val > 0.20 and val <= 0.30:
        return '20-30'
    elif val > 0.30 and val <= 0.40:
        return '30-40'
    elif val > 0.40 and val <= 0.50:
        return '40-50'

In [353]:
# create the dataframe that will be used for filtering the data
df_drink = df[['health__pct_excessive_drinking','econ__pct_civilian_labor', 'demo__pct_adults_less_than_a_high_school_diploma']]

# use the .apply() method to create new columns based on the previously created functions
df_drink['econ__pct_civlab_cat'] = df_drink['econ__pct_civilian_labor'].apply(chunk_econ_pcl)
df_drink['demo__pct_less_hs_diplo_cat'] = df_drink['demo__pct_adults_less_than_a_high_school_diploma'].apply(chunk_demo_hs_diplo)
df_drink.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


health__pct_excessive_drinking                      978
econ__pct_civilian_labor                              0
demo__pct_adults_less_than_a_high_school_diploma      0
econ__pct_civlab_cat                                  0
demo__pct_less_hs_diplo_cat                           0
dtype: int64

In [291]:
# view the head of the new df df_drink
df_drink.head()

Unnamed: 0,health__pct_excessive_drinking,econ__pct_civilian_labor,demo__pct_adults_less_than_a_high_school_diploma,econ__pct_civlab_cat,demo__pct_less_hs_diplo_cat
0,,0.408,0.194223,40-45,15-20
1,0.181,0.556,0.164134,55-60,15-20
2,0.195,0.541,0.158573,50-55,15-20
3,,0.5,0.181637,45-50,15-20
4,0.194,0.471,0.122367,45-50,10-15


In [354]:
# Lets create the new table that will be used to filter the mean values by the correlated features
df_drink_table = df_drink.drop(['econ__pct_civilian_labor','demo__pct_adults_less_than_a_high_school_diploma'],
                               axis=1).groupby(['econ__pct_civlab_cat','demo__pct_less_hs_diplo_cat']).mean()

In [355]:
# use the df_drink_table to retrieve values for rows that have NaN values for 'health__pct_excessive_drinking'
# using the fillna_vals() function
df['health__pct_excessive_drinking'] = df_drink.apply(lambda x: fillna_vals(x['health__pct_excessive_drinking'], df_drink_table,
                                                                      columns=[x['econ__pct_civlab_cat'], x['demo__pct_less_hs_diplo_cat']]),
                                                axis=1)

### <font color='teal'> Do the same as above for the remaining two columns with more than 10% missing values

In [356]:
# drop index 26 ('health__pct_excessive_drinking') to view the remaining columns
df_GT10pct.drop(26)

Unnamed: 0,Column Name,dtype,NaN count,Percent NaN
23,health__pct_adult_smoking,float64,464,14.509068
30,health__motor_vehicle_crash_deaths_per_100k,float64,417,13.0394


In [357]:
# determine which columns are most correlated to this column
df.corr()['health__pct_adult_smoking'].sort_values()[:-1]

demo__pct_adults_bachelors_or_higher               -0.522427
econ__pct_civilian_labor                           -0.432774
demo__pct_asian                                    -0.300080
demo__pct_adults_with_some_college                 -0.220110
demo__pct_hispanic                                 -0.203549
health__pct_excessive_drinking                     -0.143918
demo__pct_female                                   -0.061542
row_id                                             -0.041828
demo__pct_below_18_years_of_age                    -0.001050
econ__pct_uninsured_children                        0.005351
demo__pct_non_hispanic_african_american             0.025176
demo__pct_aged_65_years_and_older                   0.038207
demo__birth_rate_per_1k                             0.038505
demo__pct_non_hispanic_white                        0.039222
health__air_pollution_particulate_matter            0.195959
demo__pct_american_indian_or_alaskan_native         0.206414
health__pct_low_birthwei

<font color='teal'> The highest correlated features are `health__pct_physical_inacticity` and `demo__pct_adults_bachelors_or_higher `

In [358]:
df[['health__pct_physical_inacticity', 'demo__pct_adults_bachelors_or_higher']].describe()

Unnamed: 0,health__pct_physical_inacticity,demo__pct_adults_bachelors_or_higher
count,3196.0,3198.0
mean,0.277161,0.199475
std,0.053003,0.089308
min,0.09,0.011078
25%,0.24275,0.139234
50%,0.28,0.176471
75%,0.313,0.231354
max,0.442,0.798995


<font color='teal'> Based on the information above, the columns above will be used to create 2 new categorical columns split up by the following:
    
| Column | range, mean, std | Chunk categories (%) |
| --- | --- | --- |
| health__pct_physical_inacticity | 0.090-0.442, 0.277, 0.053 | 0-10, 10-20, 20-25, 30-35, 35-45|
| demo__pct_adults_bachelors_or_higher | 0.011-0.799, 0.199, 0.089 | 0-10, 10-15, 15-20, 20-25, 25-30, 30-40, 40-60, 60-80 |

In [359]:
# Create the functions that will generate the new columns based on chunk size
def chunk_physinact(val):
    if val > 0.00 and val <= 0.10:
        return '0-10'
    elif val > 0.10 and val <= 0.20:
        return '10-20'
    elif val > 0.20 and val <= 0.25:
        return '20-25'
    elif val > 0.25 and val <= 0.30:
        return '25-30'
    elif val > 0.30 and val <= 0.35:
        return '30-35'
    elif val > 0.35 and val <= 0.45:
        return '35-45'
    
def chunk_bachorhigher(val):
    if val > 0.00 and val <= 0.10:
        return '0-10'
    elif val > 0.10 and val <= 0.15:
        return '10-15'
    elif val > 0.15 and val <= 0.20:
        return '15-20'
    elif val > 0.20 and val <= 0.25:
        return '20-25'
    elif val > 0.25 and val <= 0.30:
        return '25-30'
    elif val > 0.30 and val <= 0.40:
        return '30-40'
    elif val > 0.40 and val <= 0.60:
        return '40-60'
    elif val > 0.60 and val <= 0.80:
        return '60-80'

In [360]:
# create the dataframe that will be used for filtering the data
df_smoke = df[['health__pct_adult_smoking','health__pct_physical_inacticity', 'demo__pct_adults_bachelors_or_higher']]

# use the .apply() method to create new columns based on the previously created functions
df_smoke['health__phys_inact_cat'] = df_smoke['health__pct_physical_inacticity'].apply(chunk_physinact)
df_smoke['demo__bach_higher_cat'] = df_smoke['demo__pct_adults_bachelors_or_higher'].apply(chunk_bachorhigher)
df_smoke.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


health__pct_adult_smoking               464
health__pct_physical_inacticity           2
demo__pct_adults_bachelors_or_higher      0
health__phys_inact_cat                    2
demo__bach_higher_cat                     0
dtype: int64

In [361]:
df_smoke_table = df_smoke.drop(['health__pct_physical_inacticity',
                                'demo__pct_adults_bachelors_or_higher'],
                               axis=1).groupby(['health__phys_inact_cat','demo__bach_higher_cat']).mean()

In [362]:
# use the df_drink_table to retrieve values for rows that have NaN values for 'health__pct_excessive_drinking'
# using the fillna_vals() function
df['health__pct_adult_smoking'] = df_smoke.apply(lambda x: fillna_vals(x['health__pct_adult_smoking'], df_smoke_table,
                                                                      columns=[x['health__phys_inact_cat'], x['demo__bach_higher_cat']]),
                                                axis=1)

#### <font color='teal'> Lastly we will work with the `health__motor_vehicle_crash_deaths_per_100k` column

In [363]:
# determine which columns are most correlated to this column
df.corr()['health__motor_vehicle_crash_deaths_per_100k'].sort_values()[:-1]

demo__pct_adults_bachelors_or_higher               -0.536784
demo__pct_asian                                    -0.371687
econ__pct_civilian_labor                           -0.346759
health__pct_excessive_drinking                     -0.213703
demo__pct_non_hispanic_white                       -0.114775
demo__pct_adults_with_some_college                 -0.095026
demo__pct_female                                   -0.078898
health__air_pollution_particulate_matter           -0.063060
demo__pct_hispanic                                 -0.053457
row_id                                              0.000465
demo__pct_non_hispanic_african_american             0.106834
demo__birth_rate_per_1k                             0.122453
demo__pct_below_18_years_of_age                     0.136475
demo__pct_aged_65_years_and_older                   0.214857
econ__pct_unemployment                              0.229797
health__pct_low_birthweight                         0.259686
health__pop_per_primary_

<font color='teal'> The most correlated features are the same as the previous column that we worked with. We will use the previous functions `chunk_physinact() and chunk_bachorhigher()` that were created to make the new table.

In [364]:
# create the dataframe that will be used for filtering the data
df_motor = df[['health__motor_vehicle_crash_deaths_per_100k','health__pct_physical_inacticity', 'demo__pct_adults_bachelors_or_higher']]

# use the .apply() method to create new columns based on the previously created functions
df_motor['health__phys_inact_cat'] = df_motor['health__pct_physical_inacticity'].apply(chunk_physinact)
df_motor['demo__bach_higher_cat'] = df_motor['demo__pct_adults_bachelors_or_higher'].apply(chunk_bachorhigher)
df_motor.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


health__motor_vehicle_crash_deaths_per_100k    417
health__pct_physical_inacticity                  2
demo__pct_adults_bachelors_or_higher             0
health__phys_inact_cat                           2
demo__bach_higher_cat                            0
dtype: int64

In [365]:
df_motor_table = df_motor.drop(['health__pct_physical_inacticity',
                                'demo__pct_adults_bachelors_or_higher'],
                               axis=1).groupby(['health__phys_inact_cat','demo__bach_higher_cat']).mean()

In [366]:
# use the df_drink_table to retrieve values for rows that have NaN values for 'health__pct_excessive_drinking'
# using the fillna_vals() function
df['health__motor_vehicle_crash_deaths_per_100k'] = df_motor.apply(lambda x: fillna_vals(x['health__motor_vehicle_crash_deaths_per_100k'], df_motor_table,
                                                                                         columns=[x['health__phys_inact_cat'], x['demo__bach_higher_cat']]),
                                                                   axis=1)

### <font color='teal'> Now we will check the dataframe for the remaining null values.

In [367]:
df.isnull().sum()

row_id                                               0
area__rucc                                           0
area__urban_influence                                0
econ__economic_typology                              0
econ__pct_civilian_labor                             0
econ__pct_unemployment                               0
econ__pct_uninsured_adults                           2
econ__pct_uninsured_children                         2
demo__pct_female                                     2
demo__pct_below_18_years_of_age                      2
demo__pct_aged_65_years_and_older                    2
demo__pct_hispanic                                   2
demo__pct_non_hispanic_african_american              2
demo__pct_non_hispanic_white                         2
demo__pct_american_indian_or_alaskan_native          2
demo__pct_asian                                      2
demo__pct_adults_less_than_a_high_school_diploma     0
demo__pct_adults_with_high_school_diploma            0
demo__pct_

<font color='teal'> We can now use the .dropna() method on the remaining null values.

In [368]:
df = df.dropna()
print(df.shape)
df.head()

(3151, 34)


Unnamed: 0,row_id,area__rucc,area__urban_influence,econ__economic_typology,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician,yr,heart_disease_mortality_per_100k
0,0,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Manufacturing-dependent,0.408,0.057,0.254,0.066,0.516,0.235,0.176,0.109,0.039,0.829,0.004,0.011,0.194223,0.424303,0.227092,0.154382,12,12,0.297,0.23,0.131,0.089,0.124094,0.332,13.0,15.09,1650.0,1489.0,a,312
1,1,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,Mining-dependent,0.556,0.039,0.26,0.143,0.503,0.272,0.101,0.41,0.07,0.493,0.008,0.015,0.164134,0.234043,0.342452,0.259372,19,7,0.288,0.19,0.09,0.082,0.181,0.265,10.0,19.79,2010.0,2480.0,a,257
2,4,Metro - Counties in metro areas of 1 million p...,Large-in a metro area with at least 1 million ...,Nonspecialized,0.541,0.057,0.07,0.023,0.522,0.179,0.115,0.202,0.198,0.479,0.013,0.085,0.158573,0.237859,0.186323,0.417245,12,6,0.212,0.156,0.084,0.098,0.195,0.209,10.0,3.14,629.0,690.0,b,195
3,5,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore adjacent to a small metro with town of...,Nonspecialized,0.5,0.061,0.203,0.059,0.525,0.2,0.164,0.013,0.049,0.897,0.007,0.001,0.181637,0.407186,0.248503,0.162675,11,12,0.285,0.208778,0.104,0.058,0.15155,0.238,13.0,19.403221,1810.0,6630.0,b,218
4,6,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore not adjacent to a metro/micro area and...,Nonspecialized,0.471,0.05,0.225,0.103,0.511,0.237,0.171,0.025,0.008,0.953,0.003,0.0,0.122367,0.41324,0.306921,0.157472,14,12,0.284,0.234,0.137,0.07,0.194,0.29,9.0,29.39,3489.0,2590.0,a,355


<font color='teal'> Comparing the new shape to the original shape of the dataframe, we went from (3198) rows to (3153).

### Export data to `DataWrangling_output.csv`

In [370]:
df.to_csv('../Data/predict heart disease rate/DataWrangling_output.csv')

---

### <font color='teal'> Quick model with only the numeric columns - Linear Regression using sci-kit learn.

In [373]:
# import the necessary packages for creating a Linear Regression model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# import metrics to view results
from sklearn.metrics import explained_variance_score, mean_absolute_error

In [374]:
# Create the X and y variables for splitting the data
X = df.drop(['heart_disease_mortality_per_100k', 'area__rucc', 'area__urban_influence', 'econ__economic_typology', 'yr'], axis=1)
y = df['heart_disease_mortality_per_100k']

# scale the data using StandardScaler()
X_scaled = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=1)

In [375]:
# create the linear model
lm = LinearRegression()
model = lm.fit(X_train, y_train)

In [376]:
# Predict using the X_test data
y_pred = model.predict(X_test)

In [379]:
print(f'Explained variance score (R^2):  {explained_variance_score(y_test,y_pred)}')
print(f'Mean absolute error:             {mean_absolute_error(y_test, y_pred)}')

Explained variance score (R^2):  0.662445465358438
Mean absolute error:             26.65479216091521


In [384]:
# Create a quick dataframe showing the coefficients of importance
pd.DataFrame(data=lm.coef_, index=X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False)

Unnamed: 0,Coefficient
demo__pct_adults_bachelors_or_higher,3027340.0
demo__pct_adults_with_high_school_diploma,2409773.0
demo__pct_adults_less_than_a_high_school_diploma,2280656.0
demo__pct_adults_with_some_college,1777066.0
demo__death_rate_per_1k,22.14253
health__pct_physical_inacticity,8.3387
health__pct_diabetes,8.060895
health__motor_vehicle_crash_deaths_per_100k,6.875089
econ__pct_uninsured_adults,5.911669
health__pct_adult_smoking,4.829057
