# First Part
### Data and Features
The dataset H_MHAS_c2.sas7bdat has 26839 rows and 5241 features.

SOME IMPORTANT INSIGHTS
-   The first character of the majority of variables indicates whether the variable refers to the reference person (“r”), spouse (“s”), or household (“h”).
-   The second character indicates the wave to which the variable pertains: “1”, “2”, “3”, “4”, “5”, or “A”. The “A” indicates “all,”

All features are divided into the following sections.

- SECTION A: DEMOGRAPHICS, IDENTIFIERS, AND WEIGHTS
- SECTION B: HEALTH
- SECTION C: HEALTH CARE UTILIZATION AND INSURANCE
- SECTION D: COGNITION
- SECTION E: FINANCIAL AND HOUSING WEALTH
- SECTION F: INCOME
- SECTION G: FAMILY STRUCTURE
- SECTION H: EMPLOYMENT HISTORY
- SECTION I: RETIREMENT
- SECTION J: PENSION
- SECTION K: PHYSICAL MEASURES
- SECTION L: ASSISTANCE AND CAREGIVING
- SECTION M: STRESS
- SECTION O: END OF LIFE PLANNING
- SECTION Q: PSYCHOSOCIAL

We have decided to analyze the features by sections...

In [1]:
# Import libraries
from sas7bdat import SAS7BDAT

import pandas as pd
import numpy as np

from src.section_dict import (
    section_A, # A: DEMOGRAPHICS, IDENTIFIERS, AND WEIGHTS
    section_B, # B: HEALTH
    section_C, # C: HEALTH CARE UTILIZATION AND INSURANCE
    section_D, # D: COGNITION
    section_E, # E: FINANCIAL AND HOUSING WEALTH
    section_F, # F: INCOME
    section_G, # G: FAMILY STRUCTURE
    section_H, # H: EMPLOYMENT HISTORY
    section_I, # I: RETIREMENT
    section_J, # J: PENSION
    section_K, # K: PHYSICAL MEASURES
    section_L, # L: ASSISTANCE AND CAREGIVING
    section_M, # M: STRESS
    section_O, # O: END OF LIFE PLANNING
    section_Q  # Q: PSYCHOSOCIAL
)
# read the dataset from the file (470 MB)
with SAS7BDAT('./dataset/H_MHAS_c2.sas7bdat') as file:
    df = file.to_data_frame()

print('the dataset has '+ str(df.shape[0]) + ' rows and ' + str(df.shape[1]) + ' features')

[H_MHAS_c2.sas7bdat] column count mismatch


the dataset has 26839 rows and 5241 features


As a first step, we check if all the columns extracted from the guide document are in the dataset...

In [2]:
# Initialize an empty list to store the appended values
Total_sections = []

# Iterate over each dictionary and its values
for section_dict in ( section_A, section_B, section_C, section_D, section_E, section_F, section_G, section_H, section_I, section_J, section_K, section_L, section_M, section_O, section_Q):
    for values_list in section_dict.values():
        # Extend the appended_values list with the values from the current list
        Total_sections.extend(values_list)


# all variables in the dataset...
all_variables = df.columns.tolist()

# Print the appended values
print('The document shows ' + str(len(Total_sections)) +  ' features and the data has ' + str(len(all_variables)) + ' features.\n')

# Searching for variable names that are not in the document...
variables_not_in_list = [variable for variable in all_variables if variable not in Total_sections]

print('The ' + str(len(variables_not_in_list)) + ' variables that are not in the document:')
print(variables_not_in_list)


The document shows 5237 features and the data has 5241 features.

The 4 variables that are not in the document:
['r2relgwk', 's2relgwk', 'r5riccaredpmm', 's1rpfcaredpm']


Since they are not in the PDF guide document, we can delete them.

In [3]:
df.drop(columns=variables_not_in_list,inplace=True)
df.shape

(26839, 5237)

Another approach we can take at this stage is to look for null values; since the data were produced in an interview, many columns contain a high percentage of null data...

In [4]:
# Calculate the percentage of missing data for each column
total_values = df.isnull().sum()

missing_percentage = round((total_values / len(df)) * 100,1)

# Create a DataFrame to store the results
missing_data_df = pd.DataFrame({
    'Total': total_values,
    'Missing Percentage': missing_percentage
})

# Sort the DataFrame by missing percentage in descending order
missing_data_df = missing_data_df.sort_values(by='Missing Percentage', ascending=False)

# Display the top 20 columns with higher missing percentage
top_missing_columns = missing_data_df.head(25)
print("Top 20 columns with higher missing percentage:")
print(top_missing_columns)

Top 20 columns with higher missing percentage:
            Total  Missing Percentage
s5penage    26835               100.0
r3bpref     26839               100.0
s3wghtsft   26827               100.0
s3gripothr  26836               100.0
s2penage    26827               100.0
r5penage    26826               100.0
s5dmonth    26839               100.0
s5dyear     26839               100.0
r3bpsft     26839               100.0
s3bpsft     26839               100.0
s3bpref     26839               100.0
r3wghtref   26837               100.0
s3walksft   26832               100.0
s3walktryu  26832               100.0
s3walkref   26832               100.0
s3walkothr  26832               100.0
r3gripsft   26833               100.0
s3gripsft   26836               100.0
r3gripref   26833               100.0
s3gripref   26836               100.0
s3wghttryu  26827               100.0
r3gripothr  26833               100.0
s3wghtref   26839               100.0
s3hipref    26824                99.9
r3w

As we can see, many columns have 100% null data and others in a high percentage. For this reason, and as a first measure to clean the data, we decided to eliminate those columns with a percentage of null values higher than 90%.  

In [5]:
var_high_null_percentage = missing_data_df[missing_data_df['Missing Percentage']>90].index.tolist()
print('Are going to be eliminated ' + str(len(var_high_null_percentage)) + ' columns with a percentage of null values than 90% or higher')

df.drop(columns=var_high_null_percentage,inplace=True)
df.shape

Are going to be eliminated 1753 columns with a percentage of null values than 90% or higher


(26839, 3484)

1753 features meet this condition, i.e. we have eliminated 33 % of the columns
We now have 3484 features for our model. <br>
As mentioned before, the Exploratory Analysis of the EDA data will be done by sections (15 in total), in order to be aligned with the guidance document.

<span style="font-size:0.8em;">

<h3><center> EDA Section A - DEMOGRAPHICS, IDENTIFIERS, AND WEIGHTS </center></h3>

This section consists of 26 subsections (245 features) which are shown below:

-  Person Specific Identifier
-  Household Identifier
-  Spouse Identifier
-  Wave Status: Response Indicator
-  Wave Status: Interview Status
-  Sample Cohort
-  Whether Proxy Interview
-  Number of Household Respondents
-  Whether Couple Household
-  Household Analysis Weight
-  Person-Level Analysis Weight
-  Interview Dates
-  Birth Date: Month and Year
-  Death Date: Month and Year
-  Age at Interview (Months and Years)
-  Gender
-  Education
-  Education: Categories by ISCED Codes
-  Education: Harmonized Education
-  Literacy and Numeracy
-  Indigenous Language
-  Current Marital Status: Current Partnership Status
-  Current Marital Status: With Partnership
-  Current Marital Status: Without Partnership
-  Number of Marriages
-  Urban or Rural
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section B - HEALTH </center></h3>

This section consists of 27 subsections (1624 features) which are shown below:

-  Self-Report of Health
-  Activities of Daily Living (ADLs): Raw Recodes
-  Activities of Daily Living (ADLs): Some Difficulty
-  Instrumental Activities of Daily Living (IADLs): Raw Recodes
-  Instrumental Activities of Daily Living (IADLs): Some Difficulty
-  Other Functional Limitations: Raw Recodes
-  Other Functional Limitations: Some Difficulty
-  ADL Summary: Sum ADLs Where Respondent Reports Any Difficulty
-  IADL Summary: Sum IADLs Where Respondent Reports Any Difficulty
-  Other Summary Indices: Mobility, Large Muscle, Gross, Fine Motor, Total, Upper, Lower Body Mobility, and NAGI Activities
-  Doctor Diagnosed Health Problems: Ever Have Condition
-  Doctor Diagnosed Diseases: Whether Receives Treatment or Medication for Disease
-  Doctor Diagnosed Diseases: Whether Disease Limits Activity
-  Doctor Diagnosed Diseases: Age of Diagnosis
-  Vision
-  Hearing
-  Falls
-  Urinary Incontinence
-  Persistent Health Problems
-  Sleep
-  Pain
-  Menopause
-  BMI
-  Health Behaviors: Physical Activity or Exercise
-  Health Behaviors: Drinking
-  Health Behaviors: Smoking (Cigarettes)
-  Health Behaviors: Preventive Care
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section C - HEALTH CARE UTILIZATION AND INSURANCE </center></h3>

This section consists of 8 subsections (236 features) which are shown below:

-  Medical Care Utilization: Hospital
-  Medical Care Utilization: Doctor
-  Medical Care Utilization: Other Medical Care Utilization
-  Medical Expenditures: Out of Pocket and Total
-  Covered by Federal Government Health Insurance Program
-  Covered by Private Health Insurance
-  Covered by Health Insurance from a Current or Previous Employer
-  Number of Health Insurance Plans
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section D: COGNITION </center></h3>

This section consists of 15 subsections (706 features) which are shown below:
 
-  Cognition Testing Conditions
-  Self-Reported Memory
-  Immediate Word Recall
-  Delayed Word Recall
-  Summary Scores
-  Picture Drawing
-  Verbal Fluency
-  Visual Scanning
-  Backwards Counting From 20
-  Date Naming/Orientation
-  Serial 7’s
-  Proxy Cognition: JORM IQCODE
-  Proxy Cognition: Ratings of Memory and Abilities
-  Proxy Cognition: Cognitive Impairment
-  Proxy Cognition: Problem Behaviors in Past Week
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section E - FINANCIAL AND HOUSING WEALTH </center></h3>

This section consists of 15 subsections (148 features) which are shown below:

-  Inflation Multiplier
-  Net Value of Real Estate (Not Primary Residence)
-  Net Value of Cars
-  Net Value of Businesses 
-  Value of Stocks, Shares, and Bonds
-  Value of Checking, Savings Accounts
-  Value of Other Assets
-  Value of Primary Residence
-  Value of All Mortgages (Primary Residence)
-  Net Value of Primary Residence
-  Home ownership
-  Value of Other Debt
-  Value of Loans Lent
-  Net Value of Non-Housing Financial Wealth (Excluding IRAs)
-  Total Wealth

</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section F - INCOME </center></h3>

This section consists of 10 subsections (212 features) which are shown below:

-  Individual Earnings
-  Household Capital Income
-  Individual Income from Private Pension
-  Individual Public Pension Income
-  Individual Other Pensions Income
-  Individual Total Pensions Income
-  Individual Income from Other Government Transfers
-  All Other Income
-  Total Household Income (respondent & spouse)
-  Total Household Consumption (full household)
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section G - FAMILY STRUCTURE </center></h3>

This section consists of 19 subsections (223 features) which are shown below:

-  Number of People Living in Household
-  Number of Living Children
-  Number of Deceased Children
-  Number of Children Ever Born
-  Number of Grandchildren
-  Number of Living Siblings
-  Number of Deceased Siblings
-  Number of Living Parents
-  Parental Mortality
-  Parents' Current Age or Age at Death
-  Parents' Education
-  Any Child Co-Resides with Respondent
-  Any Children Living in the Same City
-  Any Weekly Contact with Children
-  Frequent or Weekly Contact with Relatives and Friends
-  Any Weekly Social Activities or Participate in Religious Groups
-  Financial Transfer from Children
-  Financial Transfer to Children
-  Financial Transfer to Parents
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section H - EMPLOYMENT HISTORY </center></h3>

This section consists of 12 subsections (110 features) which are shown below:

-  Currently Working for Pay
-  Whether Self-Employed
-  Labor Force Status
-  In the Labor Force
-  Unemployment Status
-  Retired Employment Status
-  Hours at Main Job
-  Main Activity Years of Tenure
-  Job Allows Move to Less Demanding Work
-  Occupation Code for Job with Longest Reported Tenure
-  Year Last Job Ended
-  Reason Job Ended
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section I - RETIREMENT </center></h3>

This section consists of 2 subsections (16 features) which are shown below:

-  Whether Retired: Retirement year, if says retired
-  Whether Retired: Retirement age, if says retired
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section J - PENSION </center></h3>

This section consists of 7 subsections (54 features) which are shown below:

-  Whether Receives Public Pension
-  Whether Receives Private Pension
-  Whether Receives Other Pension
-  Age When Started to Receive a Public Pension
-  Age When Started to Receive a Private Pension
-  Whether Current Public Pension(s) Can Continue
-  Whether Current Private Pension Can Continue
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section K - PHYSICAL MEASURES </center></h3>

This section consists of 8 subsections (282 features) which are shown below:
 
-  Height, Weight, Waist and Hip Circumference Measurements
-  Height, Weight, Waist and Hip Circumference Measurements: Reason Didn't Complete
-  Sitting Height
-  Sitting Height: Reason Didn't Complete
-  Balance Test
-  Balance Test: Reason Didn't Complete
-  Blood Pressure Measurements
-  Blood Pressure Measurements: Reason Didn't Complete
-  Timed Walk Measurements
-  Timed Walk Measurements: Reason Didn't Complete
-  Hand Grip Strength Measurements
-  Hand Grip Strength Measurements: Reason Didn't Complete
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section L - ASSISTANCE AND CAREGIVING </center></h3>

This section consists of 31 subsections (1103 features) which are shown below:
 
-  ADL Help
-  IADL Help
-  Whether Uses Personal Aids
-  Future ADL Help
-  Activities of Daily Living: Whether Receives Any Care
-  Activities of Daily Living: Whether Receives Any Informal Care
-  Activities of Daily Living: Receives Informal Care from Spouse
-  Activities of Daily Living: Receives Informal Care from Children or Grandchildren
-  Activities of Daily Living: Receives Informal Care from Relatives
-  Activities of Daily Living: Receives Informal Care from Other Individuals
-  Activities of Daily Living: Whether Receives Any Formal Care
-  Activities of Daily Living: Receives Formal Care from Paid Professional
-  Instrumental Activities of Daily Living: Whether Receives Any Care
-  Instrumental Activities of Daily Living: Whether Receives Any Informal Care
-  Instrumental Activities of Daily Living: Receives Informal Care from Spouse
-  Instrumental Activities of Daily Living: Receives Informal Care from Children or Grandchildren
-  Instrumental Activities of Daily Living: Receives Informal Care from Relatives
-  Instrumental Activities of Daily Living: Receives Informal Care from Other Individuals
-  Instrumental Activities of Daily Living: Whether Receives Any Formal Care
-  Instrumental Activities of Daily Living: Receives Formal Care from Paid Professional
-  Activities of Daily Living and Instrumental Activities of Daily Living: Whether Receives Any Care
-  Activities of Daily Living and Instrumental Activities of Daily Living: Whether Receives Any Informal Care
-  Activities of Daily Living and Instrumental Activities of Daily Living: Receives Informal Care from Spouse
-  Activities of Daily Living and Instrumental Activities of Daily Living: Receives Informal Care from Children or Grandchildren
-  Activities of Daily Living and Instrumental Activities of Daily Living: Receives Informal Care from Relatives
-  Activities of Daily Living and Instrumental Activities of Daily Living: Receives Informal Care from Other Individuals
-  Activities of Daily Living and Instrumental Activities of Daily Living: Whether Receives Any Formal Care
-  Activities of Daily Living and Instrumental Activities of Daily Living: Receives Formal Care from Paid Professional
-  Receives Help with Chores from Children or Grandchildren
-  Provides Informal Care to Children or Grandchildren
-  Provides Personal Care to Parents
-  Provides Informal Care for Sick or Disabled Adults
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section M - STRESS </center></h3>

This section consists of 4 subsections (77 features) which are shown below:
  
-  Social Support: Spouse
-  Social Support: Children
-  Social Support: Friends
-  Experienced Death of a Child
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section O - END OF LIFE PLANNING </center></h3>

This section consists of 3 subsections (33 features) which are shown below:

-  Will: Whether Has a Will
-  Will: Beneficiaries of Will
-  Covered by Life Insurance
</span>

<span style="font-size:0.8em;">

<h3><center> EDA Section Q - PSYCHOSOCIAL </center></h3>

This section consists of 8 subsections (166 features) which are shown below:

-  Depressive Symptoms: CESD
-  Satisfaction with Life Scale
-  Single Life Satisfaction Question
-  Cantril Ladder
</span>