#### About the dataset
1. YearStart: The year the data collection began.
2. YearEnd: The year the data collection ended.
3. LocationAbbr: The abbreviation for the location where the data was collected.
4. LocationDesc: The full name of the location where the data was collected.
5. Datasource: The source of the data.
6. Class: The class of the data.
7. Topic: The topic of the data.
8. Question: The question related to the data.
9. Data_Value_Unit: The unit of measurement for the data value.
10. DataValueTypeID: The ID for the type of data value.
11. Data_Value_Type: The type of data value (e.g. mean, percentage).
12. Data_Value: The actual data value.
13. Data_Value_Alt: An alternative data value, if applicable.
14. Low_Confidence_Limit: The lower limit of the confidence interval for the data value.
15. High_Confidence_Limit: The upper limit of the confidence interval for the data value.
16. Sample_Size: The size of the sample used to collect the data.
17. StratificationCategory1: The first category used for stratification (e.g. age group).
18. Stratification1: The specific stratification used (e.g. 18-24 years old).
19. StratificationCategory2: The second category used for stratification, if applicable.
20. Stratification2: The specific stratification used for the second category, if applicable.
21. Geolocation: The latitude and longitude of the location where the data was collected.
22. ClassID: The ID for the class of the data.
23. TopicID: The ID for the topic of the data.
24. QuestionID: The ID for the question related to the data.
25. LocationID: The ID for the location where the data was collected.
26. StratificationCategoryID1: The ID for the first category used for stratification.
27. StratificationID1: The ID for the specific stratification used for the first category.
28. StratificationCategoryID2: The ID for the second category used for stratification, if applicable.
29. StratificationID2: The ID for the specific stratification used for the second category, if applicable.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Load the dataset
file_path = "../data/Alzheimer_s_Disease_and_Healthy_Aging_Indicators__Cognitive_Decline_20250131.csv"
df = pd.read_csv(file_path)

# display the first few rows
df.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,Datasource,Class,Topic,Question,Data_Value_Unit,DataValueTypeID,...,Stratification2,Geolocation,ClassID,TopicID,QuestionID,LocationID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2
0,2022,2022,AZ,Arizona,BRFSS,Cognitive Decline,Functional difficulties associated with subjec...,Percentage of older adults who reported subjec...,%,PRCTG,...,Female,POINT (-111.7638113 34.86597028),C06,TCC02,Q31,4,AGE,AGE_OVERALL,GENDER,FEMALE
1,2022,2022,AZ,Arizona,BRFSS,Cognitive Decline,Functional difficulties associated with subjec...,Percentage of older adults who reported subjec...,%,PRCTG,...,Hispanic,POINT (-111.7638113 34.86597028),C06,TCC02,Q31,4,AGE,5064,RACE,HIS
2,2022,2022,AZ,Arizona,BRFSS,Cognitive Decline,Functional difficulties associated with subjec...,Percentage of older adults who reported subjec...,%,PRCTG,...,"White, non-Hispanic",POINT (-111.7638113 34.86597028),C06,TCC02,Q31,4,AGE,65PLUS,RACE,WHT
3,2022,2022,AZ,Arizona,BRFSS,Cognitive Decline,Functional difficulties associated with subjec...,Percentage of older adults who reported subjec...,%,PRCTG,...,Native Am/Alaskan Native,POINT (-111.7638113 34.86597028),C06,TCC02,Q31,4,AGE,65PLUS,RACE,NAA
4,2022,2022,AZ,Arizona,BRFSS,Cognitive Decline,Functional difficulties associated with subjec...,Percentage of older adults who reported subjec...,%,PRCTG,...,"Black, non-Hispanic",POINT (-111.7638113 34.86597028),C06,TCC02,Q31,4,AGE,AGE_OVERALL,RACE,BLK


In [3]:
# get general information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22182 entries, 0 to 22181
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   YearStart                   22182 non-null  int64  
 1   YearEnd                     22182 non-null  int64  
 2   LocationAbbr                22182 non-null  object 
 3   LocationDesc                22182 non-null  object 
 4   Datasource                  22182 non-null  object 
 5   Class                       22182 non-null  object 
 6   Topic                       22182 non-null  object 
 7   Question                    22182 non-null  object 
 8   Data_Value_Unit             22182 non-null  object 
 9   DataValueTypeID             22182 non-null  object 
 10  Data_Value_Type             22182 non-null  object 
 11  Data_Value                  14136 non-null  float64
 12  Data_Value_Alt              14136 non-null  float64
 13  Data_Value_Footnote_Symbol  121

In [4]:
df["Data_Value_Footnote"].unique()

array([nan,
       'Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%',
       'No Data Available',
       'Regional estimates may not represent all states in the region',
       'Fewer than 50 States reporting'], dtype=object)

In [5]:
df["Low_Confidence_Limit"].unique()

array([23.5,  nan, 14. , 10.2, 16.4, 15.1, 20.9, 21. , 18.2, 26.2, 25.2,
       13.8, 14.9, 21.6, 30.9, 24. , 26.8, 32.2, 24.9, 26. , 17.8, 11.6,
       29.7, 22. , 31.6, 15.3, 17.4, 21.8, 24.8, 17.6, 29.4, 22.8, 19.8,
       31. , 34.4, 30.5, 19.3, 26.4, 20.8, 18.6, 21.2, 24.5, 22.6, 21.4,
       23.4, 24.4, 19.4, 25.5, 18.3, 34.7, 23.3, 19.2, 37.1, 16.7, 16.2,
        9.2, 28.9, 32.7, 17.9, 27. , 35.3, 14.3, 21.7, 29.2, 41.2, 22.5,
       41. , 35.6, 33.5, 37.4, 83.8, 40.9, 26.3, 30.1, 37.2, 15.2, 37.8,
       27.6, 30.7, 34. ,  7.1, 44.5, 73.2, 42.7, 30.2, 33.8, 25.1, 34.1,
       19.7, 16.8, 25.9, 33.6, 23.7, 29. , 31.7, 22.3, 20.2, 32.4, 25. ,
       33. , 31.8, 28.2, 30.8, 25.3, 18.1, 15. ,  0.5, 32.8, 33.1, 29.3,
        6.9, 17. , 27.5, 48. , 78.8, 33.9, 18.7, 29.9, 20.6, 21.5, 39.6,
       43.2, 13.6, 12.1, 32. , 25.7, 12.2, 19.5, 19.1, 28.8, 22.4, 28. ,
       35. , 27.9, 22.7, 15.4, 23.6, 14.4, 14.6, 18. , 26.1, 31.1, 37. ,
       24.1, 34.2, 18.9, 18.4, 24.2, 27.7, 36.6, 28

In [6]:
df["High_Confidence_Limit"].unique()

array([ 41. ,   nan,  27.3,  25. ,  35.2,  27.2,  49.4,  33. ,  31.6,
        55.6,  47.9,  28.7,  47.1,  59. ,  48.6,  45.7,  55.8,  59.8,
        61.1,  51.2,  34.5,  31. ,  53.8,  46.6,  48.1,  54.2,  41.5,
        54.6,  43.1,  38.2,  41.7,  52.1,  31.5,  45.9,  49.3,  50.5,
        59.6,  48.7,  61.2,  58.4,  52.5,  47. ,  39.9,  57. ,  47.6,
        45.6,  42.2,  45.2,  37. ,  42.5,  50.1,  43.5,  45.3,  55.3,
        37.7,  63.5,  29.6,  28.3,  24.1,  36.2,  37.9,  44.9,  51.4,
        32.7,  40.9,  55.9,  31.9,  34.6,  43. ,  60. ,  54.9,  36.7,
        59.1,  48.8,  52.2,  99.5,  56. ,  74.8,  69.3,  47.2,  78.3,
        38.7,  36.4,  37.6,  38.6,  88.4,  80.6,  97.9,  70.4,  46. ,
        70.6,  35.8,  31.7,  57.3,  49.6,  44.3,  37.8,  29.2,  32.4,
        49. ,  57.5,  61.4,  48. ,  42.9,  64.1,  52.3,  31.2,  48.4,
        29.5,  50.4,  89.4,  43.2,  68.4,  84.9,  98.1,  83.5,  99.8,
        79.1,  39.6,  35.1,  78.2,  35.9,  76.2,  34. ,  72.9,  31.3,
        31.1,  64. ,

In [7]:
df["StratificationCategory1"].unique()

array(['Age Group'], dtype=object)

In [8]:
df["Stratification1"].unique()

array(['Overall', '50-64 years', '65 years or older'], dtype=object)

In [9]:
df["StratificationCategory2"].unique()

array(['Gender', 'Race/Ethnicity', nan], dtype=object)

In [10]:
df["Stratification2"].unique()

array(['Female', 'Hispanic', 'White, non-Hispanic',
       'Native Am/Alaskan Native', 'Black, non-Hispanic', 'Male', nan,
       'Asian/Pacific Islander'], dtype=object)

In [11]:
df["Geolocation"].unique()

array(['POINT (-111.7638113 34.86597028)',
       'POINT (-120.9999995 37.63864012)',
       'POINT (-72.64984095 41.56266102)',
       'POINT (-81.92896054 28.93204038)',
       'POINT (-93.81649056 42.46940091)',
       'POINT (-114.36373 43.68263001)',
       'POINT (-86.14996019 39.76691045)', nan,
       'POINT (-68.98503134 45.25422889)',
       'POINT (-84.71439027 44.66131954)',
       'POINT (-117.0718406 39.49324039)',
       'POINT (-82.40426006 40.06021014)',
       'POINT (-120.1550313 44.56744942)',
       'POINT (-71.52247031 41.70828019)',
       'POINT (-81.04537121 33.9988213)',
       'POINT (-111.5871306 39.36070017)',
       'POINT (-78.45789046 37.54268067)',
       'POINT (-72.51764079 43.62538124)',
       'POINT (-89.81637074 44.39319117)',
       'POINT (-106.1336109 38.84384076)', 'POINT (-77.036871 38.907192)',
       'POINT (-83.62758035 32.83968109)',
       'POINT (-157.8577494 21.30485044)',
       'POINT (-76.60926011 39.29058096)',
       'POINT (-89.5

In [12]:
df["ClassID"].unique()

array(['C06'], dtype=object)

In [13]:
df["TopicID"].unique()

array(['TCC02', 'TCC03', 'TCC01', 'TCC04'], dtype=object)

In [14]:
df["QuestionID"].unique()

array(['Q31', 'Q41', 'Q30', 'Q42'], dtype=object)

In [15]:
df["QuestionID"]

0        Q31
1        Q31
2        Q31
3        Q31
4        Q31
        ... 
22177    Q42
22178    Q42
22179    Q42
22180    Q42
22181    Q42
Name: QuestionID, Length: 22182, dtype: object

In [16]:
df["LocationID"].unique()

array([   4,    6,    9,   12,   19,   16,   18, 9002,   23,   26, 9001,
         32,   39,   41,   44,   45, 9003,   59,   49,   51,   50, 9004,
         55,    8,   11,   13,   15,   24,   28,   36,   40,   42,   47,
         48,    2,    5,   10,   17,   21,   37,   33,   72,   53,   56,
          1,   20,   22,   27,   29,   38,   31,   35,   46,   54,   34,
         25,   30])

In [17]:
df["StratificationCategoryID1"].unique()

array(['AGE'], dtype=object)

In [18]:
df["StratificationID1"].unique()

array(['AGE_OVERALL', '5064', '65PLUS'], dtype=object)

In [19]:
df["StratificationCategoryID2"].unique()

array(['GENDER', 'RACE', 'OVERALL'], dtype=object)

In [20]:
df["StratificationID2"].unique()

array(['FEMALE', 'HIS', 'WHT', 'NAA', 'BLK', 'MALE', 'OVERALL', 'ASN'],
      dtype=object)

### Cleaning the dataset

In [21]:
# create a new "Year" column using the average of YearStart and YearEnd, round to the nearest integer
df['Year'] = ((df['YearStart'] + df['YearEnd']) / 2).round().astype(int)

In [22]:
# move the "Year" column to the front
col_order = ["Year"] + [col for col in df.columns if col != "Year"]
df = df[col_order]

In [23]:
# rename "LocationDesc" to "Location"
df.rename(columns={"LocationDesc" : "Location",
                  "Data_Value" : "Percentage_Value"}, 
          inplace=True)

# move "Location" column next to "Year"
col_order = ["Year", "Location"] + [col for col in df.columns if col not in ["Year", "Location"]]
df = df[col_order]

In [24]:
# modifying the topic column
# Ensure 'Topic' exists before proceeding
if 'Topic' in df.columns:
    # Define shortened names for the topics
    topic_mapping = {
        'Functional difficulties associated with subjective cognitive decline or memory loss among older adults': 'Functional_Difficulties',
        'Need assistance with day-to-day activities because of subjective cognitive decline or memory loss': 'Needs_Assistance',
        'Subjective cognitive decline or memory loss among older adults': 'Cognitive_Decline',
        'Talked with health care professional about subjective cognitive decline or memory loss': 'Consulted_Professional'
    }

    # Create new boolean columns for each topic
    for full_name, short_name in topic_mapping.items():
        df[short_name] = (df['Topic'] == full_name).astype(int)

In [25]:
# modifying the question column
# Define shortened names for the questions
question_mapping = {
    'Percentage of older adults who reported subjective cognitive decline or memory loss that interferes with their ability to engage in social activities or household chores': 'q_Interferes_Activities',
    'Percentage of older adults who reported that as a result of subjective cognitive decline or memory loss that they need assistance with day-to-day activities': 'q_Needs_Assistance',
    'Percentage of older adults who reported subjective cognitive decline or memory loss that is happening more often or is getting worse in the preceding 12 months': 'q_Worsening_Decline',
    'Percentage of older adults with subjective cognitive decline or memory loss who reported talking with a health care professional about it': 'q_Consulted_Professional'
}

# Create new boolean columns for each question
for full_name, short_name in question_mapping.items():
    df[short_name] = (df['Question'] == full_name).astype(int)

**New Columns**:
- `Small_Sample_Size`: 1 if sample size is <50 or relative standard error >30%, else 0.
- `No_Data_Available`: 1 if **no data is available**, else 0.
- `Regional_Issue`: 1 if **regional estimates do not represent all states**, else 0.
- `Few_States_Reported`: 1 if **fewer than 50 states reported**, else 0.

In [26]:
# Define mapping for boolean columns
df['Small_Sample_Size'] = df['Data_Value_Footnote'].str.contains("Sample size", na=False).astype(int)
df['No_Data_Available'] = df['Data_Value_Footnote'].str.contains("No Data Available", na=False).astype(int)
df['Regional_Issue'] = df['Data_Value_Footnote'].str.contains("Regional estimates", na=False).astype(int)
df['Few_States_Reported'] = df['Data_Value_Footnote'].str.contains("Fewer than 50 States", na=False).astype(int)

In [27]:
# drop the unnecessary columns
df.drop(columns=["YearStart", "YearEnd", "LocationAbbr",
                "Datasource", "Class", "Data_Value_Unit",
                "DataValueTypeID", "Data_Value_Type",
                "Data_Value_Alt", "Data_Value_Footnote_Symbol",
                "Topic", "Question", "Data_Value_Footnote", "Geolocation",
                "ClassID"], 
        inplace=True)

In [28]:
df.head()

Unnamed: 0,Year,Location,Percentage_Value,Low_Confidence_Limit,High_Confidence_Limit,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,TopicID,...,Cognitive_Decline,Consulted_Professional,q_Interferes_Activities,q_Needs_Assistance,q_Worsening_Decline,q_Consulted_Professional,Small_Sample_Size,No_Data_Available,Regional_Issue,Few_States_Reported
0,2022,Arizona,31.6,23.5,41.0,Age Group,Overall,Gender,Female,TCC02,...,0,0,1,0,0,0,0,0,0,0
1,2022,Arizona,,,,Age Group,50-64 years,Race/Ethnicity,Hispanic,TCC02,...,0,0,1,0,0,0,1,0,0,0
2,2022,Arizona,19.9,14.0,27.3,Age Group,65 years or older,Race/Ethnicity,"White, non-Hispanic",TCC02,...,0,0,1,0,0,0,0,0,0,0
3,2022,Arizona,,,,Age Group,65 years or older,Race/Ethnicity,Native Am/Alaskan Native,TCC02,...,0,0,1,0,0,0,1,0,0,0
4,2022,Arizona,,,,Age Group,Overall,Race/Ethnicity,"Black, non-Hispanic",TCC02,...,0,0,1,0,0,0,1,0,0,0
