# **Analyzing Chronic Diseases and Behavioural Patterns in the US with a Focus on Diabetes**

Contributors: Arnav Sachdeva, Haojiang Wu, Hui Gao, Pin-Hao Pan, Rashi Jaiswal, Tharfeed Ahmed Unus

# **1. Executive Summary**

Chronic diseases like diabetes, asthma, arthritis, depression and cancer are leading causes of mortality and expenditure in the US. Diabetes specifically contributes a significant economic burden in the US, with annual costs reaching approximately 413 billion USD, as of 2022. This analysis investigates the prevalence of chronic diseases in the United States, with a focus on diabetes, using data from the CDC's Chronic Disease Indicators (CDI) and the impact of factors influencing these diseases from the 2022 Behavioral Risk Factor Surveillance System (BRFSS) data. These findings reveal significant regional disparities, with higher rates of diabetes observed in Southern states like Louisiana, Mississippi and Arkansas, as well as the US territories of Puerto Rico and Guam. Socioeconomic factors, particularly low income are strongly correlated with diabetes prevalence, highlighting the challenges of managing healthcare costs in these populations. Young adults with diabetes often face barriers to accessing medical care due to affordability, potentially exacerbating health outcomes as they age. Depression was found to be frequently comorbid with diabetes, highlighting the need for integrating care models that address both physical and mental health.

To mitigate these challenges, we recommend health interventions in high-risk regions and populations. Increasing subsidies for insulin, expanding insurance coverage for preventative care, and introducing sliding-scale healthcare costs can improve access for low-income groups. States with low participation in diabetes education, such as Mississippi and Georgia, should invest in communiy outreach and awareness programs to empower individuals with the knowledge to manage their condition. It is also important to address the mental health burden among diabetic individuals through integrated care programs that cover both conditions. Finally, childhood focused initiatives in states like Oregon and Florida are recommended to improve long-term health outcomes by addressing adverse childhood experiences. Going forward, this analysis can be expanded to further delve into disparities among different demographic groups, across multiple years to analyse time-based trends and outcomes. These efforts can significantly reduce the burden of diabetes and improve public health outcomes nationwide.

# **2. Introduction**

### **i. Motivation**

Chronic diseases pose significant health and economic burdens across the United States. Despite widespread public health efforts, some regions consistently show higher rates of chronic diseases than others. A deeper understanding of the factors driving chronic disease prevalence—such as lifestyle choices, environmental conditions, socioeconomic status, and access to healthcare—can inform comprehensive strategies to improve health outcomes. This could lead to a reduction in the overall incidence of chronic diseases, improved quality of life for those affected, and longer life expectancies.

Among the various chronic diseases, diabetes stands out not only due to its widespread prevalence (an estimated 38 million Americans have diabetes, with another 98 million having prediabetes)<sup>1</sup>, but also because it is a key contributor to other major health complications such as cardiovascular disease, kidney disease and amputations. Addressing diabetes through prevention, early detection, and management could reduce these enormous costs and alleviate the broader strain on the healthcare system.

### **ii. Business Problem**

Chronic diseases, such as diabetes, are leading causes of morbidity, mortality, and healthcare expenditure in the U.S. Understanding their prevalence is crucial for identifying high-risk populations and regions, enabling more targeted public health interventions and resource allocation. The economic burden of diabetes in the United States is substantial, with annual costs reaching approximately 413 billion USD, as of 2022. This includes 307 billion USD in direct healthcare costs, such as hospital stays, outpatient care and medications, and 106 billion USD in reduced productivity, due to lost working days, disability etc, making diabetes one of the most expensive chronic diseases in the US, while also being extremely widespread and a key contributor to other health complications.<sup>1</sup>

Identifying underlying behavioral and societal factors that may contribute to these diseases can help support public health policy and better target prevention efforts. By analyzing these trends, we hope to gain insights into the behavioral patterns that might be contributing to disparities across regions, and aid public policy decisions in the healthcare domain.

# **3. Data and Source**

Our data is primarily from the Centre for Disease Control and Prevention (CDC), made available to us from the following sources:
1.   The CDC website on Chronic Disease Indicators (CDI) in the US: https://data.cdc.gov/Chronic-Disease-Indicators/U-S-Chronic-Disease-Indicators/hksd-2xuw/about_data

  CDC's Division of Population Health provides a cross-cutting set of 115 indicators developed by consensus among CDC, the Council of State and Territorial Epidemiologists, and the National Association of Chronic Disease Directors. These indicators allow states and territories to uniformly define, collect, and report chronic disease data that are important to public health practice in their area. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources.
2.   The CDC website on 2022 Behavioural Risk Factor Surveillance System (BRFSS) Survey Data and Documentation: https://www.cdc.gov/brfss/annual_data/annual_2022.html

  The 2022 BRFSS data continue to reflect the changes initially made in 2011 for weighting methodology and adding cell-phone-only respondents. The aggregate BRFSS combined landline and cell phone data set is built from the landline and cell phone data submitted for 2022 and includes data from 50 states, the District of Columbia, Guam, Puerto Rico, and the US Virgin Islands.

We have opted to analyze data from 2022 even though BRFSS data is available for 2023, since the CDI dataset is only complete for 2022.

## **i. Importing the Dataset to BigQuery**

In [29]:
from google.colab import auth
auth.authenticate_user()

In [30]:
import pandas as pd
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [31]:
!pip install plotly_express -q
import plotly_express as px
from plotly.offline import plot
from plotly.subplots import make_subplots

In [32]:
!pip install google-cloud-bigquery pandas
from google.cloud import bigquery
client = bigquery.Client(project="fall24-ba775-a08", location="US")



The CDI dataset is available for download as a csv file in at the link listed above, making the import to BigQuery straightforward. However, the BRFSS data is available in a SPSS/STATA-compatible .xpt format, so we utilized Python to convert the dataset to csv format and then imported the resulting csv file to BigQuery. The Python code used we used for the same can be found below:

```
url = 'https://www.cdc.gov/brfss/annual_data/2022/files/LLCP2022XPT.zip'
response = requests.get(url)

if response.status_code == 200:
    with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
        zip_ref.extractall()
    print("File extracted successfully.")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

df_brfss = pd.read_sas('LLCP2022.XPT ')

df_brfss.to_csv("LLCP2022.csv")
```

Once both csv files were prepared, they were saved on the project's Google Cloud Storage Bucket, from where we created the respective tables under a source and dataset created under the GCP Project corresponding to the term project, naming the source `fall24-ba775-a04` and the dataset `group_project`.

## **ii. Data Dictionary**

#### **a. BFRSS Data Dictionary**

The behavioural risk factor surveillance system (BRFSS) data available to us takes the form of an elaborate survey questionnaire, with over 200 questions.<sup>2</sup> While we will be reducing the number of columns we will be using going forward, the challenge on utilizing the data appropriately remains.

In order to help us navigate the questions on the survey better, we have housed the data dictionary for the BRFSS data in `fall24-ba775-a08.group_project.brfss_2022_data_dictionary`. This allows us to query the table whenever information is required, allowing for easier access to the relevant questions.

For example, in order to get all relevant columns in the `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022` table that are related to income or insurance for further analysis, we would now be able to do the following by utilizing BigQuery's INFORMATION_SCHEMA views:

In [33]:
%%bigquery --project=fall24-ba775-a08
# Looking for Income related columns in the Data Dictionary
SELECT col.column_name, dd.field_definition, dd.field_key
FROM fall24-ba775-a08.group_project.INFORMATION_SCHEMA.COLUMNS col
INNER JOIN `fall24-ba775-a08.group_project.brfss_2022_data_dictionary` dd
ON col.column_name = dd.field_name
WHERE LOWER(dd.field_definition) LIKE '%income%'
AND col.table_name = 'behavioral_risk_factor_surveillance_system_2022'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,column_name,field_definition,field_key
0,INCOME3,Is your annual household income from all sources—,"01 Less than $10,000? 02 Less than $15,000? ($10,000 to less than $15,000) 03 Less than $20,000? ($15,000 to less than $20,000) 04 Less than $25,000 05 Less than $35,000 If ($25,000 to less than $35,000) 06 Less than $50,000 If ($35,000 to less than $50,000) 07 Less than $75,000? ($50,000 to less than $75,000) 08 Less than $100,000? ($75,000 to less than $100,000) 09 Less than $150,000? ($100,000 to less than $150,000)? 10 Less than $200,000? ($150,000 to less than $200,000) 11 $200,000 or more Do not read: 77 Don’t know / Not sure 99 Refused"


In [34]:
%%bigquery --project=fall24-ba775-a08
# Looking for Insurance related columns in the Data Dictionary
SELECT col.column_name, dd.field_definition, dd.field_key
FROM fall24-ba775-a08.group_project.INFORMATION_SCHEMA.COLUMNS col
INNER JOIN `fall24-ba775-a08.group_project.brfss_2022_data_dictionary` dd
ON col.column_name = dd.field_name
WHERE LOWER(dd.field_definition) LIKE '%insurance%'
AND col.table_name = 'behavioral_risk_factor_surveillance_system_2022'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,column_name,field_definition,field_key
0,PRIMINSR,What is the current primary source of your health insurance?,01 A plan purchased through an employer or union (including plans purchased through another person's employer) 02 A private nongovernmental plan that you or another family member buys on your own 03 Medicare 04 Medigap 05 Medicaid 06 Children's Health Insurance Program (CHIP) 07 Military related health care: TRICARE (CHAMPUS) / VA health care / CHAMP- VA 08 Indian Health Service 09 State sponsored health plan 10 Other government program 88 No coverage of any type 77 Don’t Know/Not Sure 99 Refused
1,CSRVINSR,"With your most recent diagnosis of cancer, did you have health insurance that paid for all or part of your cancer treatment?",1 Yes 2 No 7 Don’t know/ not sure 9 Refused
2,CSRVDEIN,Were you ever denied health insurance or life insurance coverage because of your cancer?,1 Yes 2 No 7 Don’t know/ not sure 9 Refused


From the above fields, `PRIMINSR` would be the most relevant field for our analysis later on.

The full data dictionary can be accessed through Google Sheets, and is available [here](https://docs.google.com/spreadsheets/d/1awpTOHHZMqdlawvJ6A47deXhjesijlORI6H_yvWOZ2Q/edit?usp=sharing).

#### **b. CDI Data Dictionary**

Column | Description | Data Type
------ | ----------- | ---------
**YearStart** | The year for which the data is being reported | ``int64``
**LocationDesc** | The US state name in full | ``string``
**LocationId** | The US state id as a number | ``int64``
**Topic** | The domain under which the data being reported falls under | ``string``
**Question** | The definition specific metric being reported | ``string``
**DataValue** | The numeric quantification of the metric  | ``float64``
**DataValueType** | Crude or Age-adjusted Prevalence | ``string``
**StratificationCategory1** | The basis by which the data is classified | ``string``
**Stratification1** | The specific demographic group for which the metric is reported | ``string``

# **4. Entity Relationship Diagram**

![EntityRelationshipDiagram](https://drive.google.com/uc?export=view&id=1VuwZLqQAMRNrh9PNFP3w1vv916mT24GO)

The Crow's Foot Entity Relationship Diagram (ERD) provided depicts two tables: the `behavioral_risk_factor_surveillance_system_2022` (BRFSS) table and the `chronic_disease_indicators` (CDI) table. The BRFSS table contains survey responses from various individuals across the United States, while the CDI table contains aggregated data related to chronic conditions and behavioral factors at the state level. The PERSONID column is the Primary Key in the BRFSS table, as it uniquely identifies each respondent of the survey. The relationship between the BRFSS and CDI tables is denoted as mandatory-many-to-one. This means that each record in the BRFSS table (representing individual respondents) corresponds to one record in the CDI table (representing the aggregated state-level data), and each record in the CDI table is associated with multiple respondents from the BRFSS table. We have used the State ID (denoted by STATE and LocationDesc in the BRFSS and CDI tables) to link them together.

# **5. Data Cleaning**

We created tables suffixed with "_backup" from the files uploaded to the project's cloud storage buckets, so that the final tables can be created from them after determining which columns would need to be dropped/created.

In [35]:
%%bigquery --project=fall24-ba775-a08
SELECT * FROM `fall24-ba775-a08.group_project.chronic_disease_indicators_backup` LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,DataValue,DataValueAlt,DataValueFootnoteSymbol,DataValueFootnote,LowConfidenceLimit,HighConfidenceLimit,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,StratificationCategory3,Stratification3,Geolocation,LocationID,TopicID,QuestionID,ResponseID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2019,2019,MD,Maryland,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,21.0,21.0,,,19.6,22.4,Sex,Male,,,,,POINT (-76.60926011099963 39.29058096400047),24,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
1,2019,2019,PR,Puerto Rico,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,26.4,26.4,,,21.3,32.2,Sex,Male,,,,,POINT (-66.590149 18.220833),72,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
2,2019,2019,GU,Guam,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,28.4,28.4,,,22.4,35.2,Sex,Male,,,,,POINT (144.793731 13.444304),66,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
3,2019,2019,AL,Alabama,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,20.1,20.1,,,16.3,24.6,Sex,Male,,,,,POINT (-86.63186076199969 32.84057112200048),1,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
4,2019,2019,AR,Arkansas,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,26.6,26.6,,,21.3,32.8,Sex,Male,,,,,POINT (-92.27449074299966 34.74865012400045),5,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
5,2019,2019,AZ,Arizona,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,23.5,23.5,,,18.3,29.7,Sex,Male,,,,,POINT (-111.76381127699972 34.865970280000454),4,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
6,2019,2019,CA,California,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,19.3,19.3,,,16.2,22.7,Sex,Male,,,,,POINT (-120.99999953799971 37.63864012300047),6,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
7,2019,2019,AK,Alaska,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,19.0,19.0,,,16.1,22.4,Sex,Male,,,,,POINT (-147.72205903599973 64.84507995700051),2,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
8,2019,2019,DC,District of Columbia,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,16.7,16.7,,,15.7,17.9,Sex,Male,,,,,POINT (-77.036871 38.907192),11,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,
9,2019,2019,DE,Delaware,YRBSS,Alcohol,Alcohol use among high school students,,%,Crude Prevalence,,,*,No data available,,,Sex,Male,,,,,POINT (-75.57774116799965 39.008830667000495),10,ALC,ALC01,,CRDPREV,SEX,SEXM,,,,


In [36]:
%%bigquery --project=fall24-ba775-a08
SELECT * FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022_backup` LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,int64_field_0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,PVTRESD1,COLGHOUS,STATERE1,CELPHON1,LADULT1,COLGSEX1,NUMADULT,LANDSEX1,NUMMEN,NUMWOMEN,RESPSLCT,SAFETIME,CTELNUM1,CELLFON5,CADULT1,CELLSEX1,PVTRESD3,CCLGHOUS,CSTATE1,LANDLINE,HHADULT,SEXVAR,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,PRIMINSR,PERSDOC3,MEDCOST1,CHECKUP1,EXERANY2,SLEPTIM1,LASTDEN4,RMVTETH4,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNC1,CHCOCNC1,CHCCOPD3,ADDEPEV3,CHCKDNY2,HAVARTH4,DIABETE4,DIABAGE4,MARITAL,EDUCA,RENTHOM1,NUMHHOL4,NUMPHON4,CPDEMO1C,VETERAN3,EMPLOY1,CHILDREN,INCOME3,PREGNANT,WEIGHT2,HEIGHT3,DEAF,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,HADMAM,HOWLONG,CERVSCRN,CRVCLCNC,CRVCLPAP,CRVCLHPV,HADHYST2,HADSIGM4,COLNSIGM,COLNTES1,SIGMTES1,LASTSIG4,COLNCNCR,VIRCOLO1,VCLNTES2,SMALSTOL,STOLTEST,STOOLDN2,BLDSTFIT,SDNATES1,SMOKE100,SMOKDAY2,USENOW3,ECIGNOW2,LCSFIRST,LCSLAST,LCSNUMCG,LCSCTSC1,LCSSCNCR,LCSCTWHN,ALCDAY4,AVEDRNK3,DRNK3GE5,MAXDRNKS,FLUSHOT7,FLSHTMY3,PNEUVAC4,TETANUS1,HIVTST7,HIVTSTD3,HIVRISK5,COVIDPOS,COVIDSMP,COVIDPRM,PDIABTS1,PREDIAB2,DIABTYPE,INSULIN1,CHKHEMO3,EYEEXAM1,DIABEYE1,DIABEDU1,FEETSORE,TOLDCFS,HAVECFS,WORKCFS,IMFVPLA3,HPVADVC4,HPVADSHT,SHINGLE2,COVIDVA1,COVACGET,COVIDNU1,COVIDINT,COVIDFS1,COVIDSE1,COPDCOGH,COPDFLEM,COPDBRTH,COPDBTST,COPDSMOK,CNCRDIFF,CNCRAGE,CNCRTYP2,CSRVTRT3,CSRVDOC1,CSRVSUM,CSRVRTRN,CSRVINST,CSRVINSR,CSRVDEIN,CSRVCLIN,CSRVPAIN,CSRVCTL2,PSATEST1,PSATIME1,PCPSARS2,PSASUGST,PCSTALK1,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,CAREGIV1,CRGVREL4,CRGVLNG1,CRGVHRS1,CRGVPRB3,CRGVALZD,CRGVPER1,CRGVHOU1,CRGVEXPT,ACEDEPRS,ACEDRINK,ACEDRUGS,ACEPRISN,ACEDIVRC,ACEPUNCH,ACEHURT1,ACESWEAR,ACETOUCH,ACETTHEM,ACEHVSEX,ACEADSAF,ACEADNED,LSATISFY,EMTSUPRT,SDHISOLT,SDHEMPLY,FOODSTMP,SDHFOOD1,SDHBILLS,SDHUTILS,SDHTRNSP,SDHSTRE1,MARIJAN1,MARJSMOK,MARJEAT,MARJVAPE,MARJDAB,MARJOTHR,USEMRJN4,LASTSMK2,STOPSMK2,MENTCIGS,MENTECIG,HEATTBCO,ASBIALCH,ASBIDRNK,ASBIBING,ASBIADVC,ASBIRDUC,FIREARM5,GUNLOAD,LOADULK2,RCSGEND1,RCSXBRTH,RCSRLTN2,CASTHDX2,CASTHNO2,BIRTHSEX,SOMALE,SOFEMALE,TRNSGNDR,HADSEX,PFPPRVN4,TYPCNTR9,BRTHCNT4,WHEREGET,NOBCUSE8,BCPREFER,RRCLASS3,RRCOGNT2,RRTREAT,RRATWRK2,RRHCARE4,RRPHYSM2,QSTVER,QSTLANG,_METSTAT,_URBSTAT,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_IMPRACE,_CHISPNC,_CRACE2,_CPRACE2,CAGEG,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT2,_LLCPWT,_RFHLTH,_PHYS14D,_MENT14D,_HLTHPLN,_HCVU652,_TOTINDA,_EXTETH3,_ALTETH3,_DENVST3,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR2,_PRACE2,_MRACE2,_HISPANC,_RACE1,_RACEG22,_RACEGR4,_RACEPR1,_SEX,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG1,_RFMAM22,_MAM5023,_HADCOLN,_CLNSCP1,_HADSIGM,_SGMSCP1,_SGMS101,_RFBLDS5,_STOLDN1,_VIRCOL1,_SBONTI1,_CRCREC2,_SMOKER3,_RFSMOK3,_CURECI2,_YRSSMOK,_PACKDAY,_PACKYRS,_YRSQUIT,_SMOKGRP,_LCSREC,DRNKANY6,DROCDY4_,_RFBING6,_DRNKWK2,_RFDRHV8,_FLSHOT7,_PNEUMO3,_AIDTST4
0,276511,37.0,11.0,b'11142022',b'11',b'14',b'2022',1100.0,b'2022000822',2022001000.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,2.0,88.0,2.0,88.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,8.0,2.0,2.0,2.0,2.0,,7.0,7.0,2.0,2.0,2.0,2.0,3.0,,3.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,7.0,,120.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,4.0,1.0,2.0,2.0,1.0,1.0,4.0,,,2.0,,,,,,,,7.0,,3.0,1.0,,,,1.0,7.0,,220.0,1.0,88.0,1.0,1.0,102022.0,1.0,1.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,1.0,,2.0,,32021.0,42021.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,4.0,2.0,2.0,5.0,2.0,2.0,2.0,3.0,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,2.0,4.0,,,,,,,,1.0,4.0,2.0,,7.0,2.0,10.0,1.0,1.0,1.0,1.0,371071.0,97.773912,1.0,97.773912,1.0,,,,,,1.0,0.200964,368.466148,171.282098,1.0,1.0,2.0,1.0,9.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,62.0,157.0,5443.0,2195.0,2.0,1.0,1.0,4.0,5.0,1.0,,1.0,,2.0,,,,,,,,9.0,9.0,1.0,,,,,,,1.0,67.0,1.0,467.0,1.0,1.0,1.0,2.0
1,276333,37.0,5.0,b'05052022',b'05',b'05',b'2022',1100.0,b'2022000644',2022001000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,2.0,88.0,88.0,,3.0,1.0,2.0,1.0,1.0,7.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,2.0,2.0,1.0,1.0,62.0,1.0,5.0,1.0,2.0,,1.0,2.0,1.0,88.0,6.0,,226.0,506.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,9.0,2.0,9.0,1.0,1.0,3.0,3.0,5.0,,1.0,2.0,,2.0,,2.0,,,7.0,,3.0,1.0,,,,1.0,2.0,,101.0,1.0,88.0,1.0,1.0,92021.0,1.0,1.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,1.0,,3.0,,777777.0,777777.0,,,,,,1.0,55.0,29.0,2.0,5.0,1.0,1.0,1.0,1.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,4.0,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,2.0,4.0,,,,,,,,1.0,1.0,2.0,2.0,7.0,2.0,10.0,1.0,1.0,1.0,1.0,371071.0,97.773912,2.0,195.547825,1.0,,,,,,1.0,0.200964,736.932297,660.520226,1.0,1.0,1.0,1.0,9.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,10.0,2.0,69.0,6.0,66.0,168.0,10251.0,3648.0,4.0,2.0,1.0,3.0,4.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,3.0,3.0,2.0,1.0,9.0,9.0,1.0,,,,,,,1.0,14.0,1.0,100.0,1.0,1.0,1.0,2.0
2,298778,39.0,7.0,b'07262022',b'07',b'26',b'2022',1200.0,b'2022013865',2022014000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,,1.0,2.0,1.0,2.0,4.0,15.0,30.0,30.0,5.0,3.0,1.0,1.0,1.0,2.0,4.0,3.0,2.0,1.0,2.0,2.0,,2.0,2.0,1.0,1.0,7.0,1.0,3.0,,3.0,5.0,2.0,,,1.0,2.0,7.0,88.0,77.0,,220.0,506.0,1.0,1.0,2.0,1.0,1.0,1.0,7.0,,2.0,,,,1.0,1.0,3.0,2.0,2.0,,1.0,1.0,3.0,1.0,5.0,1.0,7.0,3.0,7.0,,3.0,1.0,,,,1.0,2.0,,888.0,,,,2.0,,2.0,3.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,22.0,1.0,1.0,1.0,,392102.0,40.363842,1.0,40.363842,1.0,9.0,,,,,9.0,,1186.814414,608.294722,2.0,3.0,3.0,1.0,9.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,66.0,168.0,9979.0,3551.0,4.0,2.0,1.0,3.0,9.0,9.0,,1.0,,1.0,,,,,,,,9.0,9.0,1.0,,,,,,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,2.0,2.0,2.0
3,294835,39.0,10.0,b'11032022',b'11',b'03',b'2022',1100.0,b'2022009922',2022010000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,,1.0,1.0,2.0,2.0,2.0,88.0,5.0,88.0,1.0,1.0,2.0,1.0,2.0,7.0,1.0,8.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,1.0,3.0,,1.0,6.0,1.0,,,1.0,2.0,1.0,88.0,9.0,,240.0,504.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,1.0,4.0,1.0,7.0,2.0,1.0,1.0,1.0,,,2.0,,,,,,,,7.0,,3.0,1.0,,,,2.0,,,202.0,1.0,88.0,1.0,2.0,,2.0,4.0,2.0,,2.0,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,,,,,,,2.0,,2.0,4.0,,,,,,,,1.0,1.0,2.0,2.0,2.0,2.0,21.0,1.0,1.0,1.0,,392101.0,40.363842,1.0,40.363842,1.0,9.0,,,,,2.0,0.564933,1882.021679,1039.565451,1.0,1.0,2.0,1.0,1.0,2.0,1.0,,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,9.0,1.0,60.0,5.0,64.0,163.0,10886.0,4120.0,4.0,2.0,1.0,4.0,6.0,1.0,1.0,1.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,9.0,9.0,1.0,,,,,,,1.0,7.0,1.0,47.0,1.0,,,2.0
4,183641,26.0,5.0,b'05062022',b'05',b'06',b'2022',1100.0,b'2022001483',2022001000.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,3.0,77.0,88.0,88.0,3.0,1.0,2.0,1.0,1.0,5.0,1.0,8.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,99.0,,170.0,503.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,7.0,,,,2.0,1.0,1.0,3.0,,,7.0,,,,,,,,7.0,,3.0,4.0,,,,7.0,,,101.0,1.0,88.0,2.0,1.0,32022.0,1.0,4.0,7.0,,2.0,2.0,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,7.0,7.0,2.0,2.0,,,,,,,,,,,2.0,4.0,,,,,,,,,,,,,,12.0,1.0,1.0,1.0,2.0,261101.0,28.952948,1.0,28.952948,1.0,9.0,,,,,1.0,0.490989,372.536568,166.394531,1.0,9.0,1.0,1.0,9.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,12.0,2.0,77.0,6.0,63.0,160.0,7711.0,3011.0,4.0,2.0,1.0,4.0,9.0,1.0,,1.0,,2.0,,,,,,,,9.0,9.0,1.0,,,,,,,1.0,14.0,1.0,100.0,1.0,1.0,1.0,9.0
5,182725,26.0,5.0,b'05232022',b'05',b'23',b'2022',1200.0,b'2022000567',2022001000.0,1.0,1.0,,1.0,2.0,1.0,,3.0,,1.0,2.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,1.0,2.0,2.0,2.0,1.0,7.0,1.0,8.0,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,5.0,9.0,2.0,,1.0,2.0,7.0,88.0,99.0,,9999.0,508.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,7.0,2.0,1.0,1.0,3.0,,,2.0,,,,,,,,9.0,,3.0,1.0,,,,2.0,,,999.0,,,,,,,,,,,,,,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,4.0,,,,,,,,,,,,,,11.0,1.0,1.0,1.0,2.0,261101.0,28.952948,3.0,86.858845,1.0,9.0,,,,,1.0,0.490989,1117.609703,2294.593337,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,8.0,1.0,57.0,5.0,68.0,173.0,,,,9.0,1.0,3.0,9.0,1.0,1.0,1.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,9.0,9.0,1.0,,,,,,,9.0,900.0,9.0,99900.0,9.0,,,
6,184044,26.0,11.0,b'11152022',b'11',b'15',b'2022',1100.0,b'2022001886',2022002000.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,1.0,88.0,88.0,,3.0,2.0,2.0,1.0,1.0,4.0,1.0,8.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,7.0,,9999.0,507.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,7.0,7.0,2.0,2.0,1.0,1.0,3.0,,,2.0,,,,,,,,7.0,,3.0,1.0,,,,2.0,,,203.0,1.0,88.0,1.0,1.0,102022.0,1.0,3.0,2.0,,2.0,2.0,,,8.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,2.0,2.0,,,,,,,,,,,,2.0,4.0,,,,,,,,,,,,,,12.0,1.0,1.0,1.0,2.0,261101.0,28.952948,1.0,28.952948,1.0,9.0,,,,,1.0,0.490989,372.536568,224.393454,1.0,1.0,1.0,1.0,9.0,1.0,1.0,9.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,14.0,3.0,72.0,6.0,67.0,170.0,,,,9.0,1.0,4.0,5.0,,,1.0,,2.0,,,,,,,,9.0,9.0,1.0,,,,,,,1.0,10.0,1.0,70.0,1.0,9.0,9.0,2.0
7,183853,26.0,8.0,b'08162022',b'08',b'16',b'2022',1100.0,b'2022001695',2022002000.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,3.0,77.0,5.0,30.0,3.0,2.0,2.0,1.0,1.0,6.0,1.0,8.0,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,1.0,3.0,,3.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,5.0,,127.0,411.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,5.0,1.0,7.0,1.0,1.0,1.0,5.0,,,1.0,2.0,,2.0,,7.0,,,7.0,,3.0,1.0,,,,2.0,,,888.0,,,,1.0,102021.0,1.0,4.0,2.0,,2.0,2.0,,,8.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,70.0,22.0,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,7.0,2.0,2.0,,,,,,,,,,,2.0,4.0,,,,,,,,,,,,,,12.0,1.0,1.0,1.0,2.0,261101.0,28.952948,1.0,28.952948,1.0,9.0,,,,,1.0,0.490989,372.536568,345.60525,1.0,9.0,2.0,1.0,9.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,59.0,150.0,5761.0,2565.0,3.0,2.0,1.0,2.0,3.0,1.0,,1.0,,2.0,,,,,,,,9.0,9.0,1.0,,,,,,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.0,1.0,2.0
8,26718,6.0,11.0,b'12142022',b'12',b'14',b'2022',1100.0,b'2022000928',2022001000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,1.0,,,,,,,,,,,1.0,1.0,88.0,88.0,,3.0,1.0,2.0,2.0,1.0,7.0,4.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,5.0,1.0,2.0,,1.0,2.0,7.0,88.0,5.0,,216.0,602.0,1.0,2.0,2.0,1.0,2.0,2.0,,,,,,,,1.0,7.0,,,3.0,2.0,,,,,,,,7.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,5.0,5.0,2.0,2.0,3.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,1.0,7.0,2.0,2.0,2.0,2.0,,,,,,,,1.0,,,,,,,,,,,6.0,1.0,7.0,,3.0,2.0,12.0,1.0,2.0,1.0,3.0,61011.0,27.347587,2.0,54.695175,5.0,9.0,,,,,1.0,0.215825,712.750383,819.912131,1.0,1.0,1.0,1.0,9.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,2.0,88.0,88.0,1.0,8.0,2.0,5.0,7.0,1.0,11.0,2.0,72.0,6.0,74.0,188.0,9798.0,2773.0,3.0,2.0,1.0,3.0,3.0,,,,,,,,3.0,3.0,3.0,,,9.0,9.0,1.0,,,,,,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,2.0,2.0,2.0
9,83844,13.0,10.0,b'11032022',b'11',b'03',b'2022',1200.0,b'2022007262',2022007000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,2.0,2.0,1.0,2.0,88.0,88.0,,1.0,9.0,2.0,9.0,1.0,8.0,9.0,8.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,4.0,2.0,,,1.0,2.0,1.0,88.0,99.0,,9999.0,9999.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,,,,,1.0,9.0,,,,2.0,,,,,,,,9.0,,3.0,,,,,,,,,,,,,,,,,,,,,,9.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,4.0,,,,,,,,,,,,,,20.0,1.0,1.0,1.0,,132041.0,58.187866,1.0,58.187866,1.0,9.0,,,,,9.0,,814.839633,1235.546626,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,6.0,1.0,47.0,4.0,,,,,,9.0,1.0,2.0,9.0,,,,,,,,3.0,3.0,3.0,,,9.0,9.0,9.0,,,,,,,9.0,900.0,9.0,99900.0,9.0,,,


### **i. Cleaning the BRFSS Data**

The BRFSS dataset had over 329 columns, including an index column created as a consequence of the xpt to csv conversion, as well as multiple other columns for which entries were not available in the dataset. We also identified a host of other columns unrelated to the analysis we will be carrying out. These included columns containing data pertaining to HIV, cognitive dissonance, pregnancy and birth control. While there were more columns that we may not necessarily use over the course of this project, we decided to keep most of those that were within the realms of our analysis in case we find use cases for them over time.

In [37]:
%%bigquery --project=fall24-ba775-a08
CREATE OR REPLACE TABLE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022` AS (
  SELECT
    int64_field_0 AS PERSONID,
    _STATE,
    _SEX,
    _AGE80,
    FMONTH,
    IDATE,
    IMONTH,
    IDAY,
    IYEAR,
    GENHLTH,
    PHYSHLTH,
    MENTHLTH,
    POORHLTH,
    PRIMINSR,
    PERSDOC3,
    MEDCOST1,
    CHECKUP1,
    EXERANY2,
    SLEPTIM1,
    LASTDEN4,
    RMVTETH4,
    CVDINFR4,
    CVDCRHD4,
    CVDSTRK3,
    ASTHMA3,
    ASTHNOW,
    CHCSCNC1,
    CHCOCNC1,
    CHCCOPD3,
    ADDEPEV3,
    CHCKDNY2,
    HAVARTH4,
    DIABETE4,
    DIABAGE4,
    MARITAL,
    EDUCA,
    RENTHOM1,
    NUMHHOL4,
    NUMPHON4,
    CPDEMO1C,
    VETERAN3,
    EMPLOY1,
    CHILDREN,
    INCOME3,
    PREGNANT,
    WEIGHT2,
    HEIGHT3,
    DEAF,
    BLIND,
    DECIDE,
    DIFFWALK,
    DIFFDRES,
    DIFFALON,
    HADMAM,
    HOWLONG,
    CERVSCRN,
    CRVCLCNC,
    CRVCLPAP,
    CRVCLHPV,
    HADHYST2,
    HADSIGM4,
    COLNSIGM,
    COLNTES1,
    SIGMTES1,
    LASTSIG4,
    COLNCNCR,
    VIRCOLO1,
    VCLNTES2,
    SMALSTOL,
    STOLTEST,
    STOOLDN2,
    BLDSTFIT,
    SDNATES1,
    SMOKE100,
    SMOKDAY2,
    USENOW3,
    ECIGNOW2,
    LCSFIRST,
    LCSLAST,
    LCSNUMCG,
    LCSCTSC1,
    LCSSCNCR,
    LCSCTWHN,
    ALCDAY4,
    AVEDRNK3,
    DRNK3GE5,
    MAXDRNKS,
    FLUSHOT7,
    FLSHTMY3,
    PNEUVAC4,
    TETANUS1,
    COVIDPOS,
    COVIDSMP,
    COVIDPRM,
    PDIABTS1,
    PREDIAB2,
    DIABTYPE,
    INSULIN1,
    CHKHEMO3,
    EYEEXAM1,
    DIABEYE1,
    DIABEDU1,
    FEETSORE,
    IMFVPLA3,
    HPVADVC4,
    HPVADSHT,
    SHINGLE2,
    COVIDVA1,
    COVACGET,
    COVIDNU1,
    COVIDINT,
    COVIDFS1,
    COVIDSE1,
    COPDCOGH,
    COPDFLEM,
    COPDBRTH,
    COPDBTST,
    COPDSMOK,
    CNCRDIFF,
    CNCRAGE,
    CNCRTYP2,
    CSRVTRT3,
    CSRVDOC1,
    CSRVSUM,
    CSRVRTRN,
    CSRVINST,
    CSRVINSR,
    CSRVDEIN,
    CSRVCLIN,
    CSRVPAIN,
    CSRVCTL2,
    PSATEST1,
    PSATIME1,
    PCPSARS2,
    PSASUGST,
    PCSTALK1,
    ACEDEPRS,
    ACEDRINK,
    ACEDRUGS,
    ACEPRISN,
    ACEDIVRC,
    ACEPUNCH,
    ACEHURT1,
    ACESWEAR,
    ACETOUCH,
    ACETTHEM,
    ACEHVSEX,
    ACEADSAF,
    ACEADNED,
    LSATISFY,
    EMTSUPRT,
    SDHISOLT,
    SDHEMPLY,
    FOODSTMP,
    SDHFOOD1,
    SDHBILLS,
    SDHUTILS,
    SDHTRNSP,
    SDHSTRE1,
    MARIJAN1,
    MARJSMOK,
    MARJEAT,
    MARJVAPE,
    MARJDAB,
    MARJOTHR,
    USEMRJN4,
    LASTSMK2,
    STOPSMK2,
    MENTCIGS,
    MENTECIG,
    HEATTBCO,
    ASBIALCH,
    ASBIDRNK,
    ASBIBING,
    ASBIADVC,
    ASBIRDUC,
    CASTHNO2,
    RRCLASS3,
    RRCOGNT2,
    RRTREAT,
    RRATWRK2,
    RRHCARE4,
    RRPHYSM2
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022_backup`
);

Query is running:   0%|          |

Now that the columns we do not need are removed, we move onto looking at the data present in the tables. State information is present in different formats within each of our datasets, we will be adding a new column in the BRFSS table for clarity, and later checking the LocationID column in the CDI table to ensure that the tables can be joined with each other easily.

In [38]:
%%bigquery --project=fall24-ba775-a08
# First we will convert the _STATE column from FLOAT to INT64
ALTER TABLE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
ADD COLUMN _STATE_INT64 INT64;

UPDATE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
SET _STATE_INT64 = CAST(_STATE AS INT64)
WHERE TRUE;

ALTER TABLE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
DROP COLUMN _STATE;

ALTER TABLE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
RENAME COLUMN _STATE_INT64 TO _STATE;

Query is running:   0%|          |

In [39]:
%%bigquery --project=fall24-ba775-a08
# Using the BRFSS data dictionary to assign the names of the appropriate States to a new colmumn STATENAME based on the _STATE column
ALTER TABLE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
ADD COLUMN STATENAME STRING;

UPDATE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
SET STATENAME =
  CASE _STATE
    WHEN 1 THEN 'Alabama'
    WHEN 2 THEN 'Alaska'
    WHEN 4 THEN 'Arizona'
    WHEN 5 THEN 'Arkansas'
    WHEN 6 THEN 'California'
    WHEN 8 THEN 'Colorado'
    WHEN 9 THEN 'Connecticut'
    WHEN 10 THEN 'Delaware'
    WHEN 11 THEN 'District of Columbia'
    WHEN 12 THEN 'Florida'
    WHEN 13 THEN 'Georgia'
    WHEN 15 THEN 'Hawaii'
    WHEN 16 THEN 'Idaho'
    WHEN 17 THEN 'Illinois'
    WHEN 18 THEN 'Indiana'
    WHEN 19 THEN 'Iowa'
    WHEN 20 THEN 'Kansas'
    WHEN 21 THEN 'Kentucky'
    WHEN 22 THEN 'Louisiana'
    WHEN 23 THEN 'Maine'
    WHEN 24 THEN 'Maryland'
    WHEN 25 THEN 'Massachusetts'
    WHEN 26 THEN 'Michigan'
    WHEN 27 THEN 'Minnesota'
    WHEN 28 THEN 'Mississippi'
    WHEN 29 THEN 'Missouri'
    WHEN 30 THEN 'Montana'
    WHEN 31 THEN 'Nebraska'
    WHEN 32 THEN 'Nevada'
    WHEN 33 THEN 'New Hampshire'
    WHEN 34 THEN 'New Jersey'
    WHEN 35 THEN 'New Mexico'
    WHEN 36 THEN 'New York'
    WHEN 37 THEN 'North Carolina'
    WHEN 38 THEN 'North Dakota'
    WHEN 39 THEN 'Ohio'
    WHEN 40 THEN 'Oklahoma'
    WHEN 41 THEN 'Oregon'
    WHEN 42 THEN 'Pennsylvania'
    WHEN 44 THEN 'Rhode Island'
    WHEN 45 THEN 'South Carolina'
    WHEN 46 THEN 'South Dakota'
    WHEN 47 THEN 'Tennessee'
    WHEN 48 THEN 'Texas'
    WHEN 49 THEN 'Utah'
    WHEN 50 THEN 'Vermont'
    WHEN 51 THEN 'Virginia'
    WHEN 53 THEN 'Washington'
    WHEN 54 THEN 'West Virginia'
    WHEN 55 THEN 'Wisconsin'
    WHEN 56 THEN 'Wyoming'
    WHEN 66 THEN 'Guam'
    WHEN 72 THEN 'Puerto Rico'
    WHEN 78 THEN 'Virgin Islands'
  END
WHERE TRUE;

Query is running:   0%|          |

Looking for NULL values across all fields:

In [40]:
%%bigquery --project=fall24-ba775-a08
SELECT SUM(IF(PERSONID IS NULL, 1, 0)) AS PERSONID_nullcount,
 SUM(IF(_STATE IS NULL, 1, 0)) AS _STATE_nullcount,
 SUM(IF(STATENAME IS NULL, 1, 0)) AS STATENAME_nullcount,
 SUM(IF(_SEX IS NULL, 1, 0)) AS _SEX_nullcount,
 SUM(IF(_AGE80 IS NULL, 1, 0)) AS _AGE80_nullcount,
 SUM(IF(FMONTH IS NULL, 1, 0)) AS FMONTH_nullcount,
 SUM(IF(IDATE IS NULL, 1, 0)) AS IDATE_nullcount,
 SUM(IF(IMONTH IS NULL, 1, 0)) AS IMONTH_nullcount,
 SUM(IF(IDAY IS NULL, 1, 0)) AS IDAY_nullcount,
 SUM(IF(IYEAR IS NULL, 1, 0)) AS IYEAR_nullcount,
 SUM(IF(GENHLTH IS NULL, 1, 0)) AS GENHLTH_nullcount,
 SUM(IF(PHYSHLTH IS NULL, 1, 0)) AS PHYSHLTH_nullcount,
 SUM(IF(MENTHLTH IS NULL, 1, 0)) AS MENTHLTH_nullcount,
 SUM(IF(POORHLTH IS NULL, 1, 0)) AS POORHLTH_nullcount,
 SUM(IF(PRIMINSR IS NULL, 1, 0)) AS PRIMINSR_nullcount,
 SUM(IF(PERSDOC3 IS NULL, 1, 0)) AS PERSDOC3_nullcount,
 SUM(IF(MEDCOST1 IS NULL, 1, 0)) AS MEDCOST1_nullcount,
 SUM(IF(CHECKUP1 IS NULL, 1, 0)) AS CHECKUP1_nullcount,
 SUM(IF(EXERANY2 IS NULL, 1, 0)) AS EXERANY2_nullcount,
 SUM(IF(SLEPTIM1 IS NULL, 1, 0)) AS SLEPTIM1_nullcount,
 SUM(IF(LASTDEN4 IS NULL, 1, 0)) AS LASTDEN4_nullcount,
 SUM(IF(RMVTETH4 IS NULL, 1, 0)) AS RMVTETH4_nullcount,
 SUM(IF(CVDINFR4 IS NULL, 1, 0)) AS CVDINFR4_nullcount,
 SUM(IF(CVDCRHD4 IS NULL, 1, 0)) AS CVDCRHD4_nullcount,
 SUM(IF(CVDSTRK3 IS NULL, 1, 0)) AS CVDSTRK3_nullcount,
 SUM(IF(ASTHMA3 IS NULL, 1, 0)) AS ASTHMA3_nullcount,
 SUM(IF(ASTHNOW IS NULL, 1, 0)) AS ASTHNOW_nullcount,
 SUM(IF(CHCSCNC1 IS NULL, 1, 0)) AS CHCSCNC1_nullcount,
 SUM(IF(CHCOCNC1 IS NULL, 1, 0)) AS CHCOCNC1_nullcount,
 SUM(IF(CHCCOPD3 IS NULL, 1, 0)) AS CHCCOPD3_nullcount,
 SUM(IF(ADDEPEV3 IS NULL, 1, 0)) AS ADDEPEV3_nullcount,
 SUM(IF(CHCKDNY2 IS NULL, 1, 0)) AS CHCKDNY2_nullcount,
 SUM(IF(HAVARTH4 IS NULL, 1, 0)) AS HAVARTH4_nullcount,
 SUM(IF(DIABETE4 IS NULL, 1, 0)) AS DIABETE4_nullcount,
 SUM(IF(DIABAGE4 IS NULL, 1, 0)) AS DIABAGE4_nullcount,
 SUM(IF(MARITAL IS NULL, 1, 0)) AS MARITAL_nullcount,
 SUM(IF(EDUCA IS NULL, 1, 0)) AS EDUCA_nullcount,
 SUM(IF(RENTHOM1 IS NULL, 1, 0)) AS RENTHOM1_nullcount,
 SUM(IF(NUMHHOL4 IS NULL, 1, 0)) AS NUMHHOL4_nullcount,
 SUM(IF(NUMPHON4 IS NULL, 1, 0)) AS NUMPHON4_nullcount,
 SUM(IF(CPDEMO1C IS NULL, 1, 0)) AS CPDEMO1C_nullcount,
 SUM(IF(VETERAN3 IS NULL, 1, 0)) AS VETERAN3_nullcount,
 SUM(IF(EMPLOY1 IS NULL, 1, 0)) AS EMPLOY1_nullcount,
 SUM(IF(CHILDREN IS NULL, 1, 0)) AS CHILDREN_nullcount,
 SUM(IF(INCOME3 IS NULL, 1, 0)) AS INCOME3_nullcount,
 SUM(IF(PREGNANT IS NULL, 1, 0)) AS PREGNANT_nullcount,
 SUM(IF(WEIGHT2 IS NULL, 1, 0)) AS WEIGHT2_nullcount,
 SUM(IF(HEIGHT3 IS NULL, 1, 0)) AS HEIGHT3_nullcount,
 SUM(IF(DEAF IS NULL, 1, 0)) AS DEAF_nullcount,
 SUM(IF(BLIND IS NULL, 1, 0)) AS BLIND_nullcount,
 SUM(IF(DECIDE IS NULL, 1, 0)) AS DECIDE_nullcount,
 SUM(IF(DIFFWALK IS NULL, 1, 0)) AS DIFFWALK_nullcount,
 SUM(IF(DIFFDRES IS NULL, 1, 0)) AS DIFFDRES_nullcount,
 SUM(IF(DIFFALON IS NULL, 1, 0)) AS DIFFALON_nullcount,
 SUM(IF(HADMAM IS NULL, 1, 0)) AS HADMAM_nullcount,
 SUM(IF(HOWLONG IS NULL, 1, 0)) AS HOWLONG_nullcount,
 SUM(IF(CERVSCRN IS NULL, 1, 0)) AS CERVSCRN_nullcount,
 SUM(IF(CRVCLCNC IS NULL, 1, 0)) AS CRVCLCNC_nullcount,
 SUM(IF(CRVCLPAP IS NULL, 1, 0)) AS CRVCLPAP_nullcount,
 SUM(IF(CRVCLHPV IS NULL, 1, 0)) AS CRVCLHPV_nullcount,
 SUM(IF(HADHYST2 IS NULL, 1, 0)) AS HADHYST2_nullcount,
 SUM(IF(HADSIGM4 IS NULL, 1, 0)) AS HADSIGM4_nullcount,
 SUM(IF(COLNSIGM IS NULL, 1, 0)) AS COLNSIGM_nullcount,
 SUM(IF(COLNTES1 IS NULL, 1, 0)) AS COLNTES1_nullcount,
 SUM(IF(SIGMTES1 IS NULL, 1, 0)) AS SIGMTES1_nullcount,
 SUM(IF(LASTSIG4 IS NULL, 1, 0)) AS LASTSIG4_nullcount,
 SUM(IF(COLNCNCR IS NULL, 1, 0)) AS COLNCNCR_nullcount,
 SUM(IF(VIRCOLO1 IS NULL, 1, 0)) AS VIRCOLO1_nullcount,
 SUM(IF(VCLNTES2 IS NULL, 1, 0)) AS VCLNTES2_nullcount,
 SUM(IF(SMALSTOL IS NULL, 1, 0)) AS SMALSTOL_nullcount,
 SUM(IF(STOLTEST IS NULL, 1, 0)) AS STOLTEST_nullcount,
 SUM(IF(STOOLDN2 IS NULL, 1, 0)) AS STOOLDN2_nullcount,
 SUM(IF(BLDSTFIT IS NULL, 1, 0)) AS BLDSTFIT_nullcount,
 SUM(IF(SDNATES1 IS NULL, 1, 0)) AS SDNATES1_nullcount,
 SUM(IF(SMOKE100 IS NULL, 1, 0)) AS SMOKE100_nullcount,
 SUM(IF(SMOKDAY2 IS NULL, 1, 0)) AS SMOKDAY2_nullcount,
 SUM(IF(USENOW3 IS NULL, 1, 0)) AS USENOW3_nullcount,
 SUM(IF(ECIGNOW2 IS NULL, 1, 0)) AS ECIGNOW2_nullcount,
 SUM(IF(LCSFIRST IS NULL, 1, 0)) AS LCSFIRST_nullcount,
 SUM(IF(LCSLAST IS NULL, 1, 0)) AS LCSLAST_nullcount,
 SUM(IF(LCSNUMCG IS NULL, 1, 0)) AS LCSNUMCG_nullcount,
 SUM(IF(LCSCTSC1 IS NULL, 1, 0)) AS LCSCTSC1_nullcount,
 SUM(IF(LCSSCNCR IS NULL, 1, 0)) AS LCSSCNCR_nullcount,
 SUM(IF(LCSCTWHN IS NULL, 1, 0)) AS LCSCTWHN_nullcount,
 SUM(IF(ALCDAY4 IS NULL, 1, 0)) AS ALCDAY4_nullcount,
 SUM(IF(AVEDRNK3 IS NULL, 1, 0)) AS AVEDRNK3_nullcount,
 SUM(IF(DRNK3GE5 IS NULL, 1, 0)) AS DRNK3GE5_nullcount,
 SUM(IF(MAXDRNKS IS NULL, 1, 0)) AS MAXDRNKS_nullcount,
 SUM(IF(FLUSHOT7 IS NULL, 1, 0)) AS FLUSHOT7_nullcount,
 SUM(IF(FLSHTMY3 IS NULL, 1, 0)) AS FLSHTMY3_nullcount,
 SUM(IF(PNEUVAC4 IS NULL, 1, 0)) AS PNEUVAC4_nullcount,
 SUM(IF(TETANUS1 IS NULL, 1, 0)) AS TETANUS1_nullcount,
 SUM(IF(COVIDPOS IS NULL, 1, 0)) AS COVIDPOS_nullcount,
 SUM(IF(COVIDSMP IS NULL, 1, 0)) AS COVIDSMP_nullcount,
 SUM(IF(COVIDPRM IS NULL, 1, 0)) AS COVIDPRM_nullcount,
 SUM(IF(PDIABTS1 IS NULL, 1, 0)) AS PDIABTS1_nullcount,
 SUM(IF(PREDIAB2 IS NULL, 1, 0)) AS PREDIAB2_nullcount,
 SUM(IF(DIABTYPE IS NULL, 1, 0)) AS DIABTYPE_nullcount,
 SUM(IF(INSULIN1 IS NULL, 1, 0)) AS INSULIN1_nullcount,
 SUM(IF(CHKHEMO3 IS NULL, 1, 0)) AS CHKHEMO3_nullcount,
 SUM(IF(EYEEXAM1 IS NULL, 1, 0)) AS EYEEXAM1_nullcount,
 SUM(IF(DIABEYE1 IS NULL, 1, 0)) AS DIABEYE1_nullcount,
 SUM(IF(DIABEDU1 IS NULL, 1, 0)) AS DIABEDU1_nullcount,
 SUM(IF(FEETSORE IS NULL, 1, 0)) AS FEETSORE_nullcount,
 SUM(IF(IMFVPLA3 IS NULL, 1, 0)) AS IMFVPLA3_nullcount,
 SUM(IF(HPVADVC4 IS NULL, 1, 0)) AS HPVADVC4_nullcount,
 SUM(IF(HPVADSHT IS NULL, 1, 0)) AS HPVADSHT_nullcount,
 SUM(IF(SHINGLE2 IS NULL, 1, 0)) AS SHINGLE2_nullcount,
 SUM(IF(COVIDVA1 IS NULL, 1, 0)) AS COVIDVA1_nullcount,
 SUM(IF(COVACGET IS NULL, 1, 0)) AS COVACGET_nullcount,
 SUM(IF(COVIDNU1 IS NULL, 1, 0)) AS COVIDNU1_nullcount,
 SUM(IF(COVIDINT IS NULL, 1, 0)) AS COVIDINT_nullcount,
 SUM(IF(COVIDFS1 IS NULL, 1, 0)) AS COVIDFS1_nullcount,
 SUM(IF(COVIDSE1 IS NULL, 1, 0)) AS COVIDSE1_nullcount,
 SUM(IF(COPDCOGH IS NULL, 1, 0)) AS COPDCOGH_nullcount,
 SUM(IF(COPDFLEM IS NULL, 1, 0)) AS COPDFLEM_nullcount,
 SUM(IF(COPDBRTH IS NULL, 1, 0)) AS COPDBRTH_nullcount,
 SUM(IF(COPDBTST IS NULL, 1, 0)) AS COPDBTST_nullcount,
 SUM(IF(COPDSMOK IS NULL, 1, 0)) AS COPDSMOK_nullcount,
 SUM(IF(CNCRDIFF IS NULL, 1, 0)) AS CNCRDIFF_nullcount,
 SUM(IF(CNCRAGE IS NULL, 1, 0)) AS CNCRAGE_nullcount,
 SUM(IF(CNCRTYP2 IS NULL, 1, 0)) AS CNCRTYP2_nullcount,
 SUM(IF(CSRVTRT3 IS NULL, 1, 0)) AS CSRVTRT3_nullcount,
 SUM(IF(CSRVDOC1 IS NULL, 1, 0)) AS CSRVDOC1_nullcount,
 SUM(IF(CSRVSUM IS NULL, 1, 0)) AS CSRVSUM_nullcount,
 SUM(IF(CSRVRTRN IS NULL, 1, 0)) AS CSRVRTRN_nullcount,
 SUM(IF(CSRVINST IS NULL, 1, 0)) AS CSRVINST_nullcount,
 SUM(IF(CSRVINSR IS NULL, 1, 0)) AS CSRVINSR_nullcount,
 SUM(IF(CSRVDEIN IS NULL, 1, 0)) AS CSRVDEIN_nullcount,
 SUM(IF(CSRVCLIN IS NULL, 1, 0)) AS CSRVCLIN_nullcount,
 SUM(IF(CSRVPAIN IS NULL, 1, 0)) AS CSRVPAIN_nullcount,
 SUM(IF(CSRVCTL2 IS NULL, 1, 0)) AS CSRVCTL2_nullcount,
 SUM(IF(PSATEST1 IS NULL, 1, 0)) AS PSATEST1_nullcount,
 SUM(IF(PSATIME1 IS NULL, 1, 0)) AS PSATIME1_nullcount,
 SUM(IF(PCPSARS2 IS NULL, 1, 0)) AS PCPSARS2_nullcount,
 SUM(IF(PSASUGST IS NULL, 1, 0)) AS PSASUGST_nullcount,
 SUM(IF(PCSTALK1 IS NULL, 1, 0)) AS PCSTALK1_nullcount,
 SUM(IF(ACEDEPRS IS NULL, 1, 0)) AS ACEDEPRS_nullcount,
 SUM(IF(ACEDRINK IS NULL, 1, 0)) AS ACEDRINK_nullcount,
 SUM(IF(ACEDRUGS IS NULL, 1, 0)) AS ACEDRUGS_nullcount,
 SUM(IF(ACEPRISN IS NULL, 1, 0)) AS ACEPRISN_nullcount,
 SUM(IF(ACEDIVRC IS NULL, 1, 0)) AS ACEDIVRC_nullcount,
 SUM(IF(ACEPUNCH IS NULL, 1, 0)) AS ACEPUNCH_nullcount,
 SUM(IF(ACEHURT1 IS NULL, 1, 0)) AS ACEHURT1_nullcount,
 SUM(IF(ACESWEAR IS NULL, 1, 0)) AS ACESWEAR_nullcount,
 SUM(IF(ACETOUCH IS NULL, 1, 0)) AS ACETOUCH_nullcount,
 SUM(IF(ACETTHEM IS NULL, 1, 0)) AS ACETTHEM_nullcount,
 SUM(IF(ACEHVSEX IS NULL, 1, 0)) AS ACEHVSEX_nullcount,
 SUM(IF(ACEADSAF IS NULL, 1, 0)) AS ACEADSAF_nullcount,
 SUM(IF(ACEADNED IS NULL, 1, 0)) AS ACEADNED_nullcount,
 SUM(IF(LSATISFY IS NULL, 1, 0)) AS LSATISFY_nullcount,
 SUM(IF(EMTSUPRT IS NULL, 1, 0)) AS EMTSUPRT_nullcount,
 SUM(IF(SDHISOLT IS NULL, 1, 0)) AS SDHISOLT_nullcount,
 SUM(IF(SDHEMPLY IS NULL, 1, 0)) AS SDHEMPLY_nullcount,
 SUM(IF(FOODSTMP IS NULL, 1, 0)) AS FOODSTMP_nullcount,
 SUM(IF(SDHFOOD1 IS NULL, 1, 0)) AS SDHFOOD1_nullcount,
 SUM(IF(SDHBILLS IS NULL, 1, 0)) AS SDHBILLS_nullcount,
 SUM(IF(SDHUTILS IS NULL, 1, 0)) AS SDHUTILS_nullcount,
 SUM(IF(SDHTRNSP IS NULL, 1, 0)) AS SDHTRNSP_nullcount,
 SUM(IF(SDHSTRE1 IS NULL, 1, 0)) AS SDHSTRE1_nullcount,
 SUM(IF(MARIJAN1 IS NULL, 1, 0)) AS MARIJAN1_nullcount,
 SUM(IF(MARJSMOK IS NULL, 1, 0)) AS MARJSMOK_nullcount,
 SUM(IF(MARJEAT IS NULL, 1, 0)) AS MARJEAT_nullcount,
 SUM(IF(MARJVAPE IS NULL, 1, 0)) AS MARJVAPE_nullcount,
 SUM(IF(MARJDAB IS NULL, 1, 0)) AS MARJDAB_nullcount,
 SUM(IF(MARJOTHR IS NULL, 1, 0)) AS MARJOTHR_nullcount,
 SUM(IF(USEMRJN4 IS NULL, 1, 0)) AS USEMRJN4_nullcount,
 SUM(IF(LASTSMK2 IS NULL, 1, 0)) AS LASTSMK2_nullcount,
 SUM(IF(STOPSMK2 IS NULL, 1, 0)) AS STOPSMK2_nullcount,
 SUM(IF(MENTCIGS IS NULL, 1, 0)) AS MENTCIGS_nullcount,
 SUM(IF(MENTECIG IS NULL, 1, 0)) AS MENTECIG_nullcount,
 SUM(IF(HEATTBCO IS NULL, 1, 0)) AS HEATTBCO_nullcount,
 SUM(IF(ASBIALCH IS NULL, 1, 0)) AS ASBIALCH_nullcount,
 SUM(IF(ASBIDRNK IS NULL, 1, 0)) AS ASBIDRNK_nullcount,
 SUM(IF(ASBIBING IS NULL, 1, 0)) AS ASBIBING_nullcount,
 SUM(IF(ASBIADVC IS NULL, 1, 0)) AS ASBIADVC_nullcount,
 SUM(IF(ASBIRDUC IS NULL, 1, 0)) AS ASBIRDUC_nullcount,
 SUM(IF(CASTHNO2 IS NULL, 1, 0)) AS CASTHNO2_nullcount,
 SUM(IF(RRCLASS3 IS NULL, 1, 0)) AS RRCLASS3_nullcount,
 SUM(IF(RRCOGNT2 IS NULL, 1, 0)) AS RRCOGNT2_nullcount,
 SUM(IF(RRTREAT IS NULL, 1, 0)) AS RRTREAT_nullcount,
 SUM(IF(RRATWRK2 IS NULL, 1, 0)) AS RRATWRK2_nullcount,
 SUM(IF(RRHCARE4 IS NULL, 1, 0)) AS RRHCARE4_nullcount,
 SUM(IF(RRPHYSM2 IS NULL, 1, 0)) AS RRPHYSM2_nullcount
 FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,PERSONID_nullcount,_STATE_nullcount,STATENAME_nullcount,_SEX_nullcount,_AGE80_nullcount,FMONTH_nullcount,IDATE_nullcount,IMONTH_nullcount,IDAY_nullcount,IYEAR_nullcount,GENHLTH_nullcount,PHYSHLTH_nullcount,MENTHLTH_nullcount,POORHLTH_nullcount,PRIMINSR_nullcount,PERSDOC3_nullcount,MEDCOST1_nullcount,CHECKUP1_nullcount,EXERANY2_nullcount,SLEPTIM1_nullcount,LASTDEN4_nullcount,RMVTETH4_nullcount,CVDINFR4_nullcount,CVDCRHD4_nullcount,CVDSTRK3_nullcount,ASTHMA3_nullcount,ASTHNOW_nullcount,CHCSCNC1_nullcount,CHCOCNC1_nullcount,CHCCOPD3_nullcount,ADDEPEV3_nullcount,CHCKDNY2_nullcount,HAVARTH4_nullcount,DIABETE4_nullcount,DIABAGE4_nullcount,MARITAL_nullcount,EDUCA_nullcount,RENTHOM1_nullcount,NUMHHOL4_nullcount,NUMPHON4_nullcount,CPDEMO1C_nullcount,VETERAN3_nullcount,EMPLOY1_nullcount,CHILDREN_nullcount,INCOME3_nullcount,PREGNANT_nullcount,WEIGHT2_nullcount,HEIGHT3_nullcount,DEAF_nullcount,BLIND_nullcount,DECIDE_nullcount,DIFFWALK_nullcount,DIFFDRES_nullcount,DIFFALON_nullcount,HADMAM_nullcount,HOWLONG_nullcount,CERVSCRN_nullcount,CRVCLCNC_nullcount,CRVCLPAP_nullcount,CRVCLHPV_nullcount,HADHYST2_nullcount,HADSIGM4_nullcount,COLNSIGM_nullcount,COLNTES1_nullcount,SIGMTES1_nullcount,LASTSIG4_nullcount,COLNCNCR_nullcount,VIRCOLO1_nullcount,VCLNTES2_nullcount,SMALSTOL_nullcount,STOLTEST_nullcount,STOOLDN2_nullcount,BLDSTFIT_nullcount,SDNATES1_nullcount,SMOKE100_nullcount,SMOKDAY2_nullcount,USENOW3_nullcount,ECIGNOW2_nullcount,LCSFIRST_nullcount,LCSLAST_nullcount,LCSNUMCG_nullcount,LCSCTSC1_nullcount,LCSSCNCR_nullcount,LCSCTWHN_nullcount,ALCDAY4_nullcount,AVEDRNK3_nullcount,DRNK3GE5_nullcount,MAXDRNKS_nullcount,FLUSHOT7_nullcount,FLSHTMY3_nullcount,PNEUVAC4_nullcount,TETANUS1_nullcount,COVIDPOS_nullcount,COVIDSMP_nullcount,COVIDPRM_nullcount,PDIABTS1_nullcount,PREDIAB2_nullcount,DIABTYPE_nullcount,INSULIN1_nullcount,CHKHEMO3_nullcount,EYEEXAM1_nullcount,DIABEYE1_nullcount,DIABEDU1_nullcount,FEETSORE_nullcount,IMFVPLA3_nullcount,HPVADVC4_nullcount,HPVADSHT_nullcount,SHINGLE2_nullcount,COVIDVA1_nullcount,COVACGET_nullcount,COVIDNU1_nullcount,COVIDINT_nullcount,COVIDFS1_nullcount,COVIDSE1_nullcount,COPDCOGH_nullcount,COPDFLEM_nullcount,COPDBRTH_nullcount,COPDBTST_nullcount,COPDSMOK_nullcount,CNCRDIFF_nullcount,CNCRAGE_nullcount,CNCRTYP2_nullcount,CSRVTRT3_nullcount,CSRVDOC1_nullcount,CSRVSUM_nullcount,CSRVRTRN_nullcount,CSRVINST_nullcount,CSRVINSR_nullcount,CSRVDEIN_nullcount,CSRVCLIN_nullcount,CSRVPAIN_nullcount,CSRVCTL2_nullcount,PSATEST1_nullcount,PSATIME1_nullcount,PCPSARS2_nullcount,PSASUGST_nullcount,PCSTALK1_nullcount,ACEDEPRS_nullcount,ACEDRINK_nullcount,ACEDRUGS_nullcount,ACEPRISN_nullcount,ACEDIVRC_nullcount,ACEPUNCH_nullcount,ACEHURT1_nullcount,ACESWEAR_nullcount,ACETOUCH_nullcount,ACETTHEM_nullcount,ACEHVSEX_nullcount,ACEADSAF_nullcount,ACEADNED_nullcount,LSATISFY_nullcount,EMTSUPRT_nullcount,SDHISOLT_nullcount,SDHEMPLY_nullcount,FOODSTMP_nullcount,SDHFOOD1_nullcount,SDHBILLS_nullcount,SDHUTILS_nullcount,SDHTRNSP_nullcount,SDHSTRE1_nullcount,MARIJAN1_nullcount,MARJSMOK_nullcount,MARJEAT_nullcount,MARJVAPE_nullcount,MARJDAB_nullcount,MARJOTHR_nullcount,USEMRJN4_nullcount,LASTSMK2_nullcount,STOPSMK2_nullcount,MENTCIGS_nullcount,MENTECIG_nullcount,HEATTBCO_nullcount,ASBIALCH_nullcount,ASBIDRNK_nullcount,ASBIBING_nullcount,ASBIADVC_nullcount,ASBIRDUC_nullcount,CASTHNO2_nullcount,RRCLASS3_nullcount,RRCOGNT2_nullcount,RRTREAT_nullcount,RRATWRK2_nullcount,RRHCARE4_nullcount,RRPHYSM2_nullcount
0,0,0,0,0,0,0,0,0,0,0,3,5,3,189386,4,2,4,3,2,3,1363,1363,4,2,2,2,378438,2,3,2,7,2,3,3,383973,8,5,9,349083,438189,2827,4173,6196,9312,12932,366114,15901,17055,18644,19855,20986,22155,22879,23942,223319,274020,224318,309555,309745,309960,227453,152413,232309,237848,413769,441658,154046,383265,436153,383456,395758,383730,433738,433770,31777,281079,32600,33579,283417,324513,288992,38210,279557,413627,40763,234769,235283,235721,43593,236258,44732,45383,49235,320933,418417,304884,304910,432532,432532,432532,432532,432532,432532,432532,436089,437786,443460,430395,292089,414837,320385,436735,321193,328043,437863,437870,437883,437900,437917,421733,422567,422588,427830,435325,435349,435361,437882,435384,435392,435403,428035,443591,437952,441186,441196,440738,438364,396846,396901,396912,396938,396960,397003,397055,397103,397160,397196,397256,397311,397375,190644,190991,191342,191617,191893,192303,192610,192853,193189,193921,350213,433622,433628,433629,433637,433645,440126,397723,426243,440138,442968,430103,384190,384292,384404,385687,398512,439804,283394,283788,284167,362095,284624,284942


We see that while there are columns where NULL values are present, this can be explained by the fact that this is a collection of survey responses where not all questions may apply to all sections of the population. And instead of imputing values here, we have chosen to let most of these columns be as they were since we do not want to change the outcome of aggregation functions which ignore NULL values, and accidentally include sections of society that should have been excluded from the discussion.

However, for a few columns that apply to the general population, we see less than 10 NULL values. Since these columns are general and the number of NULL values is insignificant compared to the number of records, we will update these values to the code equivalent of "Refused", assuming that the respondents refused to answer these questions.

In [41]:
%%bigquery --project=fall24-ba775-a08
UPDATE `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
SET
    GENHLTH = CASE WHEN GENHLTH IS NULL THEN 9 ELSE GENHLTH END,
    PHYSHLTH = CASE WHEN PHYSHLTH IS NULL THEN 99 ELSE PHYSHLTH END,
    MENTHLTH = CASE WHEN MENTHLTH IS NULL THEN 99 ELSE MENTHLTH END,
    PRIMINSR = CASE WHEN PRIMINSR IS NULL THEN 99 ELSE PRIMINSR END,
    PERSDOC3 = CASE WHEN PERSDOC3 IS NULL THEN 9 ELSE PERSDOC3 END,
    MEDCOST1 = CASE WHEN MEDCOST1 IS NULL THEN 9 ELSE MEDCOST1 END,
    CHECKUP1 = CASE WHEN CHECKUP1 IS NULL THEN 9 ELSE CHECKUP1 END,
    EXERANY2 = CASE WHEN EXERANY2 IS NULL THEN 9 ELSE EXERANY2 END,
    SLEPTIM1 = CASE WHEN SLEPTIM1 IS NULL THEN 99 ELSE SLEPTIM1 END,
    CVDINFR4 = CASE WHEN CVDINFR4 IS NULL THEN 9 ELSE CVDINFR4 END,
    CVDCRHD4 = CASE WHEN CVDCRHD4 IS NULL THEN 9 ELSE CVDCRHD4 END,
    CVDSTRK3 = CASE WHEN CVDSTRK3 IS NULL THEN 9 ELSE CVDSTRK3 END,
    ASTHMA3 = CASE WHEN ASTHMA3 IS NULL THEN 9 ELSE ASTHMA3 END,
    CHCSCNC1 = CASE WHEN CHCSCNC1 IS NULL THEN 9 ELSE CHCSCNC1 END,
    CHCOCNC1 = CASE WHEN CHCOCNC1 IS NULL THEN 9 ELSE CHCOCNC1 END,
    CHCCOPD3 = CASE WHEN CHCCOPD3 IS NULL THEN 9 ELSE CHCCOPD3 END,
    ADDEPEV3 = CASE WHEN ADDEPEV3 IS NULL THEN 9 ELSE ADDEPEV3 END,
    CHCKDNY2 = CASE WHEN CHCKDNY2 IS NULL THEN 9 ELSE CHCKDNY2 END,
    HAVARTH4 = CASE WHEN HAVARTH4 IS NULL THEN 9 ELSE HAVARTH4 END,
    DIABETE4 = CASE WHEN DIABETE4 IS NULL THEN 9 ELSE DIABETE4 END,
    MARITAL = CASE WHEN MARITAL IS NULL THEN 9 ELSE MARITAL END,
    EDUCA = CASE WHEN EDUCA IS NULL THEN 9 ELSE EDUCA END,
    RENTHOM1 = CASE WHEN RENTHOM1 IS NULL THEN 9 ELSE RENTHOM1 END
WHERE
    GENHLTH IS NULL OR
    PHYSHLTH IS NULL OR
    MENTHLTH IS NULL OR
    PRIMINSR IS NULL OR
    PERSDOC3 IS NULL OR
    MEDCOST1 IS NULL OR
    CHECKUP1 IS NULL OR
    EXERANY2 IS NULL OR
    SLEPTIM1 IS NULL OR
    CVDINFR4 IS NULL OR
    CVDCRHD4 IS NULL OR
    CVDSTRK3 IS NULL OR
    ASTHMA3 IS NULL OR
    CHCSCNC1 IS NULL OR
    CHCOCNC1 IS NULL OR
    CHCCOPD3 IS NULL OR
    ADDEPEV3 IS NULL OR
    CHCKDNY2 IS NULL OR
    HAVARTH4 IS NULL OR
    DIABETE4 IS NULL OR
    MARITAL IS NULL OR
    EDUCA IS NULL OR
    RENTHOM1 IS NULL;

Query is running:   0%|          |

Confirming NULL counts are 0 for these records:

In [42]:
%%bigquery --project=fall24-ba775-a08
SELECT SUM(IF(PERSONID IS NULL, 1, 0)) AS PERSONID_nullcount,
 SUM(IF(_STATE IS NULL, 1, 0)) AS _STATE_nullcount,
 SUM(IF(STATENAME IS NULL, 1, 0)) AS STATENAME_nullcount,
 SUM(IF(_SEX IS NULL, 1, 0)) AS _SEX_nullcount,
 SUM(IF(_AGE80 IS NULL, 1, 0)) AS _AGE80_nullcount,
 SUM(IF(FMONTH IS NULL, 1, 0)) AS FMONTH_nullcount,
 SUM(IF(IDATE IS NULL, 1, 0)) AS IDATE_nullcount,
 SUM(IF(IMONTH IS NULL, 1, 0)) AS IMONTH_nullcount,
 SUM(IF(IDAY IS NULL, 1, 0)) AS IDAY_nullcount,
 SUM(IF(IYEAR IS NULL, 1, 0)) AS IYEAR_nullcount,
 SUM(IF(GENHLTH IS NULL, 1, 0)) AS GENHLTH_nullcount,
 SUM(IF(PHYSHLTH IS NULL, 1, 0)) AS PHYSHLTH_nullcount,
 SUM(IF(MENTHLTH IS NULL, 1, 0)) AS MENTHLTH_nullcount,
 SUM(IF(PRIMINSR IS NULL, 1, 0)) AS PRIMINSR_nullcount,
 SUM(IF(PERSDOC3 IS NULL, 1, 0)) AS PERSDOC3_nullcount,
 SUM(IF(MEDCOST1 IS NULL, 1, 0)) AS MEDCOST1_nullcount,
 SUM(IF(CHECKUP1 IS NULL, 1, 0)) AS CHECKUP1_nullcount,
 SUM(IF(EXERANY2 IS NULL, 1, 0)) AS EXERANY2_nullcount,
 SUM(IF(SLEPTIM1 IS NULL, 1, 0)) AS SLEPTIM1_nullcount,
 SUM(IF(CVDINFR4 IS NULL, 1, 0)) AS CVDINFR4_nullcount,
 SUM(IF(CVDCRHD4 IS NULL, 1, 0)) AS CVDCRHD4_nullcount,
 SUM(IF(CVDSTRK3 IS NULL, 1, 0)) AS CVDSTRK3_nullcount,
 SUM(IF(ASTHMA3 IS NULL, 1, 0)) AS ASTHMA3_nullcount,
 SUM(IF(CHCSCNC1 IS NULL, 1, 0)) AS CHCSCNC1_nullcount,
 SUM(IF(CHCOCNC1 IS NULL, 1, 0)) AS CHCOCNC1_nullcount,
 SUM(IF(CHCCOPD3 IS NULL, 1, 0)) AS CHCCOPD3_nullcount,
 SUM(IF(ADDEPEV3 IS NULL, 1, 0)) AS ADDEPEV3_nullcount,
 SUM(IF(CHCKDNY2 IS NULL, 1, 0)) AS CHCKDNY2_nullcount,
 SUM(IF(HAVARTH4 IS NULL, 1, 0)) AS HAVARTH4_nullcount,
 SUM(IF(DIABETE4 IS NULL, 1, 0)) AS DIABETE4_nullcount,
 SUM(IF(MARITAL IS NULL, 1, 0)) AS MARITAL_nullcount,
 SUM(IF(EDUCA IS NULL, 1, 0)) AS EDUCA_nullcount,
 SUM(IF(RENTHOM1 IS NULL, 1, 0)) AS RENTHOM1_nullcount
 FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,PERSONID_nullcount,_STATE_nullcount,STATENAME_nullcount,_SEX_nullcount,_AGE80_nullcount,FMONTH_nullcount,IDATE_nullcount,IMONTH_nullcount,IDAY_nullcount,IYEAR_nullcount,GENHLTH_nullcount,PHYSHLTH_nullcount,MENTHLTH_nullcount,PRIMINSR_nullcount,PERSDOC3_nullcount,MEDCOST1_nullcount,CHECKUP1_nullcount,EXERANY2_nullcount,SLEPTIM1_nullcount,CVDINFR4_nullcount,CVDCRHD4_nullcount,CVDSTRK3_nullcount,ASTHMA3_nullcount,CHCSCNC1_nullcount,CHCOCNC1_nullcount,CHCCOPD3_nullcount,ADDEPEV3_nullcount,CHCKDNY2_nullcount,HAVARTH4_nullcount,DIABETE4_nullcount,MARITAL_nullcount,EDUCA_nullcount,RENTHOM1_nullcount
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We will also check for any duplicates on the PERSONID column to ensure that there are no instances where anyone has been included multiple times within the dataset.

In [43]:
%%bigquery --project=fall24-ba775-a08
SELECT PERSONID, COUNT(*) AS duplicate_person_count
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
GROUP BY PERSONID
HAVING COUNT(*) > 1

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,PERSONID,duplicate_person_count


We do not see any instances of the same person being represented multiple times within the BRFSS dataset.

### **ii. Cleaning the CDI Data**

The CDI data contains 35 columns, and is an aggregation of a data points aggregated from various sources, including but not limited to the BRFSS.

Looking for NULL values across all fields:

In [44]:
%%bigquery --project=fall24-ba775-a08
SELECT
  SUM(IF(YearStart IS NULL, 1, 0)) AS YearStart_nullcount,
  SUM(IF(YearEnd IS NULL, 1, 0)) AS YearEnd_nullcount,
  SUM(IF(LocationAbbr IS NULL, 1, 0)) AS LocationAbbr_nullcount,
  SUM(IF(LocationDesc IS NULL, 1, 0)) AS LocationDesc_nullcount,
  SUM(IF(DataSource IS NULL, 1, 0)) AS DataSource_nullcount,
  SUM(IF(Topic IS NULL, 1, 0)) AS Topic_nullcount,
  SUM(IF(Question IS NULL, 1, 0)) AS Question_nullcount,
  SUM(IF(Response IS NULL, 1, 0)) AS Response_nullcount,
  SUM(IF(DataValueUnit IS NULL, 1, 0)) AS DataValueUnit_nullcount,
  SUM(IF(DataValueType IS NULL, 1, 0)) AS DataValueType_nullcount,
  SUM(IF(DataValue IS NULL, 1, 0)) AS DataValue_nullcount,
  SUM(IF(DataValueAlt IS NULL, 1, 0)) AS DataValueAlt_nullcount,
  SUM(IF(DataValueFootnoteSymbol IS NULL, 1, 0)) AS DataValueFootnoteSymbol_nullcount,
  SUM(IF(DataValueFootnote IS NULL, 1, 0)) AS DataValueFootnote_nullcount,
  SUM(IF(LowConfidenceLimit IS NULL, 1, 0)) AS LowConfidenceLimit_nullcount,
  SUM(IF(HighConfidenceLimit IS NULL, 1, 0)) AS HighConfidenceLimit_nullcount,
  SUM(IF(StratificationCategory1 IS NULL, 1, 0)) AS StratificationCategory1_nullcount,
  SUM(IF(Stratification1 IS NULL, 1, 0)) AS Stratification1_nullcount,
  SUM(IF(StratificationCategory2 IS NULL, 1, 0)) AS StratificationCategory2_nullcount,
  SUM(IF(Stratification2 IS NULL, 1, 0)) AS Stratification2_nullcount,
  SUM(IF(StratificationCategory3 IS NULL, 1, 0)) AS StratificationCategory3_nullcount,
  SUM(IF(Stratification3 IS NULL, 1, 0)) AS Stratification3_nullcount,
  SUM(IF(Geolocation IS NULL, 1, 0)) AS Geolocation_nullcount,
  SUM(IF(LocationID IS NULL, 1, 0)) AS LocationID_nullcount,
  SUM(IF(TopicID IS NULL, 1, 0)) AS TopicID_nullcount,
  SUM(IF(QuestionID IS NULL, 1, 0)) AS QuestionID_nullcount,
  SUM(IF(ResponseID IS NULL, 1, 0)) AS ResponseID_nullcount,
  SUM(IF(DataValueTypeID IS NULL, 1, 0)) AS DataValueTypeID_nullcount,
  SUM(IF(StratificationCategoryID1 IS NULL, 1, 0)) AS StratificationCategoryID1_nullcount,
  SUM(IF(StratificationID1 IS NULL, 1, 0)) AS StratificationID1_nullcount,
  SUM(IF(StratificationCategoryID2 IS NULL, 1, 0)) AS StratificationCategoryID2_nullcount,
  SUM(IF(StratificationID2 IS NULL, 1, 0)) AS StratificationID2_nullcount,
  SUM(IF(StratificationCategoryID3 IS NULL, 1, 0)) AS StratificationCategoryID3_nullcount,
  SUM(IF(StratificationID3 IS NULL, 1, 0)) AS StratificationID3_nullcount
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators_backup`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,YearStart_nullcount,YearEnd_nullcount,LocationAbbr_nullcount,LocationDesc_nullcount,DataSource_nullcount,Topic_nullcount,Question_nullcount,Response_nullcount,DataValueUnit_nullcount,DataValueType_nullcount,DataValue_nullcount,DataValueAlt_nullcount,DataValueFootnoteSymbol_nullcount,DataValueFootnote_nullcount,LowConfidenceLimit_nullcount,HighConfidenceLimit_nullcount,StratificationCategory1_nullcount,Stratification1_nullcount,StratificationCategory2_nullcount,Stratification2_nullcount,StratificationCategory3_nullcount,Stratification3_nullcount,Geolocation_nullcount,LocationID_nullcount,TopicID_nullcount,QuestionID_nullcount,ResponseID_nullcount,DataValueTypeID_nullcount,StratificationCategoryID1_nullcount,StratificationID1_nullcount,StratificationCategoryID2_nullcount,StratificationID2_nullcount,StratificationCategoryID3_nullcount,StratificationID3_nullcount
0,0,0,0,0,0,0,0,309215,0,0,100019,100019,207499,207499,120330,120325,0,0,309215,309215,309215,309215,5763,0,0,0,309215,0,0,0,309215,309215,309215,309215


We were not expecting to see NULL values in DataValue considering this table captures an aggregation of certain societal and behavioral indicators. However, we believe that imputing values here will not be the right approach as these could be likely indicators of under-reporting of certain demographic/social groups, and we expect to uncover details about the same in due course of our analysis. We would like to look into these fields alongside the BRFSS dataset and explore if data was collected as part of the survey but not reported within the CDI reporting period.

Taking a closer look into a few of the other columns with large number of NULL values:

In [45]:
%%bigquery --project=fall24-ba775-a08
SELECT DISTINCT Response,
DataValueFootnoteSymbol,
StratificationCategory2,
Stratification2,
StratificationCategory3,
Stratification3,
StratificationCategoryID2,
StratificationID2,
StratificationCategoryID3,
StratificationID3
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators_backup`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,Response,DataValueFootnoteSymbol,StratificationCategory2,Stratification2,StratificationCategory3,Stratification3,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,,,,,,,,,,
1,,*,,,,,,,,
2,,****,,,,,,,,
3,,~,,,,,,,,
4,,~~,,,,,,,,
5,,#,,,,,,,,
6,,~~~~,,,,,,,,
7,,&,,,,,,,,
8,,##,,,,,,,,
9,,###,,,,,,,,


In [46]:
%%bigquery --project=fall24-ba775-a08
SELECT COUNT(*)
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators_backup`
WHERE DataValue <> DataValueAlt;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,6


We will be dropping the above columns, since they do not contain any relevant information. We will also be dropping DataValueAlt, since it contains information already captured in YearStart and DataValue respectively.

In [47]:
%%bigquery --project=fall24-ba775-a08
CREATE OR REPLACE TABLE `fall24-ba775-a08.group_project.chronic_disease_indicators` AS (
  SELECT YearStart,
    YearEnd,
    LocationAbbr,
    LocationDesc,
    DataSource,
    Topic,
    Question,
    DataValueUnit,
    DataValueType,
    DataValue,
    DataValueFootnote,
    LowConfidenceLimit,
    HighConfidenceLimit,
    StratificationCategory1,
    Stratification1,
    Geolocation,
    LocationID,
    TopicID,
    QuestionID,
    ResponseID,
    DataValueTypeID,
    StratificationCategoryID1,
    StratificationID1
    FROM `fall24-ba775-a08.group_project.chronic_disease_indicators_backup`
);

Query is running:   0%|          |

Checking if the LocationId in the CDI dataset matches with the _STATE column in the BRFSS dataset:

In [48]:
%%bigquery --project=fall24-ba775-a08
SELECT DISTINCT LocationDesc AS state_cdi, LocationId AS state_id_cdi, _STATE AS state_id_brfss, STATENAME AS state_brfss
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
FULL JOIN `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
ON LocationDesc = STATENAME
WHERE LocationId != _STATE
ORDER BY state_cdi

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,state_cdi,state_id_cdi,state_id_brfss,state_brfss


Confirming the above is necessary considering we will be using LocationId and _STATE to join between the tables.

Now that the cleaning is complete and the dataset is ready, we will move onto the EDA portion of the project.

# **6. Data Analysis**

### **i. Introduction**

We explored the CDI dataset to look into a set of chronic health problems that are affecting the US population, where diseases like Asthma, Arthritis, Cancer, Chronic Obstructive Pulmonary Disorder (COPD), Diabetes and Mental Health Issues are reported. The dataset presents metrics such as frequency, intensity, and means for different health indicators like binge drinking, obesity, depression, and smoking. Some metrics are presented as crude values, while others are adjusted for age to provide more standardized comparisons. The data covers a wide range of health conditions and behaviors, as well as access to healthcare, providing insights into public health trends and disparities.

In the analysis that follows, we delve into the complex landscape of chronic conditions, with a primary focus on diabetes. For most of our analysis we have chosen to consider Crude Prevalence over Age-adjusted Prevalence wherever the option is available within the CDI data to keep findings consistent with findings from the BRFSS data.

### **ii. Diabetes**

Diabetes is one of the most prevalent chronic conditions worldwide, with a significant impact on public health and individual well-being. As a leading cause of complications such as heart disease, kidney failure and neuropathy, diabetes presents a complex challenge that requires comprehensive care. Additionally, with insulin affordability being an area of concern in the US, we believe it critical to understand how diabetes affects the population.

**Overview of Diabetes among Adults across the United States of America (2019-2022)**

In [49]:
diabetes_query = """
SELECT *
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
WHERE Topic = 'Diabetes'
AND Question = 'Diabetes among adults'
AND DataValueType = 'Crude Prevalence'
AND StratificationCategory1 = 'Overall'
"""

# Execute the query and load the data into a DataFrame
df_diabetes = client.query(diabetes_query).to_dataframe()

In [50]:
# Choropleth depicting diabetes across the United States of America, over a period from 2019 to 2022
fig_diabetes = px.choropleth(df_diabetes, locations="LocationAbbr", color="DataValue", hover_name="LocationDesc", animation_frame="YearStart",
                     locationmode='USA-states', scope='usa',
                    title='State-Wise Diabetes Prevalence (Crude) (2019 to 2022)', range_color=[8, 20], height=500, width=900,
                    labels={'LocationDesc':'State', 'DataValue':'Prevalence', 'YearStart':'Year'}, color_continuous_scale='ylgnbu')
# Formatting the Figure (notably the Slider and Play/Stop Buttons)
years = sorted(df_diabetes['YearStart'].unique())
fig_diabetes.update_layout(height=500, width=900, showlegend=True,
                   updatemenus=[{'direction': 'left', 'pad': {'r': 45, 't': 90}, 'type': 'buttons', 'x': 0.1, 'xanchor': 'right', 'y': 0.15, 'yanchor': 'top'}],
                   sliders=[{'yanchor': 'top', 'xanchor': 'left',
                             'currentvalue': {'font': {'size': 14}, 'prefix': 'Year:', 'visible': True, 'xanchor': 'right'},
                             'pad': {'b': 0}, 'len': 0.9, 'x': 0.1, 'y': 0.15,
                             'steps': [{'args': [[str(year)],
                                                 {'frame': {'duration': 3000, 'redraw': True}, 'mode': 'immediate'}],
                                        'label': str(year), 'method': 'animate'} for year in years]}])
fig_diabetes.show()

The map<sup>3, 4</sup> focuses on diabetes prevalence and its span over multiple years. The map shows a significant increase in diabetes prevalence in the southwestern United States from 2019 to 2022. The color shift indicates that West Virginia, Arkansas, Mississippi, and Alabama experienced the most severe increases in diabetes prevalence from 2019 to 2022. This pattern may be linked to socioeconomic factors, healthcare access, dietary habits, and lifestyle choices prevalent in these areas.

**What was the overall prevalence of diabetes among adults across all states in 2022?**

In [51]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER (ORDER BY DataValue DESC) AS rank,
  LocationDesc AS state, DataValue AS overall_diabetes_prevalence
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
WHERE Topic = 'Diabetes'
AND YearStart = 2022
AND StratificationCategory1 = 'Overall'
AND DataValueType = 'Crude Prevalence'
ORDER BY rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,rank,state,overall_diabetes_prevalence
0,1,Guam,21.6
1,2,Puerto Rico,17.7
2,3,West Virginia,17.4
3,4,Virgin Islands,15.9
4,5,Arkansas,15.7
5,6,Alabama,15.5
6,7,Mississippi,15.3
7,8,Tennessee,14.8
8,8,Kentucky,14.8
9,10,Louisiana,14.7


In 2022, Guam, Puerto Rico, West Virginia, Virgin Islands, Arkansas, Alabama and Mississippi show a high overall prevalence for Diabetes, while District of Columbia, Colorado, Vermont, Montana and Utah have lowest rates for prevalence.

The territories, such as Guam (21.6%) and Puerto Rico (17.7%), have the highest diabetes prevalence rates in 2022, surpassing any of the U.S. states. While this might seem to suggest specific public health challenges or lifestyle factors in these territories that are contributing to higher rates of diabetes compared to mainland states, the difference in population compared to the states may reflect inflated prevalences.

Among the U.S. states, Arkansas (15.7%), Alabama (15.5%), and Mississippi (15.3%) show some of the highest diabetes prevalence rates. These states are all located in the South, a region that often has higher rates of chronic conditions, possibly due to a combination of socioeconomic factors, lifestyle habits, and access to healthcare.

**What is the relationship between economic status and diabetes prevalence, specifically in terms of the percentage of the general population versus the diabetic population that is classified as low income?**

Based on the data from the Legal Services Corporation (LSC), in 2022, household incomes below 125% of the federal poverty line correspond to annual incomes below 34,500 USD.<sup>5</sup>

To more accurately reflect real-world conditions and establish a clear threshold for annual household income, we will define a low-income household as one with an annual income below 35,000 USD. This adjustment is aimed at simplifying income categorization while aligning closely with the poverty-based income guidelines.

In [52]:
%%bigquery --project=fall24-ba775-a08
WITH low_income AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(INCOME3) > 0 THEN ROUND((COUNTIF(INCOME3 IN (1.0, 2.0, 3.0, 4.0, 5.0)) / COUNT(INCOME3)) * 100, 1)
      ELSE NULL
    END AS low_income_proportion
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  GROUP BY state, state_id
),
low_income_diabetic AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(INCOME3) > 0 THEN ROUND((COUNTIF(INCOME3 IN (1.0, 2.0, 3.0, 4.0, 5.0)) / COUNT(INCOME3)) * 100, 1)
      ELSE NULL
    END AS low_income_proportion_diabetic
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  WHERE DIABETE4 = 1.0
  GROUP BY state, state_id)
SELECT RANK() OVER(ORDER BY low_income_proportion_diabetic DESC) AS rank,
  li.state, low_income_proportion, low_income_proportion_diabetic
FROM low_income li
INNER JOIN low_income_diabetic lid
ON li.state_id = lid.state_id
ORDER BY rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,rank,state,low_income_proportion,low_income_proportion_diabetic
0,1,Puerto Rico,60.9,66.7
1,2,New Mexico,32.2,43.4
2,3,Louisiana,29.9,41.6
3,4,Illinois,27.7,41.0
4,5,Mississippi,31.5,40.6
5,6,Arkansas,30.5,39.8
6,7,West Virginia,31.0,39.6
7,8,Virgin Islands,30.7,39.5
8,9,Tennessee,27.7,39.1
9,10,Kentucky,27.8,38.0


Regions with higher proportions of low-income households, such as Puerto Rico (60.9%) and Guam (31.5%) among territories, and West Virginia (31%) and Louisiana (29.9%) among states, also exhibit elevated diabetes prevalence rates (17.7% for Puerto Rico, 21.6% for Guam, 17.4% for West Virginia and 14.7% for Louisiana). This pattern suggests a positive correlation between low-income status and diabetes prevalence, likely linked to limited access to healthcare, nutritious food, and health education.

Lower-income communities often face obstacles to preventive healthcare, increasing the risk of chronic conditions like diabetes. These economic barriers may indirectly contribute to higher diabetes rates by limiting access to essential health resources and support.

The fourth column, which measures the proportion of low-income individuals among those diagnosed with diabetes, indicates that a significant portion of people with diabetes in these regions are from low-income backgrounds. For example, in Puerto Rico, a notable 66.7% of people with diabetes are in the low-income category. This highlights a compounded vulnerability where economically disadvantaged individuals face greater challenges in managing and preventing diabetes.


**How does the prevalence of diabetes impact whether individuals have insurance coverage or not?**

In [53]:
%%bigquery --project=fall24-ba775-a08
WITH uninsured AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(PRIMINSR) > 0 THEN ROUND((COUNTIF(PRIMINSR = 88.0) / COUNT(PRIMINSR)) * 100, 1)
      ELSE NULL
    END AS uninsured_proportion
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  GROUP BY state, state_id
),
uninsured_diabetic AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(PRIMINSR) > 0 THEN ROUND((COUNTIF(PRIMINSR = 88.0) / COUNT(PRIMINSR)) * 100, 1)
      ELSE NULL
    END AS uninsured_proportion_diabetic
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  WHERE DIABETE4 = 1.0
  GROUP BY state, state_id)
SELECT RANK() OVER(ORDER BY uninsured_proportion_diabetic DESC) AS rank,
  u.state, uninsured_proportion, uninsured_proportion_diabetic
FROM uninsured u
INNER JOIN uninsured_diabetic ud
ON u.state_id = ud.state_id
ORDER BY rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,rank,state,uninsured_proportion,uninsured_proportion_diabetic
0,1,Virgin Islands,15.6,10.2
1,2,Texas,12.3,8.1
2,3,Illinois,9.4,7.8
3,4,North Carolina,9.3,5.4
4,5,Colorado,7.4,5.1
5,6,Guam,10.8,4.9
6,7,Tennessee,7.7,4.6
7,8,Georgia,8.0,4.3
8,8,Nevada,8.1,4.3
9,10,Utah,6.7,4.2


Locations with a higher proportion of uninsured individuals, such as the Virgin Islands, Texas, Illinois, and Guam, often exhibit higher diabetes prevalence. This suggests a potential positive correlation between lack of insurance and diabetes prevalence, highlighting that those without health insurance may have reduced access to healthcare resources needed to prevent or manage diabetes effectively.

Without insurance, individuals may be less likely to seek regular check-ups, screenings, and diabetes management resources, potentially leading to higher rates of undiagnosed or poorly managed diabetes.

The fourth column, which measures the proportion of uninsured individuals among those diagnosed with diabetes, suggesting that individuals diagnosed with diabetes are more likely to have insurance coverage than the general population. For instance, while the overall uninsured rate in the Virgin Islands is 15.6%, only 10.2% of those with diabetes are uninsured. This pattern is consistent across other regions, such as Texas (12.3% overall uninsured rate vs. 8.1% uninsured among those with diabetes), indicating that individuals with diabetes may prioritize or seek out insurance to help manage their condition.

**Are there significant correlations between states with high diabetes prevalence and other chronic conditions like obesity and hypertension?**

In [54]:
%%bigquery --project=fall24-ba775-a08
WITH obesity AS (
  SELECT LocationDesc AS state, LocationId as state_id, DataValue AS obesity_prevalence
  FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
  WHERE Topic = 'Nutrition, Physical Activity, and Weight Status'
  AND DataSource = 'BRFSS'
  AND Question = 'Obesity among adults'
  AND DataValueType = 'Crude Prevalence'
  AND StratificationCategory1 = 'Overall'
  AND YearStart = 2022
),
diabetes AS (
  SELECT distinct LocationDesc AS state, LocationId as state_id, DataValue AS diabetes_prevalence
  FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
  WHERE Topic = 'Diabetes'
  AND DataSource = 'BRFSS'
  AND Question = 'Diabetes among adults'
  AND DataValueType = 'Crude Prevalence'
  AND StratificationCategory1 = 'Overall'
  AND YearStart = 2022
),
hypertension AS (
  SELECT distinct LocationDesc AS state, LocationId as state_id, DataValue AS hypertension_prevalence
  FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
  WHERE Topic = 'Cardiovascular Disease'
  AND DataSource = 'BRFSS'
  AND Question = 'High blood pressure among adults'
  AND DataValueType = 'Crude Prevalence'
  AND StratificationCategory1 = 'Overall'
  AND YearStart = 2021
)
SELECT RANK() OVER(ORDER BY obesity_prevalence DESC) AS rank,
  o.state, o.obesity_prevalence, d.diabetes_prevalence, h.hypertension_prevalence
FROM obesity o
INNER JOIN diabetes d
ON o.state_id = d.state_id
INNER JOIN hypertension h
ON o.state_id = h.state_id
ORDER BY rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,rank,state,obesity_prevalence,diabetes_prevalence,hypertension_prevalence
0,1,West Virginia,41.0,17.4,43.4
1,2,Louisiana,40.1,14.7,40.2
2,3,Oklahoma,40.0,13.3,38.9
3,4,Mississippi,39.5,15.3,43.9
4,5,Tennessee,38.9,14.8,37.7
5,6,Alabama,38.3,15.5,42.7
6,7,Ohio,38.1,13.1,35.6
7,8,Delaware,37.9,13.9,36.2
8,9,Kentucky,37.7,14.8,39.9
9,9,Wisconsin,37.7,10.3,31.6


States such as West Virginia, Louisiana, Oklahoma, Mississippi, Tennessee and Alabama not only show high obesity prevalence but are also characterized by elevated rates of diabetes and hypertension. While it can be noted that Oklahoma's diabetes prevalence is close to the national average, this clustering suggests that these states may face common challenges that contribute to the widespread prevalence of these chronic conditions, such as socioeconomic factors, lifestyle, and limited healthcare access.

As per the CDC's global obesity prevalence maps, Mississippi has historically stood out for consistently ranking among the highest in the country for obesity, diabetes, and hypertension prevalence. This pattern indicates that Mississippi is a critical area for public health intervention, requiring targeted efforts to address the persistent health challenges facing its population.<sup>6</sup>

The suggested correlation between obesity, diabetes, and hypertension in these states supports the well-established link between obesity and these other chronic diseases that can be explored in detail going forward. Obesity is a primary risk factor for both diabetes and hypertension, as excess body weight can lead to insulin resistance and increased cardiovascular strain. This emphasizes the need for comprehensive healthcare strategies that address multiple health risks simultaneously.

### **iii. Comorbidities**

We will now investigate different comorbidities and their occurrence across different populations, as well as how they relate with diabetes prevalence. Understanding these co-occurring conditions allows us to uncover patterns that may influence both the management and outcome in various demographic groups.

**How does the prevalence of specific disease combinations (e.g., diabetes and heart disease) vary across different age groups?**

In [55]:
%%bigquery --project=fall24-ba775-a08
SELECT
  CASE
    WHEN _AGE80 >= 18 AND _AGE80 <= 24 THEN '18 to 24'
    WHEN _AGE80 >= 24 AND _AGE80 <= 34 THEN '25 to 34'
    WHEN _AGE80 >= 35 AND _AGE80 <= 44 THEN '35 to 44'
    WHEN _AGE80 >= 45 AND _AGE80 <= 64 THEN '45 to 64'
    ELSE '65 and above'
  END AS age_group,
  ROUND(SUM(CASE WHEN DIABETE4 IN (1.0, 2.0) AND (CVDCRHD4 = 1.0 OR CVDSTRK3 = 1.0) THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_heartdisease_pct,
  ROUND(SUM(CASE WHEN DIABETE4 IN (1.0, 2.0) AND (ASTHMA3 = 1.0 OR CHCCOPD3 = 1.0) THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_respiratorydisease_pct,
  ROUND(SUM(CASE WHEN DIABETE4 IN (1.0, 2.0) AND (CHCSCNC1 = 1.0 OR CHCOCNC1 = 1.0) THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_cancer_pct,
  ROUND(SUM(CASE WHEN DIABETE4 IN (1.0, 2.0) AND (ADDEPEV3 = 1.0) THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_depression_pct,
  ROUND(SUM(CASE WHEN DIABETE4 IN (1.0, 2.0) AND (CHCKDNY2 = 1.0) THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_kidneydisease_pct,
  ROUND(SUM(CASE WHEN DIABETE4 IN (1.0, 2.0) AND (HAVARTH4 = 1.0) THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_arthritis_pct
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
GROUP BY age_group
ORDER BY age_group;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,age_group,diabetes_heartdisease_pct,diabetes_respiratorydisease_pct,diabetes_cancer_pct,diabetes_depression_pct,diabetes_kidneydisease_pct,diabetes_arthritis_pct
0,18 to 24,0.1,0.51,0.11,0.71,0.09,0.24
1,25 to 34,0.13,1.0,0.13,1.49,0.15,0.51
2,35 to 44,0.49,2.04,0.46,2.66,0.46,1.92
3,45 to 64,2.62,4.5,2.17,4.82,1.55,7.4
4,65 and above,5.8,5.75,6.58,4.2,3.46,13.12


The analysis of chronic disease combinations reveals distinct patterns across age groups. The 65+ age group stands out with the highest prevalence of all disease combinations, particularly diabetes and arthritis (13.12%), diabetes and cancer (6.58%), and diabetes and heart disease (5.8%), indicating a significant burden of comorbidities. This group also has the highest rates of diabetes and respiratory diseases and diabetes and kidney diseases, emphasizing the health challenges older adults face. The 45-64 age group also shows a high prevalence of diabetes and depression (4.82%) and diabetes and cancer (2.17%), with a noticeable rise in arthritis (7.4%) compared to younger age groups.

Notably, arthritis is far more common in the 65+ group than in middle-aged individuals, suggesting it becomes a more prevalent concern as people age. In contrast, younger age groups (18-34) have much lower rates of chronic disease combinations, though diabetes and depression remains relatively high in the 25-34 group (1.49%).

These findings highlight the growing burden of chronic diseases, especially arthritis and respiratory diseases, in older adults, while pointing to a rise in mental health comorbidities like depression in middle-aged adults with diabetes. The need for targeted healthcare interventions becomes evident, especially for managing multiple comorbidities in older populations and addressing mental health challenges in the 45-64 age group.

**How do chronic disease rates differ between states with varying levels of routine checkup fulfillments? Is there a noticeable differences between states with higher focus on preventive care?**

In [56]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER (ORDER BY ROUND((COUNTIF(CHECKUP1 IN (3.0, 4.0, 8.0, 9.0)) / COUNT(CHECKUP1)) * 100, 1) DESC) AS checkup_rank,
  STATENAME AS state,
  ROUND((SUM(IF((CASE WHEN CVDINFR4 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN CVDCRHD4 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN CVDSTRK3 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN ASTHMA3 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN CHCSCNC1 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN CHCOCNC1 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN CHCCOPD3 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN ADDEPEV3 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN CHCKDNY2 = 1.0 THEN 1 ELSE 0 END +
  CASE WHEN DIABETE4 = 1.0 THEN 1 ELSE 0 END) > 1, 1, 0)) / COUNT(*)) * 100, 2) as two_or_more_chronic_conditions,
  CASE
    WHEN COUNT(CHECKUP1) > 0 THEN ROUND((COUNTIF(CHECKUP1 IN (3.0, 4.0, 8.0, 9.0)) / COUNT(CHECKUP1)) * 100, 1)
    ELSE NULL
  END AS checkup_proportion
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
GROUP BY state
ORDER BY checkup_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,checkup_rank,state,two_or_more_chronic_conditions,checkup_proportion
0,1,Alaska,20.53,16.0
1,2,Colorado,21.19,15.3
2,2,Utah,23.09,15.3
3,4,Nevada,26.32,14.8
4,5,Oregon,24.41,14.3
5,6,California,20.71,14.0
6,7,Washington,25.56,13.3
7,7,Guam,17.7,13.3
8,9,Wyoming,24.75,13.2
9,9,New Mexico,26.12,13.2


The analysis of chronic disease rates in relation to routine checkup fulfilment highlights significant trends across different states.

States with lower routine checkup proportions tend to have higher rates of individuals with two or more chronic conditions. For instance, West Virginia (35.45%) and Alabama (30.8%) report some of the highest chronic disease prevalence rates and the lowest routine checkup rates, with only 8.3% and 7.5% of their populations receiving regular checkups, respectively. Similarly, Maine (28.5%), Arkansas (33.09%) and Kentucky (30.62%) show high chronic disease prevalence, along with low checkup rates. These trends indicate a strong association between limited access to routine healthcare and an increased burden of chronic conditions. In contrast, states like Alaska (20.53%) and Colorado (21.19%) report relatively higher checkup fulfillment rates and lower chronic disease rates, suggesting the potential benefit of regular health checkups in preventing chronic conditions.


**What is the most prevalent combination of chronic conditions among the population, based on the provided data on comorbidities?**

In [57]:
%%bigquery --project=fall24-ba775-a08
WITH comorbidities AS (
    SELECT 'Heart Attack, Angina' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND CVDCRHD4 = 1
    UNION ALL
    SELECT 'Heart Attack, Stroke' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND CVDSTRK3 = 1
    UNION ALL
    SELECT 'Heart Attack, Asthma' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND ASTHMA3 = 1
    UNION ALL
    SELECT 'Heart Attack, Skin Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND CHCSCNC1 = 1
    UNION ALL
    SELECT 'Heart Attack, Other Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND CHCOCNC1 = 1
    UNION ALL
    SELECT 'Heart Attack, COPD' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND CHCCOPD3 = 1
    UNION ALL
    SELECT 'Heart Attack, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'Heart Attack, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Heart Attack, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Angina, Stroke' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND CVDSTRK3 = 1
    UNION ALL
    SELECT 'Angina, Asthma' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND ASTHMA3 = 1
    UNION ALL
    SELECT 'Angina, Skin Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND CHCSCNC1 = 1
    UNION ALL
    SELECT 'Angina, Other Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND CHCOCNC1 = 1
    UNION ALL
    SELECT 'Angina, COPD' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND CHCCOPD3 = 1
    UNION ALL
    SELECT 'Angina, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'Angina, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Angina, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDCRHD4 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Stroke, Asthma' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND ASTHMA3 = 1
    UNION ALL
    SELECT 'Stroke, Skin Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND CHCSCNC1 = 1
    UNION ALL
    SELECT 'Stroke, Other Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND CHCOCNC1 = 1
    UNION ALL
    SELECT 'Stroke, COPD' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND CHCCOPD3 = 1
    UNION ALL
    SELECT 'Stroke, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'Stroke, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Stroke, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDSTRK3 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Asthma, Skin Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ASTHMA3 = 1 AND CHCSCNC1 = 1
    UNION ALL
    SELECT 'Asthma, Other Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ASTHMA3 = 1 AND CHCOCNC1 = 1
    UNION ALL
    SELECT 'Asthma, COPD' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ASTHMA3 = 1 AND CHCCOPD3 = 1
    UNION ALL
    SELECT 'Asthma, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ASTHMA3 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'Asthma, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ASTHMA3 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Asthma, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ASTHMA3 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Skin Cancer, Other Cancer' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCSCNC1 = 1 AND CHCOCNC1 = 1
    UNION ALL
    SELECT 'Skin Cancer, COPD' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCSCNC1 = 1 AND CHCCOPD3 = 1
    UNION ALL
    SELECT 'Skin Cancer, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCSCNC1 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'Skin Cancer, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCSCNC1 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Skin Cancer, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCSCNC1 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Other Cancer, COPD' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCOCNC1 = 1 AND CHCCOPD3 = 1
    UNION ALL
    SELECT 'Other Cancer, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCOCNC1 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'Other Cancer, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCOCNC1 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Other Cancer, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCOCNC1 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'COPD, Depression' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCCOPD3 = 1 AND ADDEPEV3 = 1
    UNION ALL
    SELECT 'COPD, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCCOPD3 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'COPD, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCCOPD3 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Depression, Kidney Disease' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ADDEPEV3 = 1 AND CHCKDNY2 = 1
    UNION ALL
    SELECT 'Depression, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE ADDEPEV3 = 1 AND DIABETE4 = 1
    UNION ALL
    SELECT 'Kidney Disease, Diabetes' AS chronic_disease_combination, COUNT(*) AS combination_count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CHCKDNY2 = 1 AND DIABETE4 = 1
),
overall_count AS (
    SELECT COUNT(*) AS count
    FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
    WHERE CVDINFR4 IS NOT NULL OR
    CVDCRHD4 IS NOT NULL OR
    CVDSTRK3 IS NOT NULL OR
    ASTHMA3 IS NOT NULL OR
    CHCSCNC1 IS NOT NULL OR
    CHCOCNC1 IS NOT NULL OR
    CHCCOPD3 IS NOT NULL OR
    ADDEPEV3 IS NOT NULL OR
    CHCKDNY2 IS NOT NULL OR
    DIABETE4 IS NOT NULL
)
SELECT RANK() OVER(ORDER BY combination_count DESC) as comorbidity_rank, chronic_disease_combination, ROUND((combination_count/count) * 100, 2) AS comorbidity_pct
FROM comorbidities, overall_count
ORDER BY comorbidity_rank
LIMIT 5;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,comorbidity_rank,chronic_disease_combination,comorbidity_pct
0,1,"Asthma, Depression",5.25
1,2,"Depression, Diabetes",3.45
2,3,"Asthma, COPD",3.16
3,4,"COPD, Depression",3.03
4,5,"Skin Cancer, Other Cancer",2.91


We see that the most common comorbidity combination is Asthma and Depression (5.25%), indicating that these conditions frequently co-occur in the population. This may suggest a potential interaction between mental health and chronic respiratory conditions, possibly driven by stress or lifestyle factors.

Depression appears in 3 of the top four combinations, suggesting that mental health challenges are a significant comorbidity across various chronic illnesses, underscoring the need for integrated mental health care in chronic disease management.

Depression and Diabetes is the second most prevalent comorbidity combination with a 3.45% prevalence in the population. This can be explained by the shared risk factors of these conditions, for example, chronic stress, which is common in depression, can also affect insulin resistance and glucose metabolism, worsening diabetes symptoms.

**How does the prevalence of asthma among individuals with diabetes vary across different U.S. states?**

In [58]:
%%bigquery --project=fall24-ba775-a08
SELECT
  RANK() OVER(ORDER BY SUM(CASE WHEN ASTHMA3 = 1 THEN 1 ELSE 0 END) / COUNT(*) DESC) AS diabetic_asthma_rank,
  STATENAME as state,
  ROUND((SUM(CASE WHEN ASTHMA3 = 1 THEN 1 ELSE 0 END) / COUNT(*)) * 100.0, 2)  AS asthma_percentage
FROM
  `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
WHERE
  DIABETE4 = 1
GROUP BY
  STATENAME
ORDER BY
  asthma_percentage desc;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,diabetic_asthma_rank,state,asthma_percentage
0,1,West Virginia,23.43
1,2,Hawaii,23.4
2,3,Colorado,22.76
3,4,Tennessee,22.57
4,5,Oregon,22.48
5,6,Puerto Rico,22.36
6,7,Rhode Island,22.31
7,8,Connecticut,22.3
8,9,District of Columbia,22.09
9,10,Virginia,21.95


West Virginia (23.4%), Hawaii (23.4%), Colorado (22.7%), Tennessee (22.5%), Oregon (22.4%), Puerto Rico (22.3%) have high prevalence of asthma in diabetic population, with nearly 1 in 4 people with diabetes in these states also reporting asthma.

Washington has the highest number of asthma cases reported despite having a relatively lower prevalence of 21.49%. This suggests a larger population size or better reporting and diagnosis infrastructure.

The Northeast region (e.g., Rhode Island, Connecticut, Vermont) generally shows asthma prevalence rates around 22%, indicating a consistent trend across this area. States in the Midwest tend to rank lower in asthma prevalence.

States and territories with the lowest prevalence include Nebraska (14.47%), Guam (15.93%) and Virgin Islands (10.90%), indicating significant variation as the lowest prevalence is less than half of the highest.

### **iv. Mental Health Struggles**

Given that depression is a common factor in the most popular comorbid conditions, it is essential to explore mental health struggles in detail. Depression can significantly impact diabetes management, was individuals with both conditions may face challenges in adhering to treatment plans, maintaining healthy lifestyles, and coping with the emotional strain of managing a chronic illness. Understanding how depression interacts with diabetes not only helps identify potential barriers to effective care but also highlights the need for integrated mental health and diabetes management strategies to improve overall well-being and outcomes for affected individuals.

**Overview of Depression among Adults across the United States of America (2019-2022)**

In [59]:
mh_query = """
SELECT *
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
WHERE Topic = 'Mental Health'
AND Question = 'Depression among adults'
AND DataValueType = 'Crude Prevalence'
AND StratificationCategory1 = 'Overall'
"""

# Execute the query and load the data into a DataFrame
df_mh = client.query(mh_query).to_dataframe()

In [60]:
# Choropleth depicting depression among adults across the United States of America, over a period from 2019 to 2022
fig_mh = px.choropleth(df_mh, locations="LocationAbbr", color="DataValue", hover_name="LocationDesc", animation_frame="YearStart",
                     locationmode='USA-states', scope='usa',
                    title='State-Wise Depression Prevalence (Crude, among adults) (2019 to 2022)', range_color=[10, 40], height=500, width=900,
                    labels={'LocationDesc':'State', 'DataValue':'Prevalence', 'YearStart':'Year'}, color_continuous_scale='ylgnbu')
# Formatting the Figure (notably the Slider and Play/Stop Buttons)
years = sorted(df_mh['YearStart'].unique())
fig_mh.update_layout(height=500, width=900, showlegend=True,
                   updatemenus=[{'direction': 'left', 'pad': {'r': 45, 't': 90}, 'type': 'buttons', 'x': 0.1, 'xanchor': 'right', 'y': 0.15, 'yanchor': 'top'}],
                   sliders=[{'yanchor': 'top', 'xanchor': 'left',
                             'currentvalue': {'font': {'size': 14}, 'prefix': 'Year:', 'visible': True, 'xanchor': 'right'},
                             'pad': {'b': 0}, 'len': 0.9, 'x': 0.1, 'y': 0.15,
                             'steps': [{'args': [[str(year)],
                                                 {'frame': {'duration': 3000, 'redraw': True}, 'mode': 'immediate'}],
                                        'label': str(year), 'method': 'animate'} for year in years]}])
fig_mh.show()

Over the last two years, Maine, Louisiana, Arkansas, Oklahoma, Ohio, has shown an increase in prevalence of Depression. West Virginia has consistently shown high prevalence rates over the last four years. Idaho, Colorado, Wyoming, New Mexico have shown slight increase in prevalence of Depression in the last four years.

We have observed a pattern that is emerging from the state-wise prevalence heatmap<sup>3, 4</sup>, we can see that southeastern states have high prevalence states, there could be some social determinants of health at play which could be leading the population towards depression.

One interesting observation is that state like California has not shown a significant increase in the prevalence rates which indicates that unlike states which have higher prevalence of depression, people in California maybe having better access to mental healthcare services, early diagnosis of mental health conditions, and socio-economic conditions which could be mitigating some factors which leads to depression.

**What was the overall prevalence of depression among adults across all states in 2022?**

In [61]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER(ORDER BY DataValue DESC) AS depression_rank,
  LocationDesc AS state, DataValue AS overall_depression_prevalence
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
WHERE Topic = 'Mental Health'
AND Question = 'Depression among adults'
AND YearStart = 2022
AND StratificationCategory1 = 'Overall'
AND DataValueType = 'Crude Prevalence'
ORDER BY depression_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,depression_rank,state,overall_depression_prevalence
0,1,Tennessee,29.2
1,2,West Virginia,26.9
2,2,Oklahoma,26.9
3,4,Arkansas,26.6
4,5,Utah,26.5
5,6,Louisiana,26.4
6,7,Maine,26.3
7,8,Kentucky,25.8
8,9,New Hampshire,25.2
9,10,Ohio,25.0


Tennessee (29.2%), Oklahoma (26.9%), West Virginia (26.9%), Arkansas (26.6%), and Utah (26.5%) reported the highest overall prevalence of depression among adults in 2022. Southeastern states such as Louisiana, Arkansas, and Oklahoma consistently show high prevalence rates over the years.

Socioeconomic disparities, access to mental healthcare, and education may significantly contribute to the higher prevalence in southeastern states.
Rural or economically disadvantaged regions may lack adequate mental health resources, exacerbating the issue.

States like New Jersey and Hawaii may have better mental health infrastructure, early diagnosis capabilities, and socio-economic advantages, leading to comparatively lower depression rates.

States with persistently high prevalence, such as Louisiana, Oklahoma and Maine, could benefit from targeted policies focusing on improving mental health services, addressing socio-economic inequalities, and public awareness campaigns.


**Is there a pattern between higher rates of mental health struggles and a prevalence of population with more than two chronic diseases in each state?**

In [62]:
%%bigquery --project=fall24-ba775-a08
WITH chronic_condition AS (
  SELECT DISTINCT LocationDesc AS state, LocationId AS state_id,  DataValue AS two_or_more_chronic_conditions
  FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
  WHERE YearStart = 2022
  AND Topic = 'Health Status'
  AND Question = '2 or more chronic conditions among adults'
  AND DataValueType = 'Crude Prevalence'
  AND StratificationCategory1 = 'Overall'
),
mental_health AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(MENTHLTH) > 0 THEN ROUND((COUNTIF(MENTHLTH IN (20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0)) / COUNT(MENTHLTH)) * 100, 1)
      ELSE NULL
    END AS low_mental_health_proportion
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  GROUP BY state, state_id
)
SELECT RANK() OVER(ORDER BY m.low_mental_health_proportion DESC) AS low_mental_health_rank,
  m.state, m.low_mental_health_proportion, c.two_or_more_chronic_conditions
FROM mental_health m
INNER JOIN chronic_condition c
ON m.state_id = c.state_id
ORDER BY low_mental_health_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,low_mental_health_rank,state,low_mental_health_proportion,two_or_more_chronic_conditions
0,1,West Virginia,13.3,30.8
1,2,Tennessee,12.7,25.8
2,3,Louisiana,12.3,23.1
3,4,Arkansas,11.6,25.3
4,5,Oregon,11.4,20.2
5,6,Ohio,11.2,23.4
6,6,Alabama,11.2,25.9
7,8,Oklahoma,11.0,21.8
8,9,Missouri,10.7,20.2
9,10,Texas,10.5,17.3


There appears to be a correlation between low mental health and the presence of multiple chronic conditions. States with higher proportions of low mental health also tend to have higher percentages of individuals reporting two or more chronic conditions.

Regions with higher mental health issue rates, such as West Virginia, Tennessee, Louisiana and Arkansas, also tend to appear frequently in data related to high diabetes prevalence or other health challenges. This might suggest an overlap between mental health conditions and chronic diseases, although further analysis would be needed to establish a direct correlation.

California, New York, and Colorado report relatively low levels of low mental health issues but still have notable chronic condition rates. This suggests that while these states have comparatively lower mental health struggles, there is still a significant portion of the population dealing with multiple chronic conditions. Lifestyle, preventive healthcare efforts, or social determinants of health might contribute to these findings.

**Which U.S. states report the highest percentage of individuals who feel that their childhood safety and general needs were not fulfilled?**

In [63]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER(ORDER BY CASE WHEN COUNTIF(CAST(ACEADSAF AS FLOAT64) IN (1.0, 2.0, 3.0, 4.0, 5.0)) > 0 THEN ROUND((COUNTIF(CAST(ACEADSAF AS FLOAT64) IN (1.0, 2.0)) / COUNT(ACEADSAF)) * 100, 2) ELSE NULL END DESC) AS childhood_factor_rank,
  STATENAME AS state,
  CASE
    WHEN COUNTIF(CAST(ACEADNED AS FLOAT64) IN (1.0, 2.0, 3.0, 4.0, 5.0)) > 0 THEN
      ROUND((COUNTIF(CAST(ACEADNED AS FLOAT64) IN (1.0, 2.0)) / COUNT(ACEADNED)) * 100, 2)
    ELSE NULL
  END AS childhood_needs,
  CASE
    WHEN COUNTIF(CAST(ACEADSAF AS FLOAT64) IN (1.0, 2.0, 3.0, 4.0, 5.0)) > 0 THEN
      ROUND((COUNTIF(CAST(ACEADSAF AS FLOAT64) IN (1.0, 2.0)) / COUNT(ACEADSAF)) * 100, 2)
    ELSE NULL
  END AS childhood_safety
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
GROUP BY state
HAVING childhood_needs IS NOT NULL AND childhood_safety IS NOT NULL
ORDER BY childhood_factor_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,childhood_factor_rank,state,childhood_needs,childhood_safety
0,1,Oregon,4.19,8.77
1,2,Nevada,4.19,8.37
2,3,Florida,2.99,6.42
3,4,Arkansas,3.01,6.39
4,5,Iowa,2.83,5.98
5,6,South Dakota,2.52,5.13
6,7,Virginia,2.19,4.7
7,8,North Dakota,1.87,4.19


Nevada (4.19%) and Oregon (4.19%) have the highest percentages of individuals who reported “never” having an adult in their household who tried to ensure their basic needs were met during childhood. Oregon (8.77%) and Nevada (8.37%) also have the highest percentages of individuals reporting that they felt unsafe or unprotected as children.

Western states show higher percentages for both questions, possibly reflecting socio-economic or cultural factors impacting family dynamics and childhood experiences.

Correlation Between Adverse Childhood Experiences (ACEs) and Later Outcomes: States with higher percentages of childhood adversity (e.g., Nevada, Oregon) face more significant challenges related to mental health, education, and socio-economic outcomes in adulthood. These findings align with research on the long-term effects of adverse childhood experiences.

### **v. Smoking and Drinking Habits**

We now extend our analysis to investigate the role of lifestyle factors, specifically smoking and drinking habits and their impact on the diabetic population. By examining these habits, we aim to understand how they may exacerbate challenges faced by the diabetic population and suggest measures that alleviate these challenges.

**Is smoking or drinking significantly associated with the incidence of specific chronic diseases (Diabetes)?**

In [64]:
%%bigquery --project=fall24-ba775-a08
WITH smoke AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(SMOKE100) > 0 THEN ROUND((COUNTIF(SMOKE100 = 1.0) / COUNT(SMOKE100)) * 100, 1)
      ELSE NULL
    END AS smoke_proportion
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  GROUP BY state, state_id
),
smoke_diabetic AS (
  SELECT STATENAME AS state, _STATE AS state_id,
  CASE
    WHEN COUNT(SMOKE100) > 0 THEN ROUND((COUNTIF(SMOKE100 = 1.0) / COUNT(SMOKE100)) * 100, 1)
      ELSE NULL
    END AS smoke_proportion_diabetic
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  WHERE DIABETE4 = 1.0
  GROUP BY state, state_id
)
SELECT RANK() OVER(ORDER BY smoke_proportion_diabetic DESC) AS diabetic_smoking_rank,
  u.state, smoke_proportion, smoke_proportion_diabetic
FROM smoke u
INNER JOIN smoke_diabetic ud
ON u.state_id = ud.state_id
ORDER BY diabetic_smoking_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,diabetic_smoking_rank,state,smoke_proportion,smoke_proportion_diabetic
0,1,Indiana,43.5,53.8
1,2,Vermont,42.6,53.7
2,3,Maine,44.5,52.8
3,4,Rhode Island,43.5,51.9
4,4,North Dakota,40.0,51.9
5,6,New Hampshire,43.6,51.5
6,7,Tennessee,46.0,51.1
7,8,Florida,45.4,51.0
8,9,North Carolina,39.4,50.9
9,10,Ohio,44.0,50.8


We notice that across nearly all states, the smoking proportion for diabetics is higher than for the general population. This suggests that people with diabetes are more than likely to smoke than those without diabetes, which may exacerbate health complications.

States like Indiana (43.5%), Vermont (42.6%), and Maine (44.5%) report high smoking proportions and correspondingly high proportions of smokers with diabetes, with Indiana at 53.8%. The pattern suggests a notable overlap between smoking prevalence and diabetes incidence.

California has one of the lowest smoking rates in both populations (33.6% for the general population and 41.6% for diabetics). This suggests that smoking-related health issues in California may be relatively lower, but there is still room for improvement in reducing smoking among diabetics.

States with high smoking rates among diabetics may benefit from targeted health policies aimed at reducing smoking specifically in diabetic populations. These could include subsidized nicotine replacement therapies, expanded access to smoking cessation programs, and increased education on the added risks smoking poses to people with diabetes.

**What is the average age at which people start smoking in each state?**

In [65]:
%%bigquery --project=fall24-ba775-a08
WITH state_avg_lcsfirst AS (
  SELECT
    STATENAME AS state,
    _STATE AS state_id,
    ROUND(AVG(LCSFIRST), 2) AS avg_lcsfirst
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  WHERE LCSFIRST BETWEEN 0 AND 100
  GROUP BY state, state_id
)
SELECT RANK() OVER(ORDER BY avg_lcsfirst DESC) AS smoke_start_age_rank,
  state, avg_lcsfirst
FROM state_avg_lcsfirst
ORDER BY smoke_start_age_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,smoke_start_age_rank,state,avg_lcsfirst
0,1,Virgin Islands,19.03
1,2,Guam,18.8
2,3,Puerto Rico,18.65
3,4,District of Columbia,18.56
4,5,California,18.38
5,6,Georgia,18.35
6,7,Texas,18.31
7,8,Mississippi,18.17
8,8,Louisiana,18.17
9,10,Alabama,18.1


Despite slight variations, the average starting age for smoking across states remains remarkably consistent, with most individuals starting in late adolescence (17-19 years). The uniformity suggests that societal, and educational interventions to prevent smoking initiation may need to target this specific age group more aggressively.

Implemeting integrated health policies focusing on smoking cessation as a preventive measure against diabetes might help in target states like Indiana and Maine. Focusing educational campaigns on adolescents aged 15-17 is also essential, as this demographic is at the highest risk of initiating smoking. This could be done by enhancing school and community-based programs to address smoking risks during critical formative years.

**How does the prevalence of current cigarette smoking among adults vary across U.S. states,  and is there a correlation with the occurence of chest CT or CAT scans performed for lung cancer screening?**

In [66]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER (ORDER BY AVG(CAST(DataValue AS FLOAT64)) DESC) AS smoking_prevalence_rank,
    LocationDesc AS state,
    AVG(CAST(DataValue AS FLOAT64)) AS overall_smoking_prevalence,
    CASE
        WHEN COUNT(LCSSCNCR) > 0 THEN ROUND((COUNTIF(LCSSCNCR = 1) / COUNT(LCSSCNCR)) * 100, 1)
        ELSE NULL
    END AS check_lung_cancer_proportion
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
FULL OUTER JOIN `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
ON LocationDesc = STATENAME
WHERE Topic = 'Tobacco'
AND YearStart = 2022
AND DataSource = 'BRFSS'
AND Question = 'Current cigarette smoking among adults'
AND StratificationCategory1 = 'Overall'
AND DataValueType = 'Crude Prevalence'
GROUP BY state
ORDER BY smoking_prevalence_rank;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,smoking_prevalence_rank,state,overall_smoking_prevalence,check_lung_cancer_proportion
0,1,West Virginia,21.0,19.7
1,2,Guam,19.8,27.5
2,3,Arkansas,18.7,20.3
3,4,Tennessee,18.5,18.6
4,5,Mississippi,17.4,17.7
5,5,Kentucky,17.4,23.9
6,7,Ohio,17.1,20.1
7,8,Missouri,16.8,19.1
8,9,Louisiana,16.7,17.6
9,10,Indiana,16.2,21.3


West Virginia (21%), Guam (19.8%), Arkansas (18.7%), Tennessee (18.5%), Kentucky (17.4%) and Ohio (17.1%) are those states there is high smoking prevalence rate. Guam (27.5%), Rhode Island (25.2%), Kentucky (23.9%), New York (22.9%), Virgin Islands (22.3%), Georgia (22.5%) have higher lung cancer screening rates.

We can see that Guam stands out when it comes to high prevalence of smoking amongst adults and high lung cancer screening. This shows that although there is high prevalence of smoking but there is targeted lung cancer screening in Guam. Utah and Puerto Rico have low prevalence of smoking rates amongst adults but lung cancer screening rates are much higher indicating the targeted efforts towards preventive health actions.

Kentucky (23.9%) and Mississippi (17.7%) have relatively the same smoking prevalence rate but lung cancer screening is far more prevalent in Kentucky as compared to Mississippi.

Oklahoma have low prevalence of lung cancer screening rate (16.1%) as compare to other states which have comparable prevalence of smoking rates among adults (15.6%).


**What trends in binge drinking or excessive alcohol consumption emerge across different ages?**

In [67]:
%%bigquery --project=fall24-ba775-a08
SELECT
    Stratification1 AS Age_Group,
    AVG(CASE WHEN YearStart = 2019 THEN CAST(DataValue AS FLOAT64) END) AS BD_Prevalence_2019,
    AVG(CASE WHEN YearStart = 2020 THEN CAST(DataValue AS FLOAT64) END) AS BD_Prevalence_2020,
    AVG(CASE WHEN YearStart = 2021 THEN CAST(DataValue AS FLOAT64) END) AS BD_Prevalence_2021,
    AVG(CASE WHEN YearStart = 2022 THEN CAST(DataValue AS FLOAT64) END) AS BD_Prevalence_2022,
    AVG(CAST(DataValue AS FLOAT64)) AS binge_drinking_Prevalence
FROM `fall24-ba775-a08.group_project.chronic_disease_indicators`
WHERE Topic = 'Alcohol'
AND DataSource = 'BRFSS'
AND Question = 'Binge drinking prevalence among adults'
AND StratificationCategory1 = 'Age'
AND DataValueType = 'Crude Prevalence'
GROUP BY Age_Group
ORDER BY Age_Group;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,Age_Group,BD_Prevalence_2019,BD_Prevalence_2020,BD_Prevalence_2021,BD_Prevalence_2022,binge_drinking_Prevalence
0,Age 18-44,24.586792,22.742593,22.435185,23.94,23.423148
1,Age 45-64,14.118868,13.77963,13.761111,14.870909,14.136111
2,Age >=65,5.043396,4.916981,4.845283,5.422222,5.058685


The age group of 18 to 44 consistently reports the highest prevalence of binge drinking across all years (e.g., 23.94% in 2022), making them the most at-risk group for excessive alcohol consumption. Age Group 45-64 shows a moderate prevalence of binge drinking, with rates steadily around 14% over the years. While less frequent than younger adults, the rate still warrants attention given its potential health implications.

Young adults (18-44) are the primary group for targeted interventions due to their high binge drinking rates. Middle-aged adults (45-64) could benefit from education on alcohol-related health risks to prevent long-term consequences.

### **v. Society and Demographics**

We now explore the broader societal, financial and demographic factors around the prevalence of diabetes, with a specific focus on income, food security and ability to pay bills to understand how these factors intersect with chronic conditions.

**How does the prevalence of diabetes differ between low-income and high-income individuals across various U.S. states?**

In [68]:
%%bigquery --project=fall24-ba775-a08
SELECT STATENAME AS state,
  CASE
    WHEN INCOME3 IN (1, 2, 3, 4, 5) THEN 'Low Income'
    WHEN INCOME3 > 5 THEN 'High Income'
    ELSE 'Unknown'
  END AS income_group,
  ROUND((SUM(CASE WHEN DIABETE4 = 1 THEN 1 ELSE 0 END) / COUNT(*))*100, 2) AS diabetes_prevalence
  FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
  WHERE DIABETE4 IN (1, 3, 4) AND INCOME3 IS NOT NULL
GROUP BY state, income_group
ORDER BY state, income_group;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,state,income_group,diabetes_prevalence
0,Alabama,High Income,17.51
1,Alabama,Low Income,26.74
2,Alaska,High Income,9.62
3,Alaska,Low Income,14.2
4,Arizona,High Income,13.7
5,Arizona,Low Income,20.86
6,Arkansas,High Income,17.13
7,Arkansas,Low Income,25.76
8,California,High Income,9.96
9,California,Low Income,16.87


Southern states such as Alabama (26.7%), West Virginia (25.9%), Arkansas (25.7%), and Tennessee (25.2%), Louisiana (25.04%) generally have higher diabetes prevalence rates in the lower income group (characterized by income levels lower than  35,000 USD)<sup>5</sup>.

These states definitely have health challenges as they also have higher prevalence of chronic diseases.  socio-economic factors, limited access to healthcare, lifestyle habits, or diet-related issues that are contributing to higher diabetes rates in these low-income groups. States with higher diabetes prevalence could benefit from targeted public health interventions, including nutrition education, better healthcare access, and diabetes screening programs.

Focusing efforts on regions like Alabama, West Virginia, and South Carolina could help lower the diabetes rates and improve overall health outcomes.
Guam has the highest diabetes prevalence rate among the high-income group, at approximately 18.61%. This indicates that even in higher-income areas, some regions face significant diabetes-related health challenges.

West Virginia and Alabama follow with prevalence rates of 17.91% and 17.51% respectively. These states exhibit higher rates compared to others in the high-income group, highlighting persistent public health issues despite higher income levels.

Puerto Rico and Arkansas also have prevalence rates slightly above 17%, showing that diabetes remains a concern in these areas despite higher income.

Public health interventions targeting Guam, West Virginia, and Alabama are needed to bring down the diabetes prevalence among high-income populations in these regions. These states/territories may benefit from increased awareness campaigns, improved healthcare facilities, or diabetes management programs.


**How does the prevalence of diabetes differ between individuals of different sexes across various U.S. states?**

In [69]:
%%bigquery --project=fall24-ba775-a08
SELECT STATENAME AS state,
  CASE
    WHEN _SEX = 1 THEN 'Male'
    WHEN _SEX = 2 THEN 'Female'
    WHEN _SEX = 3 THEN 'Non-Binary'
    ELSE 'Other/Unknown'
  END AS gender,
  ROUND((SUM(CASE WHEN DIABETE4 = 1 THEN 1 ELSE 0 END) / COUNT(*)) * 100.0, 2) AS diabetes_prevalence
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
WHERE DIABETE4 IN (1, 3, 4)
GROUP BY state, gender
ORDER BY state, gender, diabetes_prevalence desc;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,state,gender,diabetes_prevalence
0,Alabama,Female,21.07
1,Alabama,Male,18.58
2,Alaska,Female,10.2
3,Alaska,Male,10.67
4,Arizona,Female,14.04
5,Arizona,Male,17.01
6,Arkansas,Female,19.49
7,Arkansas,Male,19.59
8,California,Female,10.89
9,California,Male,11.7


The male identifying population in West Virginia has the highest diabetes prevalence rate at 21.12%, indicating a significant health concern for males in this state. The female identifying population in Puerto Rico also has a high diabetes prevalence rate at 20.90%, followed by that in Guam at 20.38%. These prevalence rates highlight that females in certain states are also facing high risks.

In Alabama, female-identifying population sees a diabetes prevalence of 20.74%, compared to males at 18.53%. Puerto Rico also follows a similar trend, with prevalence among the female population at 20.90% and males at 18.51%. This suggests that females may be at greater risk of diabetes in some states, likely due to various socio-economic, lifestyle, or health care access factors.

The male identifying population in Louisiana have a much better diabetes outcome of 16% in comparison to the female population of 20.8%. The male population of Virgin Islands also has a relatively low diabetes prevalence of 16.87%, indicating better diabetes health outcomes in this region.

States like West Virginia, Puerto Rico, and Guam (especially for females) may benefit from targeted interventions, such as increased screening, nutrition education, and health care access.

**What are the trends in difficulty paying bills and food insecurity among individuals with diabetes across various U.S. states?**

In [70]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER(ORDER BY ROUND(SUM(CASE WHEN DIABETE4 = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) DESC) AS diabetes_prevalence_rank,
  STATENAME AS state,
  ROUND(SUM(CASE WHEN DIABETE4 = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS diabetes_prevalence,
  ROUND(AVG(CASE WHEN DIABETE4 = 1 AND SDHFOOD1 IN (1, 2) THEN 1
           WHEN DIABETE4 = 1 AND SDHFOOD1 IN (3, 4, 5) THEN 0
           ELSE NULL END) * 100, 2) AS food_insecurity_12,
  ROUND(AVG(CASE WHEN DIABETE4 = 1 AND SDHBILLS = 1 THEN 1
           WHEN DIABETE4 = 1 AND SDHBILLS = 2 THEN 0
           ELSE NULL END) * 100, 2) AS no_bill_payment_12,
  ROUND(AVG(CASE WHEN DIABETE4 = 1 AND SDHUTILS = 1 THEN 1
           WHEN DIABETE4 = 1 AND SDHUTILS = 2 THEN 0
           ELSE NULL END) * 100, 2) AS no_utility_12,
  ROUND(AVG(CASE WHEN DIABETE4 = 1 AND SDHTRNSP = 1 THEN 1
           WHEN DIABETE4 = 1 AND SDHTRNSP = 2 THEN 0
           ELSE NULL END) * 100, 2) AS no_transportation_12
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
WHERE SDHFOOD1 IN (1, 2, 3, 4, 5)
  AND SDHBILLS IN (1, 2)
  AND SDHUTILS IN (1, 2)
  AND SDHTRNSP IN (1, 2)
GROUP BY state
ORDER BY diabetes_prevalence_rank
LIMIT 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,diabetes_prevalence_rank,state,diabetes_prevalence,food_insecurity_12,no_bill_payment_12,no_utility_12,no_transportation_12
0,1,West Virginia,20.53,6.15,12.64,12.42,12.42
1,2,Alabama,20.36,5.83,11.79,8.49,8.62
2,3,Puerto Rico,20.03,8.36,14.55,10.7,13.43
3,4,Tennessee,17.97,7.44,11.37,8.53,9.61
4,5,Kentucky,17.87,8.59,12.24,10.33,10.97
5,6,Georgia,17.4,7.11,11.98,7.19,9.09
6,7,Virgin Islands,17.36,9.17,19.21,14.41,15.28
7,8,Texas,17.11,7.85,14.83,10.98,10.82
8,9,Mississippi,16.84,11.11,16.75,10.95,10.63
9,10,South Carolina,16.74,4.7,10.85,7.28,7.21


**Food Insecurity:** Puerto Rico has the highest food insecurity rate at 8.36%, followed by Virgin Islands (9.17%) and Kentucky (8.58%). The elevated prevalence of diabetes in these areas, coupled with high food insecurity, suggests a potential link between diet quality and diabetes.

**Bill Payments:** Virgin Islands shows the highest rate of difficulty in bill payment among the diabetic population at 19.21%, followed by Puerto Rico (14.54%) and Texas (14.83%). The high rate of financial insecurity in these areas suggests that diabetic individuals are more likely to struggle with financial obligations, which could negatively impact their ability to access consistent healthcare.

**Utility Bill Payments:** Virgin Islands also has a high no utility payment rate of 14.14%, indicating that many individuals with diabetes face difficulties paying for basic utilities. Puerto Rico also has a relatively high rate (10.74%), showing a significant issue with utility payment insecurity that could impact individuals' overall well-being and ability to manage health conditions effectively.

Virgin Islands stands out with high rates across multiple metrics: High food insecurity (9.17%), no bill payment issues (19.21%), no utility payment issues (14.14%), and transportation challenges (15.28%). This suggests a significant socio-economic burden that correlates with diabetes prevalence and may impede effective diabetes management.<sup>7</sup>

Puerto Rico also shows consistently high levels across multiple socio-economic indicators: Food insecurity (8.36%), no bill payment issues (14.54%), utility payment issues (10.74%), and transportation issues (13.43%).

These challenges could exacerbate diabetes prevalence and health outcomes, indicating the need for targeted socio-economic support in addition to healthcare interventions.


**Which U.S. states have the highest percentage of individuals reporting worse healthcare experiences compared to people of other races?**

In [71]:
%%bigquery --project=fall24-ba775-a08
SELECT RANK() OVER(ORDER BY ROUND((SUM(CASE WHEN SAFE_CAST(RRHCARE4 AS FLOAT64) IN (1.0, 4.0) THEN 1 ELSE 0 END) / IF(COUNT(RRHCARE4) > 0, COUNT(RRHCARE4), COUNT(*))) * 100, 2) DESC) AS race_care_discrimination_rank,
  STATENAME AS state,
  ROUND((SUM(CASE WHEN SAFE_CAST(RRHCARE4 AS FLOAT64) IN (1.0, 4.0) THEN 1 ELSE 0 END) / IF(COUNT(RRHCARE4) > 0, COUNT(RRHCARE4), COUNT(*))) * 100, 2) AS race_reaction_pct
FROM `fall24-ba775-a08.group_project.behavioral_risk_factor_surveillance_system_2022`
GROUP BY state
ORDER BY race_care_discrimination_rank
LIMIT 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,race_care_discrimination_rank,state,race_reaction_pct
0,1,District of Columbia,5.39
1,2,Louisiana,5.0
2,3,Illinois,4.69
3,3,Georgia,4.69
4,5,South Carolina,3.85
5,6,Maryland,3.74
6,7,California,3.73
7,8,North Carolina,3.69
8,9,New Mexico,3.6
9,10,Tennessee,3.49


District of Columbia stands out with the highest percentage of respondents feeling that their healthcare experiences were worse compared to people of other races, at 5.39%. Louisiana follows closely with 5.00%. Georgia and Illinois both reported 4.69%, indicating similar perceptions of healthcare inequality in these states. New Mexico (3.60%), Tennessee (3.49%), and Nevada (3.31%) have the lowest percentages among the top states.

These states still show notable numbers of individuals perceiving racial inequities in healthcare, though they are comparatively lower than other states like the District of Columbia and Louisiana. District of Columbia and Louisiana have some of the highest reported rates, indicating that these areas may have systemic issues regarding healthcare equity.

The relatively high percentages in states such as Georgia, South Carolina, and North Carolina indicate a trend across parts of the Southern region where respondents perceive their healthcare experiences to be worse than those of people of other races.

California and Maryland, which are generally known for having diverse populations and relatively robust healthcare systems, still report moderate levels of perceived inequality, suggesting that disparities may exist regardless of the overall quality of healthcare infrastructure.

# **7. Tableau Dashboard**

![TableauDashboard](https://drive.google.com/uc?export=view&id=1tjMlGSJUjFjfl00PMRs-EqEy6H1mZe7B)

The above dashboard which can be accessed through [this link,](https://public.tableau.com/views/AnalyzingDiabetesintheUS/Dashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link) summarizes key insights into diabetes prevalence, healthcare barriers and socioeconomic challenges affecting diabetic populations across the US. The filter at the top allows insight into the metrics for the different sexes, as well as the general population. The map highlights regions with high diabetes prevalence, while the bar chart compares diabetes and prediabetes rates in the 15 states with the highest prediabetes prevalence. Early diagnosis of prediabetes can prepare individuals to manage or delay the onset of diabetes. Most diabetes diagnoses occur in older age groups (45 and above), indicating opportunities for improved diagnosis in teens and young adults.

Type 2 diabetes is seen to be the most prevalent form, however about 7% of the diabetic population is unsure of their diabetes type, suggesting the inadequacy of diabetes support programs. This issue is further emphasizded in the map that identifies states with gaps in diabetes self-management support programs, particularly in the Midwest and South with upto 15% of the diabetic population reporting a lack of recent education on self-management of diabetes. It is also observed that cost and transport barriers disproportionately affect younger diabetic population in terms of access to healthcare services, and diminish slightly with age. Lastly, the table reveals socioeconomic challenges, such as food insecurity and difficulty paying bills, affecting diabetic populations in the states with highest rates of diabetes. West Virginia, Puerto Rico and Kentucky show higher levels of insecurity, reflecting broader economic challenges in these regions.

# **8. Conclusion**

The analysis highlights significant regional disparities in the prevalence of conditions such as diabetes, and depression. Key findings reflect the expected relationships between socioeconomic factors, lifestyle choices, healthcare access, and chronic disease prevalence, offering valuable insights for public health interventions.

1. **Regional Trends and Risk Factors:**
States in the Southeastern U.S., such as Tennessee, Arkansas, Louisiana, and Oklahoma, consistently exhibit higher rates of chronic diseases like diabetes and depression. These states also tend to have lower routine checkup rates and are more likely to have populations with incomes below 35,000 USD, suggesting socioeconomic status as a critical factor. The territories of Guam and Puerto Rico show high rates of smoking and alcohol consumption, both of which correlate with elevated diabetes rates.

2. **Behavioral and Lifestyle Contributions:**
Smoking prevalence remains high in states like Indiana, Vermont, and Maine, contributing to increased rates of diabetes and lung cancer. Despite high smoking rates, Guam's focus on lung cancer screening seems to highlight the potential of targeted healthcare interventions in mitigating risk. Similarly, states like West Virginia and Alabama, which report high rates of both chronic disease prevalence and limited healthcare access, underscore the need for enhanced preventive care and healthcare accessibility.

3. **Vulnerable Populations:**
Certain groups, such as females in Alabama, Guam, and Puerto Rico, may face a disproportionate risk of chronic diseases like diabetes due to intersecting socioeconomic, lifestyle, and healthcare barriers. The diabetic population also appears particularly vulnerable to comorbidities, with high asthma prevalence reported in states like Hawaii, Colorado, and Oregon.

4. **Implications for Public Health:**
The consistent patterns of high chronic disease prevalence in specific regions highlight the need for targeted, localized public health strategies. Policies aimed at improving access to healthcare, increasing routine screenings, and addressing socioeconomic disparities could significantly reduce the burden of chronic diseases. Additionally, behavioral interventions promoting healthier lifestyles, such as smoking cessation and alcohol moderation programs, should be prioritized in high-risk areas.

This reinforces the critical role of understanding regional and demographic variations in chronic disease prevalence to inform effective public health strategies. By addressing the underlying behavioral, societal, and systemic factors contributing to these disparities, policy makers can implement targeted interventions to improve health outcomes and reduce the economic and social costs associated with chronic diseases.

### **i. Regional Observations**

1. **Northeastern States** (e.g., Vermont, Maine, Rhode Island, Connecticut): Vermont and Maine report high smoking rates, contributing to elevated diabetes prevalence. While healthcare systems in this region are generally stronger, pockets of chronic disease prevalence remain an area for improvement.

2. **Southeastern States** (e.g., Tennessee, West Virginia, Arkansas, Alabama, Louisiana): These states exhibit a high prevalence of chronic diseases such as diabetes and depression, accompanied by lower rates of routine healthcare checkups. There is also a strong correlation between lower income levels and increased rates of diabetes and other chronic illnesses.
  
3. **Southern and Central Regions** (e.g., Kentucky, Mississippi, Oklahoma): Chronic disease prevalence remains consistently high in these regions, coupled with lower engagement in healthcare services. These areas also tend to reflect long-standing health disparities influenced by economic challenges and lifestyle factors.  

4. **Western States** (e.g., Nevada, Utah, Oregon, Colorado): States like Nevada and Oregon display links between childhood neglect and higher rates of depression, while Utah reports high overall depression prevalence. Colorado stands out with a notable prevalence of asthma among diabetic populations, highlighting unique regional health challenges.  

5. **Territories** (e.g., Puerto Rico, Guam): High alcohol consumption is linked to increased diabetes prevalence in these territories. Guam exhibits high smoking rates but has implemented targeted lung cancer screening programs to address this issue. Women in Puerto Rico and Guam face a disproportionately higher risk of diabetes, pointing to the role socio-economic factors play in chronic disease prevalence.  

### **ii. Key Insights & Recommendations**

1. **Support for Low-Income Diabetic Populations**

  The diabetic population in the U.S. is disproportionately represented among low-income groups compared to the non-diabetic population. Despite this, diabetic individuals are more likely to be insured, potentially reflecting a reliance on insurance plans to manage the high costs of diabetes, including insulin.
  
  Recommendation: To address this, increased support is needed for low-income diabetic populations, particularly in Guam, Puerto Rico, Louisiana, and Illinois, where the financial burden may be especially acute. Policies should include subsidies, reduced-cost insulin programs, and tailored healthcare initiatives to alleviate financial stress and improve access to necessary treatments. Given the historic nature of pharmaceutical companies lobbying against reduction in insulin prices and the large profit margins, we would also suggest that the government enter the market as a manufacturer and distributor of insulin, to bring down costs. The government would position itself as a non-profit in the industry, and hence force price reductions across the board.

2. **Addressing Barriers to Medical Appointments for Young Adults**

  A notable proportion of diabetic individuals under the age of 35 report difficulties in affording medical appointments, with 17.45% of those aged 19-24 highlighting cost as a barrier. This age group often transitions to financial independence, which may leave limited resources for medical care.
  
  Recommendation: Expand insurance plans to cover preventative checkups and routine care for young adults, along with introducing sliding-scale payment models to ensure this population receives necessary medical support to prevent complications as they age.

3. **Improved Diabetes Education and Awareness**

  States such as Delaware, South Carolina, Mississippi, Georgia, and Indiana report that approximately 15% of the diabetic population has not attended any diabetes education class in the past one to two years. This gap in education indicates a need for better outreach and awareness programs in these areas.
  
  Recommendation: Implement state-sponsored diabetes education courses, community workshops, and awareness campaigns to provide critical information about diabetes management, prevention of complications, and healthy lifestyle choices.

4. **Integrating Mental Health Support into Diabetes Treatment**

  Across most age groups, depression prevalence is higher among diabetic individuals, underscoring the need for mental health support as part of diabetes care.
  
  Recommendation: Incorporate mental health screenings into routine diabetes checkups, expanding access to counseling and support groups, and integrating psychological care into diabetes management programs to improve overall health outcomes.

5. **Childcare and Safety Programs to Address Adverse Childhood Experiences (ACEs)**

  States like Oregon, Nevada, Florida, Arkansas, Iowa, and South Dakota report a significant proportion of individuals indicating they lacked a safe and supportive adult presence during childhood. These adverse childhood experiences (ACEs) are linked to poorer mental health outcomes in adulthood.

  Recommendation: To mitigate these long-term effects, targeted childcare programs and family support initiatives are recommended in these regions. These programs should aim to provide safe, nurturing environments for children and include community-based interventions that address basic needs, parenting education, and mental health resources for families.

### **iii. Future Scope**

The further scope of this project can span several areas to further understand our understanding of diabetes and its associated factors, including the following:
1. We utilized the 2022 BRFSS data to conduct our analysis, by using past and future years (based on availability), long-term studies can be conducted to track how diabetes and its comorbidities evolves over time.

2. Researching policy changes and approaches to insulin pricing and access, including investigating programs or subsidies aimed at reducing the financial burden of insulin for patients in the underserved sections of society.

3. Expanding the study of impact of racial, ethnic and socioeconomic disparities by analyzing the specific differences in access to healthcare, medication affordability and variations in care quality across the different population groups.

4. Expand the analysis by incorporating additional datasets, particularly those detailing state or federal-level healthcare programs. This will refine the findings and provide policymakers with actionable insights to develop more targeted interventions.

# **9. Challenges**

The BRFSS dataset included over 320 columns, presenting a significant challenge in identifying the most relevant variables for our analysis. Sifting through such a vast dataset required extensive data exploration and preprocessing, which consumed a considerable portion of our time. The lack of a proper data dictionary for the BRFSS dataset meant we had to had to prepare one manually for all 320+ columns from the Questionnaire itself. We created a table to store these definitions for ease of access, which helped us speed up the data analysis portion of our analysis. Our analysis was also limited by the absence of time-series data to explore trends over time. Although previous years' BRFSS data were available, inconsistencies due to annual updates in survey questions and variable definitions made it impractical to conduct a time-based analysis within the constraints of this project. Many columns in the dataset contained missing or incomplete values, further restricting our ability to conduct comprehensive analyses in several areas. For example, we aimed to examine disparities across racial and ethnic groups in detail but found the race data inconsistently populated, forcing us to abandon this aspect of our analysis. These gaps in the data, inherent to working with survey-based information, limited the directions in which we could extend our analysis.

The nature of the data in our analysis, derived from BRFSS, inherently limited the scope of visualization options. Primarily consisting of survey responses which tend to fall on a numeric scale, the healthcare data lacked a high degree of variety. Consequently, this restricted our ability to produce more dynamic visualizations and instead led us to focus on descriptive analytics to highlight key insights.

# **10. Generative AI Disclosure**

In completing this project, we have utilized Generative AI tools to assist with various aspects of our work. Below is a detailed account of how these tools were used:

**File Conversion:** We used ChatGPT to assist with converting our dataset from the .xpt file format to a usable format for analysis. The AI guided us through the conversion process, ensuring that the data was in a workable format.

**Brainstorming Ideas:** ChatGPT helped us brainstorm and refine ideas regarding the project. We faced early challenges in finding complementary datasets, and it helped summarize various datasets we were evaluating.

**Error Troubleshooting:** During data conversion and processing, we encountered a few errors. ChatGPT provided detailed explanations of the issues and offered potential solutions, enabling us to troubleshoot and debug effectively.

**Data Preprocessing Suggestions:** We also consulted ChatGPT for advice on handling missing data and other preprocessing challenges.

**Navigating Tableau:** Given the relative unfamiliarity with Tableau within the team, we consulted ChatGPT on how we might be able to make the desired changes to our visualizaitions.

**Grammar Check and Language Correction:** We also used ChatGPT to correct the grammar of our written content, wherever necessary.

Our team has reviewed, edited, and validated all AI-generated content to ensure its accuracy, relevance, and originality in accordance with academic integrity guidelines.






# **11. References**

<sup>1</sup> Centers for Disease Control and Prevention. (n.d.). Diabetes interventions. U.S. Department of Health and Human Services. Retrieved December 4, 2024, from https://www.cdc.gov/nccdphp/priorities/diabetes-interventions.html

<sup>2</sup> Centers for Disease Control and Prevention. (2022). Behavioral Risk Factor Surveillance System (BRFSS) 2022 Questionnaire (Publication No. 508). https://www.cdc.gov/brfss/questionnaires/pdf-ques/2022-BRFSS-Questionnaire-508.pdf

<sup>3</sup> Plotly (2024). plotly.express.choropleth [Documentation]. Plotly. https://plotly.com/python/choropleth-maps/

<sup>4</sup> Plotly (2024). plotly.express.fig.update_layout [Documentation]. Plotly. https://plotly.com/python/reference/layout/

<sup>5</sup> Legal Services Corporation. (n.d.). Section 2: Today’s Low-Income America. Retrieved November 14, 2024, from https://justicegap.lsc.gov/resource/section-2-todays-low-income-america/

<sup>6</sup> Centers for Disease Control and Prevention. (n.d.). Adult Obesity Prevalence Maps. Retrieved November 14, 2024, from https://www.cdc.gov/obesity/data-and-statistics/adult-obesity-prevalence-maps.html

<sup>7</sup> V.I. Consortium. (2023, November 30). Virgin Islands' diabetes deemed 'silent epidemic' in USVI, affecting over 11,600; VIDCOE seeks increased funding to combat rising cases. The V.I. Consortium. https://viconsortium.com/vi-health/virgin-islands-diabetes-deemed--silent-epidemic--in-usvi--affecting-over-11-600--vidcoe-seeks-increased-funding-to-combat-rising-cases

<sup>8</sup> Commonwealth Fund. (2023, June). 2023 scorecard on state health system performance. The Commonwealth Fund. https://www.commonwealthfund.org/publications/scorecard/2023/jun/2023-scorecard-state-health-system-performance