Dataset: **Normal weight, overweight, and obesity among adults aged 20 and over, by selected characteristics: United States**

Origin: Center for Disease Control and Prevention (CDC)

Link: [https://catalog.data.gov/dataset/normal-weight-overweight-and-obesity-among-adults-aged-20-and-over-by-selected-characteris-8e2b1](https://catalog.data.gov/dataset/normal-weight-overweight-and-obesity-among-adults-aged-20-and-over-by-selected-characteris-8e2b1)

## 1. Project Introduction

This project will investigates the correlection between the obesity and the social factor such as sex, race, and age in the US between 2015-2018. 

## 2. Importing modules and read csv file

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

In [55]:
df = pd.read_csv("Normal_weight__overweight__and_obesity_among_adults_aged_20_and_over__by_selected_characteristics__United_States.csv")
df.head()

Unnamed: 0,INDICATOR,PANEL,PANEL_NUM,UNIT,UNIT_NUM,STUB_NAME,STUB_NAME_NUM,STUB_LABEL,STUB_LABEL_NUM,YEAR,YEAR_NUM,AGE,AGE_NUM,ESTIMATE,SE,FLAG
0,"Normal weight, overweight, and obesity among a...",Normal weight (BMI from 18.5 to 24.9),1,"Percent of population, age-adjusted",1,Total,1,20 years and over,1.1,1988-1994,1,20 years and over,1.0,41.6,0.8,
1,"Normal weight, overweight, and obesity among a...",Normal weight (BMI from 18.5 to 24.9),1,"Percent of population, age-adjusted",1,Total,1,20 years and over,1.1,1999-2002,2,20 years and over,1.0,33.0,0.8,
2,"Normal weight, overweight, and obesity among a...",Normal weight (BMI from 18.5 to 24.9),1,"Percent of population, age-adjusted",1,Total,1,20 years and over,1.1,2001-2004,3,20 years and over,1.0,32.3,0.7,
3,"Normal weight, overweight, and obesity among a...",Normal weight (BMI from 18.5 to 24.9),1,"Percent of population, age-adjusted",1,Total,1,20 years and over,1.1,2003-2006,4,20 years and over,1.0,31.6,0.8,
4,"Normal weight, overweight, and obesity among a...",Normal weight (BMI from 18.5 to 24.9),1,"Percent of population, age-adjusted",1,Total,1,20 years and over,1.1,2005-2008,5,20 years and over,1.0,30.8,0.7,


## 3. Data cleaning


In [56]:
def getInfo(df):
    print("Info about the data:")
    df.info()
    print("\nNum Unique values per column:")
    print(df.nunique())

In [57]:
getInfo(df)

Info about the data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3360 entries, 0 to 3359
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   INDICATOR       3360 non-null   object 
 1   PANEL           3360 non-null   object 
 2   PANEL_NUM       3360 non-null   int64  
 3   UNIT            3360 non-null   object 
 4   UNIT_NUM        3360 non-null   int64  
 5   STUB_NAME       3360 non-null   object 
 6   STUB_NAME_NUM   3360 non-null   int64  
 7   STUB_LABEL      3360 non-null   object 
 8   STUB_LABEL_NUM  3360 non-null   float64
 9   YEAR            3360 non-null   object 
 10  YEAR_NUM        3360 non-null   int64  
 11  AGE             3360 non-null   object 
 12  AGE_NUM         3360 non-null   float64
 13  ESTIMATE        2899 non-null   float64
 14  SE              2899 non-null   float64
 15  FLAG            844 non-null    object 
dtypes: float64(4), int64(4), object(8)
memory usage: 420.1+ K

In [31]:
print("Panels:", df["PANEL"].unique(), end="\n\n")
print("Unit:", df["UNIT"].unique(), end="\n\n")
print("Stub_name:", df["STUB_NAME"].unique(), end="\n\n")
print("Stub_label:", df["STUB_LABEL"].unique(), end="\n\n")
print("Year:", df["YEAR"].unique(), end="\n\n")
print("Age:", df["AGE"].unique(), end="\n\n")
print("Flag:", df["FLAG"].unique(), end="\n\n")

Panels: ['Normal weight (BMI from 18.5 to 24.9)'
 'Obesity (BMI greater than or equal to 30.0)'
 'Overweight or obese (BMI greater than or equal to 25.0)'
 'Grade 1 obesity (BMI from 30.0 to 34.9)'
 'Grade 2 obesity (BMI from 35.0 to 39.9)'
 'Grade 3 obesity (BMI greater than or equal to 40.0)']

Unit: ['Percent of population, age-adjusted' 'Percent of population, crude']

Stub_name: ['Total' 'Sex' 'Race and Hispanic origin'
 'Sex and race and Hispanic origin' 'Percent of poverty level'
 'Sex and age']

Stub_label: ['20 years and over' 'Male' 'Female' 'Not Hispanic or Latino: White only'
 'Male: Not Hispanic or Latino: White only'
 'Female: Not Hispanic or Latino: White only'
 'Not Hispanic or Latino: Black or African American only'
 'Male: Not Hispanic or Latino: Black or African American only'
 'Female: Not Hispanic or Latino: Black or African American only'
 'Not Hispanic or Latino: Asian only'
 'Male: Not Hispanic or Latino: Asian only'
 'Female: Not Hispanic or Latino: Asian only'

**Flag meaning**

'*' : unreliable data, Standard error will be greater than 20%

'---' : no data

nan : normal

'.' : normal, signify data from year 2015-2018

In [82]:
def convertPairsListToDict(pairsList):
    returnDict = {}
    for item in pairsList:
        returnDict[item[0]] = item[1]
    return returnDict

def getMapping(df):
    """
       get encoding lists from the data file 
    """
    panelList = df[["PANEL_NUM", "PANEL"]].drop_duplicates().values
    unitList = df[["UNIT_NUM", "UNIT"]].drop_duplicates().values
    stubNameList = df[["STUB_NAME_NUM", "STUB_NAME"]].drop_duplicates().values
    stubLabelList = df[["STUB_LABEL_NUM", "STUB_LABEL"]].drop_duplicates().values
    yearList = df[["YEAR_NUM", "YEAR"]].drop_duplicates().values
    ageList = df[["AGE_NUM", "AGE"]].drop_duplicates().values
    
    panelDict = convertPairsListToDict(panelList)
    unitDict = convertPairsListToDict(unitList)
    stubNameDict = convertPairsListToDict(stubNameList)
    stubLabelDict = convertPairsListToDict(stubLabelList)
    yearDict = convertPairsListToDict(yearList)
    ageDict = convertPairsListToDict(ageList)
    return panelDict, unitDict, stubNameDict, stubLabelDict, yearDict, ageDict

In [83]:
panelDict, unitDict, stubNameDict, stubLabelDict, yearDict, ageDict = getMapping(df)
print("Panel:", panelDict, end="\n\n")
print("Unit:", unitDict, end="\n\n")
print("Stub Name:", stubNameDict, end="\n\n")
print("Stub Label:", stubLabelDict, end="\n\n")
print("Year:", yearDict, end="\n\n")
print("Age:", ageDict, end="\n\n")

Panel: {1: 'Normal weight (BMI from 18.5 to 24.9)', 3: 'Obesity (BMI greater than or equal to 30.0)', 2: 'Overweight or obese (BMI greater than or equal to 25.0)', 4: 'Grade 1 obesity (BMI from 30.0 to 34.9)', 5: 'Grade 2 obesity (BMI from 35.0 to 39.9)', 6: 'Grade 3 obesity (BMI greater than or equal to 40.0)'}

Unit: {1: 'Percent of population, age-adjusted', 2: 'Percent of population, crude'}

Stub Name: {1: 'Total', 2: 'Sex', 3: 'Race and Hispanic origin', 4: 'Sex and race and Hispanic origin', 5: 'Percent of poverty level', 6: 'Sex and age'}

Stub Label: {1.1: '20 years and over', 2.1: 'Male', 2.2: 'Female', 3.11: 'Not Hispanic or Latino: White only', 3.111: 'Male: Not Hispanic or Latino: White only', 3.112: 'Female: Not Hispanic or Latino: White only', 3.12: 'Not Hispanic or Latino: Black or African American only', 3.121: 'Male: Not Hispanic or Latino: Black or African American only', 3.122: 'Female: Not Hispanic or Latino: Black or African American only', 3.13: 'Not Hispanic or 

In [88]:
# clean data for gender, race, year = 2015-2018 , unit = age-adjusted
encoded_df = df[["PANEL_NUM", "UNIT_NUM", "STUB_NAME_NUM", "STUB_LABEL_NUM", "YEAR_NUM", "AGE_NUM", "ESTIMATE", "SE", "FLAG"]]
filtered_df = encoded_df[(encoded_df["UNIT_NUM"] == 1) & (encoded_df["YEAR_NUM"] == 10)]
raceGender_df = filtered_df[filtered_df["STUB_NAME_NUM"] == 4]
raceGender_df.head()

Unnamed: 0,PANEL_NUM,UNIT_NUM,STUB_NAME_NUM,STUB_LABEL_NUM,YEAR_NUM,AGE_NUM,ESTIMATE,SE,FLAG
49,1,1,4,3.111,10,1.0,23.4,1.4,.
59,1,1,4,3.112,10,1.0,31.9,1.7,.
79,1,1,4,3.121,10,1.0,26.4,1.7,.
89,1,1,4,3.122,10,1.0,19.2,1.2,.
109,1,1,4,3.131,10,1.0,41.9,1.7,.


In [89]:
# clean data for age, unit_num = crude, year = 2015-2018
filtered_df = encoded_df[(cleaned_df["UNIT_NUM"] == 2) & (encoded_df["YEAR_NUM"] == 10)]
age_df = filtered_df[filtered_df["STUB_NAME_NUM"] == 6]
age_df.head()

Unnamed: 0,PANEL_NUM,UNIT_NUM,STUB_NAME_NUM,STUB_LABEL_NUM,YEAR_NUM,AGE_NUM,ESTIMATE,SE,FLAG
450,1,2,6,6.11,10,1.1,32.3,2.1,.
460,1,2,6,6.12,10,1.2,18.7,2.0,.
470,1,2,6,6.13,10,1.3,16.2,1.9,.
480,1,2,6,6.14,10,1.4,21.5,2.6,.
490,1,2,6,6.15,10,1.5,15.6,2.4,.


In [91]:
# clean data for poverty, unit = age-adjusted, year = 2015-2018
filtered_df = encoded_df[(encoded_df["UNIT_NUM"] == 1) & (encoded_df["YEAR_NUM"] == 10)]
poverty_df = filtered_df[filtered_df["STUB_NAME_NUM"] == 5]
poverty_df.head()

Unnamed: 0,PANEL_NUM,UNIT_NUM,STUB_NAME_NUM,STUB_LABEL_NUM,YEAR_NUM,AGE_NUM,ESTIMATE,SE,FLAG
189,1,1,5,5.1,10,1.0,24.3,1.5,.
199,1,1,5,5.2,10,1.0,25.4,1.5,.
209,1,1,5,5.3,10,1.0,24.6,1.4,.
219,1,1,5,5.4,10,1.0,27.6,1.7,.
750,2,1,5,5.1,10,1.0,73.5,1.5,.


## 4. EDA

Plan: 

- race vs obesity
- age vs obesity
- poverty vs obesity
    

In [None]:
# race vs obesity

In [None]:
# age vs obesity

In [None]:
# poverty vs obesity

## 5. Visualizations
