# Final Report: Analysis On Factors Leading to Obesity in America
## Project Introduction
With obesity at all time rates in the United States it is important to understand the individuals most at risk. This analysis is focused on the critical issue of obesity among individuals in America, exploring data from selected population characteristics. This information was made available by the Centers of Disease Control and Prevention (CDC). Our goals are to investigate if race plays a role in being at risk of obesity, if income is the leading factor in obesity, and to explore other interesting relationships found within the data. 

**Github Repository**: https://github.com/uic-cs418/group-project-mind-masters

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
%matplotlib inline
import matplotlib.pyplot as plt
from obesity_children import * #cleaning python file
import data1_helper # for analyzing data 1

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import confusion_matrix
import seaborn as sns

df = pd.read_csv('physicaldataset2.csv')

## Reading Data / Cleaning Data
**Data - Obesity Among Children and Adolescents 2-19**

**Data - Obesity Among Adults 20+**

**Data - Nutrition, Physical Activity, and Obesity**

## Machine Learning Analysis
**Linear Regression Model**

**Logistic Regression**

In [None]:
# Select relevant features and target variable
features = ['Age(years)', 'Gender', 'Race/Ethnicity']
target = 'Data_Value'
# Filter the dataset to include only physical activity-related data
activity_df = df[(df['Topic'] == 'Physical Activity - Behavior') & (df['Question'].str.contains('no leisure-time physical activity'))]
# Convert 'Data_Value' to a binary variable based on a threshold
threshold = 30
activity_df = activity_df.copy()
activity_df['Inactive'] = (activity_df['Data_Value'] >= threshold).astype(int)
# Use 'Inactive' as the target
target = 'Inactive'
# Perform one-hot encoding for categorical features
activity_df_encoded = pd.get_dummies(activity_df[features], drop_first=True)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(activity_df_encoded, activity_df[target], test_size=0.2, random_state=42)
# Create and train the Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = lr_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(confusion_mat)
# Create sample inputs for different types of individuals
sample_inputs = [
    {"Age(years)": "25 - 34", "Gender": "Male", "Race/Ethnicity": "Non-Hispanic White"},
    {"Age(years)": "45 - 54", "Gender": "Female", "Race/Ethnicity": "Non-Hispanic Black"},
    {"Age(years)": "65 or older", "Gender": "Male", "Race/Ethnicity": "Hispanic"},
    {"Age(years)": "18 - 24", "Gender": "Female", "Race/Ethnicity": "Asian"},
    {"Age(years)": "35 - 44", "Gender": "Male", "Race/Ethnicity": "American Indian/Alaska Native"}
]
# Get the feature names from the trained model
feature_names = lr_model.feature_names_in_
# Predict inactivity for each sample input
for i, sample_input in enumerate(sample_inputs, start=1):
    # Create a DataFrame with all the encoded features from the trained model
    sample_df = pd.DataFrame(columns=feature_names, index=[0])
    sample_df.fillna(0, inplace=True)
    # Set the corresponding values based on the sample input
    for feature, value in sample_input.items():
        encoded_feature = f"{feature}_{value}"
        if encoded_feature in feature_names:
            sample_df[encoded_feature] = 1
    # Make a prediction for the sample input
    sample_prediction = lr_model.predict(sample_df)
    print(f"\nSample Input {i}:")
    print(f"Age: {sample_input['Age(years)']}")
    print(f"Gender: {sample_input['Gender']}")
    print(f"Race/Ethnicity: {sample_input['Race/Ethnicity']}")
    print(f"Predicted Inactivity: {'Inactive' if sample_prediction[0] == 1 else 'Active'}")

In [1]:
# Testing the Logistic Regression ML - confusion matrix plot

In [None]:
# Create a DataFrame for plotting
plot_data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
# Create a confusion matrix plot
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, cmap='Blues', fmt='d', cbar=False,
            xticklabels=['Active', 'Inactive'], yticklabels=['Active', 'Inactive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

In [None]:
# Logistic Regression ML - Visulazation Plot

In [None]:
race_activity = activity_df.groupby('Race/Ethnicity')['Inactive'].mean()

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=race_activity.index, y=race_activity.values)
plt.xlabel('Race/Ethnicity')
plt.ylabel('Inactivity Percentage')
plt.title('Inactivity Percentage by Race/Ethnicity')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Visualizations + Additional Visualization
**Wealthy adults in 2018 are more likely to be obese than Low-Income Adults in 1988**

**Hispanic individuals aged 2-19 have the highest obesity rate from 1988-2018**

**Asians have less obesity rate than people of other races in all levels of Obesity**


**Interactive Visualization**

Directions on how to access interactive visualization in README.

## Results
Fully explain and analyze the results from your data, i.e. the inferences or correlations you uncovered, the tools you built, or the visualizations you created. 