# Midterm Project

> Develop classification models to predict a target variable.
> Evaluate the classification models based on the different performance metrics.

[Link to Dataset](https://www.kaggle.com/datasets/adeniranstephen/obesity-prediction-dataset)

In [7]:
# Import dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report

raw = pd.read_csv('data/obesity_dataset.csv')

raw

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21,1.62,64.00,yes,no,2.0,3.0,Sometimes,no,2.00,no,0.00,1.000,no,Public_Transportation,Normal_Weight
1,Female,21,1.52,56.00,yes,no,3.0,3.0,Sometimes,yes,3.00,yes,3.00,0.000,Sometimes,Public_Transportation,Normal_Weight
2,Male,23,1.80,77.00,yes,no,2.0,3.0,Sometimes,no,2.00,no,2.00,1.000,Frequently,Public_Transportation,Normal_Weight
3,Male,27,1.80,87.00,no,no,3.0,3.0,Sometimes,no,2.00,no,2.00,0.000,Frequently,Walking,Overweight_Level_I
4,Male,22,1.78,89.80,no,no,2.0,1.0,Sometimes,no,2.00,no,0.00,0.000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,21,1.71,131.41,yes,yes,3.0,3.0,Sometimes,no,1.73,no,1.68,0.906,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,22,1.75,133.74,yes,yes,3.0,3.0,Sometimes,no,2.01,no,1.34,0.599,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,23,1.75,133.69,yes,yes,3.0,3.0,Sometimes,no,2.05,no,1.41,0.646,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,24,1.74,133.35,yes,yes,3.0,3.0,Sometimes,no,2.85,no,1.14,0.586,Sometimes,Public_Transportation,Obesity_Type_III


### Data Cleaning

1. CALC, CAEC Column must only have the values [Never, Sometimes, Frequently, Always]
2. Standardize CAEC, CALC to numeric values
3. Change underscores to whitespace
4. Simplify Gender to M and F
5. Change yes/no fields to binary 1/0
6. Rearrange Columns
7. Set aside 10% of the data

In [21]:
# TODO: Cleanup CALC and CAEC [Transform no to Never] - John Mihael
raw['CALC'] = raw['CALC'].replace("no", "Never")
raw['CAEC'] = raw['CAEC'].replace("no", "Never")
# TODO: Cleanup CALC and CAEC [0-3 scale] - Shaun
import pandas as pd

df = pd.read_csv('data/obesity_dataset.csv')

mapping = {'no': 0, 'Sometimes': 1, 'Frequently': 2, 'Always': 3}

df['CAEC'] = df['CAEC'].map(mapping)
df['CALC'] = df['CALC'].map(mapping)

# TODO: Change underscore values to whitespace and standardized - Cazindra
# Replace underscores in COLUMN NAMES (if they exist)
raw.columns = raw.columns.str.replace('_', ' ')

# Apply title case ONLY to column names containing spaces (multi-word columns)
raw.columns = [
    col.title() if ' ' in col else col  # Title case only if space exists
    for col in raw.columns
]

# Replace underscores in DATA ROWS for specific columns
for col in ['MTRANS', 'NObeyesdad']:
    raw[col] = raw[col].astype(str).str.replace('_', ' ', regex=True)

# Columns to standardize to title case
title_case_columns = [
    'Family History With Overweight',  # After underscore replacement
    'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC'
]

# Convert values in these columns to title case
raw[title_case_columns] = raw[title_case_columns].apply(lambda x: x.astype(str).str.title())

# TODO: Simplify Gender to M and F - Maria

# TODO: Transform No/Yes to 0/1 - Joyce
columns_with_yes_no= ['Family History With Overweight', 'FAVC', 'SMOKE', 'SCC']
raw[columns_with_yes_no] = raw[columns_with_yes_no].replace({'yes': 1, 'no': 0})
# TODO: Rearrange columns - Jude (Autofill)

# Randomize Dataset
raw = raw.sample(frac=1, random_state=42).reset_index(drop=True)

# Get unseen dataset
unseen = raw.iloc[1900:2110].reset_index(drop=True) # Get unseen sample
raw = raw.iloc[0:1899].reset_index(drop=True) # Cut original dataset

Updated Columns: ['Gender', 'Age', 'Height', 'Weight', 'Family History With Overweight', 'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC', 'MTRANS', 'NObeyesdad']

Sample Data (NObeyesdad column):
0    Overweight Level I
1        Obesity Type I
2       Obesity Type II
3      Obesity Type III
4       Obesity Type II
Name: NObeyesdad, dtype: object

Sample Data (MTRANS column):
0               Automobile
1    Public Transportation
2    Public Transportation
3    Public Transportation
4    Public Transportation
Name: MTRANS, dtype: object

After Title Case Standardization:
  Family History With Overweight FAVC       CAEC SMOKE SCC       CALC
0                            Yes  Yes  Sometimes    No  No  Sometimes
1                            Yes  Yes  Sometimes    No  No         No
2                            Yes  Yes  Sometimes    No  No  Sometimes
3                            Yes  Yes  Sometimes    No  No  Sometimes
4                            Yes  Yes  Sometim

### Prepare Data

1. Split features and target column
2. Prepare training and testing data.

In [9]:
# Split Features and Target Column
features = raw.drop('NObeyesdad', axis=1)
target_col = raw['NObeyesdad']

features_train, features_test, target_train, target_test = train_test_split(
    features, target_col, test_size=0.2
)

### Perform Multinomial Training

In [10]:
# TODO: Import Naive Bayes model
# model = MultinomialNB()

# model.fit(features_train, target_train)

# y_pred = model.predict(features_test)

# print("Accuracy:", accuracy_score(target_test, y_pred))

### Simplified Gender to M or F

In [11]:
raw['Gender'] = raw['Gender'].replace({'Female': 'F', 'Male': 'M'})
raw

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,F,20,1.76,53.70,yes,yes,2.00,3.89,Frequently,no,1.86,no,2.87,2.000,Never,Public_Transportation,Insufficient_Weight
1,F,26,1.62,111.00,yes,yes,3.00,3.00,Sometimes,no,2.70,no,0.00,0.323,Sometimes,Public_Transportation,Obesity_Type_III
2,M,18,1.85,60.00,yes,yes,3.00,4.00,Sometimes,no,2.00,yes,2.00,0.000,Sometimes,Automobile,Insufficient_Weight
3,F,21,1.52,42.00,no,yes,3.00,1.00,Frequently,no,1.00,no,0.00,0.000,Sometimes,Public_Transportation,Insufficient_Weight
4,M,22,1.75,74.00,yes,no,2.00,3.00,Sometimes,no,2.00,no,1.00,2.000,Sometimes,Bike,Normal_Weight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1894,F,17,1.78,44.76,no,yes,2.91,3.10,Sometimes,no,2.20,no,2.33,1.550,Sometimes,Public_Transportation,Insufficient_Weight
1895,F,38,1.70,78.64,yes,yes,3.00,3.00,Sometimes,no,2.50,no,0.00,0.000,Never,Automobile,Overweight_Level_II
1896,F,17,1.49,53.62,no,yes,1.84,1.87,Sometimes,no,2.00,yes,0.32,1.970,Sometimes,Public_Transportation,Overweight_Level_I
1897,M,31,1.76,118.57,yes,yes,2.92,3.00,Sometimes,no,2.24,no,1.08,1.481,Sometimes,Automobile,Obesity_Type_II
