# Midterm Project

> Develop classification models to predict a target variable.
> Evaluate the classification models based on the different performance metrics.

[Link to Dataset](https://www.kaggle.com/datasets/adeniranstephen/obesity-prediction-dataset)

In [84]:
# Import dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report

raw = pd.read_csv('data/obesity_dataset.csv')

raw

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21,1.62,64.00,yes,no,2.0,3.0,Sometimes,no,2.00,no,0.00,1.000,no,Public_Transportation,Normal_Weight
1,Female,21,1.52,56.00,yes,no,3.0,3.0,Sometimes,yes,3.00,yes,3.00,0.000,Sometimes,Public_Transportation,Normal_Weight
2,Male,23,1.80,77.00,yes,no,2.0,3.0,Sometimes,no,2.00,no,2.00,1.000,Frequently,Public_Transportation,Normal_Weight
3,Male,27,1.80,87.00,no,no,3.0,3.0,Sometimes,no,2.00,no,2.00,0.000,Frequently,Walking,Overweight_Level_I
4,Male,22,1.78,89.80,no,no,2.0,1.0,Sometimes,no,2.00,no,0.00,0.000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,21,1.71,131.41,yes,yes,3.0,3.0,Sometimes,no,1.73,no,1.68,0.906,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,22,1.75,133.74,yes,yes,3.0,3.0,Sometimes,no,2.01,no,1.34,0.599,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,23,1.75,133.69,yes,yes,3.0,3.0,Sometimes,no,2.05,no,1.41,0.646,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,24,1.74,133.35,yes,yes,3.0,3.0,Sometimes,no,2.85,no,1.14,0.586,Sometimes,Public_Transportation,Obesity_Type_III


### Data Cleaning

1. CALC, CAEC Column must only have the values [Never, Sometimes, Frequently, Always]
2. Standardize CAEC, CALC to numeric values
3. Change underscores to whitespace
4. Simplify Gender to M and F
5. Change yes/no fields to binary 1/0
6. Rearrange Columns
7. Set aside 10% of the data

In [85]:
# Import necessary library
import pandas as pd

# Load dataset
file_path = "data/obesity_dataset.csv"  # Update the path if needed
df = pd.read_csv(file_path)

# Cleanup CALC and CAEC [Transform no to Never] - John Mihael
raw['CALC'] = raw['CALC'].replace("no", "Never")
raw['CAEC'] = raw['CAEC'].replace("no", "Never")

# Cleanup CALC and CAEC [0-3 scale] - Shaun
import pandas as pd

df = pd.read_csv('data/obesity_dataset.csv')

mapping = {'no': 0, 'Sometimes': 1, 'Frequently': 2, 'Always': 3}

df['CAEC'] = df['CAEC'].map(mapping)
df['CALC'] = df['CALC'].map(mapping)

# Change underscore values to whitespace and standardized - Cazindra
# Replace underscores in COLUMN NAMES (if they exist)
raw.columns = raw.columns.str.replace('_', ' ')

# Apply title case ONLY to column names containing spaces (multi-word columns)
raw.columns = [
    col.title() if ' ' in col else col  # Title case only if space exists
    for col in raw.columns
]

# Replace underscores in DATA ROWS for specific columns
for col in ['MTRANS', 'NObeyesdad']:
    raw[col] = raw[col].astype(str).str.replace('_', ' ', regex=True)

# Columns to standardize to title case
title_case_columns = [
    'Family History With Overweight',  # After underscore replacement
    'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC'
]

# Convert values in these columns to title case
raw[title_case_columns] = raw[title_case_columns].apply(lambda x: x.astype(str).str.title())

# Simplified Gender to M or F
raw['Gender'] = raw['Gender'].replace({'Female': 'F', 'Male': 'M'})

# Transform No/Yes to 0/1 - Joyce
columns_with_yes_no = ['Family History With Overweight', 'FAVC', 'SMOKE', 'SCC']
raw[columns_with_yes_no] = raw[columns_with_yes_no].replace({'Yes': 1, 'No': 0})

# Rearrange columns - Jude (Autofill)
raw = raw[[
    # Yes/No columns
    'Gender', 'Age', 'Height', 'Weight', 'Family History With Overweight', 'FAVC', 'SMOKE', 'SCC',
    # Decimal (numeric) columns
    'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE',
    # Target column (must remain last)
    'NObeyesdad'
]]

# Change M - 0 and F - 1
raw['Gender'] = raw['Gender'].replace({'M': 0, 'F': 1})

# Get unseen dataset
unseen = raw.iloc[1900:2110].reset_index(drop=True) # Get unseen sample
raw = raw.iloc[0:1899].reset_index(drop=True) # Cut original dataset

raw

  raw[columns_with_yes_no] = raw[columns_with_yes_no].replace({'Yes': 1, 'No': 0})
  raw['Gender'] = raw['Gender'].replace({'M': 0, 'F': 1})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw['Gender'] = raw['Gender'].replace({'M': 0, 'F': 1})


Unnamed: 0,Gender,Age,Height,Weight,Family History With Overweight,FAVC,SMOKE,SCC,FCVC,NCP,CH2O,FAF,TUE,NObeyesdad
0,1,21,1.62,64.00,1,0,0,0,2.0,3.0,2.00,0.00,1.000,Normal Weight
1,1,21,1.52,56.00,1,0,1,1,3.0,3.0,3.00,3.00,0.000,Normal Weight
2,0,23,1.80,77.00,1,0,0,0,2.0,3.0,2.00,2.00,1.000,Normal Weight
3,0,27,1.80,87.00,0,0,0,0,3.0,3.0,2.00,2.00,0.000,Overweight Level I
4,0,22,1.78,89.80,0,0,0,0,2.0,1.0,2.00,0.00,0.000,Overweight Level II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1894,1,23,1.74,133.49,1,1,0,0,3.0,3.0,2.84,1.34,0.803,Obesity Type III
1895,1,26,1.57,102.00,1,1,0,0,3.0,3.0,1.00,0.00,1.000,Obesity Type III
1896,1,26,1.58,102.13,1,1,0,0,3.0,3.0,1.01,0.02,0.696,Obesity Type III
1897,1,20,1.78,154.62,1,1,0,0,3.0,3.0,2.32,1.97,0.768,Obesity Type III


### Prepare Data

1. Split features and target column
2. Prepare training and testing data.

In [86]:
# Split Features and Target Column
features = raw.drop('NObeyesdad', axis=1)
target_col = raw['NObeyesdad']

features_train, features_test, target_train, target_test = train_test_split(
    features, target_col, test_size=0.2
)

### Perform Multinomial Training

In [87]:
# TODO: Import Naive Bayes model
model = MultinomialNB()

model.fit(features_train, target_train)

y_pred = model.predict(features_test)

print("Accuracy:", accuracy_score(target_test, y_pred))
print(classification_report(target_test, y_pred))  # Detailed metrics

Accuracy: 0.4710526315789474
                     precision    recall  f1-score   support

Insufficient Weight       0.60      0.80      0.68        54
      Normal Weight       0.45      0.27      0.34        63
     Obesity Type I       0.37      0.37      0.37        78
    Obesity Type II       0.53      0.96      0.68        53
   Obesity Type III       1.00      0.46      0.63        24
 Overweight Level I       0.24      0.10      0.14        52
Overweight Level II       0.37      0.41      0.39        56

           accuracy                           0.47       380
          macro avg       0.51      0.48      0.46       380
       weighted avg       0.46      0.47      0.44       380



### Use the model

In [88]:
original = unseen.copy()
unseen.drop('NObeyesdad', axis=1, inplace=True)
results = model.predict(unseen)

unseen['Predictions'] = results

unseen

Unnamed: 0,Gender,Age,Height,Weight,Family History With Overweight,FAVC,SMOKE,SCC,FCVC,NCP,CH2O,FAF,TUE,Predictions
0,1,26,1.66,111.96,1,1,0,0,3.0,3.0,2.78,0.00,0.052,Obesity Type I
1,1,26,1.68,104.59,1,1,0,0,3.0,3.0,1.30,0.28,0.929,Obesity Type I
2,1,25,1.68,104.85,1,1,0,0,3.0,3.0,1.24,0.26,0.852,Obesity Type I
3,1,19,1.72,127.64,1,1,0,0,3.0,3.0,1.31,0.91,0.707,Obesity Type III
4,1,20,1.68,125.42,1,1,0,0,3.0,3.0,1.12,0.88,0.552,Obesity Type III
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,1,21,1.73,131.34,1,1,0,0,3.0,3.0,1.80,1.73,0.898,Obesity Type III
206,1,21,1.71,131.41,1,1,0,0,3.0,3.0,1.73,1.68,0.906,Obesity Type III
207,1,22,1.75,133.74,1,1,0,0,3.0,3.0,2.01,1.34,0.599,Obesity Type III
208,1,23,1.75,133.69,1,1,0,0,3.0,3.0,2.05,1.41,0.646,Obesity Type III


In [89]:
original

Unnamed: 0,Gender,Age,Height,Weight,Family History With Overweight,FAVC,SMOKE,SCC,FCVC,NCP,CH2O,FAF,TUE,NObeyesdad
0,1,26,1.66,111.96,1,1,0,0,3.0,3.0,2.78,0.00,0.052,Obesity Type III
1,1,26,1.68,104.59,1,1,0,0,3.0,3.0,1.30,0.28,0.929,Obesity Type III
2,1,25,1.68,104.85,1,1,0,0,3.0,3.0,1.24,0.26,0.852,Obesity Type III
3,1,19,1.72,127.64,1,1,0,0,3.0,3.0,1.31,0.91,0.707,Obesity Type III
4,1,20,1.68,125.42,1,1,0,0,3.0,3.0,1.12,0.88,0.552,Obesity Type III
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,1,21,1.73,131.34,1,1,0,0,3.0,3.0,1.80,1.73,0.898,Obesity Type III
206,1,21,1.71,131.41,1,1,0,0,3.0,3.0,1.73,1.68,0.906,Obesity Type III
207,1,22,1.75,133.74,1,1,0,0,3.0,3.0,2.01,1.34,0.599,Obesity Type III
208,1,23,1.75,133.69,1,1,0,0,3.0,3.0,2.05,1.41,0.646,Obesity Type III
