
# Step-5: ML Model Training & Evaluation

This notebook trains **Logistic Regression** and **Random Forest** models  
on the `user_food_suitability_dataset_v1.csv` dataset.

Each step is explained clearly for learning, exams, and reviews.


In [2]:

# Import core libraries
import pandas as pd
import numpy as np


## 1. Load Dataset

In [4]:

# Load the final dataset generated in Step-4
df = pd.read_csv("/kaggle/input/user-data-with-ntrition-value/user_food_suitability_dataset_v1.csv")

# Display first few rows
df.head()


Unnamed: 0,Weight,BMI,Calories,Protein,Fat,Carbohydrates,Label
0,96,31.708284,88.0,0.058,9.8,0.073,0
1,96,31.708284,99.0,2.8,8.8,3.7,0
2,96,31.708284,120.0,0.0,10.6375,0.0,0
3,96,31.708284,271.125,0.0,10.6375,0.0,0
4,96,31.708284,123.0,0.0,10.6375,0.0,0



We separate:
- X → input features
- y → target label


In [5]:

X = df.drop("Label", axis=1)
y = df["Label"]

y.value_counts()


Label
0    174522
1     85398
Name: count, dtype: int64

## 2. Train-Test Split

In [6]:

from sklearn.model_selection import train_test_split

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


## 3. Logistic Regression (Baseline Model)


Logistic Regression is a **linear baseline model**.
It requires feature scaling.


In [8]:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [9]:

# Scale features (important for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train_scaled, y_train)


In [10]:

# Predictions using Logistic Regression
y_pred_log = log_model.predict(X_test_scaled)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_log))

print("Confusion Matrix:")
confusion_matrix(y_test, y_pred_log)


Logistic Regression Accuracy: 0.9310172360726378

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95     34904
           1       0.88      0.91      0.90     17080

    accuracy                           0.93     51984
   macro avg       0.92      0.93      0.92     51984
weighted avg       0.93      0.93      0.93     51984

Confusion Matrix:


array([[32827,  2077],
       [ 1509, 15571]])

## 4. Random Forest (Primary Model)


Random Forest is a **non-linear ensemble model**.
It does NOT require feature scaling.


In [11]:

from sklearn.ensemble import RandomForestClassifier


In [12]:

# Train Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)


In [13]:

# Predictions using Random Forest
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

print("Confusion Matrix:")
confusion_matrix(y_test, y_pred_rf)


Random Forest Accuracy: 0.9999807633117882

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34904
           1       1.00      1.00      1.00     17080

    accuracy                           1.00     51984
   macro avg       1.00      1.00      1.00     51984
weighted avg       1.00      1.00      1.00     51984

Confusion Matrix:


array([[34903,     1],
       [    0, 17080]])

## 5. Feature Importance (Random Forest)

In [14]:

# Display feature importance
feature_importance = pd.Series(
    rf_model.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

feature_importance


Protein          0.371839
Fat              0.251253
Calories         0.216711
Carbohydrates    0.158385
BMI              0.001391
Weight           0.000421
dtype: float64

In [17]:
import joblib

# Save trained Random Forest model
joblib.dump(
    rf_model,
    "random_forest_food_suitability_model_v1.pkl"
)

print("Random Forest model saved successfully.")


Random Forest model saved successfully.



## Conclusion

- Logistic Regression validates dataset learnability
- Random Forest captures complex interactions
- The better-performing model is selected for deployment
