# üß¨ NovaGen Research Labs ‚Äì Machine Learning Project

## üìå Project Overview
This project focuses on building and evaluating machine learning models to identify the most effective algorithm for the given dataset. The objective is not only to achieve strong predictive performance but also to follow industry best practices such as data preprocessing, pipeline usage, and proper evaluation techniques


In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler

In [7]:
df = pd.read_csv("novagen_dataset.csv")

## üìä Data Understanding & Exploratory Analysis (EDA)
The dataset was explored to understand structure, feature types, missing values, and basic statistical properties. This step helped guide preprocessing and model selection decisions.


In [8]:
df.head()
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9549 entries, 0 to 9548
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    9549 non-null   float64
 1   BMI                    9549 non-null   float64
 2   Blood_Pressure         9549 non-null   float64
 3   Cholesterol            9549 non-null   float64
 4   Glucose_Level          9549 non-null   float64
 5   Heart_Rate             9549 non-null   float64
 6   Sleep_Hours            9549 non-null   float64
 7   Exercise_Hours         9549 non-null   float64
 8   Water_Intake           9549 non-null   float64
 9   Stress_Level           9549 non-null   float64
 10  Target                 9549 non-null   int64  
 11  Smoking                9549 non-null   int64  
 12  Alcohol                9549 non-null   int64  
 13  Diet                   9549 non-null   int64  
 14  MentalHealth           9549 non-null   int64  
 15  Phys

Unnamed: 0,Age,BMI,Blood_Pressure,Cholesterol,Glucose_Level,Heart_Rate,Sleep_Hours,Exercise_Hours,Water_Intake,Stress_Level,Target,Smoking,Alcohol,Diet,MentalHealth,PhysicalActivity,MedicalHistory,Allergies
count,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0,9549.0
mean,33.806786,25.660697,130.382658,199.091528,100.225678,73.613782,6.951409,1.892345,3.580899,4.382134,0.521416,0.99047,0.995183,1.005864,0.998429,1.003351,1.004713,0.989318
std,24.566473,1.942369,27.878476,1.969234,2.157999,1.681538,2.352152,1.378714,1.622874,2.078593,0.499567,0.815521,0.816653,0.815877,0.821844,0.8088,0.813506,0.815699
min,0.0,19.0,22.0,192.0,93.0,67.0,0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,14.0,24.0,113.0,198.0,99.0,73.0,5.0,1.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,29.0,26.0,134.0,199.0,100.0,74.0,7.0,2.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,50.0,27.0,150.0,200.0,102.0,75.0,9.0,3.0,5.0,6.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
max,100.0,32.0,225.0,207.0,107.0,80.0,14.0,8.0,10.0,12.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


## üõ†Ô∏è Data Preprocessing & Feature Engineering
The dataset was split into training and testing sets using `train_test_split`. Feature scaling was applied using `StandardScaler` to normalize numerical features. Pipelines were used to ensure clean, reproducible preprocessing and to avoid data leakage.


In [9]:
X = df.drop(["Target"], axis=1)
y =df["Target"]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [13]:
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

## ü§ñ Model Building Strategy
Multiple classification models were trained using consistent preprocessing pipelines to allow fair comparison between algorithms


# Logistic Regression

In [14]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

In [15]:
y_pred = model.predict(X_test)

In [16]:
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("precision score: ", precision_score(y_test, y_pred))
print("recall score: ", recall_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))

accuracy score:  0.8230366492146597
precision score:  0.8287671232876712
recall score:  0.8386138613861386
f1 score:  0.8336614173228346


‚ÄúSince accuracy alone can be misleading for imbalanced datasets, I also evaluated Precision, Recall, and F1-score, which show the model performs consistently across classes.‚Äù

# Decision Tree

In [17]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [18]:
y_pred_test = dt.predict(X_test)
y_pred_train = dt.predict(X_train)

In [19]:
print("Training Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Training precision: ", precision_score(y_train, y_pred_train))
print("Training recall: ", recall_score(y_train, y_pred_train))
print("Training f1 score: ", f1_score(y_train, y_pred_train))

print("\nTesting Accuracy: ", accuracy_score(y_test, y_pred_test))
print("Testing precision: ", precision_score(y_test, y_pred_test))
print("Testing recall: ", recall_score(y_test, y_pred_test))
print("Testing f1: ", f1_score(y_test, y_pred_test))

Training Accuracy:  1.0
Training precision:  1.0
Training recall:  1.0
Training f1 score:  1.0

Testing Accuracy:  0.8890052356020942
Testing precision:  0.9006024096385542
Testing recall:  0.8881188118811881
Testing f1:  0.8943170488534397


The Decision Tree model achieved perfect training accuracy (1.0), which indicates overfitting as the model has memorized the training data. However, the test accuracy of ~90% shows good predictive performance. This suggests the model has high variance and requires pruning techniques such as limiting max_depth or increasing min_samples_leaf to improve generalization.

# Using pruning

In [20]:
dt = DecisionTreeClassifier(
    max_depth=4,
    min_samples_split=20,
    min_samples_leaf=5,
    random_state=42
)
dt.fit(X_train, y_train)

In [21]:
y_pred_test = dt.predict(X_test)
y_pred_train = dt.predict(X_train)

In [22]:
print("Training Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Training precision: ", precision_score(y_train, y_pred_train))
print("Training recall: ", recall_score(y_train, y_pred_train))
print("Training f1 score: ", f1_score(y_train, y_pred_train))

print("\nTesting Accuracy: ", accuracy_score(y_test, y_pred_test))
print("Testing precision: ", precision_score(y_test, y_pred_test))
print("Testing recall: ", recall_score(y_test, y_pred_test))
print("Testing f1: ", f1_score(y_test, y_pred_test))

Training Accuracy:  0.8283806780992277
Training precision:  0.805939226519337
Training recall:  0.8820861678004536
Training f1 score:  0.8422952002887044

Testing Accuracy:  0.8324607329842932
Testing precision:  0.8212290502793296
Testing recall:  0.8732673267326733
Testing f1:  0.8464491362763915


After pruning, the Decision Tree model showed similar training and testing performance (~83%), indicating reduced overfitting and improved generalization. Compared to the unpruned tree, the pruned model is more reliable for unseen data. Hence, pruning successfully controlled model complexity and produced a more balanced classifier.

# Random Forest

In [23]:
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=4,
    min_samples_split=20,
    min_samples_leaf=5,
    oob_score=True,
    random_state=42
)

rf.fit(X_train, y_train)

In [24]:
y_pred_test = rf.predict(X_test)
y_pred_train = rf.predict(X_train)

In [25]:
print("Training Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Training precision: ", precision_score(y_train, y_pred_train))
print("Training recall: ", recall_score(y_train, y_pred_train))
print("Training f1 score: ", f1_score(y_train, y_pred_train))

print("\nTesting Accuracy: ", accuracy_score(y_test, y_pred_test))
print("Testing precision: ", precision_score(y_test, y_pred_test))
print("Testing recall: ", recall_score(y_test, y_pred_test))
print("Testing f1: ", f1_score(y_test, y_pred_test))

print("\nOOB Score: ", rf.oob_score_)

Training Accuracy:  0.8367587380547192
Training precision:  0.8043381037567084
Training recall:  0.9062736205593348
Training f1 score:  0.8522686885440114

Testing Accuracy:  0.8471204188481676
Testing precision:  0.8240072202166066
Testing recall:  0.903960396039604
Testing f1:  0.8621340887629839

OOB Score:  0.8355805733734782


Among all models tested, the Random Forest classifier performed best with ~84.7% testing accuracy and strong recall (~90%). The minimal gap between training and testing performance and an OOB score close to test accuracy indicate good generalization and model stability. Compared to a single decision tree, Random Forest reduces overfitting by averaging multiple trees, making it the most reliable model for this dataset.

# SVM_Classifier

In [26]:
svc = SVC()
svc.fit(X_train, y_train)

In [27]:
y_pred = svc.predict(X_test)

In [28]:
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("classification report:\n ", classification_report(y_test, y_pred))

accuracy score:  0.7366492146596859
classification report:
                precision    recall  f1-score   support

           0       0.67      0.88      0.76       900
           1       0.85      0.61      0.71      1010

    accuracy                           0.74      1910
   macro avg       0.76      0.74      0.73      1910
weighted avg       0.76      0.74      0.73      1910



The SVM classifier achieved lower accuracy (~73.7%) compared to other models. It showed high precision but low recall for class 1, indicating many false negatives. This suggests SVM did not capture the underlying data patterns effectively, possibly due to lack of feature scaling or kernel tuning. Tree-based ensemble methods performed better for this dataset.

# Using Scaling

In [29]:
y_pred = svc.predict(X_test_scaled)

In [30]:
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("classification report:\n ", classification_report(y_test, y_pred))

accuracy score:  0.5287958115183246
classification report:
                precision    recall  f1-score   support

           0       0.00      0.00      0.00       900
           1       0.53      1.00      0.69      1010

    accuracy                           0.53      1910
   macro avg       0.26      0.50      0.35      1910
weighted avg       0.28      0.53      0.37      1910



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Using Pipeline

In [32]:
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("svc", SVC(kernel="rbf", C=10, gamma="scale", class_weight="balanced"))
]
)

In [33]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [34]:
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("classification report:\n ", classification_report(y_test, y_pred))

accuracy score:  0.9287958115183246
classification report:
                precision    recall  f1-score   support

           0       0.93      0.92      0.92       900
           1       0.93      0.93      0.93      1010

    accuracy                           0.93      1910
   macro avg       0.93      0.93      0.93      1910
weighted avg       0.93      0.93      0.93      1910



After applying feature scaling and using an SVM pipeline with RBF kernel and balanced class weights, the model achieved the best performance (~92.9% accuracy) among all tested algorithms. The classifier showed balanced precision and recall across both classes, indicating strong generalization and effective decision boundary learning.

# knn

In [35]:
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train_scaled, y_train)

In [36]:
y_pred = knn_classifier.predict(X_test_scaled)

In [37]:
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("precision score: ", precision_score(y_test, y_pred))
print("recall score: ", recall_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))

accuracy score:  0.8717277486910995
precision score:  0.8859737638748738
recall score:  0.8693069306930693
f1 score:  0.8775612193903048


In [38]:
# k=7

knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(X_train_scaled, y_train)

y_pred = knn_classifier.predict(X_test_scaled)

print("accuracy score: ", accuracy_score(y_test, y_pred))
print("precision score: ", precision_score(y_test, y_pred))
print("recall score: ", recall_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))

accuracy score:  0.8806282722513089
precision score:  0.8902195608782435
recall score:  0.8831683168316832
f1 score:  0.8866799204771372


In [39]:
# k=9

knn_classifier = KNeighborsClassifier(n_neighbors=9)
knn_classifier.fit(X_train_scaled, y_train)

y_pred = knn_classifier.predict(X_test_scaled)

print("accuracy score: ", accuracy_score(y_test, y_pred))
print("precision score: ", precision_score(y_test, y_pred))
print("recall score: ", recall_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))

accuracy score:  0.8848167539267016
precision score:  0.8942115768463074
recall score:  0.8871287128712871
f1 score:  0.8906560636182903


In [40]:
# k=12

knn_classifier = KNeighborsClassifier(n_neighbors=12)
knn_classifier.fit(X_train_scaled, y_train)

y_pred = knn_classifier.predict(X_test_scaled)

print("accuracy score: ", accuracy_score(y_test, y_pred))
print("precision score: ", precision_score(y_test, y_pred))
print("recall score: ", recall_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))

accuracy score:  0.875392670157068
precision score:  0.9088983050847458
recall score:  0.8495049504950495
f1 score:  0.8781985670419652


KNN performance improved as K increased from 7 to 9, with K=9 giving the best balance between precision and recall (F1 ‚âà 0.891). Increasing K further to 12 increased precision but reduced recall, indicating a smoother decision boundary that misses some positive cases. Thus, K=9 was selected as the optimal value.

# Naive Bayes

In [41]:
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

In [42]:
y_pred = gnb_model.predict(X_test)

In [43]:
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("precision score: ", precision_score(y_test, y_pred))
print("recall score: ", recall_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))

accuracy score:  0.8209424083769633
precision score:  0.8450413223140496
recall score:  0.80990099009901
f1 score:  0.8270980788675429


The Naive Bayes classifier achieved ~82% accuracy with balanced precision and recall. While it performed comparably to Logistic Regression, its independence assumption limits its ability to capture complex feature interactions. Ensemble and kernel-based models achieved higher performance on this dataset.

# Final Conclusion

In this project, multiple classification algorithms were implemented and compared on the same dataset, including Logistic Regression, Decision Tree, Pruned Decision Tree, Random Forest, SVM, KNN, and Naive Bayes.

Initial experiments showed that an unpruned Decision Tree overfitted the data (100% training accuracy), which was corrected using pruning techniques to improve generalization. Ensemble learning with Random Forest further improved stability and reduced variance. Distance-based learning using KNN showed strong performance after tuning the value of K, while Naive Bayes and Logistic Regression served as effective baseline models.

The SVM classifier initially underperformed due to scaling and hyperparameter issues, but after applying feature scaling and using an RBF kernel within a pipeline, it achieved the best performance (~93% accuracy) with balanced precision and recall across both classes.

Overall, the comparative analysis shows that model performance depends heavily on preprocessing, hyperparameter tuning, and controlling overfitting. Among all models, the tuned SVM demonstrated the strongest generalization ability and classification accuracy, making it the most suitable model for this dataset.