# Elite Tech Intern Task 1: Predictive Modeling

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [4]:
# Load the Pima Indians Diabetes Dataset
# Dataset source: https://www.kaggle.com/uciml/pima-indians-diabetes-database

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = [
    'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
    'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'
]
data = pd.read_csv(url, names=columns)

In [6]:
print("Dataset Information:")
print(data.info())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


## There are a total 9 columns and 768 rows in the given dataset.

In [8]:
print("\nFirst 5 Rows of Data:")
print(data.head())


First 5 Rows of Data:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [10]:
# Define features (X) and target (y)
feature_columns = [
    'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
    'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'
]
target_column = 'Outcome'

X = data[feature_columns]
y = data[target_column]

In [14]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model (Random Forest in this case)
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

In [17]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Model Evaluation:
Accuracy: 0.7207792207792207

Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.78      0.78        99
           1       0.61      0.62      0.61        55

    accuracy                           0.72       154
   macro avg       0.70      0.70      0.70       154
weighted avg       0.72      0.72      0.72       154


Confusion Matrix:
 [[77 22]
 [21 34]]


## Overall Performance 
Accuracy: The model has an accuracy of approximately 72%, meaning it correctly predicts whether a person has diabetes 72% of the time. While not perfect, it is a reasonable starting point for a baseline model.

## Class-wise Performance
Class 0 (Non-diabetic):
- Precision: 0.79 – When the model predicts a person is non-diabetic, it is correct 79% of the time.
- Recall: 0.78 – The model correctly identifies 78% of the actual non-diabetic cases.
- F1-score: 0.78 – This indicates a good balance between precision and recall for the non-diabetic class.

## Class 1 (Diabetic):
- Precision: 0.61 – When the model predicts a person is diabetic, it is correct 61% of the time.
- Recall: 0.62 – The model correctly identifies 62% of the actual diabetic cases.
- F1-score: 0.61 – This shows a lower performance for diabetic predictions, indicating room for improvement.

## Confusion Matrix
- True Positives (TP): 34 – The model correctly predicted 34 diabetic cases.
- True Negatives (TN): 77 – The model correctly predicted 77 non-diabetic cases.
- False Positives (FP): 22 – The model incorrectly predicted 22 cases as diabetic.
- False Negatives (FN): 21 – The model missed 21 diabetic cases.

## Key Observations
### Class Imbalance:
There are more non-diabetic cases (99) than diabetic cases (55), which might have led to better performance on the majority class (non-diabetic) but lower performance on the minority class (diabetic).

### Diabetic Predictions (Class 1):
Lower precision and recall for the diabetic class indicate the model struggles to generalize well for this minority class. This could lead to critical misses (false negatives) in real-world applications.

### Weighted Performance:
The weighted average F1-score is 0.72, which aligns with the overall accuracy, reflecting a reasonable balance between precision and recall.

In [19]:
# Feature importance
feature_importances = pd.DataFrame(
    {'Feature': feature_columns, 'Importance': model.feature_importances_}
).sort_values(by='Importance', ascending=False)
print("\nFeature Importances:\n", feature_importances)


Feature Importances:
                     Feature  Importance
1                   Glucose    0.258864
5                       BMI    0.169984
7                       Age    0.140931
6  DiabetesPedigreeFunction    0.123768
2             BloodPressure    0.088134
0               Pregnancies    0.076551
4                   Insulin    0.076122
3             SkinThickness    0.065646


## Key Observations
Top Features:
- Glucose (Importance: 0.258864): Glucose is the most important feature, which aligns with medical understanding, as blood glucose levels are a primary indicator of diabetes.
- BMI (Importance: 0.169984): BMI is the second most important feature, reflecting the significant role body mass plays in diabetes risk.
- Age (Importance: 0.140931): Age is the third most important feature, likely because the risk of diabetes increases with age.