## Stroke Prediction Project

This notebook is my complete workflow for the Tabular Data project where i chose to work on Stroke Prediction, using structured data analysis, preprocessing, and machine learning.

### Step 1: Import Libraries
Start by importing the necessary libraries for data analysis and visualization.

In [18]:
# Data analysis and manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Profiling for detailed data analysis
from ydata_profiling import ProfileReport

# Suppress warnings for clean output
import warnings
warnings.filterwarnings('ignore')

### Step 2: Load and Explore the Dataset
Load the stroke dataset and perform an initial exploration to understand its structure.

In [19]:
# Load dataset
df = pd.read_csv('stroke_data.csv')

# Display the first few rows
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [20]:
# Dataset overview
df.info()

# Generate a profile report for detailed insights
profile = ProfileReport(df, title="Stroke Data Profile", explorative=True)
profile.to_file("stroke_data_profile.html")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Step 3: Data Cleaning and Preprocessing
To handle missing values, we encode categorical variables, and prepare the dataset for modeling.

In [21]:
# Drop irrelevant columns (e.g., 'id')
df.drop(['id'], axis=1, inplace=True)

# Handle missing values in BMI by imputing with the median
df['bmi'].fillna(df['bmi'].median(), inplace=True)

# Encode categorical variables using label encoding
categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
from sklearn.preprocessing import LabelEncoder

label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Remove duplicates
df.drop_duplicates(inplace=True)

# Display the cleaned dataset
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1,67.0,0,1,1,2,1,228.69,36.6,1,1
1,0,61.0,0,0,1,3,0,202.21,28.1,2,1
2,1,80.0,0,1,1,2,0,105.92,32.5,2,1
3,0,49.0,0,0,1,2,1,171.23,34.4,3,1
4,0,79.0,1,0,1,3,0,174.12,24.0,2,1


### Step 4: Exploratory Data Analysis (EDA)
Visualize data distributions and relationships to gain insights.

In [22]:
# Plot the distribution of the target variable
sns.countplot(x='stroke', data=df)
plt.title("Distribution of Stroke Outcomes")
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

### Step 5: Model Development
We build, train, and evaluate machine learning models for stroke prediction.

In [23]:
# Split the data into features and target
X = df.drop('stroke', axis=1)
y = df['stroke']

# Apply SMOTE to handle class imbalance
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

# Train a Random Forest model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9406924922865958
              precision    recall  f1-score   support

           0       0.96      0.92      0.94      1457
           1       0.92      0.96      0.94      1460

    accuracy                           0.94      2917
   macro avg       0.94      0.94      0.94      2917
weighted avg       0.94      0.94      0.94      2917



### Step 6: Save the Model
Save the trained model and preprocessing pipelines for deployment.

In [24]:
import joblib

# Save the model
joblib.dump(model, "stroke_model.pkl")

# Save the label encoders
for col, le in label_encoders.items():
    joblib.dump(le, f"{col}_encoder.pkl")

print("Model and encoders saved!")

Model and encoders saved!


In [25]:
from sklearn.preprocessing import LabelEncoder
import joblib

# Create and save encoders with all expected categories
le_gender = LabelEncoder()
le_gender.fit(['Male', 'Female'])
joblib.dump(le_gender, "gender_encoder.pkl")

le_ever_married = LabelEncoder()
le_ever_married.fit(['No', 'Yes'])
joblib.dump(le_ever_married, "ever_married_encoder.pkl")

le_work_type = LabelEncoder()
le_work_type.fit(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'])
joblib.dump(le_work_type, "work_type_encoder.pkl")

le_residence_type = LabelEncoder()
le_residence_type.fit(['Urban', 'Rural'])
joblib.dump(le_residence_type, "residence_type_encoder.pkl")

le_smoking_status = LabelEncoder()
le_smoking_status.fit(['never smoked', 'formerly smoked', 'smokes', 'Unknown'])
joblib.dump(le_smoking_status, "smoking_status_encoder.pkl")

['smoking_status_encoder.pkl']

## Step 7: Preprocessing and Unit Tests

In [26]:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

def preprocess_data(data):
    # Create LabelEncoders
    label_encoders = {}
    categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
    
    # Impute missing BMI values
    data['bmi'].fillna(data['bmi'].median(), inplace=True)
    
    # Encode categorical columns
    for col in categorical_columns:
        le = LabelEncoder()
        data[col] = le.fit_transform(data[col].astype(str))
        label_encoders[col] = le
        
    return data, label_encoders


In [27]:

import unittest
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

class TestPreprocessing(unittest.TestCase):
    
    def setUp(self):
        data = {
            "gender": ["Male", "Female", "Male"],
            "age": [34, 45, 29],
            "hypertension": [0, 1, 0],
            "heart_disease": [0, 0, 1],
            "ever_married": ["Yes", "No", "Yes"],
            "work_type": ["Private", "Self-employed", "Private"],
            "Residence_type": ["Urban", "Rural", "Urban"],
            "avg_glucose_level": [120.0, 140.5, 130.0],
            "bmi": [24.5, 25.5, np.nan],
            "smoking_status": ["never smoked", "smokes", "never smoked"]
        }
        self.df = pd.DataFrame(data)

    def test_missing_values(self):
        self.assertTrue(self.df.isnull().values.any(), "Missing values are not found.")
        processed_data, _ = preprocess_data(self.df)
        self.assertFalse(processed_data.isnull().any().any(), "Missing values found after preprocessing.")

    def test_label_encoding(self):
        label_encoder = LabelEncoder()
        encoded_df = self.df.copy()
        for column in ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']:
            encoded_df[column] = label_encoder.fit_transform(self.df[column])
        for column in ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']:
            self.assertTrue(pd.api.types.is_integer_dtype(encoded_df[column]), f"Column {column} is not encoded correctly.")

if __name__ == '__main__':
    unittest.main(argv=[''], verbosity=2, exit=False)


test_label_encoding (__main__.TestPreprocessing.test_label_encoding) ... ok
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['bmi'].fillna(data['bmi'].median(), inplace=True)
ok

----------------------------------------------------------------------
Ran 2 tests in 0.015s

OK


### Step 8: Deployment with Gradio
Create an interactive interface for predictions.

In [12]:
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()
#checkout git
#login to huggingface, setting, access token, generate a new token



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [28]:
from huggingface_hub import notebook_login

# Log in to Hugging Face (interactive method)
notebook_login()

# Path to your model file
model_path = "stroke_model.pkl"

# Upload the model to the repository
from huggingface_hub import upload_file

upload_file(
    path_or_fileobj=model_path,  # Path to the model file
    path_in_repo="stroke_model.pkl",  # The file name you want in the repo
    repo_id="robinho46/Project",  # Your Hugging Face repo
    token=None  # No need to pass the token if you're using notebook_login()
)

print(f"Model uploaded successfully to: https://huggingface.co/robinho46/Project")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

stroke_model.pkl:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Model uploaded successfully to: https://huggingface.co/robinho46/Project


In [15]:
from huggingface_hub import upload_file

# Directly authenticate using the token
hf_token = "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"  # Your Hugging Face token

# Path to your model file
model_path = "stroke_model.pkl"

# Upload the model to the repository
upload_file(
    path_or_fileobj=model_path,  # Path to the model file
    path_in_repo="stroke_model.pkl",  # The file name you want in the repo
    repo_id="robinho46/Project",  # Your Hugging Face repo
    token=hf_token  # Pass the token directly
)

print(f"Model uploaded successfully to: https://huggingface.co/robinho46/Project")

No files have been modified since last commit. Skipping to prevent empty commit.


Model uploaded successfully to: https://huggingface.co/robinho46/Project


In [17]:
# Load and update encoders globally once
def load_and_update_encoders():
    encoders = {
        "gender": joblib.load("gender_encoder.pkl"),
        "ever_married": joblib.load("ever_married_encoder.pkl"),
        "work_type": joblib.load("work_type_encoder.pkl"),
        "Residence_type": joblib.load("residence_type_encoder.pkl"),
        "smoking_status": joblib.load("smoking_status_encoder.pkl"),
    }
    # Ensure all necessary classes are included
    for category, default in zip(
        ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'],
        ['Female', 'No', 'Private', 'Urban', 'Unknown']
    ):
        if default not in encoders[category].classes_:
            encoders[category].classes_ = np.append(encoders[category].classes_, default)
    return encoders

# Load and update encoders globally
encoders = load_and_update_encoders()

# Define the prediction function
def predict_stroke(gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status):
    # Encode inputs
    data = pd.DataFrame([{
        'gender': encoders['gender'].transform([gender])[0],
        'age': age,
        'hypertension': hypertension,
        'heart_disease': heart_disease,
        'ever_married': encoders['ever_married'].transform([ever_married])[0],
        'work_type': encoders['work_type'].transform([work_type])[0],
        'Residence_type': encoders['Residence_type'].transform([Residence_type])[0],
        'avg_glucose_level': avg_glucose_level,
        'bmi': bmi,
        'smoking_status': encoders['smoking_status'].transform([smoking_status])[0],
    }])

    # Make prediction
    prediction = model.predict(data)[0]
    return "Stroke" if prediction == 1 else "No Stroke"

NameError: name 'joblib' is not defined

In [10]:
import gradio as gr

# Create Gradio interface
interface = gr.Interface(
    fn=predict_stroke,  # The function to handle predictions
    inputs=[
        gr.Radio(["Male", "Female"], label="Gender"),
        gr.Number(label="Age"),
        gr.Radio([0, 1], label="Hypertension (0: No, 1: Yes)"),
        gr.Radio([0, 1], label="Heart Disease (0: No, 1: Yes)"),
        gr.Radio(["No", "Yes"], label="Ever Married"),
        gr.Dropdown(["Private", "Self-employed", "Govt_job", "children", "Never_worked"], label="Work Type"),
        gr.Dropdown(["Urban", "Rural"], label="Residence Type"),
        gr.Number(label="Average Glucose Level"),
        gr.Number(label="BMI"),
        gr.Dropdown(["never smoked", "formerly smoked", "smokes", "Unknown"], label="Smoking Status"),
    ],
    outputs=gr.Textbox(label="Prediction"),
    title="Stroke Prediction",
)

# Launch the Gradio interface
interface.launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


