**Task**
#####It is a binary classification problem, where prediction would be either a person had Brain stroke(1) or not(0)
**Dataset**
#####Both train and test dataset was generated from a deep learning model trained on the Stroke Prediction Dataset.
**Attributes**
#####id: unique identifier
#####gender: "Male", "Female" or "Other"
#####age: age of the patient
#####hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
#####heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
#####ever_married: "No" or "Yes"
#####work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
#####Residence_type: "Rural" or "Urban"
#####avg_glucose_level: average glucose level in blood
#####bmi: body mass index
#####smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
#####stroke: 1 if the patient had a stroke or 0 if not

**Library imports**

In [None]:
import numpy as np             #provides support for multi-dimensional arrays and mathematical functions.
import pandas as pd            #provides data manipulation and analysis tools.
import matplotlib.pyplot as plt #plotting library for creating visualizations
import seaborn as sns           #data visualization library based on matplotlib, providing additional tools for creating statistical graphics
import missingno as msno        #visualization tool for exploring missing data in datasets

Loading dataset

In [None]:
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_data.csv")
original_data = pd.read_csv("stroke_data.csv")

In [None]:
#check for missing values
df_train.isna().sum()

In [None]:
df_test.isna().sum()

In [None]:
original_data.isna().sum()

In [None]:
df_train.head(3)

In [None]:
df_test.head(3)

In [None]:
original_data.head(3)

**Merge train and original dataset**

In [None]:
#dropping id column since we dont need it for data visualization
original_data = (pd.concat([original_data,df_train],axis=0)).drop(columns=["id"])

In [None]:
original_data.info()

In [None]:
#handle null values
original_data.isnull().sum()

In [None]:
#seperate categorical and numerical variables for visualization
numerical_vars = original_data.select_dtypes(np.number).drop(columns=["stroke"])
categorical_vars = original_data.select_dtypes("object")

In [None]:
numerical_vars.head(3)

In [None]:
categorical_vars.tail(3)

In [None]:
#getting info about how many values that categorical variables have by using unique method
#unique method is to extract unique values from the columns of categorical variables
print(pd.unique(categorical_vars["gender"]))
print(pd.unique(categorical_vars["smoking_status"]))
print(pd.unique(categorical_vars["ever_married"]))
print(pd.unique(categorical_vars["work_type"]))
print(pd.unique(categorical_vars["Residence_type"]))

In [None]:
numerical_vars.describe().T.style.background_gradient()

#####The numerical_vars dataframe is being described using the .describe() method, which generates a summary of statistics for each numerical column in the dataframe. The .T attribute is used to transpose the summary table, so that each column becomes a row.

#####The .style.background_gradient() method is then used to apply a background color gradient to the resulting table. The color of each cell in the table will vary based on the value of the cell, with lower values being assigned a lighter color and higher values being assigned a darker color.

**EDA**

In [None]:
#histplot
plt.figure(figsize=(10,3))
plt.subplot(1,3,1)
sns.histplot(numerical_vars["age"])
plt.subplot(1,3,2)
g1=sns.histplot(numerical_vars["avg_glucose_level"],kde=True)
g1.set(ylabel=None)
plt.subplot(1,3,3)
g1=sns.histplot(numerical_vars["bmi"],kde=True)
g1.set(ylabel=None)
plt.show()

##### numerical_vars contains numerical variables such as age, average glucose level, and BMI, this code will produce a figure with three subplots. The first subplot will show a histogram of the "age" variable, while the second and third subplots will show histograms with KDE overlays of the "avg_glucose_level" and "bmi" variables, respectively. The g1.set(ylabel=None) command is again used to remove the y-label from the plot
##### we can see that age is generally distributed between 20 and 60. Glucose levels seem normal but there are some high values.Bmi also same

In [None]:
#boxplot
plt.figure(figsize=(10,3))
plt.subplot(1,3,1)
sns.boxplot(numerical_vars["age"]);
plt.subplot(1,3,2)
sns.boxplot(numerical_vars["avg_glucose_level"]);
plt.subplot(1,3,3)
sns.boxplot(numerical_vars["bmi"]);

#####This code will produce a figure with three subplots. Each subplot will show a boxplot of one of the numerical variables.
#####Since avg_glucose_level seems normal, as we can see there are some values are higher. At Bmi, there are higher values like greater than 40, it means the people that who have greater than 40 value can carry high obesity risk

In [None]:
#countplot
plt.figure(figsize=(17,6))

plt.subplot(1,3,1)
plt.title("Smoking Status by Gender",fontsize=18)
plt.ylabel("Count",fontsize=18)
plt.xlabel("Gender",fontsize=18)
sns.countplot(x=categorical_vars["gender"],hue="smoking_status",data=categorical_vars);
plt.legend(loc=1, prop={'size': 11})

plt.subplot(1,3,2)
plt.title("Work Type by Gender",fontsize=18)
plt.xlabel("Gender",fontsize=18)
g1=sns.countplot(x=categorical_vars["gender"],hue="work_type",data=categorical_vars);
g1.set(ylabel=None)
plt.legend(loc=1, prop={'size': 11})

plt.subplot(1,3,3)
plt.title("Residence Type",fontsize=18)
plt.xlabel("Gender",fontsize=18)
g1=sns.countplot(x=categorical_vars["gender"],hue="Residence_type",data=categorical_vars);
plt.legend(loc=1, prop={'size': 11})
g1.set(ylabel=None)

plt.show()

#####The sns.countplot(x=categorical_vars["gender"],hue="smoking_status",data=categorical_vars) function call creates a countplot of the "smoking_status" variable by "gender" from the categorical_vars dataframe and plots it in this first subplot. The plt.title, plt.xlabel, plt.ylabel, and plt.legend function calls add various labels and a legend to the subplot.
#####The sns.countplot(x=categorical_vars["gender"],hue="work_type",data=categorical_vars) function call creates a countplot of the "work_type" variable. The sns.countplot(x=categorical_vars["gender"],hue="Residence_type",data=categorical_vars) function call creates a countplot of the "Residence_type" variable by "gender"

In [None]:
#scatterplot
plt.figure(figsize=(20,6))

plt.subplot(1,3,1)
plt.title("Age and Bmi Relationship",fontsize=18)
plt.xlabel("Age",fontsize=18)
plt.ylabel("Bmi",fontsize=18)
sns.scatterplot(x="age",y="bmi",data=original_data)

plt.subplot(1,3,2)
plt.title("Age and Average Glucose Level Relationship",fontsize=18)
plt.xlabel("Age",fontsize=18)
plt.ylabel("Average Glucose Level",fontsize=18)
sns.scatterplot(x="age",y="avg_glucose_level",data=original_data)

plt.subplot(1,3,3)
plt.title("Average Glucose Level and Bmi Relationship",fontsize=18)
plt.xlabel("Average Glucose Level",fontsize=18)
plt.ylabel("Bmi",fontsize=18)
sns.scatterplot(x="avg_glucose_level",y="bmi",data=original_data)

plt.show()

#####There are three graphs now. For first graph, people who are between 20-60 have high bmi values for this data. According to the second graph, older age>40 people have high glucose level, that means, these people can have diabetics.For last graph, if your glucose value is low your bmi is low, too. That means, it can follow same way.

In [None]:
df_corr = original_data.corr()

#####Created a correlation matrix of the numerical variables in the dataset. This will identify which variables are strongly correlated with each other, which can be useful in feature selection and modeling.

In [None]:
sns.set(font_scale=0.8)
# plt.figure(figsize=(20,20))
sns.heatmap(df_corr,annot=True,cmap="coolwarm",square=True)
plt.show()

#####The heatmap shows the correlation matrix between the numerical variables. The color scale represents the correlation coefficient ranging from -1 to 1, where blue indicates negative correlation and red indicates positive correlation.

#####Based on the heatmap, we can see that there are no strong correlations between the variables. However, we can observe a mild positive correlation between age and hypertension, as well as between age and heart disease. There is also a mild positive correlation between average glucose level and stroke, and between age and stroke.

**Modelling**

In [None]:
#Separation of dependent and independent variables, preparing the dataset for algorithms or models and filling nulls with mean.
original_data.isna().sum()

In [None]:
original_data.fillna(original_data["bmi"].mean(),inplace=True)

In [None]:
X = original_data.iloc[:,:-1]
y = original_data.iloc[:,-1]
X = pd.get_dummies(X,drop_first=True)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_percentage_error,roc_auc_score,roc_curve
from sklearn.model_selection import GridSearchCV

Preproccessing the data: split the data into training and testing sets using the train_test_split function from sklearn. This will allow us to evaluate the performance of the logistic regression model on unseen data. This will split the data into 80% training and 20% testing sets, with a random state of 42 for reproducibility. 

In [None]:
X_train, X_test, y_train , y_test = train_test_split(X,y,random_state=42)

**Create ROC curve**

In [None]:
model = LogisticRegression(solver="liblinear").fit(X_train,y_train)

#####This code initializes a logistic regression model and trains it on the training data using the solver "liblinear". The solver parameter specifies the algorithm to use in the optimization problem, and "liblinear" is a solver that is commonly used for small to medium-sized datasets. The .fit() method fits the logistic regression model to the training data. The training data X_train contains the features, and y_train contains the target variable or labels.

In [None]:

roc_auc=roc_auc_score(y,model.predict(X))
fpr,tpr,thresholds = roc_curve(y,model.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr,tpr,label="AUC (area=%0.2f)"%roc_auc)
plt.plot([0,1],[0,1],"r--")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc="lower right")
plt.show()

#####This code calculates the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) score for a logistic regression model. It plots the ROC curve with FPR on the x-axis and TPR on the y-axis. plt.plot([0, 1], [0, 1], "r--") adds a dashed diagonal line representing a random classifier. plt.xlim([0.0, 1.0]) sets the limits of the x-axis to [0, 1]. plt.ylim([0.0, 1.05]) sets the limits of the y-axis to [0, 1.05].


In [None]:
model.score(X_test,y_test)

We get a score of 0.957. A model score of 0.957 indicates that the logistic regression model is able to predict the stroke occurrence in the given dataset with 95.7% accuracy. This means that the model is able to correctly identify the occurrence or non-occurrence of stroke in 95.7% of the cases based on the selected features.