## Perform Exploratory Data Analysis on the data set

### I- Assessing
In this stage we display the data we're going to import to assess its Quality and Tidiness.

Quality dimensions or aspects are mainly:

**1**- Completness (checking if there are any missing records).

**2**- Validity (Checking if the values displayed are 'valid' i.e data that follow certain known rules)

**3**- Accuracy (a significant decrease or increase in a value is considered an 'inaccurate data')

**4**- Consistency (There should be only one way to represent or refer to a value otherwise the data is called to be 'inconsistent')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import numpy as np
stroke_data =pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
stroke_data.shape

In [None]:
stroke_data.info()


In [None]:
stroke_data.head(5)

In [None]:
stroke_data.tail(5)

In [None]:
stroke_data.describe()

In [None]:
# Gender column
stroke_data.gender.value_counts()

In [None]:
# Age column
stroke_data.age.value_counts()

* Note: some values in the 'Age' column doesn't make any sense.

In [None]:
# Hypertension column
stroke_data.hypertension.value_counts()

In [None]:
# Heart disease column
stroke_data.heart_disease.value_counts()

In [None]:
# Ever-married column
stroke_data.ever_married.value_counts()

In [None]:
# Work type column
stroke_data.work_type.value_counts()

* Note: 687 records in this column are assigned under 'children' category which is not a suitable work type.

In [None]:
# Residence type column
stroke_data.Residence_type.value_counts()

In [None]:
# Average glucose level column
stroke_data.avg_glucose_level.describe()

In [None]:
# BMI column
stroke_data.bmi.isnull().sum()

In [None]:
# Smoking status column
stroke_data.smoking_status.value_counts()

In [None]:
# Stroke column
stroke_data.stroke.value_counts()

### II-Cleaning
- Drop ID column since it's not necessary in our analysis.
- Change the format of 'hyper tension', 'heart disease', 'stroke', 'Gender' to category.
- Fill the 201 null values in 'BMI' column.
- Drop the 'other' category in the gender column.
- Add values under children category to those of never worked category.
- Create 'age category' column.

In [None]:
# Drop ID column.

stroke_data= stroke_data.drop(columns= 'id')
stroke_data.info()

In [None]:
# Change the format of 'hyper tension', 'heart disease', 'stroke' to category.

stroke_data['hypertension'] = stroke_data['hypertension'].astype('category')
stroke_data['heart_disease'] = stroke_data['heart_disease'].astype('category')
stroke_data['stroke'] = stroke_data['stroke'].astype('category')
stroke_data['gender'] = stroke_data['gender'].astype('category')
stroke_data.info()

In [None]:
# Fill the 201 null values in 'BMI' column.
# Let's ignore the 'other' column since it will be removed. 
# We will fill each missing value in 'bmi' column with the mean value for each gender
stroke_data.groupby('gender')['bmi'].mean()

In [None]:
# Mean bmi for male = 28.6 // mean bmi for female = 29, So pretty much the same value
stroke_data = stroke_data.fillna(stroke_data.mean())

# Now let's check that all null values are replaced with the mean values.
stroke_data.bmi.isnull().sum()

In [None]:
# Drop the 'other' category in the gender column.
other = stroke_data[stroke_data['gender'] == 'Other'].index
stroke_data.drop(other, axis=0, inplace= True)
stroke_data.gender.value_counts()

In [None]:
# Add values under children category to those of never worked category.
stroke_data.work_type = np.where(stroke_data['work_type'] == 'children','Never_worked',stroke_data.work_type)
stroke_data.work_type.value_counts()

In [None]:
# Create 'age category' column.
conditions= [(stroke_data['age'] <=14), 
                 (stroke_data['age'] >=15) & (stroke_data['age']<=24),
                 (stroke_data['age'] >=25) & (stroke_data['age']<=64),
                 (stroke_data['age'] >= 65)
]

values= ['Child','Youth','Adult','Senior']

#Create the new column
stroke_data['Age_Category']= np.select(conditions,values)

#Now we check if the new column is added
stroke_data.head(5)

**Further Exploration**
- Some values in the BMI column made no sense.
A mean BMI value of 12 as the lower limit for human survival and the maximum possible BMI is 50 which falls under the 'extremely obese' category.
Hence BMI values less than 12 or more than 50 are considered outliers and should be dealt with.

In [None]:
stroke_data.loc[stroke_data['bmi'] < 12 ]


Record 1609 is a baby so a BMI of 10.3 is acceptable, but the other 2 records are for a 40 year-old male and a 79 year-old female having very low BMI levels. Anyhow, they didn't have a stroke so we can safely remove them.


In [None]:
stroke_data = stroke_data.drop(labels=[1609,2187,3307], axis=0)


In [None]:
#Now let's make sure the records are deleted
stroke_data.loc[stroke_data['bmi'] < 12 ]


In [None]:
#Investigate records with BMI more than 50
stroke_data.loc[stroke_data['bmi'] > 50 ]

In [None]:
stroke_data.drop(stroke_data.index[stroke_data['bmi'] > 50], inplace = True)

In [None]:
#Now let's make sure the records are deleted
stroke_data.loc[stroke_data['bmi'] > 50 ]

### III- Build a simple logistic model to predict strokes using other variables.


Split the dataset into X and Y:

In [None]:
X= stroke_data[['age', 'avg_glucose_level', 'bmi']]
y= stroke_data['stroke']

Split the dataset into train and test:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) 

Create a logistic regression body:

In [None]:
logreg= LogisticRegression()
logreg.fit(X_train,y_train) 

In [None]:
y_pred=logreg.predict(X_test)
print (X_test) #test dataset
print (y_pred) #predicted values

### Model Evaluation using Confusion Matrix
#### A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.

In [None]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_pred, normalize= 'all')
print(matrix)

#Let's visualize the matrix
import seaborn as sns
sns.heatmap(matrix, annot=True)

### Let's evaluate the model we've just built using model evaluation metrics such as accuracy, recall and precision.

In [None]:
from sklearn import metrics
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
print('Recall: ',metrics.recall_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("CL Report:",metrics.classification_report(y_test, y_pred))

Now if we have other patients with different ages, BMI or glucose level, the model can predict if they may have a stroke or not.

In [None]:
#This dataset is fictional and for illustrating only.
new_patients= {'age': [20, 35, 70, 80, 90, 100], 'avg_glucose_level': [120, 140, 160, 200, 170, 150], 'bmi': [20, 25, 18, 30, 19, 31]}
stroke_new= pd.DataFrame(new_patients, columns= ['age', 'avg_glucose_level', 'bmi'])
stroke_new

Now let's try the model on the new dataset:

In [None]:
X= stroke_data[['age', 'avg_glucose_level', 'bmi']]
y= stroke_data['stroke']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) 

logreg= LogisticRegression()
logreg.fit(X, y.values.ravel())


new_patients= {'age': [20, 35, 70, 80, 90, 100], 'avg_glucose_level': [120, 140, 160, 200, 170, 150], 'bmi': [20, 25, 18, 30, 19, 31]}
stroke_new= pd.DataFrame(new_patients, columns= ['age', 'avg_glucose_level', 'bmi'])

stroke_new = pd.DataFrame(new_patients,columns= ['age', 'avg_glucose_level', 'bmi'])
y_pred=logreg.predict(stroke_new)

print (stroke_new)
print (y_pred)

In [None]:
print(y_pred)

### IV- Analysis and visualization 
#### We'll now look at the relationships between the stroke and different variables.

In [None]:
stroke_data.head(5)

#### 1- The relation between 'Age' and 'Stroke'
#### I- According to age range

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows each age value and its corresponding stroke state
age_data= pd.concat([stroke_data['age'], y], axis=1)

#Create dataset for the plot
age_plot= pd.melt(age_data, id_vars= 'stroke', var_name= 'age')

#Create the plot
sns.boxplot(x= 'age', y= 'value', hue= 'stroke', data= age_plot, palette="Set2")

#### II- According to age category

In [None]:
fig,ax = plt.subplots(figsize = (8,10))
#Create dataset that shows smoking state and its corresponding stroke state
agecat_data= pd.concat([stroke_data['Age_Category'], y], axis=1)

#Create dataset for the plot
agecat_plot= stroke_data[['Age_Category', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'Age_Category', y= 0, hue= 'stroke', data= agecat_plot, palette="Set2")

#### It's clear from the plot that older patients 'seniors' (60-80 years) are more likely to have a stroke than younger people, so age is a crucial factor in predicting strokes.

#### 2-The relation between 'Glucose level' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows each glucose level value and its corresponding stroke state
glucose_data= pd.concat([stroke_data['avg_glucose_level'], y], axis=1)

#Create dataset for the plot
glucose_plot= pd.melt(glucose_data, id_vars= 'stroke', var_name= 'avg_glucose_level')

#Create the plot
sns.boxplot(x= 'avg_glucose_level', y= 'value', hue= 'stroke', data= glucose_plot, palette="Set1")

#### From the visual we can notice that higher glucose levels are associated with a higher risk of a stroke, also there're too many outliers.
#### Glucose levels ranging between 80-120 are not a precise indicator to predict a stroke since some patients with this glucose level had a stroke and others didn't.

#### 3-The relation between 'BMI' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows each bmi value and its corresponding stroke state
bmi_data= pd.concat([stroke_data['bmi'], y], axis=1)

#Create dataset for the plot
bmi_plot= pd.melt(bmi_data, id_vars= 'stroke', var_name= 'bmi')

#Create the plot
sns.boxplot(x= 'bmi', y= 'value', hue= 'stroke', data= bmi_plot, palette="Set3")

#### We notice here that there're many outliers in bmi values, also some patients with bmi ranging from (28-33) already had a stroke while others with the same bmi didn't, so bmi value is a bit misleading and cannot be considered a good predictor of a stroke.

#### 4-The relation between 'Gender' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows each gender and its corresponding stroke state
gender_data= pd.concat([stroke_data['gender'], y], axis=1)

#Create dataset for the plot
gender_plot= stroke_data[['gender', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'gender', y= 0, hue= 'stroke', data= gender_plot, palette="Set2")

In [None]:
gender_plot= stroke_data[['gender', 'stroke']].value_counts().reset_index()
gender_plot.head(5)

#### From the previous plot & table, it's clear that  females (140) were more likely to have a stroke than males (108).

#### 5-The relation between 'Hypertension' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows hypertension state and its corresponding stroke state
hypertn_data= pd.concat([stroke_data['hypertension'], y], axis=1)

#Create dataset for the plot
hypertn_plot= stroke_data[['hypertension', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'hypertension', y= 0, hue= 'stroke', data= hypertn_plot, palette="Set2")

In [None]:
hypertn_plot= stroke_data[['hypertension', 'stroke']].value_counts().reset_index()
hypertn_plot.head(5)

#### From the previous we notice that:
#### 415 patients are hypertensive but didn't have a stroke which means that hypertension isn't a propable risk factor for strokes.
#### 182 patients had a stroke but aren't hypertensive and that signifies that stroke isn't necesarily associated with hypertension.
#### 66 patients both had a stroke and are hypertensive, so that's 66 out of 5110 which is clearly not a reliable proportion to tell that hypertension is a crucial predictor.

#### 6-The relation between 'Heart disease' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows heart disease state and its corresponding stroke state
heart_data= pd.concat([stroke_data['heart_disease'], y], axis=1)

#Create dataset for the plot
heart_plot= stroke_data[['heart_disease', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'heart_disease', y= 0, hue= 'stroke', data= heart_plot, palette="Set1")

In [None]:
heart_plot= stroke_data[['heart_disease', 'stroke']].value_counts().reset_index()
heart_plot.head()

#### From the previous we notice that:
#### 228 patients have heart disease but didn't have a stroke which means that heart disease isn't a propable risk factor for strokes.
#### 201 patients had a stroke but don't suffer from heart disease and that signifies that stroke isn't necesarily associated with heart disease.
#### 47 patients both had a stroke and suffer from heart disease, so that's 47 out of 5110 which is clearly not a reliable proportion to tell that heart disease is a crucial predictor.

#### 7-The relation between 'Marital status' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows marital status and its corresponding stroke state
social_data= pd.concat([stroke_data['ever_married'], y], axis=1)

#Create dataset for the plot
social_plot= stroke_data[['ever_married', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'ever_married', y= 0, hue= 'stroke', data= social_plot, palette="Set1")

In [None]:
social_plot= stroke_data[['ever_married', 'stroke']].value_counts().reset_index()
social_plot.head()

#### The data shows that marriage is not a significant factor in predicting a stroke.

#### 8-The relation between 'Work type' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (6,8))
#Create dataset that shows work type and its corresponding stroke state
work_data= pd.concat([stroke_data['work_type'], y], axis=1)

#Create dataset for the plot
work_plot= stroke_data[['work_type', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'work_type', y= 0, hue= 'stroke', data= work_plot, palette="Set2")

In [None]:
work_plot= stroke_data[['work_type', 'stroke']].value_counts().reset_index()
work_plot.head(10)

#### It's clear that the largest proportion of those who had a stroke are in the private sector (148), although itsn't an important risk factor it's worth taken into consideration

#### 9-The relation between 'Residence type' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (8,10))
#Create dataset that shows work type and its corresponding stroke state
residence_data= pd.concat([stroke_data['Residence_type'], y], axis=1)

#Create dataset for the plot
residence_plot= stroke_data[['Residence_type', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'Residence_type', y= 0, hue= 'stroke', data= residence_plot, palette="Set2")

#### It seems that residence type doesn't have much of an effect on predicting strokes.

#### 10-The relation between 'Smoking status' and 'Stroke'

In [None]:
fig,ax = plt.subplots(figsize = (8,10))
#Create dataset that shows smoking state and its corresponding stroke state
smoking_data= pd.concat([stroke_data['smoking_status'], y], axis=1)

#Create dataset for the plot
smoking_plot= stroke_data[['smoking_status', 'stroke']].value_counts().reset_index()

#Create the plot
sns.barplot(x= 'smoking_status', y= 0, hue= 'stroke', data= smoking_plot, palette="Set2")

In [None]:
smoking_plot= stroke_data[['smoking_status', 'stroke']].value_counts().reset_index()
smoking_plot.head(10)

#### The result here is quite interesting, it was expected that smokers have a higher risk of a stroke but the data shows that 735 smoker patients never experienced a stroke while 89 non-smoker already had a stroke, so unlike the usual, smoking status is definetely not a good predictor of a stroke.

## Final conclusions

### This dataset has 5110 records and 12 columns representing 12 different clinical and demographic features.
#### From the analysis, the following points were concluded:
#### 1- Seniors have a higher risk of stroke compared to adults and youth, so it's a good indicator.
#### 2- The level of average glucose isn't a good predictor.
#### 3- BMI values are misleading and can never be considered a risk factor for strokes.
#### 4- Gender data showed that both males and females have strokes but more females suffered from strokes than males.
#### 5- Hypertension is definitely not a good indicator for strokes since many hypertensive patients didn't have a stroke and other who aren't did suffer from a stroke, the same conclusion applies for heart disease which surprisingly was not a propable risk factor.
#### 6- Marital status, residence type and smoking status are all unimportant in predicting strokes for this dataset.
#### 7- On the other hand, people working in the private sector showed higher level of stroke than people with other work types, the number isn't significant but it's worth taken into consideration in further studies.

In [None]:
stroke_data.to_csv(r'C:\Users\sss-a\Desktop\Practicum Internship\Project\healthcare-dataset-stroke-data.csv')

In [None]:
stroke_data.head(5)