
<h1 align="center"> Stroke data Analysis and Exploration</h1>
<center><img src="https://th.bing.com/th/id/OIP.jJJIzT7fHknFNlCVk1dJigHaEK?pid=ImgDet&rs=1" width="60%" >


**In the next few lines of code, I reviewed several features that affect the individual and that may lead to a stroke. I reviewed those features and explained the results we obtained in detail....
Enjoy learning...**

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Reading Dataset

In [None]:
data=pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

# Let's take a look at the data

In [None]:
data.shape

> **The volume of data seems rather good in order to build a neural network if necessary**

In [None]:
data.head(10)

In [None]:
list(data.columns)

# Here is a definition of these columns:
Attribute Information
* id: unique identifier
* gender: "Male", "Female" or "Other"
* age: age of the patient
* hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* ever_married: "No" or "Yes"
* work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* Residence_type: "Rural" or "Urban"
* avg_glucose_level: average glucose level in blood
* bmi: body mass index
* smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* stroke: 1 if the patient had a stroke or 0 if not
> * **Note: "Unknown" in smoking_status means that the information is unavailable for this patient**

In [None]:
data=data.drop(["id"], axis=1)

In [None]:
# This code also helps us extract numeric columns, isn't that a good thing?
data.describe()

In [None]:
data.info()

In [None]:
data_dtype = data.dtypes
data_dtype.value_counts()

In [None]:
# Let's look at the unique values for each column:
for i in data.columns:
    print(i,data[i].unique())

# Missing Values:

In [None]:
data.isnull().sum().any()

> **It seems that we have missing values that we will deal with when we delve deeper into the analysis at the column level.**

In [None]:
# Draw the missing values on the map.
sns.heatmap(data.isnull(),cbar=True, cmap='viridis')

# EDA : General data exploration

In [None]:
#The heat map shows the relationships between the columns and each other, it's good.
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(data.corr(), annot=True, linewidths=.5, ax=ax)
plt.show()

In [None]:
data.hist(figsize=(18,10))
plt.show()

> **One of the columns has a positive skewness with a positive tail as well.**

> **Let's start by asking questions and answering them through analysis, and in doing so we will get rid of missing and outliers and do aggregate data analytics.**

**Let's continue.....**

# Handling Missing Values

In [None]:
data["bmi"].isnull().sum()

> **To solve the problem of empty values in that column, I will replace those empty values. I will create two columns that will carry the new values, and I will replace the empty values with the mean and median of the original column, and to make sure that the results I got are the best results that I can reach, I will draw a normal distribution For data before substitution, for data after substitution with mean, and for data after substitution with median, all of this is in one form.**

In [None]:
bmi_mean = data.bmi.mean()
bmi_median = data.bmi.median()
data['bmi_mean'] = data.bmi.fillna(bmi_mean)
data['bmi_median'] = data.bmi.fillna(bmi_median)

#Now let's draw the new columns
sns.kdeplot(data['bmi_mean'],color='red',label='Mean')
sns.kdeplot(data['bmi_median'],color='blue',label='Median')
sns.kdeplot(data['bmi'],color='black',label='Original')
plt.legend()


> **From the normal distribution of the data that I obtained, we notice that the three curves apply to each other, and therefore this is an excellent indicator of the quality of the results, but there is something that bothers me, which is the upper part of the curve, but it is OK, so far.**

In [None]:
#Take a look at the exact results we got, the results are good, aren't they?
data.head(10)

> **Now let's start asking questions and answering them through analysis, let's change the plan. I will answer the questions without asking a question, meaning that I will leave the leadership to the data, where will it take me.**

# EDA : Deeper

# EDA : for stroke (target) column

In [None]:
data["stroke"].value_counts()

In [None]:
sns.countplot(data=data,x='stroke')

> **The following figure is in order to determine the proportions of stroke incidence from no injury in the data.**

In [None]:
class_occur = data['stroke'].value_counts()
class_names = ['No Stroke','Stroke']
fig, ax = plt.subplots()
ax.pie(class_occur, labels=class_names, autopct='%1.2f%%',
        shadow=True, startangle=0, counterclock=False)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax.set_title('Class distribution')
plt.show()

> **It is very clear that the number of casualties in this data is very small.**

# EDA : FOR Gender column 

In [None]:
# Here we will convert that column to the three distinct elements.
data["gender"]=data["gender"].map({"Male":0 , "Female":1 , "Other":2})

In [None]:
# Let's look at the different values for that column.
data["gender"].unique()

In [None]:
#The results of the previous code are interesting so let's look at the number of each of those values.
data["gender"].value_counts()

In [None]:
# The following figure shows that the number of women is higher than the number of men, so we may witness a bias in the results in favor of women.
sns.countplot(data=data,x='gender')

In [None]:
sns.countplot(data=data,x='gender',hue='stroke')

> **We note here that the number of infected women is slightly greater than the number of women, and this may be due to the fact that the data includes more women than men**

# EDA : for Age column

In [None]:
data["age"].describe()

In [None]:
sns.boxplot('age',data=data) 
plt.xlabel('age Column Distribution')

> **We note that the values combine between the values of 30 years and approximately 60 years.**

In [None]:
# Let's see the distribution of the data.
#histogram
sns.distplot(data['age'])

In [None]:
# This instruction shows the number of unique elements in the column, not the values.
data["age"].nunique()

> **To facilitate the analysis process, I divided the age column into age groups, as follows:**
* 0 - 18
* 19 - 25 
* 26 - 35 
* 36 - 45 
* 46 - 55 
* 56 - 65 
* 66 - 75 
* +75 

In [None]:
# I set this function to achieve what we want:
def age_groups(age):
    if age >= 75:
        return('+75 ')
    elif age > 65:
        return('66 - 75 ')
    elif age > 55:
        return('56 - 65 ')
    elif age > 45:
        return('46 - 55 ')
    elif age > 35:
        return('36 - 45 ')
    elif age > 25:
        return('26 - 35 ')
    elif age > 18:
        return('19 - 25 ')
    elif age > 0:
        return('0 - 18 ')
    else:
        return(None)

In [None]:
# Now I will create a new column to display the results.
data['Age_group'] = data['age'].apply(age_groups)

In [None]:
data.head(10)

In [None]:
sns.countplot(data=data,x='gender',hue='Age_group')

> **From the figure, we can say that the rates of stroke in women increase from the beginning of the year 35 to reach their maximum in the year 46 and continue until the year 55.**

In [None]:
fig = sns.FacetGrid(data, hue="stroke",aspect=4)
fig.map(sns.kdeplot,'age',shade= True)
oldest = data['age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

> **In general, the number of cases of infection increases significantly between the sexes, starting from the age of 50 years and continues to rise frighteningly, reaching the maximum number of cases between 74 years and 80 years, according to the largest value in the age column that we have come to in the data.**

# EDA : for hypertension column

In [None]:
data["hypertension"].unique()

> **hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension**

In [None]:
sns.countplot(data=data,x='gender',hue='hypertension')

> **Women suffer from high blood pressure less than men, and this is due to the fact that the number of women included in the data is greater than the number of men.**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="Age_group",hue="hypertension" , data=data,palette="rocket")

> **The rates of high blood pressure increase significantly, starting from the age of 36 and increasing until I reach the age of 80, and these values were referred to previously, in the same years the rates of heart attack increase in women.**

> **focus more...**

# EDA : for heart_disease column 

In [None]:
data["heart_disease"].unique()

> **heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease**

In [None]:
fig = sns.FacetGrid(data, hue="heart_disease",aspect=4)
fig.map(sns.kdeplot,'age',shade= True)
oldest = data['age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

> **From the figure, we notice that the rate of heart disease increases from the beginning of the year 50 and continues to increase steadily until it reaches the age of 80, and therefore this category is the most vulnerable to stroke disease.**

In [None]:
sns.countplot(data=data,x='hypertension',hue='heart_disease')

> **Whenever a person suffers from an increased heart rate, he is also at risk of developing heart disease and consequently having a stroke.**

# EDA : for ever_married column

In [None]:
data["ever_married"].unique()

In [None]:
# Here we will convert that column to the three distinct elements.
data["ever_married"]=data["ever_married"].map({"Yes":1 , "No":0 })

In [None]:
sns.countplot(data=data,x='gender',hue='ever_married')

> **The percentage of married women is higher than the males.**

In [None]:
#Here is a more accurate ratio for married and unmarried people.
class_occur = data['ever_married'].value_counts()
class_names = ['Married','No married']
fig, ax = plt.subplots()
ax.pie(class_occur, labels=class_names, autopct='%1.2f%%',
        shadow=True, startangle=0, counterclock=False)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax.set_title('Class distribution')
plt.show()

In [None]:
fig = sns.FacetGrid(data, hue="ever_married",aspect=4)
fig.map(sns.kdeplot,'gender',shade= True)
oldest = data['gender'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

In [None]:
fig = sns.FacetGrid(data, hue="stroke",aspect=4)
fig.map(sns.kdeplot,'gender',shade= True)
oldest = data['gender'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

In [None]:
sns.countplot(data=data,x='Age_group',hue='ever_married')

> **The rate of marriage of men and women increases significantly from the year 35 and continues to rise until it reaches its maximum value in 55 years and then begins to decline slightly.**

> **Relying on the previous graphs, and since women are more numerous, the marriage rate of women is higher than men, and the rate of high blood pressure in women is higher than men, and based on the values of previous ages, we conclude that the incidence of stroke diseases among married people is higher.**

In [None]:
data.head()

# EDA : for work_type column

In [None]:
data["work_type"].unique()

In [None]:
data["work_type"].value_counts()

In [None]:
# Here we get the most accurate proportions.
class_occur = data['work_type'].value_counts()
class_names = ['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked']
fig, ax = plt.subplots()
ax.pie(class_occur, labels=class_names, autopct='%1.2f%%',
        shadow=True, startangle=0, counterclock=False)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax.set_title('Class distribution')
plt.show()

In [None]:
sns.countplot(data=data,x='work_type',hue='ever_married')

> **We note that married people working in the private sector are higher than the rest of the groups, and this can be due to several factors, including that the salaries of workers in the private sector are higher and therefore they can marry.**

In [None]:
sns.countplot(data=data,x='work_type',hue='stroke')

**But everything has its consequences. From the figure, we note the high rate of stroke among people working in the private sector.**

In [None]:
sns.countplot(data=data,x='work_type',hue='gender')

**It is clear that the percentage of workers in the private sector consists of married women, and therefore they are more likely to have a stroke.**

In [None]:
# Of course, I can carry out this process through one of the tools, but it is okay, the conversion elements are few.
# Here we will convert that column to the three distinct elements.
data["work_type"]=data["work_type"].map({'Private':0, 'Self-employed':1, 'Govt_job':2, 'children':3, 'Never_worked':4 })

In [None]:
data["work_type"].unique()

# EDA : for Residence_type column

In [None]:
data["Residence_type"].value_counts()

In [None]:
# Here we get the most accurate proportions.
class_occur = data['Residence_type'].value_counts()
class_names = ['Urban', 'Rural']
fig, ax = plt.subplots()
ax.pie(class_occur, labels=class_names, autopct='%1.2f%%',
        shadow=True, startangle=0, counterclock=False)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax.set_title('Class distribution')
plt.show()

**The difference between the place of residence is almost non-existent, so let's continue and see, perhaps the place of residence affects other things.....**

In [None]:
sns.countplot(data=data,x='Residence_type',hue='work_type')

In [None]:
sns.countplot(data=data,x='Residence_type',hue='gender')

**This feature seems to be ineffective. It will be influential if we expand the data to include more features about each current column in the data. It also varies from one country to another in terms of education, income, health, marriage, and others, and this greatly affects the rate of stroke.**

# EDA : for avg_glucose_level column

In [None]:
data["avg_glucose_level"].describe()

In [None]:
data["avg_glucose_level"].value_counts()

In [None]:
sns.boxplot('avg_glucose_level',data=data) 
plt.xlabel('avg_glucose_level Column Distribution')

In [None]:
sns.displot(data['avg_glucose_level'])

> **To facilitate the analysis, we modified the previous split function and modified it for column division.**

In [None]:
# I set this function to achieve what we want:
def age_groups(age):
    if age >= 145:
        return('+146 ')
    elif age > 135:
        return('136 - 145 ')
    elif age > 125:
        return('126 - 135 ')
    elif age > 115:
        return('116 - 125 ')
    elif age > 105:
        return('106 - 115 ')
    elif age > 95:
        return('96 - 105 ')
    elif age > 85:
        return('86 - 95 ')
    elif age > 75:
        return('76 - 85 ')
    elif age > 65:
        return('66 - 75 ')
    elif age > 54:
        return('55 - 65')
    else:
        return(None)

In [None]:
# Now I will create a new column to display the results.
data['avg_glucose_level_group'] = data['avg_glucose_level'].apply(age_groups)

In [None]:
data.head()

In [None]:
fig = sns.FacetGrid(data, hue="Age_group",aspect=4)
fig.map(sns.kdeplot,'avg_glucose_level',shade= True)
oldest = data['avg_glucose_level'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

> **Here we note the accumulation of average data for glucose glucose between the values of 80 to 100, and we also note some outliers from the largest mass of data, which is higher than 150 for the glucose level.**

> **I didn't get an explanation for why those outliers were.**

> **So if you are a specialist or have an explanation, please tell me in the comments.**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=data,x='avg_glucose_level_group',hue='gender')

**We note that women suffer more than men from fluctuations in the level of glucose sugar.**

# EDA : for bmi_median column

> bmi: body mass index

In [None]:
data["bmi_median"].nunique()

In [None]:
data["bmi_median"].describe()

In [None]:
sns.boxplot('bmi_median',data=data) 
plt.xlabel('bmi_median Column Distribution')

In [None]:
sns.distplot(data['bmi_median']);

> **We note here the presence of some outliers far from all the larger data.**

> **After looking at some reviews and papers, we can conclude that the relationship between body mass and stroke incidence is ambiguous. There are no bases that we can take in the analysis. We were able to establish a strong relationship between stroke disease and body mass, so to delve deeper into this problem we need to expand the data And adding more features that have a relationship between the two.**

# EDA : for smoking_status column

In [None]:
data["smoking_status"].unique()

In [None]:
#Let's see these values on the graph.
sns.countplot(data=data,x='smoking_status')

In [None]:
sns.countplot(data=data,x='smoking_status',hue='gender')

In [None]:
sns.countplot(data=data,x='smoking_status',hue='stroke')

> **We note from the figure that there is no difference in the incidence of stroke disease or not, whether the person is a smoker or not, or a former smoker.**

> **But we notice a very small increase in the incidence of stroke among smokers and former smokers.**

> **So you have to quit smoking.**

In [None]:
#Here we perform the process of data transformation to facilitate data visualizations.
data["smoking_status"]=data["smoking_status"].map({'formerly smoked':0, 'never smoked':1, 'smokes':2, 'Unknown':3 })

In [None]:
fig = sns.FacetGrid(data, hue="Age_group",aspect=4)
fig.map(sns.kdeplot,'smoking_status',shade= True)
oldest = data['smoking_status'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

In [None]:
data["smoking_status"].value_counts()

# The End

> **At that stage, I analyzed and explored the data in detail about ways to review the features affecting the goal and extract information from it, and that was the first stage.**

> **In the next stage, I will use machine learning algorithms and neural networks to reach as accurate as possible the correct results of the test data.**

# Thank you very much .