## Author: Lirilkumar Amal

## Goal   : Strokes data set exploration

##### Understand data -> know how taget variables are shaped by other data columns -> create features based on corelation -> model -> wohoo!

#### Tip: Write what questions you are tryig to solve, and add conclusions for each questions

#### Data explorations always used to
- know your data and distribution
- what is available in data
- is sample of data is capturing or covering population
- assumptions are validated by data if any
- understanding relationship between target and predictor
- find anomalies

#### Note:
- Data exploration can be performed by visualization as well as statistical methods
- for simple understanding, plots are used but when data is large or complex relationships needs to tested, statistical methods are used / preferred.

# imports and installations

In [None]:
import pandas as pd
import missingno
import seaborn as sns
import matplotlib.pyplot as plt

---

## Data  

In [None]:
data_folder = r"/kaggle/input/heart-stroke/"

In [None]:
strokes_df = pd.read_csv(data_folder+r"train_strokes.csv", index_col="id")
strokes_df[['gender','ever_married','Residence_type','work_type','smoking_status','hypertension','stroke']] = strokes_df[['gender','ever_married','Residence_type','work_type','smoking_status','hypertension','stroke']].astype('category')

## Data  Dictionary

In [None]:
strokes_df.shape

Target : stroke, 1 = yes,  0 = No

In [None]:
strokes_df.info()

In [None]:
# here we have
categorical_vars = ['gender','hypertension','heart_disease', 'ever_married','Residence_type', 'work_type', 'smoking_status', 'stroke']
continuous_vars = ['age','avg_glucose_level','bmi']

# Data exploration  

In [None]:
strokes_df.head()

In [None]:
strokes_df[continuous_vars].describe()

missing data gaps

In [None]:
missingno.matrix(strokes_df, figsize = (30,5))

---

# univariate analysis for all columns
U-Axx

##### Tip : Zipf's law

too many unique values with most of them are having very less occurances, can be seen in categorical vars

(Zipf’s law: The highest occurring variable will have double the number of occurrences of the second highest occurring variable, triple the amount of the third and so on.)

U-A1: count of records per category

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4,figsize=(25,7))

fig.suptitle("Countplot for strokes_df", fontsize=35)

sns.countplot(x="gender", data=strokes_df,ax=ax1)
sns.countplot(x="stroke", data=strokes_df,ax=ax2)
sns.countplot(x="ever_married", data=strokes_df,ax=ax3)
sns.countplot(x="hypertension", data=strokes_df,ax=ax4)

conclusion :
- There are few other genders available
- marrieds are high in dataset

---

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(25,7))

fig.suptitle("Countplot for strokes_df", fontsize=35)

sns.countplot(x="work_type", data=strokes_df,ax=ax1)
sns.countplot(x="Residence_type", data=strokes_df,ax=ax2)
sns.countplot(x="smoking_status", data=strokes_df,ax=ax3)


conclusion :
- to many private job observations
- equal rural and urban observation
- half observations never smoked

In [None]:
g = sns.catplot(x="Residence_type", hue="smoking_status", col="work_type",
                data=strokes_df, kind="count",
                height=4, aspect=.7)

---

UA-B1 : distribution of numberical columns

##### Motive
- What range do the observations cover?
- What is their central tendency?
- Are they heavily skewed in one direction?
- Is there evidence for bimodality?
- Are there significant outliers?

In [None]:
sns.histplot(strokes_df[continuous_vars], kde=True)

#### KDE plot

aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

In [None]:
sns.displot(x="age", data=strokes_df, kind="kde", hue="gender", col="smoking_status", row="Residence_type")

---

## Bi variate analysis : to check relationship of two variables
BI-Axx

#### Pairs can be
- Categorical and Categorical variable : grouped count plot, contingency table
    - Contingency table: A table summarization of two categorical variables in this form is called a contingency table.
- Categorical and Continuous variable: box plots, Z-Test/ T-Test
- Continuous and Continuous variable : scatter plot

BI-A1 : How vawlue of numeric data observations falls strokes

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,7))
fig.suptitle("Boxplot for Strokes", fontsize=35)

sns.boxplot(x="stroke", y="avg_glucose_level", data=strokes_df,ax=ax1)
sns.boxplot(x="stroke", y="bmi", data=strokes_df,ax=ax2)
sns.boxplot(x="stroke", y="age", data=strokes_df,ax=ax3)

conclusions:
- high average glucose observed in stroke
- high BMI above Q1 is not seen in stroke observations
- high age is observed in stroke cases, few anomalies in young age

---

BI-B1: how different resident type affects risks of stroke

In [None]:
temp = strokes_df
temp['stroke'] = temp['stroke'].astype(int)
residence_type_df = strokes_df.groupby(["Residence_type"])['stroke'].agg(['sum','count']).reset_index()
residence_type_df['risk'] = residence_type_df['sum'] / residence_type_df['count'] * 100
residence_type_df

conclusion
- rural life is slight less prone to stroke

---

BI-B2: HYP: does private job empoyee are more proone to stroke?

In [None]:
temp = strokes_df
temp['stroke'] = temp['stroke'].astype(int)
work_type_df = strokes_df.groupby(["work_type"])['stroke'].agg(['sum','count']).reset_index()
work_type_df['risk'] = work_type_df['sum'] / work_type_df['count'] * 100
work_type_df

conclusion
 - self employed are at highest risk ompared to clidrens

---

BI-B3: Does smoking is a factor leads to stroke compared to non married

In [None]:
temp = strokes_df
temp['stroke'] = temp['stroke'].astype(int)
smoking_status_df = temp.groupby(["smoking_status"])['stroke'].agg(['sum','count']).reset_index()
smoking_status_df['risk'] = smoking_status_df['sum'] / smoking_status_df['count'] * 100
smoking_status_df

conclusion
 - smoking doubles the risk of stroke

<hr/>

BI-B4: Does marriage is a factor leads to stroke compared to non married

In [None]:
temp = strokes_df
temp['stroke'] = temp['stroke'].astype(int)
ever_married_df = temp.groupby(["ever_married"])['stroke'].agg(['sum','count']).reset_index()
ever_married_df['risk'] = ever_married_df['sum'] / ever_married_df['count'] * 100
ever_married_df

conclusion
 - married person have a 5 times high risk of getting stroke

---

BI-B5: Does marriage and smoking_status is a factor leads to stroke compared to non married

In [None]:
temp = strokes_df
temp['stroke'] = temp['stroke'].astype(int)
ever_married_df = temp.groupby(["ever_married","smoking_status"])['stroke'].agg(['sum','count']).reset_index()
ever_married_df['risk'] = ever_married_df['sum'] / ever_married_df['count'] * 100
ever_married_df

conclusion
 - huge risk if person is married ever and smoke

---

In [None]:
sns.boxplot(x="stroke", y="bmi",data=strokes_df)
plt.title("! Feature idea : does BMI over 60 means no stroke?")

In [None]:
sns.boxplot(x="gender", y="bmi", hue="stroke",data=strokes_df)
plt.title("Thought: No stroke if gender is Other?")

---

In [None]:
sns.boxplot(x="stroke", y="age", hue="gender",data=strokes_df)
plt.title("Anomaly: High age, more risk, few young cases where young person have stroke")

---

BI-B6: Does smoker and non working person have high chance of stroke due to glucose level, how does it looks with gender?

In [None]:
sns.set(rc={'figure.figsize':(17,5)})
sns.boxplot(x="work_type", y="avg_glucose_level", hue="smoking_status",data=strokes_df)
plt.title("Nover worked and Smoker is very high under risk")

In [None]:
sns.set(rc={'figure.figsize':(17,5)})
sns.boxplot(x="work_type", y="avg_glucose_level", hue="stroke",data=strokes_df)
plt.title("Nover worked and Smoker is very high under risk")

---

---

In [None]:
g = sns.FacetGrid(strokes_df, col="heart_disease", hue="stroke", height=4, aspect=1.6, row="work_type")
g.map(sns.scatterplot, "avg_glucose_level", "age", alpha=.7)
g.add_legend()

conclusion
- provate and self employed with high age are prone to stroke
- childs are not having strokes much

---

In [None]:
sns.displot(data=strokes_df, x="age", hue="stroke", multiple="stack", kind="kde", col="work_type", row="smoking_status")
plt.title("Stroke and No Stroke observation per combination")

---

In [None]:
sns.pairplot(data=strokes_df[continuous_vars+["stroke"]], hue="stroke")

---

In [None]:
g = sns.FacetGrid(strokes_df, col="work_type", hue="gender", height=4, aspect=1.6,row="ever_married")
g.map(sns.countplot, "smoking_status", alpha=.7)
g.add_legend()

---

Tip: grouped Summary

In [None]:
strokes_df.columns

In [None]:
import researchpy as rp
rp.summary_cont(strokes_df[['avg_glucose_level','age']].groupby(strokes_df['stroke']))

---

### Corelation matrix between variables

Pearson correlation coefficient

One of the simplest method for understanding a feature’s relation to the response variable is Pearson correlation coefficient, which measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

- ‘pearson’, ‘kendall’, ‘spearman’

In [None]:
# Compute a correlation matrix and convert to long-form
corr_mat = strokes_df.corr("kendall").stack().reset_index(name="correlation")

# Draw each cell as a scatter point with varying size and color
g = sns.relplot(
    data=corr_mat,
    x="level_0", y="level_1", hue="correlation", size="correlation",
    palette="vlag", hue_norm=(-1, 1), edgecolor=".7",
    height=5, sizes=(50, 250), size_norm=(-.2, .8),
)

# Tweak the figure to finalize
g.set(xlabel="", ylabel="", aspect="equal")
g.despine(left=True, bottom=True)
g.ax.margins(0.25)
for label in g.ax.get_xticklabels():
    label.set_rotation(90)
for artist in g.legend.legendHandles:
    artist.set_edgecolor(".1")

In [None]:
strokes_temp_df=strokes_df
strokes_temp_df[['stroke','hypertension']] = strokes_df[['stroke','hypertension']].astype('int')
corr = strokes_temp_df.corr()
corr.style.background_gradient()
corr.style.background_gradient().set_precision(2)

conclusion:
- stroke is correlated with heart age and desease

---

# Outcome expected from EDA

- Deep understanding of data and distribution
- Columns to consider when building features
- Edge cases for which, data is less available
- Anomalies and Outliers in Data

#  Links