# 1. Background Information of Dataset

Features of dataset:

1. sex: (0 represents female & 1 represents male)
2. age: patient's age in years
3. hypertension: (0 represents no history of hypertension & 1 represents history of hypertension)
4. heart_disease: (0 represents no history of heart disease & 1 represents history of heart disease)
5. ever_married: (0 represents that patient has not been married before & 1 represents patient has been married before)
6. work_type: (0 represents never_worked, 1 represents children, 2 represents Govt-Job, 3 represents Self-Employed & 4 represents Private)
7. Residence_type: (0 represents Rural & 1 represents Urban)
8. avg_glucose_level: (numeric data to represent the average patient's glucose level)
9. bmi: (numeric data to represent Body Mass Index)
10. smoking_status: (0 represents never smoked & 1 represents smokes)

40910 rows of data but there are 3 missing values for 'sex', after data cleaning, we are left with 40907 rows of data
We further removed rows where there are negative values for 'age'.

In [1]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# 2. Libraries and Packages

In [2]:
# Import general packages - numpy,pandas,seaborn,matplotlib
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set

# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy import stats

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

### Step 1: Import the csv file 

In [3]:
#Import the data set
sourcedata = pd.read_csv('stroke_data.csv')
sourcedata.head()

Unnamed: 0,sex,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1.0,63,0,1,1,4,1,228.69,36.6,1,1
1,1.0,42,0,1,1,4,0,105.92,32.5,0,1
2,0.0,61,0,0,1,4,1,171.23,34.4,1,1
3,1.0,41,1,0,1,3,0,174.12,24.0,0,1
4,1.0,85,0,0,1,4,1,186.21,29.0,1,1


# 3. Data Cleaning

### Step 1: Check the data type of the factors in the dataset

In [4]:
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40910 entries, 0 to 40909
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sex                40907 non-null  float64
 1   age                40910 non-null  int64  
 2   hypertension       40910 non-null  int64  
 3   heart_disease      40910 non-null  int64  
 4   ever_married       40910 non-null  int64  
 5   work_type          40910 non-null  int64  
 6   Residence_type     40910 non-null  int64  
 7   avg_glucose_level  40910 non-null  float64
 8   bmi                40910 non-null  float64
 9   smoking_status     40910 non-null  int64  
 10  stroke             40910 non-null  int64  
dtypes: float64(3), int64(8)
memory usage: 3.4 MB


In [5]:
sourcedata.describe()

Unnamed: 0,sex,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
count,40907.0,40910.0,40910.0,40910.0,40910.0,40910.0,40910.0,40910.0,40910.0,40910.0,40910.0
mean,0.555162,51.327255,0.213835,0.127719,0.82134,3.461134,0.514886,122.075901,30.406355,0.488609,0.500122
std,0.496954,21.623969,0.410017,0.333781,0.383072,0.780919,0.499784,57.561531,6.835072,0.499876,0.500006
min,0.0,-9.0,0.0,0.0,0.0,0.0,0.0,55.12,11.5,0.0,0.0
25%,0.0,35.0,0.0,0.0,1.0,3.0,0.0,78.75,25.9,0.0,0.0
50%,1.0,52.0,0.0,0.0,1.0,4.0,1.0,97.92,29.4,0.0,1.0
75%,1.0,68.0,0.0,0.0,1.0,4.0,1.0,167.59,34.1,1.0,1.0
max,1.0,103.0,1.0,1.0,1.0,4.0,1.0,271.74,92.0,1.0,1.0


### Step 2: Check whether there are any NaN values in the csv file

In [6]:
#Count the number of NaN values in 'sex'
sourcedata.isnull().sum()

sex                  3
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### Step 3: Remove the rows with NaN values 

In [7]:
#Remove the NaN values in 'sex'
sourcedata.dropna(subset=['sex'], inplace=True)
sourcedata.head()

Unnamed: 0,sex,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1.0,63,0,1,1,4,1,228.69,36.6,1,1
1,1.0,42,0,1,1,4,0,105.92,32.5,0,1
2,0.0,61,0,0,1,4,1,171.23,34.4,1,1
3,1.0,41,1,0,1,3,0,174.12,24.0,0,1
4,1.0,85,0,0,1,4,1,186.21,29.0,1,1


### Step 4: Remove rows with negative age

In [8]:
# drop all rows with negative values
sourcedata = sourcedata[sourcedata["age"]>= 0].dropna()

### Step 5: Check to ensure there is no NaN value after data cleaning

In [9]:
#check to ensure the dataset does not contain any NaN values after data cleaning
sourcedata.isnull().sum()

sex                  0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### Step 6: Convert the data type to category 

In [10]:
sourcedata['sex']= sourcedata['sex'].astype('category')

In [11]:
sourcedata['hypertension']= sourcedata['hypertension'].astype('category')

In [12]:
sourcedata['heart_disease']= sourcedata['heart_disease'].astype('category')

In [13]:
sourcedata['ever_married']= sourcedata['ever_married'].astype('category')

In [14]:
sourcedata['work_type']= sourcedata['work_type'].astype('category')

In [15]:
sourcedata['Residence_type']= sourcedata['Residence_type'].astype('category')

In [16]:
sourcedata['smoking_status']= sourcedata['smoking_status'].astype('category')

In [17]:
sourcedata['stroke']= sourcedata['stroke'].astype('category')

### Step 7: Export the cleaned dataset 

In [18]:
sourcedata.to_csv('cleaned_data.csv', index=False)