# Libraries and Dataset

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
print('data loaded')

data loaded


In [5]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


# 10-Point Inspection

In [8]:
#1 Shape
#shape = df.shape
#print(shape[0])
df.shape

(5110, 12)

**Findings**

* 5110 Rows
* 12 Columns/Features
* Each row represents a patient, while each column is a feature of that patient

In [9]:
#2 Column Names
columns = df.columns
n = 1
for i in columns:
    print(f'Column {n}: {i}')
    n += 1 

Column 1: id
Column 2: gender
Column 3: age
Column 4: hypertension
Column 5: heart_disease
Column 6: ever_married
Column 7: work_type
Column 8: Residence_type
Column 9: avg_glucose_level
Column 10: bmi
Column 11: smoking_status
Column 12: stroke


For columns like work and residence type I will need to look at the values to better understand how they have been used. I will also need to research the medical value columns (such as 'bmi' and 'avg_glucose_level') so that I can know what impossible/illogical values look like.

In [12]:
# 3 Data Types
df.dtypes

numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()

print('Sorting Values...\n')

print('Numerical Columns:')
for col in numerical_cols:
    print(f"{col}: {df.dtypes[col]}")

print()
print('Categorical Columns:')
for col in categorical_cols:
    print(f"{col}: {df.dtypes[col]}")


Sorting Values...

Numerical Columns:
id: int64
age: float64
hypertension: int64
heart_disease: int64
avg_glucose_level: float64
bmi: float64
stroke: int64

Categorical Columns:
gender: object
ever_married: object
work_type: object
Residence_type: object
smoking_status: object


Some of the data types in the numrical section cannot be considered truly numerical. the 'heart_disease' column, while using integers, is actually categorical with 0 meaning they do not have heart disease and 1 meaning they do. As such, going forward I shouldn't rely on the dtypes to determine whether or not they are numerical or categorical.

In [18]:
#4 First look
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


* I was curious what the dataset meant by work and residence type, but now it is a little  more clear what that means. I do wonder why they did not consider sub-urban as a residence type, though perhap they considered it close enough to urban.
* Nothing unexpected or unusual jumps out, except for the potentially missing values which will be discussed next.
* There do seem to be some obvious missing values immediately, with the second line having 'NaN' for bmi. Furthermore, not seen here but in the tail on the next block, there is an 'Unknown' value for smoking, which may be considered a placeholder.

In [19]:
#5 Last Look
df.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


* The data seems to have a clean end
* All the rows remain consistent with the beginning

In [20]:
#6 Memory Usage
df.memory_usage()

Index                  132
id                   40880
gender               40880
age                  40880
hypertension         40880
heart_disease        40880
ever_married         40880
work_type            40880
Residence_type       40880
avg_glucose_level    40880
bmi                  40880
smoking_status       40880
stroke               40880
dtype: int64

In [13]:
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / (1024 ** 2):.2f} MB")

Total memory usage: 1.62 MB


This is a very small data set by data science standards.

In [22]:
#7 Missing Values
df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [23]:
#8 Duplicates
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
5105    False
5106    False
5107    False
5108    False
5109    False
Length: 5110, dtype: bool

In [24]:
#9 Basic Statistics
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [25]:
#10 Unique Counts
df.nunique()

id                   5110
gender                  3
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               5
Residence_type          2
avg_glucose_level    3979
bmi                   418
smoking_status          4
stroke                  2
dtype: int64

# Data Dictionary

# Data Validation

# Create Age Groups

# Research Questions

# Target Variable Analysis 