## Data Cleaning

In [3]:
import pandas
import numpy
from matplotlib import pyplot as plt
import seaborn as sns

The raw data, which is in the form of a CSV file, has been uploaded into the notebook and presented as a dataframe. Here we can see that there are a total of sixteen variables. 
<br/>

In [5]:
stroke_data = pandas.read_csv("framinghamdata.csv")
stroke_data.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In order to better understand the structure of the dataset so that it can be prepared for further evaluation, the shape function was used to inform myself on how many rows and columns are present in the original dataset.The info function was used to comprehend the each column's datatype and number of complete datapoints.

Here, it can be observed that there are 4238 rows and 16 columns total. The columns also either contain integers or floats as their data types. Looking at the counts of non-null data for each column, some attributes have complete data points while others are missing some.

In [6]:
print(stroke_data.shape)
stroke_data.info()

(4238, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


Missing data can pose complications to further data analysis and evaluation. Using isnull and sum enabled a view of how many daty points were missing from each attribute (column). The columns: education, cigsPerDay, BPMeds, totChol, BMI, heartRate and glucose are all missing data points.

In [7]:
stroke_data.isnull().sum()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

In order to deal with this issue, I decided to drop the rows with the missing data, as opposed to other methods. This is because for attributes such as glucose or totChol, whose measurments are done with smaller increments and precision, simply replacing their non-present data with for example, the mean, will completely skew the data. In addition, providing the most frequent value for the nominal attributes, may also lead to the same issue. Hence, the rows with the missing datapoints were dropped after accounting for these cases of concern.

In order to drop the rows with missing data, the points at which no data was present were replaced with numpy's 'NaN' value. They were then subsequently dropped using the dropna function. 

In [8]:
full_data = stroke_data.replace("NA", numpy.nan)

stroke_data = full_data.dropna()

stroke_data.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


Now, we check to see if the previous process was successful. It seems to have operated correctly as there are no more missing values in any of the rows of the dataset. 

In [9]:
stroke_data.isnull().sum()

male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

After dropping the values, the number of data entries have decreased from 4238 to 3656. Although I hope to use as many data points as possible for a more holistic analysis, the 3656 rows should still provide sufficient data to garner a proper insight. 

In [10]:
stroke_data.shape, stroke_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3656 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             3656 non-null   int64  
 1   age              3656 non-null   int64  
 2   education        3656 non-null   float64
 3   currentSmoker    3656 non-null   int64  
 4   cigsPerDay       3656 non-null   float64
 5   BPMeds           3656 non-null   float64
 6   prevalentStroke  3656 non-null   int64  
 7   prevalentHyp     3656 non-null   int64  
 8   diabetes         3656 non-null   int64  
 9   totChol          3656 non-null   float64
 10  sysBP            3656 non-null   float64
 11  diaBP            3656 non-null   float64
 12  BMI              3656 non-null   float64
 13  heartRate        3656 non-null   float64
 14  glucose          3656 non-null   float64
 15  TenYearCHD       3656 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 485.6 KB


((3656, 16), None)

I decided to change the column names as the orginal format did not matain any form of consistency. Hence, I decided to adhere by these rules:
- All text will be lowercase (apart from acroynms)
- Acroynms will be all uppercase
- Spaces will be replaced with underscores. 

In [11]:
stroke_data = stroke_data.rename(columns = {"male":"sex"})
stroke_data = stroke_data.rename(columns = {"currentSmoker":"current_smoker"})
stroke_data = stroke_data.rename(columns = {"cigsPerDay":"cigs_per_day"})
stroke_data = stroke_data.rename(columns = {"BPMeds":"BP_meds"})
stroke_data = stroke_data.rename(columns = {"prevalentStroke":"stroke"})
stroke_data = stroke_data.rename(columns = {"prevalentHyp":"hypertension"})
stroke_data = stroke_data.rename(columns = {"totChol":"tot_chol"})
stroke_data = stroke_data.rename(columns = {"sysBP":"sys_BP"})
stroke_data = stroke_data.rename(columns = {"diaBP":"dia_BP"})
stroke_data = stroke_data.rename(columns = {"heartRate":"heart_rate"})
stroke_data = stroke_data.rename(columns = {"TenYearCHD":"10yr_CHD_risk"})

In order to conform with consistent reporting of the nominal (categorical) attributes, the education and BP_meds attributes were changed to integers from floats.

In [12]:
stroke_data["education"] = stroke_data["education"].astype(int)
stroke_data["BP_meds"] = stroke_data["BP_meds"].astype(int)

stroke_data.head()

Unnamed: 0,sex,age,education,current_smoker,cigs_per_day,BP_meds,stroke,hypertension,diabetes,tot_chol,sys_BP,dia_BP,BMI,heart_rate,glucose,10yr_CHD_risk
0,1,39,4,0,0.0,0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2,0,0.0,0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,1,20.0,0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3,1,30.0,0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3,1,23.0,0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


 The above data table is the cleaned working dataset

In [26]:
stroke_data.to_csv("stroke_data.csv", index = False)

The above line of code exported the curated data set into a new csv file called "stroke_data"

This curated dataset file can be found at the link below:

https://drive.google.com/file/d/1FrD7v2XudKIc9Y9bOdwI5vhBlAhl0JK9/view?usp=sharing