# Question 1. Data Exploration
## [CM1] Data Cleaning and Normalization

Importing all necessary libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler,StandardScaler

# Loading  Iris Dataset

In [2]:
df_iris = pd.read_csv("iris_dataset_missing.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'iris_dataset_missing.csv'

## Data Cleaning for Iris Data set

#### 1. Check for duplicate values

In [None]:
df_iris.duplicated().sum()

<b> There are no duplicate values in Iris Data. </b>

#### 2. Removing negative values and replacing with NAN 

In [None]:
(df_iris.iloc[:,0:-1]<0).sum()

In [None]:
df_iris[df_iris['petal_width']<0]  #negative values in petal_width column

In [None]:
df_iris[df_iris.iloc[:,0:-1]<0] = np.NAN

<b> Thus all the negative values in petal_width column are replaced with NAN so that it will be processed later for further cleaning. </b>

#### 3. Checking Outliers
<b> The outliers are removed first so that the mean value is more accurate which is used to replace NAN values later in next step. To find the outliers we are plotting box and wisker plot and replacing it with min and max values of that function if not dropping the values. <b/>

In [None]:
sns.boxplot(data=df_iris.iloc[:,0:-1])

<b> Outliers = Observations > Q3 + 1.5*IQR  or  < Q1 â€“ 1.5*IQR </b>

In [None]:
temp = df_iris.describe() 
#We are extracting the inter quartile range from describe()
q3 = temp['sepal_width']['75%']
q1 = temp['sepal_width']['25%']
IQR = q3-q1
right_limit = q3+1.5*IQR
left_limit = q1-1.5*IQR
print("right limit is:",right_limit)
print("left limit is:",left_limit)

#printing outlier values
upper = df_iris['sepal_width']>right_limit
lower = df_iris['sepal_width']<left_limit
print(df_iris[upper | lower]) #all outliers

#replacing outlier values with upper limit or lower limit
df_iris['sepal_width'] = np.where(df_iris['sepal_width']>right_limit, right_limit, df_iris['sepal_width'])
df_iris['sepal_width'] = np.where(df_iris['sepal_width']<left_limit, left_limit, df_iris['sepal_width'])

<b> The sepal_width column is having 4 outliers. These ouliers are replaced by the nearest upper limit or lower limit successfully. We are not dropping them as they are very close to the upper limit and lower limit. </b>

#### 4. Data Cleaning by replacing with average mean

In [None]:
print("-------------Iris Data--------------")
print("Size of Iris data set:", df_iris.size)
print("Shape of Iris data set:",df_iris.shape)
print("Total NAN values in iris data are :",df_iris.isna().sum().sum(),"\n")
df_iris.describe()


In [None]:
#using groupby to find categorical mean values
df_iris.groupby("species").mean()          

In [None]:
#replacing na values with mean
df_iris['sepal_width'] = df_iris.groupby('species')['sepal_width'].apply(lambda x:x.fillna(x.mean()))
df_iris['petal_length'] = df_iris.groupby('species')['petal_length'].apply(lambda x:x.fillna(x.mean()))
df_iris['petal_width'] = df_iris.groupby('species')['petal_width'].apply(lambda x:x.fillna(x.mean()))

<b> We replaced the na values with the mean of respective colums so that we can replace na values with the approximate values without the loss of data. </b>

In [None]:
print(df_iris.isna().sum())
df_iris.to_csv("cleaned_data_iris.csv",index=False) #clean iris data saved to new csv file

# Normalization for Iris Data

## 1. Min-Max Normalization

In [None]:
Scaler = MinMaxScaler()
temp = df_iris.drop(["species"],axis=1)
MinMax_iris = Scaler.fit_transform(temp)
MinMax_df_iris = pd.DataFrame(MinMax_iris,columns=temp.columns)
MinMax_df_iris.head()

## 2. Z-Score Normalization

In [None]:
#referred z_score formula from lecture notes
z_score_iris = (temp-temp.mean())/temp.std()
z_score_iris.head()

In [None]:
#Visualizing
plt.figure(figsize=(15,10))
plt.subplot(1,3,1)
sns.histplot(df_iris['sepal_width']).set(title='Un-normalized Data')
plt.subplot(1,3,2)
sns.histplot(MinMax_df_iris['sepal_width']).set(title='MinMax Normalized Data')
plt.subplot(1,3,3)
sns.histplot(z_score_iris['sepal_width']).set(title='Z_score Normalized Data')

Note: comparison of un-normalized vs normalized is given in the end of the document

In [None]:
df_iris.info()

# Loading Heart Dataset


In [None]:
df = pd.read_csv('heart_disease_missing.csv')
df.head()

In [None]:
df.info()

### Data Cleaning

#### 1. Check for duplicate values

In [None]:
df.duplicated().sum()

<b> It seems very unlikely that two patients would have exactly the same values for all of these measures which suggests that this is a duplicate. In our case we don't have any duplicate value otherwise we would have dropped it. </b>

#### 2. Check for negative values

In [None]:
df[df<0].count()

In [None]:
df["oldpeak"][df["oldpeak"]<0].head(10)

ST depression or EST: Exercise Stress Test                                          
The results of an EST are usually reported as either negative, positive or inconclusive.                                       
<b>Negative</b>:A negative test result indicates a normal test which significantly decreases the likelihood of coronary artery disease.                                                                                                                       
<b> Therefore, negative values of oldpeak are required in our dataset and we will leave them as it is. </b>

#### 3. Checking outliers

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(data=df)

###### a. Removing outliers from "trestbps" column

In [None]:
sns.boxplot(data=df["trestbps"])

In [None]:
t = df.describe()
#We are extracting the inter quartile range from describe()
q3=t['trestbps']['75%']
q1=t['trestbps']['25%']
print("q1:",q1)
print("q3:",q3)
IQR=q3-q1
print("IQR:",IQR)
right_limit=q3+2*IQR
left_limit=q1-2*IQR
print("right_limit:",right_limit)
print("left_limit:",left_limit)

#to print outliers
upper = df['trestbps']>right_limit
lower = df['trestbps']<left_limit
print(df[upper | lower]) #all outliers
print("mean:",df['trestbps'].mean())    

#finding the index of outliers and dropping them
index = df[upper|lower].index
df.drop(index, inplace=True)

<b> Trestbps is the resting blood pressure of the patient therefore we cannot replace an outlier in such sensitive data. Therefore, we dropped the outlier values.</b>

###### b. Removing outliers from "thalach" column

In [None]:
sns.boxplot(data=df["thalach"])

In [None]:
q3=t['thalach']['75%']
q1=t['thalach']['25%']
print("q1:",q1)
print("q3:",q3)
IQR=q3-q1
print("IQR:",IQR)
right_limit=q3+1.5*IQR
left_limit=q1-1.5*IQR
print("right_limit:",right_limit)
print("left_limit:",left_limit)

upper = df['thalach']>right_limit
lower = df['thalach']<left_limit
print(df[upper | lower]) #all outliers
print("mean:",df['thalach'].mean())    

df['thalach']=np.where(df['thalach']>right_limit, right_limit, df['thalach']) #When True, yield x, otherwise yield y()
df['thalach']=np.where(df['thalach']<left_limit, left_limit, df['thalach'])


#outlier values of thalach are very near to the lower_limit so instead of dropping, We have set them to near boundary values.

<b> The outlier value for thalach is not dropped because its very near to the lower limit. Thalach represents Maximum heart rate achieved during thalium stress test and so slightly improving it won't harm the data. </b>

##### c. Removing outliers from "oldpeak" column

In [None]:
sns.boxplot(data=df["oldpeak"])

In [None]:
q3=t['oldpeak']['75%']
q1=t['oldpeak']['25%']
print("q1:",q1)
print("q3:",q3)
IQR=q3-q1
print("IQR:",IQR)
right_limit=q3+1.5*IQR
left_limit=q1-1.5*IQR
print("right_limit:",right_limit)
print("left_limit:",left_limit)

upper = df['oldpeak']>right_limit
lower = df['oldpeak']<left_limit
print("Outlier Values:",df['oldpeak'][upper | lower]) #all outliers
print("mean:",df['oldpeak'].mean())    

#clear outliers, will drop them
index = df[upper|lower].index
df.drop(index, inplace=True)

<b> The 2 outlier values for oldpeak are dropped because they are way too far from upper limit and can be considered clear outliers. </b>

##### d. Removing outliers from "chol" column


In [None]:
sns.boxplot(data=df["chol"])

In [None]:
q3=t['chol']['75%']
q1=t['chol']['25%']
print("q1:",q1)
print("q3:",q3)
IQR=q3-q1
print("IQR:",IQR)
right_limit=q3+1.5*IQR
left_limit=q1-1.5*IQR
print("right_limit:",right_limit)
print("left_limit:",left_limit)

upper = df['chol']>right_limit
lower = df['chol']<left_limit
print(df[upper | lower]) #all outliers
print("mean:",df['chol'].mean())    

#Dropping outliers
index = df[upper|lower].index
df.drop(index, inplace=True)

<b> Chol = 406.93 on index=186 is a clear outliear as the Serum cholestoral level ranges upto 200mg/dl so we'll drop it. </b>

#### 4. Data Cleaning by dropping data

##### checking column 'ca

In [None]:
df.ca.hist()

<b> As we can see that we need to remove the ca values coming as 4 as this category is not defined. </b>

In [None]:
index = df[df.ca > 3].index
print(index)

df.drop(index, inplace=True)

In [None]:
df = df.dropna()

<b> Dropped all the na values in the heart dataset as its not appropriate to change the data using average mean as its health data and should not be mendled with. </b>

In [3]:
print(df.isna().sum())
print(df.describe())
df.to_csv("heart_disease_cleaned.csv", index=False) #clean Heart data saved to new csv file

NameError: name 'df' is not defined

# Normalization for Heart Data

## 1. Min Max Normalization for Heart Disease Dataset

In [None]:
names = df[['target','sex','cp','restecg','fbs','exang','slope','ca']]
#referred min-max formula from lecture notes
normalised_df = (df-df.min()) / (df.max()-df.min())
for i in normalised_df:
    if i in names:
        normalised_df[i]=df[i]
normalised_df.to_csv("minmax_cleaned_heart.csv",index=False)

## 2. Z score Normalization for Heart Disease Dataset

In [None]:
z_score_heart = (df-df.mean())/df.std()
#referred z-score formula from lecture notes
for i in z_score_heart:
    if i in names:
        z_score_heart[i]=df[i]
z_score_heart.to_csv("zscore_cleaned_heart.csv",index=False)

In [None]:
#Vizualising
plt.figure(figsize = (15,10))
plt.subplot(1,3,1)
sns.histplot(df['chol']).set(title = 'Un-normalised Data')
plt.subplot(1,3,2)
sns.histplot(normalised_df['chol']).set(title = 'MinMax Normalised Data')
plt.subplot(1,3,3)
sns.histplot(z_score_heart['chol']).set(title = 'Z-Score Normalised Data')

# Un-normalized vs Normalized Data comparison

<b> The un-normalised data is having different units for different columns which might produce bias while computing. To reduce this, dataset is normalised.                                                                                                   
MIn-Max normalization ranges between [0-1] but a presence of outlier might affect the distribution/values. Thus, it is highly recommended to remove outliers before performing min-max normalisation.                                                        
In z-score normalisation, the features are scaled such that they have properties similar to normal distribution where mean is zero and standard deviation is one. From the above graphs its quiet evident that there is no change in shape but only change in values. </b>