# Lab 1-b: Data Exploration, Pre-processing, Plotting and Dimensionality Reduction

This lab will guide your through the data exploration and data pre-processing process (inluding: plotting, data-cleaning, missing-values, outliers, standardization and dimensionality reduction) in Python. Linked with the topic of this lab is HW1. Additionally, this lab includes a practice assignment (under the name Lab1_practice.ipynb) for whomever is interested in starting practicing basic Python concepts before assignment 1. The practice assignment is not mandatory and it will not be graded while the solutions will be discussed in Q&A session for lab1.

**Outline:**

1. Exploring the data

2. Missing Values

3. Imputation

4. Plotting

5. Outliers

6. Highly correlated features

7. Normalization vs Standardization

8. Principal Component Analysis


In [None]:
#Import libraries

import pandas as pd 
import matplotlib.pyplot as plt
#download sklearn
#We will use sklearn library during  this course. Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning 
!pip install sklearn
#The ! tells the notebook to execute the cell as a shell command.

In [None]:
"""
In this session we will use the cleveland, heart disease dataset.  
link to the dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/
"""


data = pd.read_csv("./datasets/processed.cleveland.data", header=None)


In [None]:
data

In [None]:
"""
The name of the columns and generally information about this dataset can be found under the heart-disease.names file. 

set_axis assigns a  desired index to given axis. 
Input:
    a list-like index
    the axis to update
    inplace: whether to return a new dataframe instance

"""


data.set_axis(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num' ], axis=1,inplace=True)


In [None]:
data

## Dataset Information as found in heart-disease.names


This database contains 14 attributes and 303 patients that might not or might have a heart disease diagnoses. 

There are both continuous and categorical attributes. 

For example, the dataset has the age of the patients, resting blood pressure, maximum heart rate for numerical features. 
While sex, chest pain type or number of major vessels are categorical features. 

The class label is categorical, consists of labels from 0 to 4 and refers to a diagnosis of a heart disease



### Feature explanation
- age = in years // numerical, continuous

- sex    =   (0 is female, 1 is male) // categorical


- cp     = chest pain type (1 -> typical angina,  2 -> atypical angina,  3 -> non-anginal, 4 -> asymptomatic) //categorical

- trestbps = resting blood pressure//continuous

- chol      = serum cholestral in mg/dl //continuous

- fbs       = fasting blood sugar > 120 mg/dl is 1 otherwise 0 //categorical
 
- restecg   = resting electrocardiographic result, 0 -> normal, 1 -> St-T wave abnormality, 2 -> probable or definite hypertropy//categorical

- thalach   = maximum heart rate achieved//continuous

- exang     = exercise induced angina (1 = yes, 0 = no)//categorical

- oldpeak   = ST depression induced by exercise relative to rest//continuous

- slope     = the slope of the peak exercise ST segment (1 -> upslopping, 2 -> flat, 3 -> downslopping)//categorical

- ca        = number of major vessels (0-3) covered by flourosopy//categorical

- thal      = (3 -> normal, 6 -> fixed defect, 7 -> reversible defect)//categorical

- class     = diagnosis of heart disease//categorical


In [None]:
## Essentialy: 
#numerical are: age, trestbps, chol, thalac, oldpeak
#categorical are: sex, cp, fbs, restecg, exang, slope, thal, class

## 1. Exploring the data

In [None]:
data.head(5)

In [None]:
#check number of rows(patients) and colums(features)
data.shape

In [None]:
"""
Check how many patients are diagnosed with Heart Failure. 

count_values(): returns a series containing counts of unique values
"""


data["num"].value_counts()
#count the class label

In [None]:

"""

There are 163 patients that do not have heart disease.
While the patients that are postive to heart disease are distributed among labels 1 to 4.
We can make the problem more balanced by transforming all these positive class labels to 1.


Use the replace method to map the values  2, 3, 4 to 1!

"""
data["num"].replace({2: 1, 3: 1, 4:1}, inplace=True)
#alternative: data['num'] = np.where((data['num']>0),1,0)



In [None]:
"""
Let's check the distribution again! the class now seems balanced.
It's also binary. O for negative to heart disease and 1 for positive.

"""


data["num"].value_counts()


## 2. Missing Values

In [None]:
"""
When no data are stored. Missing data can have a significant effect on the conclusions and the ML process. We need to handle them!


We can use the isna and isnull methods to check for Na, None and np.NaN


# np.nan, None and NaT (for datetime64[ns] types) are standard missing value for Pandas.
# np.nan and None are float so if you use them in a column of integers, they will be upcast to floating-point data type


Write the name of the dataset and call isna() or is isnull(). These will return the original dataframe and will replace the values with True if there are any missing values. Otherwise it returns False. 

To summarize the results along a column call .any()
"""

print(data.isna().any()) #checks for NA, None, numpy.NaN
print(data.isnull().any()) #checks for NaN, None, NaT in datetimelike


These methods say that there are no missing values. Remember that they only look for None and NaN values. 

But missing values come in many forms!

Rule of thumb! Always open the dataset file and investigate the data yourself. 
Missing values might be written like: "missing", "?", "_", etc.



In [None]:
"""
Let's print some information about the dataset before proceeding

We can see that the attributes ca and thal are of object type.

"""

print('Data Show Info\n')
data.info()

In [None]:
"""
We notice that that the attributes, ca and thal are of object type. 
Let's see what is unique about them as compated to the others!
Remember that ca and thal are catecorigal and are depicted as integers.

We can print and see the unique values of ca and thal attribures. 
We notice that there are instances of ?.


unique(): Returns unique values of Series object.
"""


print(data["ca"].unique())
print(data["thal"].unique())



### How to handle the missing values

In general,  there are different ways of dealing with missing values

For example:

 
1. if the feature is nominal (categorical), replace missing with most frequent category
2. if feature is numeric, replace it with Mean value
3. or simply delete the rows that have missing values if there is an insignificant amount of them (The simplest approach for dealing with missing values is to remove entire predictor(s) and/or sample(s) that contain missing values.)

In [None]:
"""

Count how many "?" the columns ca and thal have

First index the columns that you are interested in and return their values by calling the values method. 

When you call  values in dataframe or series only the values will be returned, without the axes. 

See if any of these values in the columns are equal to ?
 
and sum them!
"""


print("Number of missing (?) values for the whole dataframe:\t", sum(data.values=="?"))

print("Number of missing (?) values for ca:\t", sum(data["ca"].values=="?"))
print("Number of missing (?) values for thal:\t", sum(data["thal"].values=="?"))

## 3.  Imputation

In [None]:
"""
We will use the Simple imputer from sklearn-impute

1. First you need to initialize the SimpleImputer object, add a placeholder for missing values and an imputation strategy

2. Create a new dataset and call the imputer and the fit_transform method. 

The fit method: computes the mean and std, to be used for later
The transform method:  performs these calculations and transforms the dataset
fit_transform: does the above two steps, in one step!

Returns a numpy array!
"""



#download sklearn first:
#!pip install sklearn

from sklearn.impute import SimpleImputer
#first we define the object
imputer = SimpleImputer(missing_values="?", strategy='most_frequent')#it works along the columns
#then fit it to the data
data_imputed = imputer.fit_transform(data)


In [None]:
data_imputed
#returns a numpy array

In [None]:
#Convert it to a dataframe
data_imputed = pd.DataFrame(data_imputed, columns = data.columns)


In [None]:
data_imputed.info()

In [None]:
"""

#they were transformed to object type after the transformation
#we'll transform all to float so we do not lose any information\, the dataset is not so big so it's fine. 


astype(): Casts a pandas object to a specified dtype.

"""


data_imputed=data_imputed.astype(float)



### Drop the missing values

In [None]:
"""
In case you have empty values in the form of nan: 
dataframe.dropna(inplace=True)
"""


#drop everything that contains ?

data_dropped = data[(data["ca"]!= "?") & (data["thal"]!="?")]

In [None]:
data_dropped

In [None]:
data_dropped[['ca','thal']]=data_dropped[['ca','thal']].astype(float).astype(int)

print('Data Show Info\n')
data_dropped.info()

In [None]:
"""
To see statistical information about the dataset use the describe method. It will generate descriptive statistics 

"""

print('Data Show Describe\n')
data.describe()

## 4. Plotting

In [None]:
"""
1. Visualize the class distribution 

The figure object is your empty canvas.

Next you can specify the limits of the plot frame by call .add_axes

Then you can call ax.bar

plt.style.use('ggplot'): the  "ggplot" style,  adjusts the style to emulate ggplot (a popular plotting package for R).

Barplot parameters: 
    x : a sequence of scalars (unique values of the class label)
    y: the height of the bars (how many samples each class has)

add_axes(): Add an axes to the figure.
    Input: The dimensions [left, bottom, width, height] of the new axes. All quantities are in fractions of figure width and height. 
    !left,bottom: the top left coordinates of your figure
set_xticks(): Set the x ticks with list of ticks
    Essentialy here we will pass the names of the bars. 
grid(): adds a grid to the plot
        b: Whether to show the grid lines. True/False

"""

#first create a series where you will store the class label
class_label = data_dropped["num"]
#replace the 0 with negative and 1 with positive, for the sake of interpretation
class_label = class_label.replace({0: "Negative", 1: "Positive"})



plt.style.use('ggplot')
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(class_label.unique(), class_label.value_counts())
ax.set_title("Distribution of class label")
ax.set_xticks(class_label.unique())


ax.grid(b=True)


In [None]:
"""

2. plot the sex in relation to the class label

Group the class label according to sex

Groupby: allows you to group together rows based off if a column and perform an aggregate function on them
After groupby, specify a summarization function!

So groupby sex and count the values of male and female for each of the class labels. 

To plot: 
Call unstack, that pivots the grouped dataframe back, and just call plot with kind equals to bar!


stacked: The bars for the different class labels will be put one top of each other, instead of next to each other. Convert it to False if you want to see the difference
"""


sex_by_class = data_dropped.groupby("sex").num.value_counts()
sex_by_class.unstack().plot(kind='bar', stacked= True)

In [None]:
#check how groupby and unstack works
sex_by_class
#for each female and male we have the counts of the class labels

In [None]:
#if you call unstack the dataframe now is pivoted back!
sex_by_class.unstack()

In [None]:
"""
Same procedure: 

Distribution of age
"""

by_age= data_dropped.groupby(["age"]).num.value_counts()

In [None]:
by_age.unstack().plot(kind='bar', stacked=True )



In [None]:
"""
Pairplot from seaborn:  investigates pairwise relationships.

Make a list of the numerical values, and in hue pass the class label!


On the diagonal you see the distribution of these diffeent numerical variables

"""

import seaborn as sns

sns.pairplot(data_dropped[['age','trestbps','thalach','chol','num']],hue='num',size=2.5);

## Data Preprocessing steps

1. Check and handle  Missing Values (We've done that already)
2. Check for inconsistencies in the data (example: wrongly spelled words, abbreviations etc)
3. Check for outliers
4. Check for highly correlated features
5. Standardize or Normalize numeric features 


## 5. Check for outliers
To check for outliers we can use the boxplot to see the distribution of the attributes. 
Any outliers are normally outside the plot region


Outliers can either be a mistake or just variance.
If they are the result of a mistake, then we can ignore them, but if it is just a variance in the data we would need take into consideration the target population, subject-area, research question, and research methodology. 


!! **Rule of thumb**: Need to be careful before deleting the outliers, domain specific!


If the outlier in question is:

A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that observation because you know it’s incorrect.
Not a part of the population you are studying (i.e., unusual properties or conditions), you can remove the outlier.
A natural part of the population you are studying, you should not remove it.

In [None]:

"""

Create boxplots for all the numerical features

Instead of plt.figure you can call plt.subplots and specify how many rows and columns you want.
if nrows=1, ncols=2 it will create 1 row with 2 plots/ if nrows=2, ncols=3 it will create 2 rows with 3 plots each.

Note!

The axes attribute is just a list of the matplotlib axes. So you can actually iterate through and create  different plots!


tight_layout(): automatically adjusts subplot params so that the subplot(s) fits in to the figure area


barplot:
    x: The input data.
"""

fig, ax = plt.subplots(1, 4)#create 1 row with 4 plots

plt.tight_layout()

ax[0].set_title('trestbps')
ax[0].boxplot(data["trestbps"])

ax[1].set_title('chol')
ax[1].boxplot(data["chol"])

ax[2].set_title('thalach')
ax[2].boxplot(data["thalach"])

ax[3].set_title('oldpeak')
ax[3].boxplot(data["oldpeak"])

##  6. Highly correlated features


Correlated features in general don't improve models (although it depends on the specifics of the problem like the number of variables and the degree of correlation).


A strong correlation is indicated by a Pearson Correlation Coefficient value near 1. Therefore, when looking at the Heatmap, we want to see what correlates most with the class label.

The correlation coefficient has values between -1 to 1:

— A value closer to 0 implies weaker correlation (exact 0 implying no correlation)


— A value closer to 1 implies stronger positive correlation


— A value closer to -1 implies stronger negative correlation

In [None]:

import seaborn as sns 
#Using Pearson Correlation
plt.figure(figsize=(12,10))

cor = data_dropped.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()


We can see the thalac has the strongest correlation with the class label!


## 7. Normalization vs Standardization 

The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things. Normalization usually means to scale a variable to have a values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1. 
https://www.statisticshowto.com/normalized/


For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance. https://sebastianraschka.com/Articles/2014_about_feature_scaling.html



Compare the effects of diffeent scalers from sklearn:
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

In [None]:

"""
Standardization transforms data to have a mean of zero and a standard deviation of 1. 

It is a crucial step before performing PCA, since we are interested in the components that maximize the variance. 


Make a list of the numerical features and extract them from the original dataset. 

Create our StandardScaler object


fit: computes the mean and std to be used for later scaling. 

transform: uses a previously computed mean and std to autoscale the data 

fit_transform:  does both at the same time. So you can do it with 1 line of code instead of 2.


"""



from sklearn.preprocessing import StandardScaler

numerical = ["age", "trestbps", "chol", "thalach", "oldpeak"] #list of num features


X = data_dropped[[c for c in data_dropped.columns if c in numerical]] #list comprehension


#Scale Data
scaler = StandardScaler()

X = scaler.fit_transform(X)

X = pd.DataFrame(X, columns=numerical)


In [None]:

#This: 
#X = data_dropped[[c for c in data_dropped.columns if c in numerical]]

#can be considered equal to this: 

#S = pd.DataFrame()
#for c in data_dropped.columns:
    #if c in numerical:
        #S[c] = data_dropped[c]

## 8. Principal Component Analysis (PCA)


In [None]:
"""

PCA is an unsupervised statistical technique that aims to investigate the innerrelations among a set of variables and their underline structure. 


Overall PCA  attemps to find out what features explain the most variance in your data

PCA also helps yo visualize your data. in our dataset we have 14 components which it's a bit impossible to plot. So we can reduce the dimensionality and plot the two principal components of the dataset


"""

In [None]:

#import PCA
from sklearn.decomposition import PCA

In [None]:
"""
Intatiate a pca object, specify how many principal components you want

Fit your dataset
Transform: and the apply the rotation and dimensionality reduction

"""

pca = PCA(n_components=2)

In [None]:
#fit
pca.fit(X)

In [None]:
#and then transform and store into a variable
x_pca = pca.transform(X)

In [None]:
#original shape
X.shape

In [None]:
#now let's check the dimensions after pca 
x_pca.shape

In [None]:
pca.explained_variance_ratio_

#the first component explains 35% and the second 21%

#Components are a linear transformation that chooses a variable system for the dataset such that the greatest variance of the dataset comes to lie on the first axis. 
#and likewise the second greatest variance lies on the second axis.

In [None]:

"""

Scatterplot to plot the two principal components. 

Note:
In general, interpreting the  components is not  easy. 
But based on the two components we can see if we have a clear seperation between the labels. 

For this dataset the seperation is not as clear. (in toy datasets you might be able to see clear groups forming)

"""

plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=data_dropped['num'],cmap='rainbow')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

In [None]:

"""

The components correspond to a combination of features from your dataset. 
The components themselves are stored as an attribute to the pca object

"""
pca.components_ 
# Each row represents a principal component and each column actually relates back to the original features



In [None]:
#let's transform this array into a df

df_comp = pd.DataFrame(pca.components_,columns=X.columns)

In [None]:
df_comp

In [None]:
"""
If we create a heatmap now we will be able to see the correlation between various features and the principal components themselves

basically each principal comp is shown here as a row and the more yellow the color then it's more correlated to a feature in the column 
In that way you can see which features are more important for each principal component 
"""

plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma')

# END OF LAB 1-b