# Lab 4: Attribute Transformation an Dimensionality Reduction

## Attribute Transformation

An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

Dataset contains features with different metrics and scales. For example: "pregnant" and "insulin" values are based on different scales of measurement. The magnitude of "insulin" value is higher than "pregnant" in the diabetes dataset. Hence, many algorithm that are sensitive to varying scales of value will be biased towards the one with higher magnitdue.For example neural netwroks are highly sensitive to scaling of the data attributes. Hence we need to convert the dataset into suitable format before it is fed into the neurons.

#### The solution to varying scale values

We need a mechanism that scales all the attribute values into a given range typically between 0 to +1 or between a certain specified range. This approach is called feature scaling.

Below are two approaches taht converts each feature into same scale

1. Min-Max Scaler (Normalization)
2. Standardization


## Using MinMaxScaler() 

Rescaling X_train dataset

"minj" and "maxj" represent the minimum and maximum values of attribute 'j'. The $j^{th}$ attribute value $x_{i}^{j}$ of the $i^{th}$ row is scaled as:

### $y_{i}^{j} = (x_{i}^{j} - min_{j})/(max_{j}-min_{j})$

We transform only the train dataset for scaling or any data tranformation tasks.

#### Split the cleaned data into input features $(X_{i})$ and output component (Y)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
df = pd.read_csv('dataset/imputed_data_diabetes.csv')
df.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,1,85.0,66.0,29.0,125.0,26.6,0.351,31,0
1,8,183.0,64.0,29.142593,125.0,23.3,0.672,32,1
2,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
3,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
4,5,116.0,74.0,29.142593,125.0,25.6,0.201,30,0


In [3]:
splitted_data = df.values

X = splitted_data[:, 0:8]
Y = splitted_data[:, 8]

#### Separate the splitted dataset into training and testing dataset with training  dataset = 80% of cleaned data and test dataset  = 20% of cleaned dataset

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

#### Use Sci-Kit learn MinMaxScaler () for normlization

In [5]:
from sklearn.preprocessing import MinMaxScaler
sclr = MinMaxScaler(feature_range=(0, 1))
scaled_data_X_train = sclr.fit_transform(X_train)
np.set_printoptions(precision=4)
print(scaled_data_X_train[0:5, :])

[[0.3529 0.4685 0.4898 0.413  0.2596 0.3149 0.2778 0.2167]
 [0.7647 0.6783 0.6531 0.3261 0.1514 0.4581 0.4666 0.3   ]
 [0.4118 0.972  0.4694 0.2826 0.1575 0.1411 0.0338 0.5667]
 [0.1176 0.2378 0.3673 0.2407 0.1334 0.1084 0.0458 0.0667]
 [0.7647 0.4895 0.6735 0.2407 0.1334 0.5153 0.2136 0.35  ]]


#### The above code converted all the feature values into the  scale between 0 and 1 using Normalization or Min-Max scaling.

Some learning algorithms like Neural Networks expect input values between [0,1] hence we use normalization for scaling in such case

## Standardization

It is another approach to scaling where the scaled value isn't within the [0,1] range. It is suitable where the data collection process has errors and hence has extreme values or outliers.

The $j^{th}$ attribute value $x_{i}^{j}$ of the ith row is normalized by:

####                         Z-score_normalization (x')=  ($x_{i}^{j}$ -$\mu_{j}$)  /  $\sigma_{j}$

 where the $j^{th}$  attribute has mean $\mu_{j}$ and standard deviation $\sigma_{j}$ .
                       
We use a function `StandardScaler()` for standardization purpose.

In [6]:
from sklearn.preprocessing import StandardScaler
scale_ftrs_stndrd = StandardScaler().fit(X_train)
scaled_stndrd_X_train = scale_ftrs_stndrd.transform(X_train)

# summarize transformed data
np.set_printoptions(precision=3)
print(scaled_stndrd_X_train[0:5, :])

[[ 0.614  0.024 -0.052  1.826  0.98   0.133  0.769  0.059]
 [ 2.656  1.007  1.267  0.91  -0.029  1.13   2.067  0.481]
 [ 0.905  2.383 -0.217  0.451  0.027 -1.077 -0.907  1.832]
 [-0.554 -1.057 -1.041  0.01  -0.197 -1.305 -0.825 -0.701]
 [ 2.656  0.122  1.431  0.01  -0.197  1.529  0.328  0.734]]


## Dimensionality Reduction

Dimensionality reduction is all about summarizing the data with most of the information preserved in compact form.Reducing the dimension of the feature space, creates fewer relationships between variables and hence the model is less likely to overfit.

> One of such technique discussed here is the Principal Component Analysis (PCA)


## Principal Component Analysis (PCA)

PCA is a dimensionality-reduction technique for reducing the dimensionality of large data sets, i.e. by transforming a large set of input features into a smaller set which still contains most of the information in the original dataset .But Before applying PCA, the dataset must be rescaled, if not rescaled, the model/algorithm's accuracy may not be improved much.

In [7]:
from sklearn.decomposition import PCA

# use three diagonal compnents for data reduction and summarization
principal_components = PCA(n_components=3)
principal_summary = principal_components.fit(scaled_stndrd_X_train)

In [8]:
# summarize the components
print(f"Explained Variance: {principal_summary.explained_variance_ratio_}")

Explained Variance: [0.283 0.185 0.145]


In [9]:
print(principal_summary.components_)

[[ 0.303  0.414  0.379  0.408  0.305  0.4    0.167  0.382]
 [ 0.569 -0.1    0.158 -0.273 -0.236 -0.388 -0.3    0.52 ]
 [-0.024  0.451 -0.282 -0.406  0.555 -0.398  0.272  0.098]]



Above code created three principial components as denoted in three separate arrays. Each array represents the component that summarizes the overall data.