### **Importing necessary packages**

In [2]:
pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (1.2.0)
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as  np
import matplotlib.pyplot as plt
import sklearn
import scipy

### **Numpy Arrays**

In [4]:
data = np.array([
    [3, -1.5, 2, -2.5],
    [0, 4, -0.3, 2.1],
    [1, 3.3, -1.9, -4.3]
])
print(data)

[[ 3.  -1.5  2.  -2.5]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]


In [5]:
NpArray1 = np.arange(10) # Similar to the python Range function
print(NpArray1)

[0 1 2 3 4 5 6 7 8 9]


In [6]:
NpArray2 = np.arange(10, 100, 5) # Array with elements from 10 -> 100 with a step of 5
print(NpArray2)

[10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95]


In [7]:
NpArray3 = np.linspace(0, 10, 50) # Array of 50 equally spaced numbers between 0 & 10
print(NpArray3)

[ 0.          0.20408163  0.40816327  0.6122449   0.81632653  1.02040816
  1.2244898   1.42857143  1.63265306  1.83673469  2.04081633  2.24489796
  2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  5.51020408  5.71428571  5.91836735
  6.12244898  6.32653061  6.53061224  6.73469388  6.93877551  7.14285714
  7.34693878  7.55102041  7.75510204  7.95918367  8.16326531  8.36734694
  8.57142857  8.7755102   8.97959184  9.18367347  9.3877551   9.59183673
  9.79591837 10.        ]


### **Data preprocessing using mean removal**

Mean-removal or _Standardization_ is a technique that allows us to _scale_ the data based on **normal standard distribution**. It allows us to place our features on the same scale by making sure the data distribution now has a `mean = 0` and a `standard-deviation = 1`.

This type of transformation changes the enitre **shape** of the data distribution. It's also referred to as _data standardization_.

In [8]:
from sklearn import preprocessing

In [9]:
# Checking on the characteristics of our `data` array we created earlier.
# axis = 0 means along each column, so 1.3333 corresponds to the mean of [3., 0. ,1.] from the first colum od `data`.

print("Mean: ", data.mean(axis=0))
print("Standard deviation: ", data.std(axis=0))

Mean:  [ 1.33333333  1.93333333 -0.06666667 -1.56666667]
Standard deviation:  [1.24721913 2.44449495 1.60069429 2.69485106]


In [10]:
data_standardize = preprocessing.scale(data)

Notice the mean values below are all ~ 0 and the STD is 1

In [11]:
print("Mean: ", data_standardize.mean(axis=0))
print("Standard deviation: ", data_standardize.std(axis=0))

Mean:  [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -1.48029737e-16]
Standard deviation:  [1. 1. 1. 1.]


### **Data Scaling**

In order to place our features on the same scale, we use a method called _data scaling_. 

Formally speaking: **_Scaling is the process of transforming data to fit on a specific scale._** This type of Transformation changes just the **range** of the data distribution.

A well-known data scaling technique is called **min-max scaling**.

A special case of Min-Max scaling where the target range for our data is between (0, 1) is also known as '_Normalization_' [! Not to confuse with vector normalization which we'll see in a second].

In [18]:
# In this example we are trying to fit our points within the (0,1) range.
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(data)

In [14]:
print("Min: ", data.min(axis=0))
print("Max: ", data.max(axis=0))

Min:  [ 0.  -1.5 -1.9 -4.3]
Max:  [3.  4.  2.  2.1]


In [15]:
print("Min: ", data_scaled.min(axis=0))
print("Max: ", data_scaled.max(axis=0))

Min:  [0. 0. 0. 0.]
Max:  [1. 1. 1. 1.]


We can now compare our original data values, to the scaled ones.

In [17]:
print(data)

[[ 3.  -1.5  2.  -2.5]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]


In [16]:
print(data_scaled)

[[1.         0.         1.         0.28125   ]
 [0.         1.         0.41025641 1.        ]
 [0.33333333 0.87272727 0.         0.        ]]


### **(Vector) Normalization**

(Vector) Normalization is a technique that allows us to transform the values in a vector in order for the vector to have a magnitude of 1 (basically trasnforming it into a _unit_ vector). This does not change the direction of the vector, just its magnitude.

This method is also known as _Unit vector Scaling_.

In [22]:
# Normalization can be of different types, one of them being L1 which we are using in the method below.
data_normalized = preprocessing.normalize(data, norm='l1', axis=0)
print(data_normalized)

[[ 0.75       -0.17045455  0.47619048 -0.28089888]
 [ 0.          0.45454545 -0.07142857  0.23595506]
 [ 0.25        0.375      -0.45238095 -0.48314607]]


In [23]:
data_norms_abs = np.abs(data_normalized) # Here we took the abs value of each value along each column (the columns being our feature vectors).
print(data_norms_abs.sum(axis=0)) # Here we summed the abs values we got from the previous step to show thta now each feature vector has a magnitude of 1.

[1. 1. 1. 1.]
