**Post-processing of numerical datset**

<u>To do :</u>

The processed dataset is in a numerical format. 

However, it still requires post-processing to ensure good machine learning performance.

The post-processing is a 2-steps process :
- Normalization
- Dimension Reduction
 
<u>Note :</u> Replace whenever you find #? by a required python code

**1. Load the processed numerical dataset**

In [1]:
# Import pandas
import pandas as pd

In [2]:
# Load processed numerical employees as a dataframe
df_employees = pd.read_csv('employes_num.csv', header=0, index_col=0)
df_employees.head(3)

Unnamed: 0_level_0,salaire,prime,anciennete,etat civil celibataire,etat civil marie
nom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ali,1200.675,100.56,5,1.0,0.0
Sonia,2800.786,400.876,18,0.0,1.0
Rahma,2192.082091,130.987,6,1.0,0.0


In [3]:
# Since the dataset becomes a numerical matrix, we have to import numpy module
# The subsequent processing will be performed on a matrix
import numpy as np

# To limit the precision of float numbers to 2 when displaying numpy arrays
# Use this trick :
np.set_printoptions(precision=2)

**2. Normalization of the columns**

Column scales are not balanced

They require a normalization processing

We can check this scale unbalancing in means and variances of variables 'salaire' , 'prime' , 'anciennete' , ... or by the covariance matrix of dataset

In [4]:
# Get the numerical values of the dataset df_ as a numpy matrix X
X=df_employees.values
X

array([[1.20e+03, 1.01e+02, 5.00e+00, 1.00e+00, 0.00e+00],
       [2.80e+03, 4.01e+02, 1.80e+01, 0.00e+00, 1.00e+00],
       [2.19e+03, 1.31e+02, 6.00e+00, 1.00e+00, 0.00e+00],
       [2.50e+03, 3.41e+02, 1.30e+01, 0.00e+00, 1.00e+00],
       [3.10e+03, 2.57e+02, 1.90e+01, 0.00e+00, 1.00e+00],
       [1.30e+03, 1.51e+02, 6.00e+00, 1.00e+00, 0.00e+00],
       [1.10e+03, 1.31e+02, 4.00e+00, 1.00e+00, 0.00e+00],
       [3.00e+03, 2.57e+02, 2.30e+01, 0.00e+00, 1.00e+00],
       [1.51e+03, 1.60e+02, 6.00e+00, 1.00e+00, 0.00e+00],
       [2.70e+03, 4.00e+02, 2.40e+01, 0.00e+00, 1.00e+00],
       [1.20e+03, 2.57e+02, 8.00e+00, 1.00e+00, 0.00e+00]])

**Check data statistics**

To justify the normalization, we should be sure that the data statistics are unbalanced mainly means and variances

In [5]:
# Check the means of the dataset X
# To compute the mean per column of a matrix
# Call np.mean() and pass as arguments :
# - the matrix
# - axis=0
np.mean(X , axis=0)

array([2.05e+03, 2.35e+02, 1.20e+01, 5.45e-01, 4.55e-01])

In [6]:
# Check the variances of the dataset X
# To compute the mean per column of a matrix
# Call np.var() and pass as arguments :
# - the matrix
# - axis=0
np.var(X , axis=0)

array([5.84e+05, 1.09e+04, 5.35e+01, 2.48e-01, 2.48e-01])

**Apply Standard Scaling process**

Standard Scaling is implemented using `sklearn.preprocessing.StandardScaler`

It is a way to normalize the data so that it has : 
- a mean (average) of 0
- a standard deviation (variance square) of 1. 

Process description :

1. Calculate Mean and Standard Deviation per columns

2. Subtract Mean: It then subtracts the mean from each data point. 
   
   -> This operation centers the data around 0, so the new mean is 0.

3. Divide by Standard Deviation: After centering, each data point is divided by the standard deviation. 

   -> This step scales the data, making the standard deviation equal to 1.

   Once the normalization is applied, all the data variables (dataset columns) have 0 (means) and 1 (variance) 

In [7]:
# Import StandardScaler class from sklearn.preprocessing module
from sklearn.preprocessing import StandardScaler

In [8]:
# Create an instance of StandardScaler class
ss=StandardScaler()

In [10]:
# Call fit() function of ss and pass X as argument
ss.fit(X)

In [11]:
ss.mean_

array([2.05e+03, 2.35e+02, 1.20e+01, 5.45e-01, 4.55e-01])

In [12]:
ss.var_

array([5.84e+05, 1.09e+04, 5.35e+01, 2.48e-01, 2.48e-01])

In [13]:
# Call transform() of ss and pass X as argument
# It returns the normalized dataset, denoted X_ss
X_ss = ss.transform(X)
X_ss

array([[-1.12, -1.29, -0.96,  0.91, -0.91],
       [ 0.98,  1.59,  0.82, -1.1 ,  1.1 ],
       [ 0.18, -1.  , -0.82,  0.91, -0.91],
       [ 0.58,  1.01,  0.14, -1.1 ,  1.1 ],
       [ 1.37,  0.21,  0.96, -1.1 ,  1.1 ],
       [-0.99, -0.81, -0.82,  0.91, -0.91],
       [-1.25, -1.  , -1.09,  0.91, -0.91],
       [ 1.24,  0.21,  1.5 , -1.1 ,  1.1 ],
       [-0.72, -0.72, -0.82,  0.91, -0.91],
       [ 0.84,  1.58,  1.64, -1.1 ,  1.1 ],
       [-1.12,  0.21, -0.55,  0.91, -0.91]])

**Check statistics of normalized dataset**

After the normalization, we can check the data statistics are now balanced mainly means and variances

In [14]:
# After noramlization, check the means of the columns of dataset X
# They should be almost 0
np.mean(X_ss , axis=0)

array([ 1.21e-16, -1.26e-17,  0.00e+00,  1.21e-16, -6.06e-17])

In [15]:
# After normalization, check the variances or covariance matrix of dataset
# The variance values should be 1
np.var(X_ss , axis=0)

array([1., 1., 1., 1., 1.])

**3. Reduction of the dimension**

The numerical data matrix serves as a representation of population data.

The rows within this matrix represent individuals within the population.

The columns of this matrix correspond to characteristics that describe the population.

Because the data matrix is normalized, it becomes possible to compare these characteristics, revealing potential redundancies.

In such cases, reducing the number of dimensions helps to eliminate redundant information.

We can assess the <font color='red'>correlation</font> between these dimensions by examining the covariance matrix of the normalized data matrix.

**Check dataset statistics**

To justify the dimension reduction, we should be sure that there are <font color='red'>high correlations</font> between dimensions

In [16]:
# Check the covarinace matrix of normalized dataset
# Mainly the non-diagonal values that correpond to correlations between the variables (salaire, prime, ...)
# They are values between [-1, 1]
# If they are close to 0 , it means that there is no correlation
# If they are close to 1 , it means that there is a high correlation
# In this case, it is worth to reduce the dimension (number of variables) 
np.cov(X_ss.T , ddof=0)

array([[ 1.  ,  0.69,  0.89, -0.92,  0.92],
       [ 0.69,  1.  ,  0.8 , -0.84,  0.84],
       [ 0.89,  0.8 ,  1.  , -0.92,  0.92],
       [-0.92, -0.84, -0.92,  1.  , -1.  ],
       [ 0.92,  0.84,  0.92, -1.  ,  1.  ]])

**Principal Component Analysis (PCA)**

In [17]:
# Import PCA class from sklearn.decomposition module
from sklearn.decomposition import PCA

In [19]:
# Create an instance of PCA class
pca=PCA()

In [20]:
# Call the function fit() and pass X_ss as an argument
pca.fit(X_ss)

In [21]:
# Check the weight of each variable in 
pca.explained_variance_ratio_

array([9.01e-01, 6.41e-02, 2.05e-02, 1.45e-02, 1.48e-33])

In [22]:
# X_pca=pca.transform(X_ss)
# np.cov(X_pca.T , ddof=0)

array([[ 4.50e+00,  2.08e-15, -2.98e-16, -6.56e-17,  6.52e-16],
       [ 2.08e-15,  3.21e-01, -7.63e-18, -1.68e-17, -3.04e-17],
       [-2.98e-16, -7.63e-18,  1.02e-01, -3.69e-17,  1.36e-17],
       [-6.56e-17, -1.68e-17, -3.69e-17,  7.26e-02,  3.68e-17],
       [ 6.52e-16, -3.04e-17,  1.36e-17,  3.68e-17,  1.19e-31]])

**Dimension reduction using PCA**

In [23]:
from sklearn.decomposition import PCA
# Apply PCA with reduced dimension=2
pca=PCA(n_components=2)
pca.fit(X_ss)
X_pca=pca.transform(X_ss)
X_pca

array([[-2.3 , -0.25],
       [ 2.47,  0.57],
       [-1.55, -0.73],
       [ 1.76,  0.37],
       [ 2.14, -0.8 ],
       [-1.99,  0.07],
       [-2.3 ,  0.08],
       [ 2.33, -0.79],
       [-1.83, -0.01],
       [ 2.78,  0.55],
       [-1.5 ,  0.95]])

**Check statistics of reduced dataset**

Following dimension reduction, we examine the covariance matrix of the reduced data matrix to confirm that there no longer correlations.

In [24]:
# Check the covarinace matrix of reduced dataset
# Check that the non-diagonal values are reduced to 0.
np.cov(X_pca.T , ddof=0)

array([[4.50e+00, 2.08e-15],
       [2.08e-15, 3.21e-01]])

**Save the transformed dataset**

In [26]:
# To save the reduced normalized dataset, call to_csv() function and pass as arguments :
# - file name 'employes_num_2.csv'
# - header : None because the columns have no labels
# - index : True in order to put employees names as index
df_employes_proc=pd.DataFrame(X_pca, index=df_employees.index)
df_employes_proc.to_csv('employes_num_2.csv', header=None, index=True)