## Scale and Normalize Categorical Dataset
Important summary points:
    
    1) Data scaling and normalization are necessary when preparing data for machine learning.
    
    2) Features can have different scales; and features with larger numbers can have a disproportionate impact on the model. 
    
    3) Scaling changes the range of the data; normalization changes the shape of the distribution of the data. 

In [1]:
# Import dependencies.
import pandas as pd
from path import Path

In [3]:
# Load data
file_path = Path("../Resources/loans_data_encoded.csv")
encoded_df = pd.read_csv(file_path)
encoded_df.head()

Unnamed: 0,amount,term,age,bad,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male,month_num
0,1000,30,45,0,0,1,0,0,0,1,6
1,1000,30,50,0,1,0,0,0,1,0,7
2,1000,30,33,0,1,0,0,0,1,0,8
3,1000,15,27,0,0,0,0,1,0,1,9
4,1000,30,28,0,0,0,0,1,1,0,10


## Scaling Data

Note: previous notebooks values in each column were rescaled to be between 0 and 1. Scaling is often necessary with models that are sensitive to large numerial values.
    1. Models that especially use scaling: SVM

    2. Scaling is performed using Scikit-learn's StandardScaler module (scaling makes each feature be rescaled to make the mean as 0, and the standard deviation as 1).

    3. The model->fit->predict/transform workflow is also used when scaling data.

In [4]:
# Creating the scaler instance
from sklearn.preprocessing import StandardScaler
data_scaler = StandardScaler()

In [5]:
# Fitting the scaler
loans_data_scaled = data_scaler.fit_transform(encoded_df)
loans_data_scaled[:5]

array([[ 0.49337687,  0.89789115,  2.28404253, -0.81649658, -0.39336295,
         1.17997648, -0.08980265, -0.88640526, -0.42665337,  0.42665337,
        -0.16890147],
       [ 0.49337687,  0.89789115,  3.10658738, -0.81649658,  2.54218146,
        -0.84747452, -0.08980265, -0.88640526,  2.34382305, -2.34382305,
         0.12951102],
       [ 0.49337687,  0.89789115,  0.3099349 , -0.81649658,  2.54218146,
        -0.84747452, -0.08980265, -0.88640526,  2.34382305, -2.34382305,
         0.42792352],
       [ 0.49337687, -0.97897162, -0.67711892, -0.81649658, -0.39336295,
        -0.84747452, -0.08980265,  1.12815215, -0.42665337,  0.42665337,
         0.72633602],
       [ 0.49337687,  0.89789115, -0.51260995, -0.81649658, -0.39336295,
        -0.84747452, -0.08980265,  1.12815215,  2.34382305, -2.34382305,
         1.02474851]])

In [6]:
import numpy as np
print(np.mean(loans_data_scaled[:,0]))
print(np.std(loans_data_scaled[:,0]))

-3.552713678800501e-16
0.9999999999999999


In [7]:
# Define shape of X of loans_data_scaled
loans_data_scaled.shape

(500, 11)

#### Skill Drill: 
Create a for loop to make sure all columns are standardized


In [9]:
# Import loans_data_scaled into a df
print(loans_data_scaled)

[[ 0.49337687  0.89789115  2.28404253 ... -0.42665337  0.42665337
  -0.16890147]
 [ 0.49337687  0.89789115  3.10658738 ...  2.34382305 -2.34382305
   0.12951102]
 [ 0.49337687  0.89789115  0.3099349  ...  2.34382305 -2.34382305
   0.42792352]
 ...
 [-1.24386563 -0.97897162 -0.18359201 ... -0.42665337  0.42665337
  -0.16890147]
 [ 0.49337687  0.89789115  1.13247975 ...  2.34382305 -2.34382305
  -1.06413896]
 [ 0.49337687  0.89789115 -0.51260995 ... -0.42665337  0.42665337
  -1.06413896]]


In [8]:
# creating a list of dataframe columns 
columns = list(loans_data_scaled_df) 
  
for i in columns: 
  
    # printing the third element of the column 
    print (df[i][2]) 

NameError: name 'loans_data_scaled_df' is not defined