<h1 style='text-align: center;'>Examples of normscaler package</h1>
<h3 style='text-align: center;'>Shouke Wei, Ph.D. Professor</h3>
<h4 style='text-align: center;'>Email: shouke.wei@gmail.com</h4>

## Objective
This example will display how to use the data ***normscaler*** package to normalize dataset for model training/estimation and model testing. 

You can refer to the articles to read more about [data normalization methods](https://medium.com/@shouke.wei/different-methods-to-normalize-dataset-for-model-development-with-python-scikit-learn-e1752a6ef1b6) and [its process](https://medium.com/@shouke.wei/right-data-normalization-procedure-for-model-development-with-python-eb9a6a6a268e). 

The normalization scalers include:

- MinMaxScaler
- MaxAbsScaler
- RobustScaler
- StandardScaler
- Normalizer
- DecimalScaler

In this example, we will use `DecimalScaler` and `MinMaxScaler` to show how to use this package for data normalization. 

## 1. Import required packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from normscaler.scaler import DecimalScaler,MinMaxScaler   

## 2. Read data

In [2]:
url = 'https://raw.githubusercontent.com/shoukewei/data/main/data-pydm/gdp_china_encoded.csv'
df = pd.read_csv(url)
df

Unnamed: 0,year,gdp,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
0,2000,1.074125,8.650000,0.314513,1.408147,0.108032,0.976157,1.0,0.0,0.0,0.0,0.0
1,2001,1.203925,8.733000,0.348443,1.501391,0.132133,1.041519,1.0,0.0,0.0,0.0,0.0
2,2002,1.350242,8.842000,0.385078,1.830169,0.152108,1.113720,1.0,0.0,0.0,0.0,0.0
3,2003,1.584464,8.963000,0.481320,2.346735,0.169563,1.238043,1.0,0.0,0.0,0.0,0.0
4,2004,1.886462,9.052298,0.587002,2.955899,0.185295,1.362765,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
90,2014,3.493824,9.436000,3.078217,0.399111,0.602869,2.367206,0.0,1.0,0.0,0.0,0.0
91,2015,3.700216,9.480000,3.566035,0.459535,0.679935,2.557561,0.0,1.0,0.0,0.0,0.0
92,2016,4.047179,9.532000,4.041509,0.471385,0.745374,2.723292,0.0,1.0,0.0,0.0,0.0
93,2017,4.455283,9.392000,4.449690,0.474870,0.821552,2.955790,0.0,1.0,0.0,0.0,0.0


## 3. Slice data into features X and target y

In [3]:
X = df.drop(['gdp'],axis=1)
y = df['gdp']

## 4. Split dataset for model training and testing

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30, random_state=1)

In [5]:
# display data for training
X_train

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
66,2009,5.276,1.074232,1.282390,0.265335,2.461081,0.0,0.0,0.0,0.0,1.0
54,2016,9.947,5.332294,1.547657,0.875521,3.401208,0.0,0.0,0.0,1.0,0.0
36,2017,7.656,5.327700,3.999750,1.062103,4.362180,0.0,0.0,1.0,0.0,0.0
45,2007,9.367,1.253770,0.931296,0.226185,1.426470,0.0,0.0,0.0,1.0,0.0
52,2014,9.789,4.249555,1.701122,0.717731,2.922194,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
75,2018,5.155,3.169770,2.851160,0.862953,5.557430,0.0,0.0,0.0,0.0,1.0
9,2009,10.130,1.293312,4.174383,0.433437,2.157472,1.0,0.0,0.0,0.0,0.0
72,2015,5.539,2.732332,2.159908,0.664598,4.371448,0.0,0.0,0.0,0.0,1.0
12,2012,10.594,1.875150,6.211629,0.738786,3.022671,1.0,0.0,0.0,0.0,0.0


In [6]:
# display data for test
X_test

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
40,2002,9.082,0.348331,0.280899,0.086065,0.761436,0.0,0.0,0.0,1.0,0.0
31,2012,7.92,3.08542,3.459007,0.702767,2.967697,0.0,0.0,1.0,0.0,0.0
46,2008,9.417,1.543593,1.100156,0.270466,1.630541,0.0,0.0,0.0,1.0,0.0
58,2001,4.729,0.283494,0.271497,0.05973,1.046467,0.0,0.0,0.0,0.0,1.0
77,2001,9.555,0.154406,0.023027,0.050858,0.526742,0.0,1.0,0.0,0.0,0.0
49,2011,9.637,2.674968,1.523541,0.500207,2.279184,0.0,0.0,0.0,1.0,0.0
87,2011,9.388,1.776895,0.210703,0.424882,1.81948,0.0,1.0,0.0,0.0,0.0
44,2006,9.309,1.111142,0.759025,0.183344,1.219224,0.0,0.0,0.0,1.0,0.0
88,2012,9.406,2.145,0.326601,0.50064,2.044262,0.0,1.0,0.0,0.0,0.0
90,2014,9.436,3.078217,0.399111,0.602869,2.367206,0.0,1.0,0.0,0.0,0.0


## 5. Data Normalization with DecimalScaler

In [7]:
X_train_scaled, X_test_scaled = DecimalScaler(X_train,X_test)

In [8]:
# display the normalized train dataset in Pandas DataFram
X_train_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
66,0.2009,0.05276,0.107423,0.128239,0.026533,0.246108,0.0,0.0,0.0,0.0,1.0
54,0.2016,0.09947,0.533229,0.154766,0.087552,0.340121,0.0,0.0,0.0,1.0,0.0
36,0.2017,0.07656,0.532770,0.399975,0.106210,0.436218,0.0,0.0,1.0,0.0,0.0
45,0.2007,0.09367,0.125377,0.093130,0.022618,0.142647,0.0,0.0,0.0,1.0,0.0
52,0.2014,0.09789,0.424955,0.170112,0.071773,0.292219,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
75,0.2018,0.05155,0.316977,0.285116,0.086295,0.555743,0.0,0.0,0.0,0.0,1.0
9,0.2009,0.10130,0.129331,0.417438,0.043344,0.215747,1.0,0.0,0.0,0.0,0.0
72,0.2015,0.05539,0.273233,0.215991,0.066460,0.437145,0.0,0.0,0.0,0.0,1.0
12,0.2012,0.10594,0.187515,0.621163,0.073879,0.302267,1.0,0.0,0.0,0.0,0.0


In [9]:
# display the normalized test dataset in Pandas DataFram
X_test_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
40,0.2002,0.09082,0.034833,0.02809,0.008606,0.076144,0.0,0.0,0.0,1.0,0.0
31,0.2012,0.0792,0.308542,0.345901,0.070277,0.29677,0.0,0.0,1.0,0.0,0.0
46,0.2008,0.09417,0.154359,0.110016,0.027047,0.163054,0.0,0.0,0.0,1.0,0.0
58,0.2001,0.04729,0.028349,0.02715,0.005973,0.104647,0.0,0.0,0.0,0.0,1.0
77,0.2001,0.09555,0.015441,0.002303,0.005086,0.052674,0.0,1.0,0.0,0.0,0.0
49,0.2011,0.09637,0.267497,0.152354,0.050021,0.227918,0.0,0.0,0.0,1.0,0.0
87,0.2011,0.09388,0.17769,0.02107,0.042488,0.181948,0.0,1.0,0.0,0.0,0.0
44,0.2006,0.09309,0.111114,0.075903,0.018334,0.121922,0.0,0.0,0.0,1.0,0.0
88,0.2012,0.09406,0.2145,0.03266,0.050064,0.204426,0.0,1.0,0.0,0.0,0.0
90,0.2014,0.09436,0.307822,0.039911,0.060287,0.236721,0.0,1.0,0.0,0.0,0.0


## 6. Data Normalization with MinMaxScaler

In [10]:
X_train_scaled, X_test_scaled = MinMaxScaler(X_train,X_test)

X_train_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
66,0.500000,0.094319,0.173947,0.176927,0.145251,0.390579,0.0,0.0,0.0,0.0,1.0
54,0.888889,0.833518,0.964883,0.214073,0.544119,0.575614,0.0,0.0,0.0,1.0,0.0
36,0.944444,0.470961,0.964029,0.557440,0.666084,0.764752,0.0,0.0,1.0,0.0,0.0
45,0.388889,0.741731,0.207296,0.127763,0.119660,0.186948,0.0,0.0,0.0,1.0,0.0
52,0.777778,0.808514,0.763764,0.235562,0.440974,0.481335,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
75,1.000000,0.075170,0.563194,0.396602,0.535903,1.000000,0.0,0.0,0.0,0.0,1.0
9,0.500000,0.862478,0.214641,0.581894,0.255137,0.330823,1.0,0.0,0.0,0.0,0.0
72,0.833333,0.135939,0.481940,0.299806,0.406242,0.766576,0.0,0.0,0.0,0.0,1.0
12,0.666667,0.935908,0.322718,0.867170,0.454738,0.501111,1.0,0.0,0.0,0.0,0.0


In [11]:
X_test_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_gd,prov_hn,prov_js,prov_sd,prov_zj
40,0.111111,0.696629,0.039111,0.036688,0.028066,0.056056,0.0,0.0,0.0,1.0,0.0
31,0.666667,0.512739,0.547526,0.481719,0.431193,0.490291,0.0,0.0,1.0,0.0,0.0
46,0.444444,0.749644,0.261131,0.151409,0.148605,0.227113,0.0,0.0,0.0,1.0,0.0
58,0.055556,0.007754,0.027068,0.035371,0.010851,0.112156,0.0,0.0,0.0,0.0,1.0
77,0.055556,0.771483,0.003089,0.000578,0.005052,0.009864,0.0,1.0,0.0,0.0,0.0
49,0.611111,0.78446,0.471284,0.210696,0.298783,0.354778,0.0,0.0,0.0,1.0,0.0
87,0.611111,0.745055,0.304467,0.026858,0.249544,0.2643,0.0,1.0,0.0,0.0,0.0
44,0.333333,0.732553,0.180803,0.10364,0.091655,0.146158,0.0,0.0,0.0,1.0,0.0
88,0.666667,0.747903,0.372843,0.043088,0.299066,0.308541,0.0,1.0,0.0,0.0,0.0
90,0.777778,0.752651,0.546188,0.053241,0.365891,0.372103,0.0,1.0,0.0,0.0,0.0
