<h1> Day 9 - Class </h1>

## Dataset
Golden data set / Customer 360 data set - A data set that is true representative of all kind of scenarios

- Training dataset
- Testing dataset
- Holdout dataset

Partitioning data into training, validation, and holdout sets allows you to develop highly accurate models that are relevant to data that you collect in the future, not just the data the model was trained on. By training your data, validating it, and testing it on the holdout set, you get a real sense of how accurate the model’s outcomes will be, leading to better decisions and greater confidence in your model’s accuracy.

### What is a Training Set?
A training set is the subsection of a dataset from which the machine learning algorithm uncovers, or “learns,” relationships between the features and the target variable. In supervised machine learning, training data is labeled with known outcomes.

### What is a Validation Set?
A validation set is another subset of the input data to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset’s other features.

### What is a Holdout Set?
Sometimes referred to as “testing” data, a holdout subset provides a final estimate of the machine learning model’s performance after it has been trained and validated. Holdout sets should never be used to make decisions about which algorithms to use or for improving or tuning algorithms.

### Cross-validation

Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.

For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would be trained and tested 5 separate times so each group would get a chance to be the test set. This can be seen in the graph below.

<img src='img\cross-validation-01.png' />

## Memory Management (Pre-Processing)

In [1]:
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

In [2]:
df=pd.read_csv(r"D:\sanooj\datascience\data\home-credit-default-risk\application_test.csv")
df.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [3]:
num_columns = df.mean().index
cat_columns = []
for i in df.columns:
    if i not in num_columns:
        cat_columns.append(i)

## QC check .. cat_columns + num_columns should be the total columns
len(cat_columns),len(num_columns),len(df.columns)

(16, 105, 121)

In [4]:
null_values = (df.isna().sum() / df.shape[0]) * 100
null_values

SK_ID_CURR                     0.000000
NAME_CONTRACT_TYPE             0.000000
CODE_GENDER                    0.000000
FLAG_OWN_CAR                   0.000000
FLAG_OWN_REALTY                0.000000
                                ...    
AMT_REQ_CREDIT_BUREAU_DAY     12.409732
AMT_REQ_CREDIT_BUREAU_WEEK    12.409732
AMT_REQ_CREDIT_BUREAU_MON     12.409732
AMT_REQ_CREDIT_BUREAU_QRT     12.409732
AMT_REQ_CREDIT_BUREAU_YEAR    12.409732
Length: 121, dtype: float64

In [5]:
## Let's find the columns where there are < 30% null values .. we are going to consider these columns/fields for calculation
treatment_columns = null_values[null_values< 30].index

## Let's find the columns where there are > 30% null values .. we are going to drop these columns/fields for calculation
drop_columns = null_values[null_values> 30].index

treatment_columns
drop_columns

Index(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CN

Index(['OWN_CAR_AGE', 'OCCUPATION_TYPE', 'EXT_SOURCE_1', 'APARTMENTS_AVG',
       'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG',
       'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG',
       'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG',
       'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
       'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE',
       'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE',
       'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE',
       'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE',
       'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI',
       'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
       'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI',
       'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI',
       'NONLIVINGAPARTMENTS_MEDI', 'NONLI

In [6]:
## to permanantly drop use inplace=True
df.drop(drop_columns,axis=1,inplace=True)

In [7]:
num_columns = df.mean().index
print(num_columns)

cat_columns = []
for i in df.columns:
    if i not in num_columns:
        cat_columns.append(i)

len(cat_columns),len(num_columns),len(df.columns)

Index(['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
       'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLA

(11, 60, 71)

In [10]:
for i in df.columns:
    if i  in num_columns:
        print('processing ',i)
        q1 = np.quantile(df[i].values,0.25)
        q3 = np.quantile(df[i].values,0.75)
        iqr = q3 - q1
        utv = q3 + (1.5*iqr)
        ltv = q1 - (1.5*iqr)
        target = []
        print(utv)
        print(ltv)
        
        for x in df[i].values:
            if x < ltv or x > utv:
                target.append(df[i].median())
            else:
                target.append(x)
        np.array(target).shape
        df[i].values.shape
        df[i] = target

processing  SK_ID_CURR
636052.125
-79938.875


(48744,)

(48744,)

processing  CNT_CHILDREN
2.5
-1.5


(48744,)

(48744,)

processing  AMT_INCOME_TOTAL
393750.0
-56250.0


(48744,)

(48744,)

processing  AMT_CREDIT
1296540.0
-360900.0


(48744,)

(48744,)

processing  AMT_ANNUITY
nan
nan


  interpolation=interpolation)


(48744,)

(48744,)

processing  AMT_GOODS_PRICE
1237500.0
-382500.0


(48744,)

(48744,)

processing  REGION_POPULATION_RELATIVE
0.056648500000000004
-0.017979500000000002


(48744,)

(48744,)

processing  DAYS_BIRTH
-1784.5
-30348.5


(48744,)

(48744,)

processing  DAYS_EMPLOYED
3625.0
-6831.0


(48744,)

(48744,)

processing  DAYS_REGISTRATION
6436.375
-15796.625


(48744,)

(48744,)

processing  DAYS_ID_PUBLISH
2407.0
-8561.0


(48744,)

(48744,)

processing  FLAG_MOBIL
1.0
1.0


(48744,)

(48744,)

processing  FLAG_EMP_PHONE
1.0
1.0


(48744,)

(48744,)

processing  FLAG_WORK_PHONE
0.0
0.0


(48744,)

(48744,)

processing  FLAG_CONT_MOBILE
1.0
1.0


(48744,)

(48744,)

processing  FLAG_PHONE
2.5
-1.5


(48744,)

(48744,)

processing  FLAG_EMAIL
0.0
0.0


(48744,)

(48744,)

processing  CNT_FAM_MEMBERS
4.5
0.5


(48744,)

(48744,)

processing  REGION_RATING_CLIENT
2.0
2.0


(48744,)

(48744,)

processing  REGION_RATING_CLIENT_W_CITY
2.0
2.0


(48744,)

(48744,)

processing  HOUR_APPR_PROCESS_START
20.0
4.0


(48744,)

(48744,)

processing  REG_REGION_NOT_LIVE_REGION
0.0
0.0


(48744,)

(48744,)

processing  REG_REGION_NOT_WORK_REGION
0.0
0.0


(48744,)

(48744,)

processing  LIVE_REGION_NOT_WORK_REGION
0.0
0.0


(48744,)

(48744,)

processing  REG_CITY_NOT_LIVE_CITY
0.0
0.0


(48744,)

(48744,)

processing  REG_CITY_NOT_WORK_CITY
0.0
0.0


(48744,)

(48744,)

processing  LIVE_CITY_NOT_WORK_CITY
0.0
0.0


(48744,)

(48744,)

processing  EXT_SOURCE_2
nan
nan


(48744,)

(48744,)

processing  EXT_SOURCE_3
nan
nan


(48744,)

(48744,)

processing  OBS_30_CNT_SOCIAL_CIRCLE
nan
nan


(48744,)

(48744,)

processing  DEF_30_CNT_SOCIAL_CIRCLE
nan
nan


(48744,)

(48744,)

processing  OBS_60_CNT_SOCIAL_CIRCLE
nan
nan


(48744,)

(48744,)

processing  DEF_60_CNT_SOCIAL_CIRCLE
nan
nan


(48744,)

(48744,)

processing  DAYS_LAST_PHONE_CHANGE
1741.875
-3871.125


(48744,)

(48744,)

processing  FLAG_DOCUMENT_2
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_3
1.0
1.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_4
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_5
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_6
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_7
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_8
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_9
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_10
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_11
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_12
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_13
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_14
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_15
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_16
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_17
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_18
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_19
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_20
0.0
0.0


(48744,)

(48744,)

processing  FLAG_DOCUMENT_21
0.0
0.0


(48744,)

(48744,)

processing  AMT_REQ_CREDIT_BUREAU_HOUR
nan
nan


(48744,)

(48744,)

processing  AMT_REQ_CREDIT_BUREAU_DAY
nan
nan


(48744,)

(48744,)

processing  AMT_REQ_CREDIT_BUREAU_WEEK
nan
nan


(48744,)

(48744,)

processing  AMT_REQ_CREDIT_BUREAU_MON
nan
nan


(48744,)

(48744,)

processing  AMT_REQ_CREDIT_BUREAU_QRT
nan
nan


(48744,)

(48744,)

processing  AMT_REQ_CREDIT_BUREAU_YEAR
nan
nan


(48744,)

(48744,)

In [11]:
df

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0.0,135000.0,568800.0,20560.5,450000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0.0,99000.0,222768.0,17370.0,180000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0.0,202500.0,663264.0,69777.0,630000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2.0,315000.0,450000.0,49018.5,396000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1.0,180000.0,625500.0,32067.0,625500.0,...,0.0,0,0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0.0,121500.0,412560.0,17473.5,270000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
48740,456222,Cash loans,F,N,N,2.0,157500.0,622413.0,31909.5,495000.0,...,0.0,0,0,0,,,,,,
48741,456223,Cash loans,F,Y,Y,1.0,202500.0,315000.0,33205.5,315000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
48742,456224,Cash loans,M,N,N,0.0,225000.0,450000.0,25128.0,450000.0,...,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


### Reduce memory footprint

In [1]:
import numpy as np
np.iinfo('int8')

iinfo(min=-128, max=127, dtype=int8)

In [2]:
np.iinfo('int32')

iinfo(min=-2147483648, max=2147483647, dtype=int32)

As you can see, int32 takes more memory. So by checking the max and min values of a column we can assign the appropriate datatype for that column and there by reducing the overall memory utilisation

In [24]:
df=pd.read_csv(r"D:\sanooj\datascience\data\home-credit-default-risk\application_test.csv")

In [25]:
df.memory_usage()

Index                            128
SK_ID_CURR                    389952
NAME_CONTRACT_TYPE            389952
CODE_GENDER                   389952
FLAG_OWN_CAR                  389952
                               ...  
AMT_REQ_CREDIT_BUREAU_DAY     389952
AMT_REQ_CREDIT_BUREAU_WEEK    389952
AMT_REQ_CREDIT_BUREAU_MON     389952
AMT_REQ_CREDIT_BUREAU_QRT     389952
AMT_REQ_CREDIT_BUREAU_YEAR    389952
Length: 122, dtype: int64

In [26]:
## Memory usage in MB
df.memory_usage().sum()/(1024 * 1024)

44.99847412109375

In [28]:
df.dtypes

SK_ID_CURR                      int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
FLAG_OWN_REALTY                object
                               ...   
AMT_REQ_CREDIT_BUREAU_DAY     float64
AMT_REQ_CREDIT_BUREAU_WEEK    float64
AMT_REQ_CREDIT_BUREAU_MON     float64
AMT_REQ_CREDIT_BUREAU_QRT     float64
AMT_REQ_CREDIT_BUREAU_YEAR    float64
Length: 121, dtype: object

In [32]:
np.iinfo('int8')
np.iinfo('int16')
np.iinfo('int32')

iinfo(min=-128, max=127, dtype=int8)

iinfo(min=-32768, max=32767, dtype=int16)

iinfo(min=-2147483648, max=2147483647, dtype=int32)

In [53]:
df=pd.read_csv(r"D:\sanooj\datascience\data\home-credit-default-risk\application_test.csv")
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage before optimization is: {:.2f} MB'.format(start_mem))

for col in df.columns:
    col_type = df[col].dtype
    if col_type != object:
        c_min = df[col].min()
        c_max = df[col].max()
    #type(col_type)
        if str(col_type)[0:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)  
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)
    else:
        df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

Memory usage before optimization is: 45.00 MB
Memory usage after optimization is: 9.40 MB
Decreased by 79.1%


In [66]:
df=pd.read_csv(r"D:\sanooj\datascience\data\amazon-reviews-unlocked-mobile-phones\Amazon_Unlocked_Mobile.csv")

## Memory Management
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage before optimization is: {:.2f} MB'.format(start_mem))

for col in df.columns:
    col_type = df[col].dtype
    if col_type != object:
        c_min = df[col].min()
        c_max = df[col].max()
    #type(col_type)
        if str(col_type)[0:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)  
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)
    else:
        df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

Memory usage before optimization is: 18.94 MB
Memory usage after optimization is: 11.57 MB
Decreased by 38.9%


In [67]:
## Descriptive Statistics
df.describe()

Unnamed: 0,Price,Rating,Review Votes
count,407907.0,413840.0,401544.0
mean,,3.819578,
std,,1.548216,
min,1.730469,1.0,0.0
25%,80.0,3.0,0.0
50%,144.75,5.0,0.0
75%,270.0,5.0,1.0
max,2598.0,5.0,645.0


In [69]:
df.shape
df.head()

(413840, 6)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,200.0,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,200.0,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,200.0,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,200.0,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,200.0,4,Great phone to replace my lost phone. The only...,0.0


In [None]:
numerical_columns = df.mean().index
category_columns = []
for col in df.columns:
    if col not in numerical_columns:
        category_columns.append(col)

# QC 
len(numerical_columns), len(category_columns), len(df.columns)

In [None]:
# Find Null Values
null_values = (df.isna().sum() / df.shape[0]) * 100
null_values

## Let's find the columns where there are < 30% null values .. we are going to consider these columns/fields for calculation
treatment_columns = null_values[null_values< 30].index

## Let's find the columns where there are > 30% null values .. we are going to drop these columns/fields for calculation
drop_columns = null_values[null_values> 30].index

treatment_columns
drop_columns

In [None]:
## to permanantly drop use inplace=True
df.drop(drop_columns,axis=1,inplace=True)
df.shape

In [None]:
## for all numeric columns, replace the null value with median
## for all categorical columns, replace the null value with the mode
for i in df.columns:
    if i  in num_columns:
        df[i].fillna(df[i].median(),inplace=True)
    else:
        df[i].fillna(df[i].value_counts().index[0],inplace=True)

In [None]:
## Now find outliers - i.e. Outlier treatment
for i in df.columns:
    if i  in num_columns:
        q1 = np.quantile(df[i].values,0.25)
        q3 = np.quantile(df[i].values,0.75)
        iqr = q3 - q1
        utv = q3 + (1.5*iqr)
        ltv = q1 - (1.5*iqr)
        target = []
        
        for x in df[i].values:
            if x < ltv or x > utv:
                target.append(df[i].median())
            else:
                target.append(x)
        np.array(target).shape
        df[i].values.shape
        df[i].values = np.array(target)

## Transformations

Sometimes the data doesn't show any patterns (e.g. - normal distribution, or correlation), so we could apply different kind of transformations on top of the data to see if some pattern emerges. 

e.g. - 
- x -> 1/x
- x -> x * 2
- x -> log(x)

Transformations will be at the cost of interpretation. 

### Standard Scaler
The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY (!!!) so that each column/feature/variable will have μ = 0 and σ = 1.

The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the feature is scaled based on:

<img src='img/std-scalar-01.png'/>


In [2]:
from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
scaler.fit(data)
output = scaler.transform(data)
output

array([[-1., -1.],
       [-1., -1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [5]:
import numpy as np
np.mean(data)

0.5

In [6]:
np.std(data)

0.5

In [3]:
print(scaler.transform([[2, 2]]))

[[3. 3.]]


## Preprocessing steps
- Read training, testing datasets
- Print descriptive statistics
- Memory management
- Null value treatment
- Outlier treatment
- Minmax Scalar
- Standard Scalar
- Transformations
- Garbage value removal

## TODO
- Read Python constructor
- Read Generator
- Refer youtube video and do plotting to show minmax and standard scalar benefits