- In this notebook, we will work on a case study where we will learn how the data needs to be prepared to analyze and apply any ML algorithms

#### Problem Statement
__Case 1__
- Given is a data about customers of a bank. At some point, the bank wants to use an automatic system to predict whether loan can be granted to a customer based on his/her credentials. But before building that system, we need to clearly understand the data and prepare the data accordingly. This exercise gives you a glimpse of general pre-processing aspects followed during data preparation.


- Steps involved:

    1. Read the data and preliminary analysis of data
        - How many records and how many variables are available in data
        - What are the data types that are available
        - Structure and basic summary of the data
    2. Variable interpretation- checking weither R interpreted the variable types correctly or not
        - If not then perform a type conversion
    3. Missing value analysis
    
    4. Creating different kinds of data
        - Binning the numeric variables
        - Dummification of categorical attributes
        - Data standardization


### 1. Read the data and preliminary analysis of data

In [12]:
# Read the data

data = read.csv('Data/Bank.csv')

In [13]:
# Undertanding data structure (similar to data.dtypes in python)

str(data)

'data.frame':	5000 obs. of  14 variables:
 $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
 $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
 $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
 $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
 $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
 $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
 $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
 $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
 $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
 $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
 $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...


- Identify the numeric and categorical attributes in the data

Here we need to identify the correct datatype. That can be done by understanding the data
and having knowledge about the domain can help.

In [17]:
# view the top data (similar to data.head(10) in python )

head(data,10)
# tail(data,5)

ID,Age,Experience,Income,ZIP.Code,Family,CCAvg,Education,Mortgage,Personal.Loan,Securities.Account,CD.Account,Online,CreditCard
1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
6,37,13,29,92121,4,0.4,2,155,0,0,0,1,0
7,53,27,72,91711,2,1.5,2,0,0,0,0,1,0
8,50,24,22,93943,1,0.3,3,0,0,0,0,0,1
9,35,10,81,90089,3,0.6,2,104,0,0,0,1,0
10,34,9,180,93023,1,8.9,3,0,1,0,0,0,0


In [19]:
# Describing the data ( similar to data.describe() in python)

summary(data)

       ID            Age          Experience       Income          ZIP.Code    
 Min.   :   1   Min.   :23.00   Min.   :-3.0   Min.   :  8.00   Min.   : 9307  
 1st Qu.:1251   1st Qu.:35.00   1st Qu.:10.0   1st Qu.: 39.00   1st Qu.:91911  
 Median :2500   Median :45.00   Median :20.0   Median : 64.00   Median :93437  
 Mean   :2500   Mean   :45.34   Mean   :20.1   Mean   : 73.77   Mean   :93153  
 3rd Qu.:3750   3rd Qu.:55.00   3rd Qu.:30.0   3rd Qu.: 98.00   3rd Qu.:94608  
 Max.   :5000   Max.   :67.00   Max.   :43.0   Max.   :224.00   Max.   :96651  
     Family          CCAvg          Education        Mortgage    
 Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :  0.0  
 1st Qu.:1.000   1st Qu.: 0.700   1st Qu.:1.000   1st Qu.:  0.0  
 Median :2.000   Median : 1.500   Median :2.000   Median :  0.0  
 Mean   :2.396   Mean   : 1.938   Mean   :1.881   Mean   : 56.5  
 3rd Qu.:3.000   3rd Qu.: 2.500   3rd Qu.:3.000   3rd Qu.:101.0  
 Max.   :4.000   Max.   :10.000   Max.   :3.

In [22]:
# Print the column names of dataframe (similar to data.columns in python)

names(data)