## Agenda
* Setting the Working Directory and knowing the Current working Directory
* Read the data and then preliminary analysis
  - How many records and how many variables are available in data
  - What are the data types that are available
  - Structure and basic summary of the data
  - Checking the initial and tail records of data

* Variable interpretation- Has R interpreted the variable types correctly
  - If not then perform a type conversion
* Missing value analysis
* Binning the numeric variables
* Numeric Data standardization


#### Present Working Directory – getwd()
* R is always pointed at a directory. You can find out which directory by running the getwd (get working directory) function.

<img src ="img/r_setdir.png">

In [44]:
getwd()

#### Setting the working directory – setwd()
To change your working directory, use setwd and specify the path to the desired folder.
* We generally set the working directory to the folder when our data is present

In [45]:
setwd('/home/srilakshmik/CSE 7212o/Session 4')

### Installing packages on R

<img src ="img/r_libraries.png">

In [46]:
# install.packages('DMwR') # to install packages on R.
library(DMwR) # loading the library

### Data Description

Given is a data about customers of a bank. At some point, the bank wants to use an automatic system to predict whether loan can be granted to a customer based on his/her credentials. But before building that system, we need to clearly understand the data and prepare the data accordingly. This exercise gives you a glimpse of general pre-processing aspects followed during data preparation. However, please note that not every time, and not on all kinds of data, we need to perform all these steps.

 The customer dataset that are available are
 * Customer_Bank Details_MV.csv  - Customer bank details
 * Customer_Demographics_MV_DOB.csv - Customer demographic details

### Variable Description

#### Customer_Bank

* ID : Customer ID 
* CCAvg  : Avg. spending on credit cards per month in dollars
* Mortgage : Value of house mortgage if any. in dollars
* Personal Loan : Did this customer accept the personal loan offered in the last campaign? (Target attribute)
* Securities Account : Does the customer have a securities account with the bank?
* CD Account : Does the customer have a certificate of deposit (CD) account with the bank?
* Online : Does the customer use internet banking facilities?
* CreditCard : Does the customer use a credit card issued by UniversalBank?


#### Customer_Demographics
* Customer ID : ID of customer
* DOB : Customer's date of birth
* Experience : #years of professional experience
* Income : Annual income of the customer in dollars
* ZIPCode : Home Address ZIP code
* Family : Family size of the customer
* Education : Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
 

#### Reading data

* Reading data from a csv file and creating a dataframe
* The Data should be read into the R environment so that any processing can be done on that data 


<img src ="img/r_datareading.png">

In [47]:
Customer_Bank<-read.csv("data/Customer_Bank Details_MV.csv",header=F)
Customer_Demographics<-read.csv("data/Customer_Demographics_MV_DOB.csv",header=F)

In [48]:
Customer_Demographics

V1,V2,V3,V4,V5,V6,V7
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Customer ID,DOB,Experience,Income,ZIP Code,Family,Education
1,2/17/1993,1,49,91107,4,1
2,2/2/1973,19,34,90089,3,1
3,2/7/1979,15,11,94720,1,1
4,2/10/1983,9,100,94112,1,2
5,2/10/1983,8,45,91330,?,2
6,2/8/1981,13,29,92121,4,2
7,1/27/1965,27,72,91711,2,2
8,1/30/1968,24,22,93943,1,3
9,2/10/1983,10,81,90089,3,2


In [49]:
Customer_Bank<-read.csv("data/Customer_Bank Details_MV.csv",header=T,na.strings=c(NA,"?"))
Customer_Demographics<-read.csv("data/Customer_Demographics_MV_DOB.csv",header=T,
                               na.strings=c(NA,"?"))

In [50]:
Customer_Demographics

Customer.ID,DOB,Experience,Income,ZIP.Code,Family,Education
<int>,<chr>,<int>,<int>,<int>,<int>,<int>
1,2/17/1993,1,49,91107,4,1
2,2/2/1973,19,34,90089,3,1
3,2/7/1979,15,11,94720,1,1
4,2/10/1983,9,100,94112,1,2
5,2/10/1983,8,45,91330,,2
6,2/8/1981,13,29,92121,4,2
7,1/27/1965,27,72,91711,2,2
8,1/30/1968,24,22,93943,1,3
9,2/10/1983,10,81,90089,3,2
10,2/11/1984,9,180,93023,1,3


### Understanding the Data

Checking the class of data

In [51]:
class(Customer_Bank)

class(Customer_Demographics)

Check the number of rows and number of columns in data

In [52]:
dim(Customer_Bank)

dim(Customer_Demographics)

<img src ="img/merge.png">

In [53]:
# merging 

# Merging two datasets
customer_data<-merge(Customer_Demographics,Customer_Bank,
                  by.x="Customer.ID",by.y="ID")

We are merging using inner join in this case as we have same IDs in both the datasets, hence the 
right and left outer joins doesn't seem intuitive.

### Understanding the descriptive statistics of the data

Descriptive statistics is often the first step and an important part in any statistical analysis as it aims at summarizing, describing and presenting a series of values or a dataset. It allows to check the quality of the data and it helps to “understand” the data by having a clear overview of it.


#### Understanding the structure of the data

* str() command which shows about the structure of data

In [54]:
str(customer_data)

'data.frame':	5000 obs. of  14 variables:
 $ Customer.ID       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DOB               : chr  "2/17/1993" "2/2/1973" "2/7/1979" "2/10/1983" ...
 $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
 $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
 $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
 $ Family            : int  4 3 1 1 NA 4 2 1 3 1 ...
 $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
 $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
 $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
 $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
 $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
 $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...


#### summary( )
* summary( ) command will provide you with a statistical summary of your data. It is called the five-point summary (min, Q1, median, Q3 and MAX)

In [55]:
#summary of R object: 
summary(customer_data)


  Customer.ID       DOB              Experience        Income      
 Min.   :   1   Length:5000        Min.   :-6.00   Min.   :  8.00  
 1st Qu.:1251   Class :character   1st Qu.:10.00   1st Qu.: 39.00  
 Median :2500   Mode  :character   Median :20.00   Median : 64.00  
 Mean   :2500                      Mean   :20.11   Mean   : 73.76  
 3rd Qu.:3750                      3rd Qu.:30.00   3rd Qu.: 98.00  
 Max.   :5000                      Max.   :43.00   Max.   :224.00  
                                   NA's   :8       NA's   :7       
    ZIP.Code         Family        Education         CCAvg       
 Min.   : 9307   Min.   :1.000   Min.   :1.000   Min.   : 0.000  
 1st Qu.:91911   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 0.700  
 Median :93437   Median :2.000   Median :2.000   Median : 1.500  
 Mean   :93152   Mean   :2.396   Mean   :1.881   Mean   : 1.932  
 3rd Qu.:94608   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 2.500  
 Max.   :96651   Max.   :4.000   Max.   :3.000   Max.   :10.


#### head ( )
- The head( ) function in R is used to display the first n rows present in the input data frame.



In [56]:
#First few of R object: 
head(customer_data, n=15)


Unnamed: 0_level_0,Customer.ID,DOB,Experience,Income,ZIP.Code,Family,Education,CCAvg,Mortgage,Personal.Loan,Securities.Account,CD.Account,Online,CreditCard
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,2/17/1993,1,49,91107,4.0,1,1.6,0,0,1,0,0,0
2,2,2/2/1973,19,34,90089,3.0,1,1.5,0,0,1,0,0,0
3,3,2/7/1979,15,11,94720,1.0,1,1.0,0,0,0,0,0,0
4,4,2/10/1983,9,100,94112,1.0,2,2.7,0,0,0,0,0,0
5,5,2/10/1983,8,45,91330,,2,1.0,0,0,0,0,0,1
6,6,2/8/1981,13,29,92121,4.0,2,0.4,155,0,0,0,1,0
7,7,1/27/1965,27,72,91711,2.0,2,1.5,0,0,0,0,1,0
8,8,1/30/1968,24,22,93943,1.0,3,0.3,0,0,0,0,0,1
9,9,2/10/1983,10,81,90089,3.0,2,0.6,104,0,0,0,1,0
10,10,2/11/1984,9,180,93023,1.0,3,8.9,0,1,0,0,0,0


#### tail ( )
- The tail( ) function in R is used to display the last n rows present in the input data frame.



In [57]:
#last few of R object: 
tail(customer_data, n=15)


Unnamed: 0_level_0,Customer.ID,DOB,Experience,Income,ZIP.Code,Family,Education,CCAvg,Mortgage,Personal.Loan,Securities.Account,CD.Account,Online,CreditCard
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
4986,4986,1/31/1970,23,30,94720,3,2,1.7,162,0,0,0,1,0
4987,4987,2/12/1986,6,78,95825,1,3,2.9,0,0,0,0,0,0
4988,4988,1/31/1970,23,43,93943,3,2,1.7,159,0,0,0,1,0
4989,4989,2/11/1984,8,85,95134,1,1,2.5,136,0,0,0,0,1
4990,4990,2/18/1994,0,38,93555,1,3,1.0,0,0,0,0,1,0
4991,4991,1/26/1963,25,58,95023,4,3,2.0,219,0,0,0,0,1
4992,4992,1/29/1967,25,92,91330,1,2,1.9,100,0,0,0,0,1
4993,4993,2/14/1988,5,13,90037,4,3,0.5,0,0,0,0,0,0
4994,4994,2/2/1973,21,218,91801,2,1,6.67,0,0,0,0,1,0
4995,4995,1/19/1954,40,75,94588,3,3,2.0,0,0,0,0,1,0


In [58]:
#apply for unique values of each column
apply(customer_data,2, function(x) length(unique(x)))

### Variable interpretation

* We observe that data type of few variables are not interpretted correectly.

* so, we converting the data into required data types using  ```as.*``` function


In [59]:
## Converting the data into required data types - as.* function when you have more to convert
cat_att<-c("Family","Education","Personal.Loan","CD.Account","Securities.Account","Online","CreditCard")

In [60]:
## Type conversion
customer_data[cat_att]<-data.frame(lapply(customer_data[cat_att],FUN=as.factor))

In [61]:
str(customer_data)

'data.frame':	5000 obs. of  14 variables:
 $ Customer.ID       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DOB               : chr  "2/17/1993" "2/2/1973" "2/7/1979" "2/10/1983" ...
 $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
 $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
 $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
 $ Family            : Factor w/ 4 levels "1","2","3","4": 4 3 1 1 NA 4 2 1 3 1 ...
 $ Education         : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 3 2 3 ...
 $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
 $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
 $ Personal.Loan     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
 $ Securities.Account: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
 $ CD.Account        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Online            : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 2 1 ...
 $ CreditCard        : Fac

In [62]:
summary(customer_data)

  Customer.ID       DOB              Experience        Income      
 Min.   :   1   Length:5000        Min.   :-6.00   Min.   :  8.00  
 1st Qu.:1251   Class :character   1st Qu.:10.00   1st Qu.: 39.00  
 Median :2500   Mode  :character   Median :20.00   Median : 64.00  
 Mean   :2500                      Mean   :20.11   Mean   : 73.76  
 3rd Qu.:3750                      3rd Qu.:30.00   3rd Qu.: 98.00  
 Max.   :5000                      Max.   :43.00   Max.   :224.00  
                                   NA's   :8       NA's   :7       
    ZIP.Code      Family     Education       CCAvg           Mortgage     
 Min.   : 9307   1   :1471   1   :2093   Min.   : 0.000   Min.   :  0.00  
 1st Qu.:91911   2   :1296   2   :1401   1st Qu.: 0.700   1st Qu.:  0.00  
 Median :93437   3   :1009   3   :1498   Median : 1.500   Median :  0.00  
 Mean   :93152   4   :1220   NA's:   8   Mean   : 1.932   Mean   : 56.34  
 3rd Qu.:94608   NA's:   4               3rd Qu.: 2.500   3rd Qu.:101.00  
 Max. 

### Checking for the missing values 

In [63]:
sum(is.na(customer_data))

In [64]:
# to check column-wise
colSums(is.na(customer_data))

### Imputing missing values

In [65]:
library(DMwR)
customer_data_imputed<-centralImputation(customer_data) #Cenral Imputation
sum(is.na(customer_data_imputed))

## Data Manipulations

In this data set you are now working with Date of birth and not age
- How to extract age from date of birth ?

In [66]:
library(lubridate)

customer_data_imputed$Age<-year(today()) - year( mdy(customer_data_imputed$DOB) )

In [67]:
names(customer_data_imputed)

Do we need DOB, ID and ZIP.Code for analysis?
- ID is basically a reference element. For example a bank doesn't decide upon whether to loan is given or not based on account number or university give grades based on admission number
- How can Zip code be used- technically is this a numeric or categorical and how to deal with this?

In [68]:
customer_data_imputed <- subset(customer_data_imputed,select=-c(Customer.ID,ZIP.Code,DOB))

In [69]:
names(customer_data_imputed)

Substitue negative values in experience with something appropriate

### Binning : To convert a numeric to categorical (not just type conversion!!)

Binning - Binning or discretization is the process of transforming numerical variables into categorical values.
This is accomplished by grouping the values into a pre-defined number of bins. 

To convert numerical to categorical we can use any of the following approaches:
* Manual
* Equal frequency
    - Number of samples in each bin
* Equal width
    - Interval is same (good for uniform distributions)
    
Let's consider Income, for example if we want to categorize as high, middle and low. How can we do this..

- Getting the numeric attributes

In [70]:
num_data <- customer_data_imputed[!names(customer_data_imputed) %in% cat_att]

head(num_data)

Unnamed: 0_level_0,Experience,Income,CCAvg,Mortgage,Age
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<int>,<dbl>
1,1,49,1.6,0,27
2,19,34,1.5,0,47
3,15,11,1.0,0,41
4,9,100,2.7,0,37
5,8,45,1.0,0,37
6,13,29,0.4,155,39


In [71]:
range(num_data$Income)

#### Manual binning

* cut divides the range of x into intervals and codes the values in x according to which interval they fall. 


In [72]:
bins = cut(num_data$Income, breaks = seq(0, 250, 50), right = T)

bins

In [73]:
class(bins)

In [74]:
# Count the number of rows that comes under a particular bin
table(bins)

bins
   (0,50]  (50,100] (100,150] (150,200] (200,250] 
     1913      1876       770       425        16 

We will use the *discretize* function from infotheo package

* Equal width binning: 
    - It divides the data into k intervals of equal size.
    - The width of intervals is: w = (max-min)/k


* Equal frequency binning: 
    - divides the data into k groups which each group 
    - contains approximately same number of values
    
    
<img src ="img/binning.png">

In [75]:
library(infotheo)

In [76]:
IncomeBin <- discretize(num_data$Income, disc="equalfreq", nbins=4)
table(IncomeBin)

IncomeBin
   1    2    3    4 
1311 1244 1200 1245 

In [77]:

IncomeBin <- discretize(num_data$Income, disc="equalwidth", nbins=4)
table(IncomeBin)


IncomeBin
   1    2    3    4 
2387 1692  661  260 

In [78]:
IncomeBin

X
<int>
1
1
1
2
1
1
2
1
2
4


### Scaling the Numerical Attribute

Normalization or Standardization is an important technique that is performed as a pre-processing step before many Machine Learning models.

It comes into picture when features of input dataset have large differences between their ranges e.g. age and income of a person. 

It helps in transforming features to comparable scales.


<img src ="img/normalization.png">

### Normalization

Normalization: Normalization means that the range of values are 'normalized' to be from 0 to 1.

* Formula for Normalization = (X-Xmin)/(Xmax-Xmin)


In [38]:
library(vegan)
num_data1 <- decostand(num_data,"range") # using range method 

Loading required package: permute

This is vegan 2.5-7



In [39]:
num_data1


Unnamed: 0_level_0,Experience,Income,CCAvg,Mortgage,Age
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.1428571,0.18981481,0.16,0.0000000,0.04545455
2,0.5102041,0.12037037,0.15,0.0000000,0.50000000
3,0.4285714,0.01388889,0.10,0.0000000,0.36363636
4,0.3061224,0.42592593,0.27,0.0000000,0.27272727
5,0.2857143,0.17129630,0.10,0.0000000,0.27272727
6,0.3877551,0.09722222,0.04,0.2440945,0.31818182
7,0.6734694,0.29629630,0.15,0.0000000,0.68181818
8,0.6122449,0.06481481,0.03,0.0000000,0.61363636
9,0.3265306,0.33796296,0.06,0.1637795,0.27272727
10,0.3061224,0.79629630,0.89,0.0000000,0.25000000


### Standardization

Standardization: Standardisation means that the range of values are standardised to measure how many standard deviations the value is from its mean.

ZScore = (X-mean)/standard_deviation

In [40]:
num_data2 <- decostand(num_data,"standardize") 

In [41]:
num_data2

Unnamed: 0_level_0,Experience,Income,CCAvg,Mortgage,Age
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-1.667943480,-0.5379631,-0.19119341,-0.5544382,-1.77423939
2,-0.096553203,-0.8639882,-0.24883075,-0.5544382,-0.02952064
3,-0.445751042,-1.3638934,-0.53701742,-0.5544382,-0.55293627
4,-0.969547801,0.5705222,0.44281726,-0.5544382,-0.90188002
5,-1.056847261,-0.6249032,-0.53701742,-0.5544382,-0.90188002
6,-0.620349962,-0.9726632,-0.88284142,0.9730266,-0.72740814
7,0.601842476,-0.0380580,-0.24883075,-0.5544382,0.66836686
8,0.339944097,-1.1248083,-0.94047876,-0.5544382,0.40665905
9,-0.882248341,0.1575571,-0.76756676,0.4704414,-0.90188002
10,-0.969547801,2.3093226,4.01633199,-0.5544382,-0.98911595
