# Chapter-5: Model Selection and Cross-Validation

## Steps in model building
1. Defining the objectives
2. Explore, validate and Prepare the data
3. Build the model 
4. Validate the model
5. Deploy the model


Importing all the required packages and libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn import  linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score
import statsmodels.api as sm
from matplotlib.pyplot import plot 
pd.set_option('display.max_columns', None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
git_hub_path="https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter5_Model_Selection_Feature_engg/Datasets/"

## Model validating measures

### 1.Regression
* Mean Absolute Deviation -MAD
* Mean Absolute Percentage Error - MAPE
* Root Mean Squared Error - RMSE

We will see these measures one by one 

Firstly we will import the data and find the basic details of the data.

This dataset contains house sale prices for King County, which includes Seattle. The data contains details of all houses sold between May 2014 to May 2015. It is available under License CC0: Public Domain. We want to predict the house price using features like the number of bedrooms, the number of bathrooms, age of construction, square feet area, location of the house, etc.

In [None]:
#kc_house_data = pd.read_csv(r'/content/drive/My Drive/DataSets/Chapter-5/datasets/kc_house_data/kc_house_data.csv')
kc_house_data = pd.read_csv(git_hub_path+ "/kc_house_data/kc_house_data.csv")

In [None]:
print(kc_house_data.shape)

In [None]:
print(kc_house_data.columns)

There are 21,613 rows and 21 columns in the dataset. The column names are self-explanatory. The columns try to describe the home properties like bedrooms, bathrooms, living room area, number of floors, year of construction. The target variable that we are trying to predict here is “price.” 

In [None]:
print(kc_house_data.dtypes)

All the variables are integers except for the date variable. We will use the rest of all variables for building our model. We will keep aside the date variable

In [None]:
kc_house_data.info()

All the columns have data populated. None of the columns has missing values. Now we will go though the summary of each column

In [None]:
all_cols_summary=kc_house_data.describe()
print(round(all_cols_summary,2))

Overall the data seem to be in good shape. There are a few columns with outliers.  We will go-ahead with model building.

* Till now, we have used the “statsmodels” package to perform regression tasks. We will try to use an alternative package in this exercise. The syntax will be different, and the results will be the same.  
* A lot of data scientists use the “sklearn” package. We will try to understand the “sklearn” package as well. The interpretation of R-squared and all the related regression measures remain the same. Only syntax changes.
* The summary() function is available in statsmodels package. That is why we started with that package. However, there is no summary function in sklearn package. Now we know what values are important in the regression output. We can fetch all those values from the model object individually in this package.

First, we will write the code to create train data and test data



In [None]:
X = kc_house_data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']]

y = kc_house_data['price']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

80% of the overall data is considered for training, and the rest of the data is considered for testing. Now we are ready to build the model. Below is the code for building the model using sklearn package. 

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

When we execute the above code, the model will be fit and stored in model_1. There is no summary() function in this package. We can fetch the coefficients and r-squared values using the below code. 

In [None]:
print(model_1.intercept_)
print(model_1.coef_)

#### R-squarred
R-Squared value is a good measure. It talks about the overall accuracy of the model and the total variance explained by the model. If we want to get an intuitive idea of how close or far away from the predicted values from the actual values, then we can look at these measures.  

In [None]:
from sklearn import metrics
y_pred_train=model_1.predict(X_train)
print(metrics.r2_score(y_train,y_pred_train))

In [None]:
y_pred_test=model_1.predict(X_test)
print(metrics.r2_score(y_test,y_pred_test))

Intercept is a large value. Mainly due to the scale of the target. The rest of the coefficients are shown in scientific number format. 

In [None]:
round(kc_house_data.price.describe())

In [None]:
print("R-Squared on Train data : ", metrics.r2_score(y_train,y_pred_train))
print("R-Squared on Test data : ", metrics.r2_score(y_test,y_pred_test))

The R-squared value of the model on train data is 0.700 i.e., 70%, and the R-squared value on the test data is 0.696. i.e. 69.6% . 

#### Mean Absolute Deviation(MAD)

MAD = $\sum_{i=1}^{n}\frac{|y_i-y|}{n}$
* Find the deviation of predicted value from the actual value. Both negative and positive deviations are errors. Calculate the absolute deviation.
* The average of all these absolute deviations is known as Mean Absolute Deviation – MAD. 
* Calculate MAD on train and Test data. 
* A good model will have near to zero MAD on train and test data.
* MAD gives tries to tell us the average deviation of predictions from the actual values. 
* While working with MAD, we are not sure if a value of 1000 is higher or lower, unless we see the scale of target variable y.  If y is in millions, then MAD of 1000 is very less if y value is in thousands then MAD of 1000 is considered high
 

In [None]:
print("MAD on Train data : ", round(np.mean(np.abs(y_train - y_pred_train)),2))
print("MAD on Test data : ", round(np.mean(np.abs(y_test - y_pred_test)),2))

#### Mean Absolute Percentage Error(MAPE)

MAPE = $\frac{100}{n}$$\sum_{i=1}^{n}\frac{|y_i-y|}{n}$
* We tweak the MAD formula and convert each deviation into the percentage of actual value. 
* If we are interested in knowing the deviation percentage instead of actual deviation, then we can use MAPE.
* In MAPE, we don’t need to worry about the scale of the variable. A MAPE value of 2%, is always lower than MAPE of 10%, no matter what is the scale of Y. 


In [None]:
print("MAPE on Train data : ", round(np.mean(np.abs(y_train - y_pred_train)/y_train),2))
print("MAPE on Test data : ", round(np.mean(np.abs(y_test - y_pred_test)/y_test),2))

#### Root Mean Squared Error(RMSE)

RMSE = $\sqrt(\sum_{i=1}^{n}\frac{(y_i-y)^2}{n})$

* Root mean squared error is another alternative. 
* All these measures are trying to explain the error using different formulas. 
* When we are comparing two modes, A lower value of RMSE is preferred. 

In [None]:
print("RMSE on Train data : ", round(math.sqrt(np.mean(np.abs(y_train - y_pred_train)**2)),2))
print("RMSE on Test data : ", round(math.sqrt(np.mean(np.abs(y_test - y_pred_test)**2)),2))

Here from above results we can note few things. The above model is just a benchmark model. Here we have not cleaned the data for outliers. We have not put any effort into improving the model. We have used some variables like zipcode as it is. If we use data cleaning and feature engineering techniques, then we can improve the overall accuracy of the model with the same data and with the same model building algorithm. 

### 2 . Classification
While reading with regression models, we have used R-squared and other deviation based validation measures. When it comes to classification, we have 0 and 1 in the output. The deviation value actual-predicted may not work here. We will create a confusion matrix and derive accuracy from actual and predicted classes. 

We will first import the data and find the basic details about the data.  This dataset is created from the “Give me some credit” competition on the kaggle.com website.  A bank wants to predict a customer is a good or bad customer. The bank has collected two years of historical data. We will build a model on this historical data and use it for predicting the defaults in the new data.

In [None]:
import pandas as pd
#credit_risk_data = pd.read_csv(r'/content/drive/My Drive/DataSets/Chapter-5/datasets/loans_data/credit_risk_data_v1.csv')
credit_risk_data = pd.read_csv(git_hub_path+"/loans_data/credit_risk_data_v1.csv")

In [None]:
print(credit_risk_data.shape)

In [None]:
print(credit_risk_data.columns)

In [None]:
print(credit_risk_data.dtypes)

In [None]:
credit_risk_data.info()

From the output, we can see that there are 150,008 records and eight columns. The columns are self-explanatory. All the columns are numerical. Below table quickly explains the columns

Column_name | Description
--- | ---
Cust_num | Customer id or number
Bad | Bad indicator. Target variable. Defaulters are denoted with 1
Credit_Limit | The credit limit on their card. 
Late_Payments_Count | Number of times customer was late in paying the bill 
Card_Utilization_Percent | Customer credit line average utilization 
Age | Age of the customer
Debt_to_income_ratio | Debt to income ratio
Monthly_Income | Monthly income
Num_loans_personal_loans | Number of personal loans
Family_dependents | Number of dependents


In [None]:
pd.set_option('display.max_columns', None)

In [None]:
all_cols_summary=credit_risk_data.describe()
print(round(all_cols_summary,2))

This data is in good shape. No missing values. No noticeable outliers. We will directly go ahead with model building. Before that, we will create a train and test data. 

In [None]:
X = credit_risk_data[['Credit_Limit', 'Late_Payments_Count',
       'Card_Utilization_Percent', 'Age', 'Debt_to_income_ratio',
       'Monthly_Income', 'Num_loans_personal_loans', 'Family_dependents']]

y = credit_risk_data['Bad']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Now we will build the logistic regression model and find the accurcy of the model 

In [None]:
from sklearn.linear_model import LogisticRegression
model_2= LogisticRegression()

In [None]:
model_2.fit(X_train,y_train)

In [None]:
print(model_2.intercept_)
print(model_2.coef_)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_pred_train=model_2.predict(X_train)
cm1 = confusion_matrix(y_train,y_pred_train)
print(cm1)

In [None]:
accuracy1=(cm1[0,0]+cm1[1,1])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])
print(accuracy1)

In [None]:
y_pred_test=model_2.predict(X_test)
cm2 = confusion_matrix(y_test,y_pred_test)
print(cm2)

In [None]:
accuracy2=(cm2[0,0]+cm2[1,1])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1])
print(accuracy2)


We can see similar results on the train and test data. Overall accuracy is 93% on train data and test data. If we look at only accuracy, then the model is good. Is accuracy a sufficient measure here? 

As discussed earlier, in the case of credit risk models, accuracy may not be the right measure. One bad customer is not the same as one good customer. In this data, more than 90% are good customers; less than 10% are bad customers. We can confirm the same by looking at the frequency of good customers(0’s) and bad customers(1’s) in the data. 


In [None]:
credit_risk_data['Bad'].value_counts()

Out of 150,000 records, we can see that the number of 1’s is less than 15,000 which means class-1 percentage is less than 10%. We can see the class imbalance. In this example, bad customers are denoted with class-1. We need to focus on class-1 accuracy instead of overall accuracy.

#### Sensitivity

Sensitivity is the accuracy of the first class. We usually denote it with a class-0 or positive class. A model has high sensitivity when it has predicted many records related to class-0 accurately. 

Sensitivity = $\frac{Number~of~times~0~is~predicted~as~0}{Overall~occurances~of~0}$

Sensitivity=$\frac{cm[0,0]}{(cm[0,0]+cm[0,1])}$

Sensitivity=$\frac{True~Positives(TP)}{True~Positives(TP)+ Flase~Negatives(FN)}$

#### Specificity
The accuracy of the second class, i.e., class-1 accuracy, is specificity. Out of all records in class-1, how many times our model has predicted them correctly. 

Specificity = $\frac{Number~of~times~1~is~predicted~as~1}{Overall~occurances~of~1}$

Specificity = $\frac{cm[1,1]}{cm[1,0]+cm[1,1]}$

Specificity = $\frac{True~Negatives(TN)}{False~Positives(FP)+True~Negatives(TN)}$


***Example:***

Let us look at this below confusion matrix. We are trying to predict whether a customer is a good or bad customer before giving him a personal loan. Bad customers are known as defaulters and good customers are non-defaulters. These models are known as credit risk models



. | 0 – Bad customer | 1 – Good Customer | Class-wise Accuracy
---|---|---|---
0 – Bad customer | Model is predicting bad customer as bad | Model is predicting bad customer as good | Sensitivity
1 – Good Customer | Model is predicting good customer as bad | Model is predicting good customer as good | Specificity

In the above matrix, what is important for us? We are not worried about the diagonal elements.  The model predicting bad customers as bad and predicting good customers as good are the right predictions. There are two types of errors here. The model predicting bad customers as good customers is one type of error. The second type of error is the model predicting good customers as bad customers. Let us look at the business implications of all these cells.

. | 0 – Bad customer | 1 – Good Customer | Class-wise Accuracy
---|---|---|---
0 – Bad customer | Reject the loan | Approve the loan | Sensitivity
1 – Good Customer | Reject the loan | Approve the loan | Specificity


In [None]:
Sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print(round(Sensitivity1,3))

In [None]:
Specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print(round(Specificity1,3))

In [None]:
Sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
print(round(Sensitivity2,3))

In [None]:
Specificity2=cm2[1,1]/(cm2[1,0]+cm2[1,1])
print(round(Specificity2,3))

Sensitivity on the train and test data us 99.5%., that is nearly perfect. The measure that matters is specificity. Specificity on both train and test data is very poor.  Train data has a specificity of 6.8%, and test data shows 5.7% specificity. The model needs much improvement in this class. 

How do we improve the specificity? We have already built the model. There a small trick that can help boost the specificity. We can experiment with the threshold. By default, logistic regression gives us the probability as the prediction. We then convert it into class by taking 0.5 as a threshold. If the predicted value is less than 0.5, then the predicted class is “0” else it is 1.If we do not mention any threshold, then any predicted value by default will be considered as class-0. 

In the case of class imbalance, we can lower the threshold to 0.2 or 0.25 or 0.3 to increase the chances of finding class-1. In that process, we may misclassify some of the class-0 as class-1. Since we are interested in one class, we can afford to misclassify a few records. 

In [None]:
y_pred_prob=model_2.predict_proba(X_train)
print(y_pred_prob.shape)
print(y_pred_prob)
print(y_pred_prob[0,])
print(y_pred_prob[0,0])
print(y_pred_prob[0,1])
print(y_pred_prob[0:5,1])
print(y_pred_prob[:,1])

In [None]:
y_pred_prob_1=y_pred_prob[:,1]

Now we will create the confusion matrix and recalculate sensitivity and specificity for different thresholds. 

* ***Threshold = 0.5***

In [None]:
threshold=0.5
y_pred_class=y_pred_prob_1*0
y_pred_class[y_pred_prob_1>threshold]=1
print(y_pred_class)

In [None]:
cm3 = confusion_matrix(y_train,y_pred_class)
print("confusion Matrix with Threshold ",  threshold,  "\n",cm3)
accuracy3=(cm3[0,0]+cm3[1,1])/(cm3[0,0]+cm3[0,1]+cm3[1,0]+cm3[1,1])
print("Accuracy is ", round(accuracy3,3))

In [None]:
Sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
print("Sensitivity is", round(Sensitivity3,3))

In [None]:
Specificity3=cm3[1,1]/(cm3[1,0]+cm3[1,1])
print("Specificity is ", round(Specificity3,3))

We will get the same results as default settings for the threshold 0.5. We can compare these results with the train data results in the previous section, and they are the same. In this problem statement, class-1 is important for us. We need to maximize the probability of detecting class-1. We will lower the threshold and try to lift the specificity from 6.8% to a higher number. We can compromise a little on sensitivity. We use the same code, and we just need to change the threshold. 

* ***Threshold = 0.2***

In [None]:
threshold=0.2
y_pred_class=y_pred_prob_1*0
y_pred_class[y_pred_prob_1>threshold]=1

In [None]:
cm3 = confusion_matrix(y_train,y_pred_class)
print("confusion Matrix with Threshold ",  threshold,  "\n",cm3)
accuracy3=(cm3[0,0]+cm3[1,1])/(cm3[0,0]+cm3[0,1]+cm3[1,0]+cm3[1,1])
print("Accuracy is ", round(accuracy3,3))

In [None]:
Sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
print("Sensitivity is", round(Sensitivity3,3))

In [None]:
Specificity3=cm3[1,1]/(cm3[1,0]+cm3[1,1])
print("Specificity is ", round(Specificity3,3))

By changing the threshold to 0.2, we have lifted specificity from 6.8% to 32.9%. Let us further reduce the threshold to 0.1

* ***Threshold = 0.1***

In [None]:
threshold=0.1
y_pred_class=y_pred_prob_1*0
y_pred_class[y_pred_prob_1>threshold]=1

In [None]:
cm3 = confusion_matrix(y_train,y_pred_class)
print("confusion Matrix with Threshold ",  threshold,  "\n",cm3)
accuracy3=(cm3[0,0]+cm3[1,1])/(cm3[0,0]+cm3[0,1]+cm3[1,0]+cm3[1,1])
print("Accuracy is ", round(accuracy3,3))

In [None]:
Sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
print("Sensitivity is", round(Sensitivity3,3))

In [None]:
Specificity3=cm3[1,1]/(cm3[1,0]+cm3[1,1])
print("Specificity is ", round(Specificity3,3))

By changing the threshold to 0.1, we have further lifted specificity from 6.8% to 59.6%. However, sensitivity decreased from 99.5% to 79.7%. If we further reduce threshold specificity will further increase, but sensitivity will decrease. We will be losing much business if we keep on classifying good customers as bad customers. There has to be a tradeoff between sensitivity and specificity.

* ***Threshold = 0.01***

In [None]:
threshold=0.01
y_pred_class=y_pred_prob_1*0
y_pred_class[y_pred_prob_1>threshold]=1

In [None]:
cm3 = confusion_matrix(y_train,y_pred_class)
print("confusion Matrix with Threshold ",  threshold,  "\n",cm3)
accuracy3=(cm3[0,0]+cm3[1,1])/(cm3[0,0]+cm3[0,1]+cm3[1,0]+cm3[1,1])
print("Accuracy is ", round(accuracy3,3))

In [None]:
Sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
print("Sensitivity is", round(Sensitivity3,3))

In [None]:
Specificity3=cm3[1,1]/(cm3[1,0]+cm3[1,1])
print("Specificity is ", round(Specificity3,3))

By changing the threshold to 0.01, we have further lifted specificity from 6.8% to 97.8%. However, sensitivity decreased from 99.5% to 13.1%. We will be losing much business if we keep on classifying good customers as bad customers. There has to be a tradeoff between sensitivity and specificity. In the next section, we will see how to choose the optimal threshold where we are satisfied with both sensitivity and specificity. 

#### ROC and AUC
In some cases, sensitivity is important, and in some cases, specificity is important. By lowering the threshold, we can increase the specificity. We have seen that in the previous example. Similarly, by increasing the threshold, we can increase the sensitivity. The important question is when we lower the threshold, specificity increases; at the same time, sensitivity decreases. While we are focusing on increasing the accuracy of one class, the other class accuracy decreases. Is there a risk in it? 


Sensitivity and specificity move in opposite directions. If one increases, the other decreases. First, we decide which one of these two is our priority. We will try to maximize that at the same time, we will try to minimize the loss associated with the other one. ROC curve helps us in choosing that optimal pair of sensitivity and specificity. 


In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

In [None]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, y_pred_prob_1)
plt.title('ROC Curve')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate(Sensitivity)')
plt.xlabel('False Positive Rate(1-Specificity)')
plt.show()

In [None]:
auc = auc(false_positive_rate, true_positive_rate)
print(auc)

ROC curve is created by taking different sensitivity on Y-axis and 1-Specificity on X-axis. 
ROC curve stands for the Receiver Operating Characteristic curve. The dotted line in the middle is the no-discrimination line. On this line, both sensitivity and 1- specificity is equal. This line is created from a model that gives a 50-50 chance to both the classes. 

**AUC**
 is a better validation measure in case of class imbalance. We know that accuracy changes when threshold value changes. AUC is calculated from the ROC curve, and it considers all the threshold values. If AUC value is near to 1, then the model is considered to be good. 

 The important function in the above code is **roc_curve()**. It takes actual values and predicted probabilities of y as input. This function considers all the threshold values and returns a table that has threshold value, false-positive rate, and true positive rate.

#### F1 Score
F1 score can be considered as an extension of sensitivity and specificity. In Sensitivity, we tend to focus on a single class. The F1 score is also calculated for individual classes. 

Sensitivity is the True Positive Rate. It is also known as recall. Out of all the records in the positive class, how many are predicted correctly. If we are focusing on a single class, sensitivity helps us in predicting its probability. There is one more angle to this single class. Out of all the predicted values as positive, how many are actually positive. This is accuracy in the first column. This measure is known as precision.

F1 score is the harmonic mean of recall and precision. The harmonic mean is a different type of average. It is preferred when we are dealing with fractions. The harmonic mean is the inverse of the arithmetic mean. 

F1 Score=harmonic mean(recall,precision)

F1 Score=$\frac{2}{\frac{1}{recall}+\frac{1}{precision}}$

F1 Score=2*($\frac{precision*recall}{precison+recall}$)

In [None]:
from sklearn.metrics import f1_score

In [None]:
threshold=0.5
y_pred_class=y_pred_prob_1*0
y_pred_class[y_pred_prob_1>threshold]=1
print(f1_score(y_train, y_pred_class))

In [None]:
threshold=0.2
y_pred_class=y_pred_prob_1*0
y_pred_class[y_pred_prob_1>threshold]=1
print(f1_score(y_train, y_pred_class))

### Bias-Variance tradeoff
* While building the models, we focus on bringing the best out of the data. We want to have the best model with high accuracy. While we concentrate a lot on increasing accuracy, we may run into two types of problems. The problem of overfitting and the problem of underfitting. 
- The model should neither be overfitted or under fitted; in other words, the model should neither have variance nor bias. 
+ The overall error in a model can be divided into three parts: the irreducible error, bias, and variance. Not every model will be 100% accurate. There will always be some error inherent in data that can not be reduced. That component is known as an irreducible error. Bias component happens due to underfitting, and variance component happens due to overfitting. 
* Bias and variance move it opposite directions. If we increase the complexity of the model, then variance increases and bias reduces. If we decrease the complexity, then variance decreases and bias increases. To reduce Bias and Variance, we need to build models with optimal complexity. 


This data set was originally shared by the National Institute of Diabetes and Digestive and Kidney Diseases. The goal is to predict whether a person has diabetes or not based on several diagnostic measurements. The diagnostic measurements are  Pregnancies,    Glucose    BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction and Age. The target variable name is “outcome,” and It takes two values 0 and 1. Class-1 indicates diabetes. Below table gives us details of all the columns

Column Name | Details
--- | ---
Pregnancies  | Number of times pregnant
Glucose | Plasma glucose concentration a 2 hours in an oral glucose tolerance test
blood pressure | Diastolic blood pressure (mm Hg)
skin thickness | Triceps skinfold thickness (mm)
Insulin | 2-Hour serum insulin (mu U/ml)
BMI | Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction | Diabetes pedigree function
Age | Age (years)
Outcome | It takes two values 0 and 1. Class-1 indicates diabetes

We will import this data into python and get some basic statistics before proceeding with model building.

In [None]:
#diabetes_data= pd.read_csv(r'/content/drive/My Drive/DataSets/Chapter-5/datasets/pima/diabetes.csv')
diabetes_data= pd.read_csv(git_hub_path+"/pima/diabetes.csv")

In [None]:
print(diabetes_data.shape)

In [None]:
print(diabetes_data.columns)

In [None]:
print(diabetes_data.dtypes)

In [None]:
diabetes_data.info()

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
all_cols_summary=diabetes_data.describe()
print(round(all_cols_summary,2))

From the summary, we can see that the data is clean enough to go ahead with model building. The provider already cleaned this data. We will define the train and test data.

In [None]:
X = diabetes_data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = diabetes_data[['Outcome']]

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=33)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier()
diabetes_tree1.fit(X_train, y_train)

score() function helps us directly calculating the accuracy value. This function internally creates the confusion matrix and gives us the accuracy value. 

In [None]:
print("Max Depth = None")
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Test data Accuracy", diabetes_tree1.score(X_test, y_test))

We can see from the output that the model is overfitted. The above model is an example of a high variance model. We will try to decrease the model complexity and simplify it. In this case, we need to reduce the size of the tree. If that model is under-fitted, then we may have to increase the complexity slightly. Finally, we need to choose our model with optimal complexity. To check whether the model that we built is optimal or not, we need to perform cross-validation. 

### Cross-Validation
Building the model on train data and validating it on test data is called cross-validation. 


#### Train-Test cross-validation
Build the model on train data, find the accuracy, and validate it on test data. If the model shows high accuracy on train data and low accuracy on test data, then the model is overfitted. If the model shows low accuracy on train data, then the model is under fitted.  

We will build different models by changing the value of pruning parameter "max_depth".

***max_depth = 6***

In [None]:
print("Max Depth = 6")

In [None]:
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier(max_depth=5)
diabetes_tree1.fit(X_train, y_train)

In [None]:
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Test data Accuracy", diabetes_tree1.score(X_test, y_test))

***max_depth = 3***

In [None]:
print("Max Depth = 3")
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier(max_depth=3)
diabetes_tree1.fit(X_train, y_train)

In [None]:
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Test data Accuracy", diabetes_tree1.score(X_test, y_test))

***max_depth = 2***

In [None]:
print("Max Depth = 2")
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier(max_depth=2)
diabetes_tree1.fit(X_train, y_train)

In [None]:
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Test data Accuracy", diabetes_tree1.score(X_test, y_test))

From the output, we can see that the max_depth=6 tree is overfitted, and the max_depth=1 tree is slightly under fitted. As discussed earlier, detecting underfitting is a little tricky. We need to look at only train data accuracy to decide to under-fitting. Out of these results, we can take either max_depth=3 or max_depth=2 as the final decision tree. 

#### k-fold cross-validation
**step 1**:Take the whole dataset. Divide it into K subsets(K-folds). Usually, K is taken as a number between 5 and 10.

**step 2**: Build K models. While building the first model, take first K-1 folds as the train data and take the last part as test data. Build and finetune the model that best suites for the pair of train and test data. Repeat the same by changing the test data. Every fold will be used as a test dataset for one model. 

**step 3**: Find the accuracy of all models. Note that all the above K models were built and finetuned for the combination of train and test data. It was NOT a single model applied on several datasets.  We will take the average accuracy of all those K models, that will be the final result.

K-fold cross-validation repeats the train-test scenario K-times. Since the K-fold method is taking average as the final result,  it gives us the optimal value for accuracy. We should see K-fold cross-validation as a model validation method, not a model building method. 
Below is the code for K-fold cross-validation

In [None]:
diabetes_tree_KF = DecisionTreeClassifier(max_depth=3)

In [None]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10)

We need to use the function KFold() and mention n_splits that mentions the number of folds. In this example, we are trying tenfold cross-validation. Here we are using a static model with max_depth=3. In reality, we have to finetune each model and arrive at this value. Since we already tried different values of max_depth, we will go ahead with max_depth=3.

In [None]:
from sklearn import model_selection
acc10 = model_selection.cross_val_score(diabetes_tree_KF,X, y,cv=kfold)
print(acc10)
print(acc10.mean())

From the above output, we can infer that the optimal accuracy of the model is 74.4%. Any model above 74.4% accuracy is overfitted, and below that value is under fitted. 

#### Train-Validation-Holdout
K-fold cross-validation is computationally expensive. It takes a lot of iterations and much time to arrive at the optimal accuracy value. There is one more reliable method that we can follow to arrive at the optimal accuracy.  Divide the data into three parts. Call them train data, validation data, and holdout data
* Train data - The dataset used for building the model. We learn the patterns from this data
* Validation data – While building the model, use this as the data for validating the model hyperparameters. We use this dataset to finetune parameters and finalize our model
* Holdout data – This is like our final test data. We have not used it in finetuning the parameters. We will use this model to test the finalized model. This data should NOT be used for finetuning the parameters. 

**The issue the train-test cross-validation was overfitting on both train data and test data. The issue with K-fold cross-validation was computations. This approach of Train-Validation-Holdout is in between the two methods.** 


In [None]:
from sklearn  import model_selection

**Split overall data into train and test split**

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.3, random_state=99)

**Split test data into holdout and validation data split**

In [None]:
X_val, X_hold, y_val, y_hold = model_selection.train_test_split(X_test, y_test ,test_size=0.5 , random_state=11)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)
print(X_hold.shape)
print(y_hold.shape)

The above code first splits the overall data into two parts - 70 % of the data for the train data and the remaining 30% for the test data. In the next line test data was further split into two parts 50% of the test data was stored in the validation and the remaining 50% in the holdout data. This split gives 15% of the overall data in the validation data.

We will try different values of pruning parameter max_depth

**max_depth = 6**

In [None]:
print("Max Depth 6")
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier(max_depth=6)
diabetes_tree1.fit(X_train, y_train)

In [None]:
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Validation data Accuracy", diabetes_tree1.score(X_val, y_val))

**max_depth = 1**

In [None]:
print("Max Depth 1")
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier(max_depth=1)
diabetes_tree1.fit(X_train, y_train)

In [None]:
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Validation data Accuracy", diabetes_tree1.score(X_val, y_val))

**max_depth = 3**

In [None]:
print("Max Depth 3")
from sklearn.tree import DecisionTreeClassifier
diabetes_tree1= DecisionTreeClassifier(max_depth=3)
diabetes_tree1.fit(X_train, y_train)

In [None]:
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Validation data Accuracy", diabetes_tree1.score(X_val, y_val))

We can finalize max_depth as three. Finally, we can test this model on holdout data.The below code gives us the result on holdout data. 

In [None]:
print("Max Depth 3")
print("Train data Accuracy", diabetes_tree1.score(X_train, y_train))
print("Validation data Accuracy", diabetes_tree1.score(X_val, y_val))
print("Holdout data Accuracy", diabetes_tree1.score(X_hold, y_hold))

We can see that that the model validates well on the holdout data. We may not always see higher accuracy on holdout data. 
The above-discussed methods are the most widely used methods of cross-validation.

For choosing the optimal value of pruning parameter till now we were building different models one by one and comparing them with each other. So instead of doing this we can use GridSearchCV. We just need to mention the values of different pruning parameters and model you want to build. 

In [None]:
from sklearn.model_selection import GridSearchCV
grid_param={'max_depth': range(1,10,1), 'max_leaf_nodes': range(2,30,1)}
clf_tree=DecisionTreeClassifier()
clf=GridSearchCV(clf_tree,grid_param)
clf.fit(X_train,y_train)

In [None]:
print(clf.best_score_)

In [None]:
print(clf.best_params_)

In [None]:
print(clf.best_estimator_)

In [None]:
grid_result_tree= clf.best_estimator_
print("Train data Accuracy", grid_result_tree.score(X_train, y_train))
print("Validation data Accuracy", grid_result_tree.score(X_val, y_val))

### Feature engineering tips and tricks
* Manually adding new features that are derived from the existing features is called feature engineering.
* Feature engineering requires both statistical knowledge and business knowledge.
* Sometimes there is some information hidden in dates. Sometimes hidden information is in latitude and longitude; sometimes, there is some information for a particular region.
* Feature engineering methods are not common or standard across all datasets and industries. We should carefully study the data and business to create these new features. * We will gain the intuition and knowledge about feature engineering with practice and experience. In this section, we will discuss some tips and tricks for feature engineering.

We will revisit the case study: House Sales in King County, USA. We have to predict the price of the house based on certain features. A regression model was built and its R-squared value was 70%. We used the data as it is. Now the question is, can we use the same data and increase the R-square value using feature engineering techniques? 

Defining X data

In [None]:
X = kc_house_data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']]
y = kc_house_data['price']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

In [None]:
print(model_1.intercept_)
print(model_1.coef_)

In [None]:
from sklearn import metrics
y_pred_train=model_1.predict(X_train)
print(metrics.r2_score(y_train,y_pred_train))

In [None]:
y_pred_test=model_1.predict(X_test)
print(metrics.r2_score(y_test,y_pred_test))

In [None]:
print("RMSE on Train data : ", round(math.sqrt(np.mean(np.abs(y_train - y_pred_train)**2)),2))
print("RMSE on Test data : ", round(math.sqrt(np.mean(np.abs(y_test - y_pred_test)**2)),2))

From the above output, we can see that the R-squared value on train and test data is around 70%. RMSE value is around 200,000. 

#### The Dummy Variable Creation or One-hot encoding 
Dummy variable creation is one of the basic and easiest ways to extract hidden information.  Dummy variable creation is used for non-numerical or categorical variables.

The below code tries to draw the boxplots for all the categorical variables vs. the target variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["bedrooms"],y=kc_house_data["price"])
plt.title('bedrooms vs House Price', fontsize=20)

Inference from the above graph – As the number of bedrooms increases the price of the house increases


In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["bathrooms"],y=kc_house_data["price"])
plt.title('bathrooms vs House Price', fontsize=20)

Inference from the above graph – As the number of bathrooms increases the price of the house increases


In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["floors"],y=kc_house_data["price"])
plt.title('floors vs House Price', fontsize=20)

Inference from the above graph – The number of floors does not have a direct relation with the price. One hot encoding may help. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["waterfront"],y=kc_house_data["price"])
plt.title('waterfront vs House Price', fontsize=20)

Inference from the above graph – The number of floors does not have a direct relation with the price. One hot encoding may help. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["view"],y=kc_house_data["price"])
plt.title('view vs House Price', fontsize=20)

Inference from the above graph – House price increases as view increases. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["condition"],y=kc_house_data["price"])
plt.title('condition vs House Price', fontsize=20)

Inference from the above graph – Condition does not show a direct strong relation with house price. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["grade"],y=kc_house_data["price"])
plt.title('grade vs House Price', fontsize=20)

Inference from the above graph - House price increases as grade increases


In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_data["zipcode"],y=kc_house_data["price"])
plt.title('zipcode vs House Price', fontsize=20)

Inference from the above graph – House price has no apparent relation with zip code. 


We have already used all these variables directly in the model. It is always a good idea to create dummy variables and check the impact of these variables on the model. 


In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
print(kc_house_data.shape)
categorical_vars=['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'zipcode']

In [None]:
encoding=OneHotEncoder()
encoding.fit(kc_house_data[categorical_vars])
onehotlabels = encoding.transform(kc_house_data[categorical_vars]).toarray()
onehotlabels_data=pd.DataFrame(onehotlabels)

The OneHotEncoder() function converts the categorical columns to one-hot encoded columns. In fit() function, we need to mention the column names. The transform function will transform the columns to one-hot encoded columns. The number of columns will depend on the number of unique values in the column.  We then drop actual columns and update the dataset with one hot encoded column. 

In [None]:
print(kc_house_data.shape)

In [None]:
kc_house_data1 = kc_house_data.drop(categorical_vars,axis = 1)
print(kc_house_data1.shape)

In [None]:
kc_house_data_onehot=kc_house_data1.join(onehotlabels_data)
print(kc_house_data_onehot.shape)

Now we will use this updated dataset to build the regression line. Below code is used for creating train and test data.

In [None]:
col_names = kc_house_data_onehot.columns.values
print(col_names)

In [None]:
x_col_names=col_names[3:]
print(x_col_names)

In [None]:
X = kc_house_data_onehot[x_col_names]
y = kc_house_data_onehot['price']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Now we are ready to build the model. Remember the previous model R-square value was 70% and the RMSE value was around 200,000

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

In [None]:
print(model_1.intercept_)
print(model_1.coef_)

In [None]:
from sklearn import metrics
y_pred_train=model_1.predict(X_train)
print("Train data R-Squared : ", metrics.r2_score(y_train,y_pred_train))

In [None]:
y_pred_test=model_1.predict(X_test)
print("Test data R-Squared : " , metrics.r2_score(y_test,y_pred_test))

In [None]:
print("RMSE on Train data : ", round(math.sqrt(np.mean(np.abs(y_train - y_pred_train)**2)),2))
print("RMSE on Test data : ", round(math.sqrt(np.mean(np.abs(y_test - y_pred_test)**2)),2))

R-squared jumped from 70% to 84%. RMSE dropped from 200,000 to 145,000. This improvement is huge. It is so huge that it raises doubts about our methodology. The only variable that is fishy here is zipcode. It has too many distinct values. However, it looks perfectly fine after a second verification. Even if we look at additional checks like adjusted R-square, this model passes all those tests.  By using the same data and same model building technique, we have achieved far better accuracy.

#### Handlling Longitude and Latitude
House price varies based on the location of the house. The location of the house is captured in longitude and latitude.  The model may not be able to learn from the numerical values of longitude and latitude directly. We have to derive new features from longitude and latitude. 

In this example, we will try to see the relation between price and longitude latitude values.


In [None]:
bubble_col= kc_house_data["price"] > kc_house_data["price"].quantile(0.7)

The below code tries to draw a scatter plot between longitude and latitude. The bubble color is filled with green if the house price is in the top 30 percentile.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,12))
plt.scatter(kc_house_data["long"],kc_house_data["lat"], c=bubble_col,cmap="RdYlGn",s=10)
plt.title('House Price vs Longitude and Latitude', fontsize=20)
plt.xlabel('Longitude', fontsize=15)
plt.ylabel('Latitude', fontsize=15)
plt.show()

We can see that the high price houses are clustered around the top left side. It looks like there is no significant impact on longitude and latitude. We can still create a feature to extract the maximum information out of these two features. We will create a center value for high priced houses. We will calculate the distance of each house from that center

In [None]:
high_long_mean=kc_house_data["long"][bubble_col].mean()
high_lat_mean=kc_house_data["lat"][bubble_col].mean()

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,12))
plt.scatter(kc_house_data["long"],kc_house_data["lat"], c=bubble_col,cmap="RdYlGn",s=10)
plt.scatter(high_long_mean,high_lat_mean, c="blue", s=1000)
plt.title('House Price vs Longitude and Latitude', fontsize=20)
plt.xlabel('Longitude', fontsize=15)
plt.ylabel('Latitude', fontsize=15)
plt.show()

The point is (high_long_mean , high_lat_mean) is the center of high priced houses. We can see the center included in the map. We will now create a new column distance from this center of high priced houses. We will later use it in model building. We will see whether that variable can lift the accuracy of the standard model. 

In [None]:
kc_house_data["High_cen_distance"]=np.sqrt((kc_house_data["long"] - high_long_mean) ** 2 + (kc_house_data["lat"] - high_lat_mean) ** 2)

In [None]:
plt.figure(figsize=(15,15))
plt.scatter(kc_house_data["High_cen_distance"],np.log(kc_house_data["price"]))
plt.title('House Price vs Distance from center', fontsize=20)
plt.xlabel('Distance from center', fontsize=15)
plt.ylabel('log(house price)', fontsize=15)

From the output, we can see that as the distance from the center increases overall, the price goes down. It is not a strong pattern, but it is a hidden pattern nonetheless. We will now use this variable in our initial standard model. 

In [None]:
col_names = kc_house_data.columns.values
print(col_names)

In [None]:
x_col_names=col_names[3:]
print(x_col_names)

In [None]:
X = kc_house_data[x_col_names]
y = kc_house_data['price']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

In [None]:
print(model_1.intercept_)
print(model_1.coef_)

In [None]:
from sklearn import metrics
y_pred_train=model_1.predict(X_train)
print("Train data R-Squared : ", metrics.r2_score(y_train,y_pred_train))

In [None]:
y_pred_test=model_1.predict(X_test)
print("Test data R-Squared : " , metrics.r2_score(y_test,y_pred_test))

In [None]:
print("RMSE on Train data : ", round(math.sqrt(np.mean(np.abs(y_train - y_pred_train)**2)),2))
print("RMSE on Test data : ", round(math.sqrt(np.mean(np.abs(y_test - y_pred_test)**2)),2))

The r-squared value increased by 1%, and RMSE reduced by 5000. This improvement is not huge in this case. Sometimes these features add great value. Apart from this, if we have any additional relevant knowledge about king county, then we can add it as another feature. 

#### Handlling Date variables
The date variables should always be considered for feature engineering. Date and DateTime variables have fixed sets of formats like DD-MM-YY-HH-MM-SS. This format makes it very difficult for the model to learn the pattern from this single column.

In [None]:
print(kc_house_data.columns)
date_vars = ['date', 'yr_built', 'yr_renovated']
kc_house_dates=kc_house_data[date_vars]
kc_house_dates.head()

We have a ‘date’ variable; this is the sale date. We will derive year of sales, the month of sales and day of sales from this.  We will also derive the age of construction from year_built. We will derive a new indicator renovation_ind. That will indicate all the houses that are renovated. Less than 10% of homes were renovated. Hence we are not considering time since renovation.

In [None]:
kc_house_dates['sale_year'] = np.int64([d[0:4] for d in kc_house_dates["date"]])
kc_house_dates['sale_month'] = np.int64([d[4:6] for d in kc_house_dates["date"]])
kc_house_dates['day_sold'] = np.int64([d[6:8] for d in kc_house_dates["date"]])
kc_house_dates['age_of_house'] = kc_house_dates['sale_year'] - kc_house_dates['yr_built']
kc_house_dates['Ind_renovated'] = kc_house_dates['yr_renovated']>0

We will draw the relevant graphs to see the relation of all these new columns with the price variable. 


In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_dates['sale_year'],y=kc_house_data["price"])
plt.title('sale_year vs House Price', fontsize=20)

Inference from the above graph- Sale year has no direct relation with house prices. 


In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_dates['sale_month'],y=kc_house_data["price"])
plt.title('sale_month vs House Price', fontsize=20)

Inference from the above graph- Sale month has no direct relation with house prices.


In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_dates['day_sold'],y=kc_house_data["price"])
plt.title('day_sold vs House Price', fontsize=20)

Inference from the above graph- Sale day has no direct relation with house prices.


In [None]:
plt.figure(figsize=(10,10))
plt.scatter(kc_house_dates["age_of_house"],kc_house_data["price"])
plt.title('age_of_house vs House Price', fontsize=20)

Inference from the above graph- age of the constuction has no direct relation with house prices. This result is surprising. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot( x=kc_house_dates['Ind_renovated'],y=kc_house_data["price"])
plt.title('Ind_renovated vs House Price', fontsize=20)

Inference from the above graph- Rennovation has an impact on house prices. This result is inline with our intuition.  

In [None]:
kc_house_dates1=kc_house_dates.drop(date_vars, axis=1) #keep only newly derived variables
kc_house_with_dates=kc_house_data.join(kc_house_dates1)
print(kc_house_with_dates.shape)

In [None]:
col_names = kc_house_with_dates.columns.values
print(col_names)

In [None]:
x_col_names=col_names[3:]
print(x_col_names)

In [None]:
X = kc_house_with_dates[x_col_names]
y = kc_house_with_dates['price']


In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

In [None]:
print(model_1.intercept_)
print(model_1.coef_)


In [None]:
from sklearn import metrics
y_pred_train=model_1.predict(X_train)
print("Train data R-Squared : ", metrics.r2_score(y_train,y_pred_train))

In [None]:
y_pred_test=model_1.predict(X_test)
print("Test data R-Squared : " , metrics.r2_score(y_test,y_pred_test))

In [None]:
print("RMSE on Train data : ", round(math.sqrt(np.mean(np.abs(y_train - y_pred_train)**2)),2))
print("RMSE on Test data : ", round(math.sqrt(np.mean(np.abs(y_test - y_pred_test)**2)),2))

The r-squared value increased by 1%, and RMSE reduced by 5000. This improvement is not huge in this case. We can try some more derived features like a quarter, day of the week. 

#### Transformations
We discussed categorical variables, and we can apply one-hot encoding on categorical variables. We also discussed the date variables. What about the continuous variables? We can apply transformations to these variables. If some variables take exponential values, then we can apply log transformation for better predictions. We can always handle the outliers by replacing them with mean or median. Alternatively, we can try the transformations. Sometimes we can derive polynomial terms from the existing data. If the data is skewed, then log transformation normalizes it. We can even apply a transformation to the target column.  We have to make sure that we do not have negative values in the column before we apply log or square root transformation. 


In [None]:
grid_plot1= sns.PairGrid(kc_house_data, y_vars=["price"], x_vars=["sqft_living", "sqft_lot"], height=5)
grid_plot1.map(sns.regplot)

In [None]:
grid_plot2 = sns.PairGrid(kc_house_data, y_vars=["price"], x_vars=["sqft_above", "sqft_basement"], height=5)
grid_plot2.map(sns.regplot)

In [None]:
grid_plot3 = sns.PairGrid(kc_house_data, y_vars=["price"], x_vars=["sqft_living15","sqft_lot15"], height=5)
grid_plot3.map(sns.regplot)

From the above graphs, we can see that some variables like sqft_living, sqft_above and sqft_living15 have a direct relation with the price variable. The price itself has some extreme values. Let us draw the distribution of the price variable. 


In [None]:
plt.figure(figsize=(10,10))
sns.distplot(kc_house_data["price"])
plt.title('House Price distribution', fontsize=20)

We can see the distribution is skewed. We can perform outlier treatment or apply log transformation on this data. Below code creates log_price variable and drawas the distribution chart for the transformed variable. 


In [None]:
kc_house_data["log_price"]=np.log(kc_house_data["price"])
plt.figure(figsize=(10,10))
sns.distplot(kc_house_data["log_price"])
plt.title('log(House Price) distribution', fontsize=20)

The graph shows negligible skewness. We will try to build the model by taking this log transformation on the target variable. As usual, we will compare these results with our initial model. The below code is used for building the model after log transformation. 


In [None]:
X = kc_house_data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']]

y = kc_house_data['log_price']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
import sklearn 
model_1 = sklearn.linear_model.LinearRegression()
model_1.fit(X_train, y_train)

In [None]:
print(model_1.intercept_)
print(model_1.coef_)

In [None]:
from sklearn import metrics
y_pred_train=model_1.predict(X_train)
print("Train data R-Squared : ", metrics.r2_score(y_train,y_pred_train))

In [None]:
y_pred_test=model_1.predict(X_test)
print("Test data R-Squared : " , metrics.r2_score(y_test,y_pred_test))

In [None]:
print("RMSE on Train data : ", round(math.sqrt(np.mean(np.abs(y_train - y_pred_train)**2)),2))
print("RMSE on Test data : ", round(math.sqrt(np.mean(np.abs(y_test - y_pred_test)**2)),2))

We can see that R-squared valued increased significantly. It has gone up to 77%. Since we did the log transformation, we can not compare RMSE value. This log transformation is just one example. We can apply a transformation on predictor variables also. Square root, Square, Cube root, Cube, inverse and binning are other examples of transformation.

There is no guarantee that every new feature increases the accuracy of the model. In our example, one-hot encoding and log transformation worked well. Feature engineering on the date, longitude and latitude variables did not show much improvement. For a certain type of datasets, certain types of feature engineering tricks work. What works best for our dataset needs to be discovered manually. 


### Dealing with class imbalance
While discussing sensitivity and specificity, we discussed class imbalance.  In some classification problems, the classes in the target have this problem of class imbalance. In those cases, the overall accuracy is driven mainly by a single class. If we are not interested in that class, then overall accuracy looks good but the model fails to fulfill its basic objective.  We then looked at individual class accuracy called sensitivity and specificity.  In that section, we discussed the model validation measures in case of class imbalance. Here we will discuss the adjustments that we need to do before building the model so that the model can learn the patterns related to rare events. 

#### Oversampling and Undersampling
Taking a subset of majority class is known as undersampling. Taking duplicate copies of the minority class is known as oversampling. We will try to create a balanced data from the imbalanced dataset. We expect the model to pick patterns associated with minority class from the balanced data. In a way, we are sending skewed data to the model so that it can focus on minority class. 

In [None]:
import pandas as pd
#credit_risk_data = pd.read_csv(r'/content/drive/My Drive/DataSets/Chapter-5/datasets/loans_data/credit_risk_data_v1.csv')
credit_risk_data = pd.read_csv(git_hub_path+"/loans_data/credit_risk_data_v1.csv")

In [None]:
print("Actual Data :", credit_risk_data.shape)

In [None]:
print("Overall Data - Frquency")
freq=credit_risk_data['Bad'].value_counts()
print(freq)

In [None]:
print("Percentage")
print((freq/freq.sum())*100)

In [None]:
credit_risk_class0 = credit_risk_data[credit_risk_data['Bad'] == 0]
credit_risk_class1 = credit_risk_data[credit_risk_data['Bad'] == 1]

In [None]:
print("Class0 Actual :", credit_risk_class0.shape)
print("Class1 Actual  :", credit_risk_class1.shape)

**Undersamling of class 0**

In [None]:
credit_risk_class0_under = credit_risk_class0.sample(int(0.5*len(credit_risk_class0)))
print("Class0 Undersample :", credit_risk_class0_under.shape)

**Oversampling of class 1**

In [None]:
credit_risk_class1_over = credit_risk_class1.sample(4*len(credit_risk_class1),replace=True)
print("Class1 Oversample :", credit_risk_class1_over.shape)

In the above code, we used the sample() function to fetch a sample from the data. For the under-sample, we choose 50% of the records from class-0. In the case of oversample, we increased the records by four times. We need to use replace=True option for oversampling

In [None]:
credit_risk_balanced=pd.concat([credit_risk_class0_under,credit_risk_class1_over])
print("Final Balannced Data :", credit_risk_balanced.shape)

In [None]:
print("Balanced Data")
freq=credit_risk_balanced['Bad'].value_counts()
print(freq)
print((freq/freq.sum())*100)

We can see from the output that class-1 was just 6% in the overall data and the balanced data class-1 is 36%. We will build a model with balanced data. We expect the updated model to have a better specificity.

In [None]:
print("Actual Data :", credit_risk_data.shape)
print("Class0 Actual :", credit_risk_class0.shape)
print("Class1 Actual  :", credit_risk_class1.shape)
print("Class0 Undersample :", credit_risk_class0_under.shape)
print("Class1 Oversample :", credit_risk_class1_over.shape)
print("Final Balannced Data :", credit_risk_balanced.shape)

In [None]:
X = credit_risk_balanced[['Credit_Limit', 'Late_Payments_Count',
       'Card_Utilization_Percent', 'Age', 'Debt_to_income_ratio',
       'Monthly_Income', 'Num_loans_personal_loans', 'Family_dependents']]

y = credit_risk_balanced['Bad']

In [None]:
from sklearn  import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
model_2= LogisticRegression()
model_2.fit(X_train,y_train)

In [None]:
print(model_2.intercept_)
print(model_2.coef_)

In [None]:
from sklearn.metrics import confusion_matrix

y_pred_train=model_2.predict(X_train)
cm1 = confusion_matrix(y_train,y_pred_train)
print("Confusion Matrix  on Train Data")
print(cm1)

In [None]:
accuracy1=(cm1[0,0]+cm1[1,1])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])
print("Accuracy on Train data ",accuracy1)


In [None]:
Sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print("Sensitivity Train data ", round(Sensitivity1,3))

In [None]:
Specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print("Specificity Train data ",round(Specificity1,3))

In [None]:
y_pred_test=model_2.predict(X_test)
cm2 = confusion_matrix(y_test,y_pred_test)
print("Confusion Matrix  on Test Data")
print(cm2)

In [None]:
accuracy2=(cm2[0,0]+cm2[1,1])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1])
print("Accuracy on Test data ", accuracy2)


In [None]:
Sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
print("Sensitivity Test data ",round(Sensitivity2,3))

In [None]:
Specificity2=cm2[1,1]/(cm2[1,0]+cm2[1,1])
print("Specificity Test data ", round(Specificity2,3))

We want a model with high specificity. By creating balanced data, we lifted the specificity of the model from 6.8% to 55.8%. Over Sampling and Undersampling is one method of handling class imbalance. There are other methods like synthetic sampling, cluster centroids. We can explore them if the above-metioned technique does not work well on our data. 