<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
Ridge Regression using Python (sklearn):</p><br>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold"><br>
Kumar Rahul</p><br>

### We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD Hospital data and answer the below questions.

1.	Load the dataset in Jupyter Notebook using pandas
2.	Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will you take on the features, which are highly correlated?
3.	Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4.	Past medical history code has 175 instances of missing value (NaN). Impute ‘None’ as a label wherever the value is NaN for this feature.
5.	Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
6.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing. 
7.	Build a model using age as independent variable and cost of treatment as dependent variable.
    > * Is age a significant feature in this model?
    * What inferences can be drawn from this model? 
8.	Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the test set.
9.	Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you observe in the model built here and the one built in step 8.
10.	Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 8 and 9.
11. Build a model using lasso, ridge and elastic net regression. What differences do you observe?
12. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
13. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.

**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**

|Sl.No.|Variable|	Description|
|------|--------|--------------|
|1|Age|	 Age of the patient in years|
|2|Body Weight|	 Weight of the patient in Kilograms|
|3|Body Height| 	Height of the patient in cm|
|4|HR Pulse|	 Pulse of patient at the time of admission|
|5|BP-High|	 High BP of patient (Systolic)|
|6|BP-Low|	 Low BP of patient (Diastolic)|
|7|RR|	 Respiratory rate of patient|
|8|HB|	 Hemoglobin count of patient|
|9|Urea|	 Urea levels of patient|
|10|Creatinine|	 Creatinine levels of patient|
|11|Marital Status|	 Marital status of the patient|
|12|Gender|	  Gender code for patient|
|13|Past Medical History Code|	 Code given to the past medical history of the Patient|
|14|Mode of Arrival|	 Way in which the patient arrived the hospital|
|15|State at the Time of Arrival|	 State in which the patient arrived|
|16|Type of Admission|	 Type of admission for the patient|
|17|Key Complaints Code|	 Codes given to the key complaints faced by the patient|
|18|Total Cost to Hospital|	 Actual cost incurred by the hospital|
|19|Total Length of Stay|	 Number of days patient stayed in the hospital|
|20|Length of Stay - ICU|	 Number of days patient stayed in the ICU|
|21|Length of Stay - Ward|	 Number of days patient stayed in the ward|
|22|Implant used (Y/N)|	 Any implant done on the patient|
|23|Cost of Implant|	 Total cost of all the implants done on the patient, if any|


***

# Code starts here

To know the environment with the pyhton kernal



In [None]:
import sys, os

sys.executable


Suppress the warnings

In [None]:
import warnings

warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook
%matplotlib inline 


## Data Import and Manipulation

### 1. Importing a data set

_Give the correct path to the data_



modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Change the display settings for columns

In [None]:
pd.options.display.max_columns = None
#pd.set_option('display.max_columns', None)

Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with '..'

> * The single period . means current working directory
* The double period .. means parent of the current working directory

In [None]:
raw_df = pd.read_csv( "<give the pathname>", 
                        sep = ',', na_values = ['', ' '])

In [None]:
raw_df.columns = raw_df.columns.str.lower().str.replace('.', '_')
raw_df.head()

Dropping SL No as these will not be used for any analysis or model building.

In [None]:
#?raw_df.drop()

In [None]:
if set(['sl no']).issubset(raw_df.columns):
    #write your code
    
raw_df.head()


### 2. Structure of the dataset



In [None]:
raw_df.info()

Give Statistical summary of the data

In [None]:
#Write your code

Get numeric features from the data and find the corelation amongst numeric features

In [None]:
numerical_features = [x for x in raw_df.select_dtypes(include=[np.number])]
numerical_features

numerical_features_df = raw_df.select_dtypes(include=[np.number])
numerical_features_df.corr()

In [None]:
categorical_features = [x for x in raw_df.select_dtypes(include=[np.object])]
categorical_features

### 2. Summarizing the dataset
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The *dropna()* function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.


In [None]:
filter_df = raw_df.dropna(axis=0, how='any', thresh=None, 
                             subset=None, inplace=False)

list(filter_df.columns )

In [None]:
filter_df.info()

We will first start by printing the unique labels in categorical features

In [None]:
#write your code using `unique` and `value_counts` method

Clubbing some of the feature labels together

In [None]:
filter_df['past_medical_history_code']=np.where(
        (filter_df['past_medical_history_code'] =='hypertension1') |
         (filter_df['past_medical_history_code'] =='hypertension2') | 
         (filter_df['past_medical_history_code'] =='hypertension3'),
    'hypertension', filter_df['past_medical_history_code'])

filter_df['past_medical_history_code']=np.where(
    (filter_df['past_medical_history_code'] =='Diabetes1') |
    (filter_df['past_medical_history_code'] =='Diabetes2'), 
    'diabetes', filter_df['past_medical_history_code'])


filter_df['key_complaints__code']=np.where(
        (filter_df['key_complaints__code'] =='other- respiratory') |
         (filter_df['key_complaints__code'] =='PM-VSD') | 
         (filter_df['key_complaints__code'] =='CAD-SVD') |
        (filter_df['key_complaints__code'] =='CAD-VSD') |
        (filter_df['key_complaints__code'] =='other-nervous') |
        (filter_df['key_complaints__code'] =='other-general'), 
        'others', filter_df['key_complaints__code'])

#filter_df.past_medical_history_code.value_counts()

We will use **groupby** function of pandas to get deeper insights of the behaviour of people **Joining** or **Not Joining** the company. We will write a generic function to report the mean by any categorical variable.

In [None]:
def group_by (categorical_features):
    return filter_df.groupby(categorical_features).mean()



In [None]:
group_by("past_medical_history_code")
group_by("key_complaints__code")
group_by("marital_status")

Calculating BMI

\begin{equation}
 BMI = \frac{Body Weight}{Body Height_{mtr}^2}
\end{equation} 

In [None]:
#Write your code to compute bmi here.

### 3. Visualizing the Data

Plot can be done using the callable functions of 

>1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)
2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

#### 3a. Visualizing the Data using seaborn

Write a custom function to create bar plot to visualize the average of numeric features w.r.t each categorical feature. Say, average age w.r.t gender.

In [None]:
#Write your custom function

In [None]:
#Call the function to plot the bar charts

## Model using sklearn:

### Dummy Variable coding

Remove the response variable from the dataset¶


In [None]:
#X_features = list(filter_df.columns)

X_features = [x for x in filter_df if x not in ['body_weight','body_height',
                                           'creatinine','state_at_the_time_of_arrival',
                                           'total_amount_billed_to_the_patient','concession',
                                          'actual_receivable_amount','total_length_of_stay',
                                          'length_of_stay___icu','length_of_stay__ward',
                                            'total_cost_to_hospital']]

In [None]:
X_features

In [None]:
categorical_features = ['gender','marital_status','key_complaints__code',
                        'past_medical_history_code','mode_of_arrival','type_of_admsn','implant_used']

In [None]:
encoded_X_df = pd.get_dummies(filter_df[X_features], columns = categorical_features, drop_first = True )

In [None]:
pd.options.display.max_columns = None
encoded_X_df.info()

In [None]:
Y = filter_df.filter(['total_cost_to_hospital'], axis =1)
X = encoded_X_df
Y.info()

### Train and test data split using Python

The train and test split can also be done using the **sklearn module**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = #write your code to split the data in 80:20 ratio

## Model Building: Using the **sklearn** 



To know about all the methods inside a library use the `dir` command after invoking the library

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score


#dir(linear_model)

In [None]:
# Create linear regression object
ridge_reg_model = linear_model.Ridge(alpha = 0.5) #alpha = 0 is same as simple regression with OLS

# Train the model using the training sets
ridge_reg_model.fit(X_train, y_train)

Making the model is as simple as calling the `fit` method for `Ridge`. However, since we would like to select the best value of alpha, lets try to do it using the below function.

In [None]:
# Make predictions using the testing set
y_pred = ridge_reg_model.predict(X_train)

In [None]:
# The coefficients
print('Coefficients: \n', ridge_reg_model.coef_)
print('Intercept: \n', ridge_reg_model.intercept_)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_train, y_pred))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_train, y_pred))


### Random Search with cross validation

To use RandomizedSearchCV, create a parameter grid from where sample will be picked during model building:

In [None]:
# Number of trees in random forest
alpha = [x for x in np.arange(0.0,5.5,.5)]

# Create the grid
random_grid = {'alpha': alpha}
random_grid

### Model with Grid Search

To report the performance on the selected KPI use `sklearn.metrics.SCORERS.keys()` to get the list of all the metrics and pass the relevant one in `RandomizedSearchCV` or `GridSearchCV`

In [None]:
from sklearn.metrics import SCORERS

SCORERS.keys()

In [None]:
# Use the random grid to search for best hyperparameters
from sklearn.model_selection import GridSearchCV

ridge_reg_model = linear_model.Ridge()

# Random search of parameters, using 3 fold cross validation, 
ridge_best_model = GridSearchCV(estimator = ridge_reg_model, 
                               param_grid = random_grid, scoring = "r2",
                               cv = 3, verbose=0)
# Fit the random search model
ridge_best_model.fit(X_train, y_train.values.ravel())

### Report the parameter

The best model has the following parameter selected from the random search grid

In [None]:
ridge_best_model.best_params_

## Model Evaluation


### 1. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [None]:
def get_predictions ( test_actual, model, test_data ):
    df = pd.DataFrame(model.predict(test_data))
    df.columns = ['predicted']
    y_pred_df = pd.concat([test_actual.reset_index(drop=True), df], axis = 1)
    return y_pred_df

In [None]:
predict_test_df = pd.DataFrame(get_predictions(y_test.total_cost_to_hospital, ridge_best_model, X_test))
predict_test_df.head()

In [None]:
r2_score(predict_test_df.total_cost_to_hospital,predict_test_df.predicted)
mean_squared_error(predict_test_df.total_cost_to_hospital,predict_test_df.predicted)

## Deployment - Save model

Two ways to save the model.

* Using joblib
* Using pickle

In [None]:
import joblib

In [None]:
joblib.dump( ridge_best_model, "ridge_best_model.joblib" )

Write your code to save the model using pickle. Explore `pickle.dumps` or `pickle.dump`

In [None]:
#Write your code here

## Use model on New Cases

We can load the model object for later use. Assuming that X_test is a new data on which we will want to use the model. We can load the model object in two different ways:

* Using joblib
* Using pickle

In [None]:
import joblib
model_joblib = joblib.load( "ridge_best_model.joblib" )

Predict on the test set:

In [None]:
model_joblib.predict( X_test )

Model performance on the test set

In [None]:
model_joblib.score(X_test,y_test)

Write your code to load the model using pickle and predict on the test set. Explore `pickle.load` or `pickle.predict`

In [None]:
#Write your code here


#### End of Document

***
***
