<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
OLS Regression using Python (statsmodel.api):</p><br>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold"><br>
Kumar Rahul</p><br>

### We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD Hospital data and answer the below questions.

1.	Load the dataset in Jupyter Notebook using pandas
2.	Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will you take on the features, which are highly correlated?
3.	Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4.	Past medical history code has 175 instances of missing value (NaN). Impute ‘None’ as a label wherever the value is NaN for this feature.
5.	Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
6.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing. 
7.	Build a model using age as independent variable and cost of treatment as dependent variable.
    > * Is age a significant feature in this model?
    * What inferences can be drawn from this model? 
8.	Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the test set.
9.	Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you observe in the model built here and the one built in step 8.
10.	Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 8 and 9.
11. Build a model using lasso, ridge and elastic net regression. What differences do you observe?
12. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
13. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.

**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**

|Sl.No.|Variable|	Description|
|------|--------|--------------|
|1|Age|	 Age of the patient in years|
|2|Body Weight|	 Weight of the patient in Kilograms|
|3|Body Height| 	Height of the patient in cm|
|4|HR Pulse|	 Pulse of patient at the time of admission|
|5|BP-High|	 High BP of patient (Systolic)|
|6|BP-Low|	 Low BP of patient (Diastolic)|
|7|RR|	 Respiratory rate of patient|
|8|HB|	 Hemoglobin count of patient|
|9|Urea|	 Urea levels of patient|
|10|Creatinine|	 Creatinine levels of patient|
|11|Marital Status|	 Marital status of the patient|
|12|Gender|	  Gender code for patient|
|13|Past Medical History Code|	 Code given to the past medical history of the Patient|
|14|Mode of Arrival|	 Way in which the patient arrived the hospital|
|15|State at the Time of Arrival|	 State in which the patient arrived|
|16|Type of Admission|	 Type of admission for the patient|
|17|Key Complaints Code|	 Codes given to the key complaints faced by the patient|
|18|Total Cost to Hospital|	 Actual cost incurred by the hospital|
|19|Total Length of Stay|	 Number of days patient stayed in the hospital|
|20|Length of Stay - ICU|	 Number of days patient stayed in the ICU|
|21|Length of Stay - Ward|	 Number of days patient stayed in the ward|
|22|Implant used (Y/N)|	 Any implant done on the patient|
|23|Cost of Implant|	 Total cost of all the implants done on the patient, if any|


***

# Code starts here

To know the environment with the pyhton kernal



In [None]:
import os

Suppress the warnings

In [None]:
import warnings

warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook
%matplotlib inline 


## Data Import and Manipulation

### 1. Importing a data set

Modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Change the display settings for columns

In [None]:
pd.options.display.max_columns

In [None]:
pd.set_option('display.max_columns', None)

Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with '..'

> * The single period . means current working directory
* The double period .. means parent of the current working directory

Correct the path name in the code below.

In [None]:
os.getcwd()

In [None]:
raw_df = ##read the data
raw_df.columns = raw_df.columns.str.lower().str.replace('.', '_')
raw_df.head()

Dropping SL No as these will not be used for any analysis or model building. Refer to the use of `drop()` function of pandas.

In [None]:
#add your code here.

In [None]:
raw_df.head()


### 2. Structure of the dataset

In [None]:
raw_df.info()

Get the statistical summary of the data using describe method of pandas. Include the statistical summary of cateogical features as well

In [None]:
## your code here

The below code summarizes the numeric feature. Modify the code below to include description of categorical features as well.

Get numeric features from the data and find the corelation amongst numeric features

In [None]:
numerical_features = [x for x in raw_df.select_dtypes(include=[np.number])]
numerical_features

In [None]:
numerical_features_df = raw_df.select_dtypes(include=[np.number])

Get correlation between numerical features. Use the corr method from pandas.

In [None]:
## your code here.

Get list of categorical features.

In [None]:
categorical_features = ## your code here
categorical_features

### 2. Summarizing the dataset

Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The *dropna()* function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.


In [None]:
filter_df = # write your code here. Look for the method to be used on raw_df

In [None]:
filter_df.info()

Loop through to print the unique labels in each categorical features lists created above.

In [None]:
#Write your code here.

Clubbing some of the feature labels together

In [None]:
filter_df['past_medical_history_code'] = np.where(
        (filter_df['past_medical_history_code'] =='hypertension1') |
         (filter_df['past_medical_history_code'] =='hypertension2') | 
         (filter_df['past_medical_history_code'] =='hypertension3'),
    'hypertension', filter_df['past_medical_history_code'])

filter_df['past_medical_history_code']=np.where(
    (filter_df['past_medical_history_code'] =='Diabetes1') |
    (filter_df['past_medical_history_code'] =='Diabetes2'), 
    'diabetes', filter_df['past_medical_history_code'])


filter_df['key_complaints__code']=np.where(
        (filter_df['key_complaints__code'] =='other- respiratory') |
         (filter_df['key_complaints__code'] =='PM-VSD') | 
         (filter_df['key_complaints__code'] =='CAD-SVD') |
        (filter_df['key_complaints__code'] =='CAD-VSD') |
        (filter_df['key_complaints__code'] =='other-nervous') |
        (filter_df['key_complaints__code'] =='other-general'), 
        'others', filter_df['key_complaints__code'])

#filter_df.past_medical_history_code.value_counts()

We will use **groupby** function of pandas to summarize numerical features by each categorical feature.

In [None]:
def group_by (categorical_features):
    std = filter_df.groupby(categorical_features).std()
    mean = filter_df.groupby(categorical_features).mean()
    return std, mean

Call the above function to group the numeric value by gender and marital_status.

In [None]:
## your code here

Calculating BMI

\begin{equation}
\ BMI = \frac{bodyweight}{bodyheigth_{in mtrs}^2}
\end{equation}

Add a new feature to calculate BMI and include it as a part of filter_df

In [None]:
filter_df['bmi'] = ## your code here

### 3. Visualizing the Data

Plot can be done using the callable functions of 

>1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)
2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

#### 3a. Visualizing the Data using seaborn

Write a custom function to create bar plot to visualize the average of numeric features w.r.t each categorical feature. Say, average age w.r.t gender.

In [None]:
#Write your custom function

In [None]:
#Call the function to plot the bar charts

## Model Approach 1:  With dummy variable coding

### Dummy Variable coding

Create a X feature list with the following feature not as a part of the features: 'body_weight','body_height',
                                           'creatinine','state_at_the_time_of_arrival',
                                           'total_amount_billed_to_the_patient','concession',
                                          'actual_receivable_amount','total_length_of_stay',
                                          'length_of_stay___icu','length_of_stay__ward',
                                            'total_cost_to_hospital'

Remove the response variable from the dataset¶

In [None]:
X_features = [x for x in filter_df if x not in ['body_weight','body_height', 
                                                'creatinine','state_at_the_time_of_arrival',
                                                'total_amount_billed_to_the_patient',
                                                'concession', 'actual_receivable_amount',
                                                'total_length_of_stay', 
                                                'length_of_stay___icu',
                                                'length_of_stay__ward', 
                                                'total_cost_to_hospital']]

In [None]:
X_features

In [None]:
categorical_features = ['gender','marital_status','key_complaints__code',
                        'past_medical_history_code','mode_of_arrival','type_of_admsn','implant_used__y_n_']

Add the code below to create a dummy variable coded column. Look for the use of pd.get_dummies

In [None]:
encoded_X_df = ## your code here

In [None]:
pd.options.display.max_columns = None
encoded_X_df.head()

In [None]:
Y = filter_df.filter(['total_cost_to_hospital'], axis =1)
X = encoded_X_df
Y.info()

### Train and test data split using Python

The train and test split can also be done using the **sklearn module**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2,
                                                   random_state = 42)

In [None]:
X_train.shape
X_test.shape

y_train.shape

y_test.shape

X_train.info()

## Model Building: Using the **statsmodel.api** 



In [None]:
import statsmodels.api as sm
regression_model = sm.OLS(y_train, sm.add_constant(X_train)).fit()
regression_model.summary()

## Find the significant variables


In [None]:
def get_significant_vars (modelobject):
    var_p_vals_df = pd.DataFrame(modelobject.pvalues)
    var_p_vals_df['vars'] = var_p_vals_df.index
    var_p_vals_df.columns = ['pvals', 'vars']
    return list(var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'])

In [None]:
significant_vars = get_significant_vars(regression_model)
significant_vars

## Model Evaluation


### 1. The prediction on train data.
Two ways to precit the outcome on the **train set**
> * Use **predict** function of the model object 
* Use **get_prediction** function of the model object

For the model with dummy variable coding explicetely done, we need to add the constant term to the test set. For the model with dummy variable coding carried out automatically, there is no need to add the constant term to the test set.

Below is the output from model with dummy variable coding

In [None]:
predict_train_df = regression_model.predict(sm.add_constant(X_train))
predict_train_df.head()

predict_train_df = regression_model.get_prediction(sm.add_constant(X_train))
predict_train_df.predicted_mean[0:5]

### 2. Model Evaluation - heteroscedasticity

In [None]:
pred_val = regression_model.fittedvalues.copy()
true_val = y_train['total_cost_to_hospital'].values.copy()
residual = true_val - pred_val

In [None]:
plt.scatter(residual, pred_val)

### 3. Model Evaluation - Test for Normality

In [None]:
normality_plot = sm.qqplot(residual, line = 'r')

### 4. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [None]:
def get_predictions ( test_actual, model, test_data ):
    y_pred_df = pd.DataFrame( { 'actual': test_actual,
                               'predicted': model.get_prediction(sm.add_constant(test_data)).predicted_mean})
    return y_pred_df

In [None]:
predict_test_df = pd.DataFrame(get_predictions(y_test.total_cost_to_hospital, regression_model, X_test))
predict_test_df


#### End of Document

***
***
