<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
OLS Regression using Python (stasmodel.formula.api):</p><br>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold"><br>
Kumar Rahul</p><br>

### We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD Hospital data and answer the below questions.

1.	Load the dataset in Jupyter Notebook using pandas
2.	Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will you take on the features, which are highly correlated?
3.	Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4.	Past medical history code has 175 instances of missing value (NaN). Impute ‘None’ as a label wherever the value is NaN for this feature.
5.	Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
6.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing. 
7.	Build a model using age as independent variable and cost of treatment as dependent variable.
    > * Is age a significant feature in this model?
    * What inferences can be drawn from this model? 
8.	Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the test set.
9.	Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you observe in the model built here and the one built in step 8.
10.	Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 8 and 9.
11. Build a model using lasso, ridge and elastic net regression. What differences do you observe?
12. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
13. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.

**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**

|Sl.No.|Variable|	Description|
|------|--------|--------------|
|1|Age|	 Age of the patient in years|
|2|Body Weight|	 Weight of the patient in Kilograms|
|3|Body Height| 	Height of the patient in cm|
|4|HR Pulse|	 Pulse of patient at the time of admission|
|5|BP-High|	 High BP of patient (Systolic)|
|6|BP-Low|	 Low BP of patient (Diastolic)|
|7|RR|	 Respiratory rate of patient|
|8|HB|	 Hemoglobin count of patient|
|9|Urea|	 Urea levels of patient|
|10|Creatinine|	 Creatinine levels of patient|
|11|Marital Status|	 Marital status of the patient|
|12|Gender|	  Gender code for patient|
|13|Past Medical History Code|	 Code given to the past medical history of the Patient|
|14|Mode of Arrival|	 Way in which the patient arrived the hospital|
|15|State at the Time of Arrival|	 State in which the patient arrived|
|16|Type of Admission|	 Type of admission for the patient|
|17|Key Complaints Code|	 Codes given to the key complaints faced by the patient|
|18|Total Cost to Hospital|	 Actual cost incurred by the hospital|
|19|Total Length of Stay|	 Number of days patient stayed in the hospital|
|20|Length of Stay - ICU|	 Number of days patient stayed in the ICU|
|21|Length of Stay - Ward|	 Number of days patient stayed in the ward|
|22|Implant used (Y/N)|	 Any implant done on the patient|
|23|Cost of Implant|	 Total cost of all the implants done on the patient, if any|

***

# Code starts here

To know the environment with the pyhton kernal



In [None]:
import sys, os

sys.executable


Suppress the warnings

In [None]:
import warnings

warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook
%matplotlib inline 


## Data Import and Manipulation

### 1. Importing a data set

Modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Change the display settings for columns

In [None]:
pd.options.display.max_columns
pd.set_option('display.max_columns', None)

Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with '..'

> * The single period . means current working directory
* The double period .. means parent of the current working directory

In [None]:
raw_df = pd.read_csv( "../DAD_hospital/data/DAD_Case_Data_Corrected.csv", 
                        sep = ',', na_values = ['', ' '])

raw_df.columns = raw_df.columns.str.lower().str.replace('.', '_')
raw_df.head()

In [None]:
#?pd.read_csv

Dropping SL No as these will not be used for any analysis or model building.

In [None]:
#?raw_df.drop()

In [None]:
if set(['sl no']).issubset(raw_df.columns):
    raw_df.drop(['sl no'],axis=1, inplace=True)
    
raw_df.head()


### 2. Structure of the dataset



In [None]:
raw_df.info()

In [None]:
raw_df.describe(include='all').transpose()
#raw_df.describe().transpose()

Get numeric features from the data and find the corelation amongst numeric features

In [None]:
numerical_features = [x for x in raw_df.select_dtypes(include=[np.number])]
numerical_features

In [None]:
numerical_features_df = raw_df.select_dtypes(include=[np.number])
numerical_features_df.corr()

In [None]:
categorical_features = [x for x in raw_df.select_dtypes(include=[np.object])]
categorical_features

### 2. Summarizing the dataset
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The *dropna()* function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.


In [None]:
filter_df = raw_df.dropna(axis=0, how='any', thresh=None, 
                             subset=None, inplace=False)

list(filter_df.columns )

In [None]:
filter_df.info()

We will first start by printing the unique labels in categorical features

In [None]:
for f in categorical_features:
    print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
    print("The values in {} is \n{}\n".format(f,  filter_df[f].value_counts()))


Clubbing some of the feature labels together

In [None]:
filter_df['past_medical_history_code']=np.where(
        (filter_df['past_medical_history_code'] =='hypertension1') |
         (filter_df['past_medical_history_code'] =='hypertension2') | 
         (filter_df['past_medical_history_code'] =='hypertension3'),
    'hypertension', filter_df['past_medical_history_code'])

filter_df['past_medical_history_code']=np.where(
    (filter_df['past_medical_history_code'] =='Diabetes1') |
    (filter_df['past_medical_history_code'] =='Diabetes2'), 
    'diabetes', filter_df['past_medical_history_code'])


filter_df['key_complaints__code']=np.where(
        (filter_df['key_complaints__code'] =='other- respiratory') |
         (filter_df['key_complaints__code'] =='PM-VSD') | 
         (filter_df['key_complaints__code'] =='CAD-SVD') |
        (filter_df['key_complaints__code'] =='CAD-VSD') |
        (filter_df['key_complaints__code'] =='other-nervous') |
        (filter_df['key_complaints__code'] =='other-general'), 
        'others', filter_df['key_complaints__code'])

#filter_df.past_medical_history_code.value_counts()

We will use **groupby** function of pandas to summarize numerical features by each categorical feature.

In [None]:
def group_by (categorical_features):
    std = filter_df.groupby(categorical_features).std()
    mean = filter_df.groupby(categorical_features).mean()
    return std, mean

Call the above function to group the numeric value by gender and marital_status

In [None]:
s,m =group_by('gender')
s
m

Calculating BMI

In [None]:
filter_df['bmi'] = filter_df.body_weight/(np.power((filter_df.body_height/100),2))

### 3. Visualizing the Data

Plot can be done using the callable functions of 

>1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)
2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

#### 3a. Visualizing the Data using seaborn

Write a custom function to create bar plot to visualize the average of numeric features w.r.t each categorical feature. Say, average age w.r.t gender.

In [None]:
filter_df[numerical_features].info()

In [None]:
def bar_plot(xlabel,ylabel):
    sn.barplot(x = xlabel, y = ylabel, data= filter_df)
    plt.xlabel(xlabel, size = 14)
    plt.ylabel(ylabel, size = 14)
    #plt.grid(True)
    x1,x2,y1,y2 = plt.axis()
    plt.show()

In [None]:
numerical_features_set = ['age','rr']
categorical_features_set = ['gender','marital_status']

for c in categorical_features_set:
    for n in numerical_features_set:
        bar_plot(c,n)

## Model Approach 2:  Without dummy variable coding

In [None]:
import statsmodels.formula.api as smf

To print the name of all the models in any library

In [None]:
#dir(smf)

In [None]:
X_features = [x for x in filter_df if x not in ['body_weight','body_height',
                                           'creatinine','state_at_the_time_of_arrival',
                                           'total_amount_billed_to_the_patient','concession',
                                          'actual_receivable_amount','total_length_of_stay',
                                          'length_of_stay___icu','length_of_stay__ward']]

In [None]:
new_df = filter_df.filter(X_features, axis =1)

new_df.info()

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split( new_df, test_size = 0.2, random_state = 42)

Writing the formula with the required set of variables to be used in model building. Formula takes the form as Y~X.

In [None]:
pass_formula = 'total_cost_to_hospital ~ \
            C(gender) + \
            C(marital_status) + \
            C(key_complaints__code) + \
            C(past_medical_history_code) + \
            C(mode_of_arrival) + \
            C(type_of_admsn)+ \
            C(implant_used) + \
            age + hr_pulse + bp__high + bp_low + \
            rr +hb + urea + cost_of_implant + bmi'

In [None]:
regression_model = smf.ols(formula=pass_formula, data=train_df).fit()
regression_model.summary()

## Find the significant variables


In [None]:
def get_significant_vars (modelobject):
    var_p_vals_df = pd.DataFrame(modelobject.pvalues)
    var_p_vals_df['vars'] = var_p_vals_df.index
    var_p_vals_df.columns = ['pvals', 'vars']
    return list(var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'])

In [None]:
significant_vars = get_significant_vars(regression_model)
significant_vars

## Model Evaluation


### 1. The prediction on train data.
Two ways to precit the outcome on the **train set**
> * Use **predict** function of the model object 
* Use **get_prediction** function of the model object

For the model with dummy variable coding explicetely done, we need to add the constant term to the test set. For the model with dummy variable coding carried out automatically, there is no need to add the constant term to the test set.

Here is the output with the model with no dummy variable coding

In [None]:
predict_train_df = regression_model.predict((train_df))
predict_train_df.head()

predict_train_df = regression_model.get_prediction(train_df)
predict_train_df.predicted_mean[0:5]

### 2. Model Evaluation - heteroscedasticity

In [None]:
pred_val = regression_model.fittedvalues.copy()
true_val = train_df['total_cost_to_hospital'].values.copy()
residual = true_val - pred_val

In [None]:
plt.scatter(residual, pred_val)

### 3. Model Evaluation - Test for Normality

In [None]:
import statsmodels.api as sm
normality_plot = sm.qqplot(residual,line = 'r')

### 4. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [None]:
def get_predictions ( test_actual, model, test_data ):
    y_pred_df = pd.DataFrame( { 'actual': test_actual,
                               'predicted': model.get_prediction((test_data)).predicted_mean})
    return y_pred_df

In [None]:
predict_test_df = get_predictions( test_df.total_cost_to_hospital, regression_model, test_df)
predict_test_df.head()


#### End of Document

***
***
