<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold">
Ridge Regression using Julia (ScikitLearn):</p>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold">
Kumar Rahul</p><br>


ScikitLearn.jl implements the popular scikit-learn interface and algorithms in Julia. It supports both models from the Julia ecosystem and those of the scikit-learn library (via PyCall.jl).

* More at: https://cstjean.github.io/ScikitLearn.jl/dev/man/python/
* Examples at: https://github.com/cstjean/ScikitLearn.jl/blob/master/docs/src/man/examples.md


### We will be using DAD hospital data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the DAD Hospital data and answer the below questions.

1.	Load the dataset in Jupyter Notebook using CSV
2.	Build a correlation matrix between all the numeric features in the dataset.
3.	Build a new feature named BMI using body height and body weight. Include this as a part of the data frame created in step 1.
4.	Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
5.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing. 
6.	Build a model using age as independent variable and cost of treatment as dependent variable.
    > * Is age a significant feature in this model?
    * What inferences can be drawn from this model? 
7.	Build a model with statsmodel.api to estimate the total cost to hospital. How do you interpret the model outcome? Report the model performance on the test set.
8.	Build a model with statsmodel.formula.api to estimate the total cost to hospital and report the model performance on the test set. What difference do you observe in the model built here and the one built in step 7.
9.	Build a model using sklearn package to estimate the total cost to hospital. What difference do you observe in this model compared to model built in step 7 and 8.
10. Build a model using lasso, ridge and elastic net regression. What differences do you observe?

**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**

|Sl.No.|Variable|	Description|
|------|--------|--------------|
|1|Age|	 Age of the patient in years|
|2|Body Weight|	 Weight of the patient in Kilograms|
|3|Body Height| 	Height of the patient in cm|
|4|HR Pulse|	 Pulse of patient at the time of admission|
|5|BP-High|	 High BP of patient (Systolic)|
|6|BP-Low|	 Low BP of patient (Diastolic)|
|7|RR|	 Respiratory rate of patient|
|8|HB|	 Hemoglobin count of patient|
|9|Urea|	 Urea levels of patient|
|10|Creatinine|	 Creatinine levels of patient|
|11|Marital Status|	 Marital status of the patient|
|12|Gender|	  Gender code for patient|
|13|Past Medical History Code|	 Code given to the past medical history of the Patient|
|14|Mode of Arrival|	 Way in which the patient arrived the hospital|
|15|State at the Time of Arrival|	 State in which the patient arrived|
|16|Type of Admission|	 Type of admission for the patient|
|17|Key Complaints Code|	 Codes given to the key complaints faced by the patient|
|18|Total Cost to Hospital|	 Actual cost incurred by the hospital|
|19|Total Length of Stay|	 Number of days patient stayed in the hospital|
|20|Length of Stay - ICU|	 Number of days patient stayed in the ICU|
|21|Length of Stay - Ward|	 Number of days patient stayed in the ward|
|22|Implant used (Y/N)|	 Any implant done on the patient|
|23|Cost of Implant|	 Total cost of all the implants done on the patient, if any|


***

# Code starts here

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [1]:
#import pandas as pd 
#import numpy as np
#import seaborn as sn # visualization library based on matplotlib
#import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook
#%matplotlib inline 

**Use Pkg.add("Package-name") to install the packages before proceeding further.**

In [2]:
using Pkg
using CSV
using DataFrames
using Statistics
using FreqTables
using StatsBase
using Gadfly
using Printf
using MLJ ##Machine Learning Julia, schema() from this package.
using ScikitLearn ##Machine Learning using SciKitLearn
using JLD ##To save model object
using PyCallJLD #to save model object


## Data Import and Manipulation

### 1. Importing a data set

_Give the correct path to the data_



Change the display settings for columns

In [3]:
ENV["COLUMNS"] = 1000

ENV["LINES"] = 30

30

In [4]:
%pwd

The analogue of IPython's `%pwd` is `pwd()` in Julia.


In [5]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_regression/Code"

In [6]:
raw_df = CSV.read( "../DAD_hospital/data/DAD_Case_Data_Corrected.csv", DataFrame, 
                    delim = ",", header =1,
                    normalizenames=true,
                    missingstrings = ["", " "]
                    )
head(raw_df)

Unnamed: 0_level_0,Sl_NO,AGE,GENDER,MARITAL_STATUS,KEY_COMPLAINTS_CODE,BODY_WEIGHT,BODY_HEIGHT,HR_PULSE,BP_HIGH,BP_LOW,RR,PAST_MEDICAL_HISTORY_CODE,HB,UREA,CREATININE,MODE_OF_ARRIVAL,STATE_AT_THE_TIME_OF_ARRIVAL,TYPE_OF_ADMSN,TOTAL_COST_TO_HOSPITAL,TOTAL_AMOUNT_BILLED_TO_THE_PATIENT,CONCESSION,ACTUAL_RECEIVABLE_AMOUNT,TOTAL_LENGTH_OF_STAY,LENGTH_OF_STAY_ICU,LENGTH_OF_STAY_WARD,IMPLANT_USED,COST_OF_IMPLANT
Unnamed: 0_level_1,Int64,Float64,String,String,String,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64,Int64,Float64,String,String,String,Float64,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64
1,1,58.0,M,MARRIED,other- heart,49,160,118,100,80,32,,11,33,0.8,AMBULANCE,ALERT,EMERGENCY,660293.0,474901,0,474901,25,12,13,Y,38000
2,2,59.0,M,MARRIED,CAD-DVD,41,155,78,70,50,28,,11,95,1.7,AMBULANCE,ALERT,EMERGENCY,809130.0,944819,96422,848397,41,20,21,Y,39690
3,3,82.0,M,MARRIED,CAD-TVD,47,164,100,110,80,20,Diabetes2,12,15,0.8,WALKED IN,ALERT,ELECTIVE,362231.0,390000,30000,360000,18,9,9,N,0
4,4,46.0,M,MARRIED,CAD-DVD,80,173,122,110,80,24,hypertension1,12,74,1.5,AMBULANCE,ALERT,EMERGENCY,629990.0,324910,0,324910,14,13,1,Y,89450
5,5,60.0,M,MARRIED,CAD-DVD,58,175,72,180,100,18,Diabetes2,10,48,1.9,AMBULANCE,ALERT,EMERGENCY,444876.0,254673,10000,244673,24,12,12,N,0
6,6,75.0,M,MARRIED,CAD-DVD,45,140,130,215,140,42,,12,29,1.0,AMBULANCE,ALERT,EMERGENCY,372357.0,499987,0,499987,31,9,22,N,0


In [7]:
rename!(raw_df, lowercase.(names(raw_df)));

Dropping SL No as these will not be used for any analysis or model building.

In [8]:
#if Set(["sl no"]) in names(raw_df){
#    raw_df.drop(['sl no'],axis=1, inplace=True)
#    }
    

In [9]:
if "sl_no" in names(raw_df)
    select!(raw_df, Not(["sl_no"]))
end

Unnamed: 0_level_0,age,gender,marital_status,key_complaints_code,body_weight,body_height,hr_pulse,bp_high,bp_low,rr,past_medical_history_code,hb,urea,creatinine,mode_of_arrival,state_at_the_time_of_arrival,type_of_admsn,total_cost_to_hospital,total_amount_billed_to_the_patient,concession,actual_receivable_amount,total_length_of_stay,length_of_stay_icu,length_of_stay_ward,implant_used,cost_of_implant
Unnamed: 0_level_1,Float64,String,String,String,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64,Int64,Float64,String,String,String,Float64,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64
1,58.0,M,MARRIED,other- heart,49,160,118,100,80,32,,11,33,0.8,AMBULANCE,ALERT,EMERGENCY,660293.0,474901,0,474901,25,12,13,Y,38000
2,59.0,M,MARRIED,CAD-DVD,41,155,78,70,50,28,,11,95,1.7,AMBULANCE,ALERT,EMERGENCY,809130.0,944819,96422,848397,41,20,21,Y,39690
3,82.0,M,MARRIED,CAD-TVD,47,164,100,110,80,20,Diabetes2,12,15,0.8,WALKED IN,ALERT,ELECTIVE,362231.0,390000,30000,360000,18,9,9,N,0
4,46.0,M,MARRIED,CAD-DVD,80,173,122,110,80,24,hypertension1,12,74,1.5,AMBULANCE,ALERT,EMERGENCY,629990.0,324910,0,324910,14,13,1,Y,89450
5,60.0,M,MARRIED,CAD-DVD,58,175,72,180,100,18,Diabetes2,10,48,1.9,AMBULANCE,ALERT,EMERGENCY,444876.0,254673,10000,244673,24,12,12,N,0
6,75.0,M,MARRIED,CAD-DVD,45,140,130,215,140,42,,12,29,1.0,AMBULANCE,ALERT,EMERGENCY,372357.0,499987,0,499987,31,9,22,N,0
7,73.0,M,MARRIED,CAD-TVD,60,170,108,160,90,24,Diabetes2,15,31,1.6,WALKED IN,ALERT,ELECTIVE,887350.0,660504,504,660000,15,15,0,N,0
8,71.0,M,MARRIED,CAD-TVD,44,164,60,130,90,22,,10,37,1.5,WALKED IN,ALERT,EMERGENCY,389827.0,248580,0,248580,24,11,13,N,0
9,72.0,M,MARRIED,CAD-DVD,72,174,95,100,50,25,Diabetes2,10,32,1.2,AMBULANCE,ALERT,EMERGENCY,4.37529e5,691297,0,691297,26,9,17,N,0
10,61.0,M,MARRIED,CAD-TVD,77,175,66,140,90,22,,14,15,0.4,WALKED IN,ALERT,ELECTIVE,364222.0,247654,0,247654,20,4,16,N,0


In [10]:
names(raw_df)

26-element Array{String,1}:
 "age"
 "gender"
 "marital_status"
 "key_complaints_code"
 "body_weight"
 "body_height"
 "hr_pulse"
 "bp_high"
 "bp_low"
 "rr"
 "past_medical_history_code"
 "hb"
 "urea"
 "creatinine"
 "mode_of_arrival"
 "state_at_the_time_of_arrival"
 "type_of_admsn"
 "total_cost_to_hospital"
 "total_amount_billed_to_the_patient"
 "concession"
 "actual_receivable_amount"
 "total_length_of_stay"
 "length_of_stay_icu"
 "length_of_stay_ward"
 "implant_used"
 "cost_of_implant"

**Optional: To iterate over rows and columns of a dataframe.**

In [11]:
eachcol(raw_df);
eachrow(raw_df);


### 2. Structure of the dataset



In [12]:
#raw_df.info()

Not very informative as it does not print the column names:

In [13]:
eltypes(raw_df)

26-element Array{DataType,1}:
 Float64
 String
 String
 String
 Int64
 Int64
 Int64
 Int64
 Int64
 Int64
 String
 Int64
 Int64
 Float64
 String
 String
 String
 Float64
 Int64
 Int64
 Int64
 Int64
 Int64
 Int64
 String
 Int64

'=>' is a pair operator in Julia

In [14]:
Dict(names(raw_df) .=> eltype.(eachcol(raw_df)))

Dict{String,DataType} with 26 entries:
  "urea"                               => Int64
  "total_cost_to_hospital"             => Float64
  "length_of_stay_ward"                => Int64
  "concession"                         => Int64
  "age"                                => Float64
  "type_of_admsn"                      => String
  "hb"                                 => Int64
  "creatinine"                         => Float64
  "actual_receivable_amount"           => Int64
  "key_complaints_code"                => String
  "total_amount_billed_to_the_patient" => Int64
  "hr_pulse"                           => Int64
  "body_height"                        => Int64
  "past_medical_history_code"          => String
  "mode_of_arrival"                    => String
  "state_at_the_time_of_arrival"       => String
  "total_length_of_stay"               => Int64
  "marital_status"                     => String
  "cost_of_implant"                    => Int64
  "bp_high"                          

Or, we can use schema() from MLJ package to get the data types. This package has useful functions for OneHotEncoding etc.

In [15]:
schema(raw_df)

┌────────────────────────────────────┬─────────┬────────────┐
│[22m _.names                            [0m│[22m _.types [0m│[22m _.scitypes [0m│
├────────────────────────────────────┼─────────┼────────────┤
│ age                                │ Float64 │ Continuous │
│ gender                             │ String  │ Textual    │
│ marital_status                     │ String  │ Textual    │
│ key_complaints_code                │ String  │ Textual    │
│ body_weight                        │ Int64   │ Count      │
│ body_height                        │ Int64   │ Count      │
│ hr_pulse                           │ Int64   │ Count      │
│ bp_high                            │ Int64   │ Count      │
│ bp_low                             │ Int64   │ Count      │
│ rr                                 │ Int64   │ Count      │
│ past_medical_history_code          │ String  │ Textual    │
│ hb                                 │ Int64   │ Count      │
│ urea                               │ Int6

Get numeric features from the data and find the corelation amongst numeric features

In [16]:
#numerical_features = [x for x in raw_df.select_dtypes(include=[np.number])]
#numerical_features

In [17]:
numerical_features = names(raw_df[(<:).(eltypes(raw_df),Union{Number,Missing})])

18-element Array{String,1}:
 "age"
 "body_weight"
 "body_height"
 "hr_pulse"
 "bp_high"
 "bp_low"
 "rr"
 "hb"
 "urea"
 "creatinine"
 "total_cost_to_hospital"
 "total_amount_billed_to_the_patient"
 "concession"
 "actual_receivable_amount"
 "total_length_of_stay"
 "length_of_stay_icu"
 "length_of_stay_ward"
 "cost_of_implant"

### Exercise

* **Build a correlation matrix between all the numeric features in the dataset.**

This is how it was done in python

In [18]:
#numerical_features_df = raw_df.select_dtypes(include=[np.number])
#numerical_features_df.corr()

In [19]:
## Write your code here



Get categorical features from the data.

In [20]:
#categorical_features = [x for x in raw_df.select_dtypes(include=[np.object])]
#categorical_features

In [21]:
categorical_features = names(raw_df[(<:).(eltypes(raw_df),Union{String,Missing})])

8-element Array{String,1}:
 "gender"
 "marital_status"
 "key_complaints_code"
 "past_medical_history_code"
 "mode_of_arrival"
 "state_at_the_time_of_arrival"
 "type_of_admsn"
 "implant_used"

In [22]:
#raw_df.describe(include='all')

In [23]:
describe(raw_df, :all, cols=numerical_features)

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,nunique,nmissing,first,last,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Real,Float64,Float64,Float64,Real,Nothing,Nothing,Real,Real,DataType
1,age,31.6063,26.6156,0.83,6.0,21.0,58.0,88.0,,,58.0,30.0,Float64
2,body_weight,39.5521,22.9404,3.0,16.5,43.0,59.5,85.0,,,49.0,71.0,Int64
3,body_height,133.607,38.1152,19.0,110.5,151.0,162.0,185.0,,,160.0,180.0,Int64
4,hr_pulse,90.9141,19.3998,58.0,76.0,90.0,102.0,140.0,,,118.0,87.0,Int64
5,bp_high,113.767,23.228,70.0,100.0,110.0,130.0,215.0,,,100.0,130.0,Int64
6,bp_low,71.5337,15.7195,40.0,60.0,70.0,80.0,140.0,,,80.0,40.0,Int64
7,rr,23.227,3.77173,12.0,22.0,24.0,24.0,42.0,,,32.0,20.0,Int64
8,hb,13.2086,3.10009,8.0,11.0,13.0,14.5,26.0,,,11.0,13.0,Int64
9,urea,28.4724,17.9362,15.0,18.0,24.0,32.0,143.0,,,33.0,15.0,Int64
10,creatinine,0.718405,0.461912,0.1,0.3,0.6,1.0,2.5,,,0.8,0.8,Float64


In [24]:
describe(raw_df, :all, cols=categorical_features)

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,nunique,nmissing,first,last,eltype
Unnamed: 0_level_1,Symbol,Nothing,Nothing,String,Nothing,Nothing,Nothing,String,Int64,Nothing,String,String,DataType
1,gender,,,F,,,,M,2,,M,M,String
2,marital_status,,,MARRIED,,,,UNMARRIED,2,,MARRIED,MARRIED,String
3,key_complaints_code,,,ACHD,,,,other-tertalogy,13,,other- heart,RHD,String
4,past_medical_history_code,,,Diabetes1,,,,other,7,,,,String
5,mode_of_arrival,,,AMBULANCE,,,,WALKED IN,3,,AMBULANCE,WALKED IN,String
6,state_at_the_time_of_arrival,,,ALERT,,,,ALERT,1,,ALERT,ALERT,String
7,type_of_admsn,,,ELECTIVE,,,,EMERGENCY,2,,EMERGENCY,ELECTIVE,String
8,implant_used,,,N,,,,Y,2,,Y,Y,String


In [25]:
describe(raw_df, :min,:max,:nunique,:first,:last,:eltype, cols=categorical_features)

Unnamed: 0_level_0,variable,min,max,nunique,first,last,eltype
Unnamed: 0_level_1,Symbol,String,String,Int64,String,String,DataType
1,gender,F,M,2,M,M,String
2,marital_status,MARRIED,UNMARRIED,2,MARRIED,MARRIED,String
3,key_complaints_code,ACHD,other-tertalogy,13,other- heart,RHD,String
4,past_medical_history_code,Diabetes1,other,7,,,String
5,mode_of_arrival,AMBULANCE,WALKED IN,3,AMBULANCE,WALKED IN,String
6,state_at_the_time_of_arrival,ALERT,ALERT,1,ALERT,ALERT,String
7,type_of_admsn,ELECTIVE,EMERGENCY,2,EMERGENCY,ELECTIVE,String
8,implant_used,N,Y,2,Y,Y,String


### 2. Summarizing the dataset
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The *dropna()* function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.


In [26]:
#filter_df = raw_df.dropna()
#list(filter_df.columns )

In [27]:
filter_df = copy(dropmissing(raw_df))
head(filter_df)

Unnamed: 0_level_0,age,gender,marital_status,key_complaints_code,body_weight,body_height,hr_pulse,bp_high,bp_low,rr,past_medical_history_code,hb,urea,creatinine,mode_of_arrival,state_at_the_time_of_arrival,type_of_admsn,total_cost_to_hospital,total_amount_billed_to_the_patient,concession,actual_receivable_amount,total_length_of_stay,length_of_stay_icu,length_of_stay_ward,implant_used,cost_of_implant
Unnamed: 0_level_1,Float64,String,String,String,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64,Int64,Float64,String,String,String,Float64,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64
1,58.0,M,MARRIED,other- heart,49,160,118,100,80,32,,11,33,0.8,AMBULANCE,ALERT,EMERGENCY,660293.0,474901,0,474901,25,12,13,Y,38000
2,59.0,M,MARRIED,CAD-DVD,41,155,78,70,50,28,,11,95,1.7,AMBULANCE,ALERT,EMERGENCY,809130.0,944819,96422,848397,41,20,21,Y,39690
3,82.0,M,MARRIED,CAD-TVD,47,164,100,110,80,20,Diabetes2,12,15,0.8,WALKED IN,ALERT,ELECTIVE,362231.0,390000,30000,360000,18,9,9,N,0
4,46.0,M,MARRIED,CAD-DVD,80,173,122,110,80,24,hypertension1,12,74,1.5,AMBULANCE,ALERT,EMERGENCY,629990.0,324910,0,324910,14,13,1,Y,89450
5,60.0,M,MARRIED,CAD-DVD,58,175,72,180,100,18,Diabetes2,10,48,1.9,AMBULANCE,ALERT,EMERGENCY,444876.0,254673,10000,244673,24,12,12,N,0
6,75.0,M,MARRIED,CAD-DVD,45,140,130,215,140,42,,12,29,1.0,AMBULANCE,ALERT,EMERGENCY,372357.0,499987,0,499987,31,9,22,N,0


'=>' is a pair operator in Julia

In [28]:
Dict(names(filter_df) .=> eltype.(eachcol(filter_df)))

Dict{String,DataType} with 26 entries:
  "urea"                               => Int64
  "total_cost_to_hospital"             => Float64
  "length_of_stay_ward"                => Int64
  "concession"                         => Int64
  "age"                                => Float64
  "type_of_admsn"                      => String
  "hb"                                 => Int64
  "creatinine"                         => Float64
  "actual_receivable_amount"           => Int64
  "key_complaints_code"                => String
  "total_amount_billed_to_the_patient" => Int64
  "hr_pulse"                           => Int64
  "body_height"                        => Int64
  "past_medical_history_code"          => String
  "mode_of_arrival"                    => String
  "state_at_the_time_of_arrival"       => String
  "total_length_of_stay"               => Int64
  "marital_status"                     => String
  "cost_of_implant"                    => Int64
  "bp_high"                          

We will first start by printing the unique labels in categorical features

In [29]:
@show Set(filter_df[:,"gender"]);
unique(filter_df[:,"gender"])

Set(filter_df[:, "gender"]) = Set(["F", "M"])


2-element Array{String,1}:
 "M"
 "F"

In [30]:
#for f in categorical_features:
#    print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
#    print("The values in {} is \n{}\n".format(f,  filter_df[f].value_counts()))

The '@' is used before printf as 'printf' is a macro and not a function.  It can parse and interpret the format string at compile time and generate custom code for that specific format string. 

More at: https://stackoverflow.com/questions/19783030/in-julia-why-is-printf-a-macro-instead-of-a-function

In [31]:
for f in categorical_features
    #print(repr(f)) ## to convert symbol to categorical name.
    unq = unique(filter_df[:, f]) ## Set(filter_df[:, (f)]) also works.
    val_cnt = StatsBase.countmap(filter_df[:, (f)])
    @printf("\nThe unique labels in %s is %s \n", f, unq)
    @printf("\nThe unique labels in %s is %s \n", f, val_cnt)
end


The unique labels in gender is ["M", "F"] 

The unique labels in gender is Dict("M" => 110,"F" => 53) 

The unique labels in marital_status is ["MARRIED", "UNMARRIED"] 

The unique labels in marital_status is Dict("UNMARRIED" => 85,"MARRIED" => 78) 

The unique labels in key_complaints_code is ["other- heart", "CAD-DVD", "CAD-TVD", "RHD", "CAD-SVD", "other- respiratory", "ACHD", "other-tertalogy", "other-nervous", "PM-VSD", "OS-ASD", "CAD-VSD", "other-general"] 

The unique labels in key_complaints_code is Dict("CAD-SVD" => 2,"CAD-VSD" => 1,"CAD-DVD" => 26,"CAD-TVD" => 22,"PM-VSD" => 4,"other-general" => 1,"other- heart" => 42,"OS-ASD" => 13,"ACHD" => 16,"other-nervous" => 1,"RHD" => 19,"other- respiratory" => 5,"other-tertalogy" => 11) 

The unique labels in past_medical_history_code is ["None", "Diabetes2", "hypertension1", "hypertension3", "hypertension2", "Diabetes1", "other"] 

The unique labels in past_medical_history_code is Dict("None" => 105,"hypertension2" => 7,"other" => 14

Clubbing some of the feature labels together.

In [32]:
#filter_df['past_medical_history_code']=np.where(
#    (filter_df['past_medical_history_code'] =='hypertension1') |
#    (filter_df['past_medical_history_code'] =='hypertension2') |
#    (filter_df['past_medical_history_code'] =='hypertension3'), 
#    'hypertension', filter_df['past_medical_history_code'])

`.` is a broadcasting operator in julia. So, f.(a, b) means "apply f elementwise to a and b".

More at: https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting

In [33]:
filter_df[:,"past_medical_history_code"].= ifelse.(
        (filter_df[:,"past_medical_history_code"] .== "hypertension1") .|
        (filter_df[:,"past_medical_history_code"] .== "hypertension2") .| 
        (filter_df[:,"past_medical_history_code"] .== "hypertension3"),
    "hypertension", filter_df[:,"past_medical_history_code"]);

In [34]:
head(filter_df[filter_df.past_medical_history_code .== "hypertension",:])

Unnamed: 0_level_0,age,gender,marital_status,key_complaints_code,body_weight,body_height,hr_pulse,bp_high,bp_low,rr,past_medical_history_code,hb,urea,creatinine,mode_of_arrival,state_at_the_time_of_arrival,type_of_admsn,total_cost_to_hospital,total_amount_billed_to_the_patient,concession,actual_receivable_amount,total_length_of_stay,length_of_stay_icu,length_of_stay_ward,implant_used,cost_of_implant
Unnamed: 0_level_1,Float64,String,String,String,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64,Int64,Float64,String,String,String,Float64,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64
1,46.0,M,MARRIED,CAD-DVD,80,173,122,110,80,24,hypertension,12,74,1.5,AMBULANCE,ALERT,EMERGENCY,629990.0,324910,0,324910,14,13,1,Y,89450
2,61.0,M,MARRIED,CAD-DVD,64,170,99,140,80,24,hypertension,13,15,1.0,WALKED IN,ALERT,ELECTIVE,514524.0,282000,15000,267000,21,10,11,Y,39690
3,68.0,F,UNMARRIED,CAD-DVD,51,123,66,120,80,20,hypertension,13,21,0.7,AMBULANCE,ALERT,EMERGENCY,495969.0,161250,0,161250,16,16,0,N,0
4,78.0,F,MARRIED,CAD-DVD,70,154,63,150,90,20,hypertension,10,25,1.0,WALKED IN,ALERT,ELECTIVE,157763.0,180000,0,180000,9,4,3,N,0
5,59.0,F,MARRIED,RHD,47,150,60,130,90,24,hypertension,12,15,0.7,WALKED IN,ALERT,ELECTIVE,343984.0,89050,0,89050,17,5,12,Y,20800
6,88.0,M,MARRIED,CAD-TVD,62,162,92,140,100,22,hypertension,12,52,1.2,AMBULANCE,ALERT,EMERGENCY,305193.0,307256,6000,301256,16,9,7,N,0


In [35]:
countmap(filter_df.past_medical_history_code)

Dict{String,Int64} with 5 entries:
  "None"         => 105
  "other"        => 14
  "hypertension" => 26
  "Diabetes2"    => 9
  "Diabetes1"    => 9

In [36]:
#filter_df['past_medical_history_code']=np.where(
#    (filter_df['past_medical_history_code'] =='Diabetes1') |
#    (filter_df['past_medical_history_code'] =='Diabetes2'), 
#    'diabetes', filter_df['past_medical_history_code'])

In [37]:
filter_df[(filter_df.past_medical_history_code .=="Diabetes1") .| 
            (filter_df.past_medical_history_code .=="Diabetes2"),"past_medical_history_code"] .= "Diabetes";

In [38]:
countmap(filter_df.past_medical_history_code)

Dict{String,Int64} with 4 entries:
  "None"         => 105
  "other"        => 14
  "hypertension" => 26
  "Diabetes"     => 18

In [39]:
filter_df[(filter_df.key_complaints_code .=="other- respiratory") .| 
         (filter_df.key_complaints_code .=="PM-VSD") .|
        (filter_df.key_complaints_code .=="CAD-SVD") .|
        (filter_df.key_complaints_code .=="CAD-VSD") .|
        (filter_df.key_complaints_code .=="other-nervous") .|
        (filter_df.key_complaints_code .=="other-general")
        ,"key_complaints_code"] .= "others";

In [40]:
countmap(filter_df.key_complaints_code)

Dict{String,Int64} with 8 entries:
  "other-tertalogy" => 11
  "other- heart"    => 42
  "OS-ASD"          => 13
  "CAD-DVD"         => 26
  "RHD"             => 19
  "others"          => 14
  "ACHD"            => 16
  "CAD-TVD"         => 22

### Exercise:

* **Calculate the average across all the numeric features w.r.t categorical feature.**

In [41]:
#def group_by (categorical_features):
#    return filter_df.groupby(categorical_features).mean()

In [42]:
#group_by("past_medical_history_code")
#group_by("key_complaints_code")
#group_by("marital_status")

In [43]:
##Write your code here



**Calculating BMI**

In [44]:
#filter_df['bmi'] = filter_df.body_weight/(np.power((filter_df.body_height/100),2))
#filter_df['bmi']

In [45]:
filter_df[:,"bmi"] = filter_df.body_weight./(filter_df.body_height./100).^2;

### Exercise: Visualizing the Data using Gadfly

* **Write a custom function to create bar plot to visualize the average of numeric features w.r.t each categorical feature. Say, average age w.r.t gender.**

This is how one may do using seaborn in python:

In [46]:
#filter_df[numerical_features].info()

In [47]:
#def bar_plot(xlabel,ylabel):
#    sn.barplot(x = xlabel, y = ylabel, data= filter_df)
#    plt.xlabel(xlabel, size = 14)
#    plt.ylabel(ylabel, size = 14)
#    #plt.grid(True)
#    x1,x2,y1,y2 = plt.axis()
#    plt.show()

In [48]:
#numerical_features_set = ['age','rr']
#categorical_features_set = ['gender','marital_status']

#for c in categorical_features_set:
#    for n in numerical_features_set:
#        bar_plot(c,n)

In [49]:
##Write your code here




## Model using sklearn:

Remove the response variable from the dataset. To get symbols:

* Type \in (TAB) - ∈ or 
* \notin (TAB) - ∉

In [50]:
#using ScikitLearn

In [51]:
removed_features = ["body_weight","body_height",
                    "creatinine","state_at_the_time_of_arrival",
                    "total_amount_billed_to_the_patient","concession",
                    "actual_receivable_amount","total_length_of_stay",
                    "length_of_stay_icu","length_of_stay_ward",
                    "total_cost_to_hospital"]

11-element Array{String,1}:
 "body_weight"
 "body_height"
 "creatinine"
 "state_at_the_time_of_arrival"
 "total_amount_billed_to_the_patient"
 "concession"
 "actual_receivable_amount"
 "total_length_of_stay"
 "length_of_stay_icu"
 "length_of_stay_ward"
 "total_cost_to_hospital"

**Optional: Way to create a list of features the python way:**

In [52]:
[x for x in names(raw_df) if (x!="age") && (x!="gender")];

In [53]:
#X_features = [x for x in names(filter_df) if x not in removed_features] ##Python

X_features = [x for x ∈ names(filter_df) if x ∉ removed_features]

16-element Array{String,1}:
 "age"
 "gender"
 "marital_status"
 "key_complaints_code"
 "hr_pulse"
 "bp_high"
 "bp_low"
 "rr"
 "past_medical_history_code"
 "hb"
 "urea"
 "mode_of_arrival"
 "type_of_admsn"
 "implant_used"
 "cost_of_implant"
 "bmi"

In [54]:
removed_features ∉ X_features

true

In [55]:
removed_features ∈ X_features

false

### Categorical Encoding using sklearn

In [56]:
X_numeric = names(filter_df[:,X_features][(<:).(eltypes(filter_df[:,X_features]),Union{Number,Missing})])

9-element Array{String,1}:
 "age"
 "hr_pulse"
 "bp_high"
 "bp_low"
 "rr"
 "hb"
 "urea"
 "cost_of_implant"
 "bmi"

In [57]:
X_categoric = names(filter_df[:,X_features][(<:).(eltypes(filter_df[:,X_features]),Union{String,Missing})])

7-element Array{String,1}:
 "gender"
 "marital_status"
 "key_complaints_code"
 "past_medical_history_code"
 "mode_of_arrival"
 "type_of_admsn"
 "implant_used"

In [58]:
##We will use LabelBinarizer

@sk_import preprocessing:(LabelBinarizer)#, StandardScaler, OneHotEncoder, LabelEncoder, MultiLabelBinarizer) 

PyObject <class 'sklearn.preprocessing._label.LabelBinarizer'>

ScikitLearn.DataFrameMapper can be used to do dummy variable coding. The DataFramemapper won't be available until DataFrames is imported. We can individually apply label binarizer but it is not efficient.

Label Binarizer is an SciKitLearn class that accepts Categorical data as input and returns a matrix. Unlike Label Encoder, which can be used to assign unique value to each label,  Label Binarizer encodes the data into dummy variables indicating the presence of a particular label or not.

One can refer to https://scikit-learn.org/stable/modules/preprocessing.html for different methods from sklearn python which can be used in Julia.

In [59]:
mapper = DataFrameMapper([ 
                        ( :key_complaints_code  , LabelBinarizer() )#,
                        #( :gender  , LabelBinarizer() )
                        ])

DataFrameMapper(Tuple[(:key_complaints_code, PyObject LabelBinarizer())], false, false, Array{Float64,2})

In [60]:
countmap(filter_df.key_complaints_code)

Dict{String,Int64} with 8 entries:
  "other-tertalogy" => 11
  "other- heart"    => 42
  "OS-ASD"          => 13
  "CAD-DVD"         => 26
  "RHD"             => 19
  "others"          => 14
  "ACHD"            => 16
  "CAD-TVD"         => 22

In [61]:
fit_transform!(mapper, copy(filter_df))

163×8 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
 ⋮                        ⋮         
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

Though the documentation gives this option but there seems to be some bug or passing a list is not supported for now. Below code will not work:

In [62]:
#mapper = DataFrameMapper([ 
#                        ([ Symbol.(X_categoric) ] , LabelBinarizer()),
#                        ( [ Symbol.(X_numeric) ], nothing )]);

#### Workaround

In [63]:
cat_col = Symbol.(X_categoric)
cat_feature_defs = [(cat_col_name, LabelBinarizer()) for cat_col_name in cat_col]
cat_mapper = DataFrameMapper(cat_feature_defs)

DataFrameMapper(Tuple[(:gender, PyObject LabelBinarizer()), (:marital_status, PyObject LabelBinarizer()), (:key_complaints_code, PyObject LabelBinarizer()), (:past_medical_history_code, PyObject LabelBinarizer()), (:mode_of_arrival, PyObject LabelBinarizer()), (:type_of_admsn, PyObject LabelBinarizer()), (:implant_used, PyObject LabelBinarizer())], false, false, Array{Float64,2})

In [64]:
num_col = Symbol.(X_numeric)
num_feature_defs = [(num_col_name, nothing) for num_col_name in num_col]
num_mapper = DataFrameMapper(num_feature_defs)

DataFrameMapper(Tuple[(:age, nothing), (:hr_pulse, nothing), (:bp_high, nothing), (:bp_low, nothing), (:rr, nothing), (:hb, nothing), (:urea, nothing), (:cost_of_implant, nothing), (:bmi, nothing)], false, false, Array{Float64,2})

In [65]:
#fit_transform!(LabelBinarizer(), filter_df)
X1 = fit_transform!(cat_mapper, (filter_df));

In [66]:
X2 = fit_transform!(num_mapper, (filter_df));

In [67]:
X3 = convert(DataFrame,hcat(X1,X2)) #Can convert
head(X3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,58.0,118.0,100.0,80.0,32.0,11.0,33.0,38000.0,19.1406
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,59.0,78.0,70.0,50.0,28.0,11.0,95.0,39690.0,17.0656
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,82.0,100.0,110.0,80.0,20.0,12.0,15.0,0.0,17.4747
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,46.0,122.0,110.0,80.0,24.0,12.0,74.0,89450.0,26.7299
5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,60.0,72.0,180.0,100.0,18.0,10.0,48.0,0.0,18.9388
6,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,75.0,130.0,215.0,140.0,42.0,12.0,29.0,0.0,22.9592


### MLJ Package

MLJ (Machine Learning in Julia) is a toolbox written in Julia providing a common interface and meta-algorithms for selecting, tuning, evaluating, composing and comparing over 150 machine learning models written in Julia and other languages. In particular MLJ wraps a large number of scikit-learn models.

https://alan-turing-institute.github.io/MLJ.jl/dev/list_of_supported_models/

Just to get the names of models supported by MLJ

In [68]:
models()

166-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ARDRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = AffinityPropagation, package_name = ScikitLearn, ... )
 (name = AgglomerativeClustering, package_name = ScikitLearn, ... )
 (name = BaggingClassifier, package_name = ScikitLearn, ... )
 (name = BaggingRegressor, package_name = ScikitLearn, ... )
 (name = BayesianLDA, package_name = M

Models for which code is already loaded can be found with:

In [69]:
localmodels()

14-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, 

To search a model pass the name as a string. All the models matching the name will be shown

In [70]:
models("regression")

36-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ARDRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )
 (name = BaggingRegressor, package_name = ScikitLearn, ... )
 (name = BayesianRidgeRegressor, package_name = ScikitLearn, ... )
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = ElasticNetCVRegressor, package_name = ScikitLearn, ... )
 (name = ElasticNetRegressor, package_name = ScikitLearn, ... )
 (name = EvoTreeCount, package_name = EvoTrees, ... )
 (name = GradientBoostingRegressor, package_name = Sc

###  Categorical Encoding using MLJ 

By default, the scientific types of categoroical variable is "Textual". We need to coerce it to Categorical before applying OneHotEncoding.

More on Scientific types and internal working at: https://alan-turing-institute.github.io/MLJ.jl/dev/mlj_cheatsheet/#Scitypes-and-coercion

In [71]:
schema(filter_df)

┌────────────────────────────────────┬─────────┬────────────┐
│[22m _.names                            [0m│[22m _.types [0m│[22m _.scitypes [0m│
├────────────────────────────────────┼─────────┼────────────┤
│ age                                │ Float64 │ Continuous │
│ gender                             │ String  │ Textual    │
│ marital_status                     │ String  │ Textual    │
│ key_complaints_code                │ String  │ Textual    │
│ body_weight                        │ Int64   │ Count      │
│ body_height                        │ Int64   │ Count      │
│ hr_pulse                           │ Int64   │ Count      │
│ bp_high                            │ Int64   │ Count      │
│ bp_low                             │ Int64   │ Count      │
│ rr                                 │ Int64   │ Count      │
│ past_medical_history_code          │ String  │ Textual    │
│ hb                                 │ Int64   │ Count      │
│ urea                               │ Int6

To coerce a particular column to continuos or multiclass, we can write:

In [72]:
head(coerce!(copy(filter_df), :hb => Continuous, :gender => Multiclass))

Unnamed: 0_level_0,age,gender,marital_status,key_complaints_code,body_weight,body_height,hr_pulse,bp_high,bp_low,rr,past_medical_history_code,hb,urea,creatinine,mode_of_arrival,state_at_the_time_of_arrival,type_of_admsn,total_cost_to_hospital,total_amount_billed_to_the_patient,concession,actual_receivable_amount,total_length_of_stay,length_of_stay_icu,length_of_stay_ward,implant_used,cost_of_implant,bmi
Unnamed: 0_level_1,Float64,Cat…,String,String,Int64,Int64,Int64,Int64,Int64,Int64,String,Float64,Int64,Float64,String,String,String,Float64,Int64,Int64,Int64,Int64,Int64,Int64,String,Int64,Float64
1,58.0,M,MARRIED,other- heart,49,160,118,100,80,32,,11.0,33,0.8,AMBULANCE,ALERT,EMERGENCY,660293.0,474901,0,474901,25,12,13,Y,38000,19.1406
2,59.0,M,MARRIED,CAD-DVD,41,155,78,70,50,28,,11.0,95,1.7,AMBULANCE,ALERT,EMERGENCY,809130.0,944819,96422,848397,41,20,21,Y,39690,17.0656
3,82.0,M,MARRIED,CAD-TVD,47,164,100,110,80,20,Diabetes,12.0,15,0.8,WALKED IN,ALERT,ELECTIVE,362231.0,390000,30000,360000,18,9,9,N,0,17.4747
4,46.0,M,MARRIED,CAD-DVD,80,173,122,110,80,24,hypertension,12.0,74,1.5,AMBULANCE,ALERT,EMERGENCY,629990.0,324910,0,324910,14,13,1,Y,89450,26.7299
5,60.0,M,MARRIED,CAD-DVD,58,175,72,180,100,18,Diabetes,10.0,48,1.9,AMBULANCE,ALERT,EMERGENCY,444876.0,254673,10000,244673,24,12,12,N,0,18.9388
6,75.0,M,MARRIED,CAD-DVD,45,140,130,215,140,42,,12.0,29,1.0,AMBULANCE,ALERT,EMERGENCY,372357.0,499987,0,499987,31,9,22,N,0,22.9592


Since we want to coerce all Textual column to multiclass, we can write:

In [73]:
head(coerce!(filter_df, Textual => Multiclass))

Unnamed: 0_level_0,age,gender,marital_status,key_complaints_code,body_weight,body_height,hr_pulse,bp_high,bp_low,rr,past_medical_history_code,hb,urea,creatinine,mode_of_arrival,state_at_the_time_of_arrival,type_of_admsn,total_cost_to_hospital,total_amount_billed_to_the_patient,concession,actual_receivable_amount,total_length_of_stay,length_of_stay_icu,length_of_stay_ward,implant_used,cost_of_implant,bmi
Unnamed: 0_level_1,Float64,Cat…,Cat…,Cat…,Int64,Int64,Int64,Int64,Int64,Int64,Cat…,Int64,Int64,Float64,Cat…,Cat…,Cat…,Float64,Int64,Int64,Int64,Int64,Int64,Int64,Cat…,Int64,Float64
1,58.0,M,MARRIED,other- heart,49,160,118,100,80,32,,11,33,0.8,AMBULANCE,ALERT,EMERGENCY,660293.0,474901,0,474901,25,12,13,Y,38000,19.1406
2,59.0,M,MARRIED,CAD-DVD,41,155,78,70,50,28,,11,95,1.7,AMBULANCE,ALERT,EMERGENCY,809130.0,944819,96422,848397,41,20,21,Y,39690,17.0656
3,82.0,M,MARRIED,CAD-TVD,47,164,100,110,80,20,Diabetes,12,15,0.8,WALKED IN,ALERT,ELECTIVE,362231.0,390000,30000,360000,18,9,9,N,0,17.4747
4,46.0,M,MARRIED,CAD-DVD,80,173,122,110,80,24,hypertension,12,74,1.5,AMBULANCE,ALERT,EMERGENCY,629990.0,324910,0,324910,14,13,1,Y,89450,26.7299
5,60.0,M,MARRIED,CAD-DVD,58,175,72,180,100,18,Diabetes,10,48,1.9,AMBULANCE,ALERT,EMERGENCY,444876.0,254673,10000,244673,24,12,12,N,0,18.9388
6,75.0,M,MARRIED,CAD-DVD,45,140,130,215,140,42,,12,29,1.0,AMBULANCE,ALERT,EMERGENCY,372357.0,499987,0,499987,31,9,22,N,0,22.9592


In [74]:
schema(filter_df)

┌────────────────────────────────────┬─────────────────────────────────┬───────────────┐
│[22m _.names                            [0m│[22m _.types                         [0m│[22m _.scitypes    [0m│
├────────────────────────────────────┼─────────────────────────────────┼───────────────┤
│ age                                │ Float64                         │ Continuous    │
│ gender                             │ CategoricalValue{String,UInt32} │ Multiclass{2} │
│ marital_status                     │ CategoricalValue{String,UInt32} │ Multiclass{2} │
│ key_complaints_code                │ CategoricalValue{String,UInt32} │ Multiclass{8} │
│ body_weight                        │ Int64                           │ Count         │
│ body_height                        │ Int64                           │ Count         │
│ hr_pulse                           │ Int64                           │ Count         │
│ bp_high                            │ Int64                           │ Count     

### Exercise 

Why hb is being shown with scientific type as 'Count' even though we just coerced it to Continuous?

Pandas dummy variable encoding is as follows:

In [75]:
#encoded_X_df = pd.get_dummies(filter_df[X_features], drop_first = True )
#encoded_X_df.head()

Transform is same as calling predict function in python. In MLJ:
* For supervised problem, we will call predict
* For unsupervised problem, it will be transform

The function 'machine()' binds a model (i.e., a choice of algorithm + hyperparameters) to data. A machine is also the object storing learned parameters. Under the hood, calling fit! on a machine calls either MLJBase.fit or MLJBase.update, depending on the machine's internal state (as recorded in private fields old_model and old_rows). 

In [76]:
ohe = machine(MLJ.OneHotEncoder(drop_last=true), filter_df[:,X_features])
MLJ.fit!(ohe)
encoded_X_df = MLJ.transform(ohe, filter_df[:,X_features]);

┌ Info: Training [34mMachine{OneHotEncoder,…} @269[39m.
└ @ MLJBase /Users/Rahul/.julia/packages/MLJBase/hLtde/src/machines.jl:342
┌ Info: Spawning 1 sub-features to one-hot encode feature :gender.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 1 sub-features to one-hot encode feature :marital_status.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 7 sub-features to one-hot encode feature :key_complaints_code.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 3 sub-features to one-hot encode feature :past_medical_history_code.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 2 sub-features to one-hot encode feature :mode_of_arrival.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 1

In [77]:
head(encoded_X_df)

Unnamed: 0_level_0,age,gender__F,marital_status__MARRIED,key_complaints_code__ACHD,key_complaints_code__CAD-DVD,key_complaints_code__CAD-TVD,key_complaints_code__OS-ASD,key_complaints_code__RHD,key_complaints_code__other- heart,key_complaints_code__other-tertalogy,hr_pulse,bp_high,bp_low,rr,past_medical_history_code__Diabetes,past_medical_history_code__None,past_medical_history_code__hypertension,hb,urea,mode_of_arrival__AMBULANCE,mode_of_arrival__TRANSFERRED,type_of_admsn__ELECTIVE,implant_used__N,cost_of_implant,bmi
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Int64,Int64,Int64,Int64,Float64,Float64,Float64,Int64,Int64,Float64,Float64,Float64,Float64,Int64,Float64
1,58.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,118,100,80,32,0.0,1.0,0.0,11,33,1.0,0.0,0.0,0.0,38000,19.1406
2,59.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,78,70,50,28,0.0,1.0,0.0,11,95,1.0,0.0,0.0,0.0,39690,17.0656
3,82.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,100,110,80,20,1.0,0.0,0.0,12,15,0.0,0.0,1.0,1.0,0,17.4747
4,46.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,122,110,80,24,0.0,0.0,1.0,12,74,1.0,0.0,0.0,0.0,89450,26.7299
5,60.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,72,180,100,18,1.0,0.0,0.0,10,48,1.0,0.0,0.0,1.0,0,18.9388
6,75.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,130,215,140,42,0.0,1.0,0.0,12,29,1.0,0.0,0.0,1.0,0,22.9592


In [78]:
X = Matrix(encoded_X_df);

In [79]:
Y = filter_df[:,"total_cost_to_hospital"];

### Train and test data split using Python

The train and test split can also be done using the **sklearn module**. If we use @sk_import to call the train_test_split function from model_selection module, we will get a warning message. Reason, the native ScikitLearn package in Julia has already defined train_test_split() in CrossValidation module, so better to use it from this module.

In MLJ, we have partition() function to do the split but we are not using it as of now.

In [80]:
#@sk_import model_selection: (train_test_split)

In [81]:
using ScikitLearn.CrossValidation: train_test_split

In [82]:
#from sklearn.model_selection import train_test_split ##Python code 


X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 42);

In [83]:
@show size(X_train)
@show size(X_test)

size(X_train) = (114, 25)
size(X_test) = (49, 25)


(49, 25)

## Model Building: Using the **sklearn** 



In [84]:
# Create linear regression object
#ridge_reg_model = linear_model.Ridge(alpha = 0.5) #alpha = 0 is same as simple regression with OLS

# Train the model using the training sets
#ridge_reg_model.fit(X_train, y_train)

In [85]:
@sk_import linear_model: (LinearRegression, Ridge)

PyObject <class 'sklearn.linear_model._ridge.Ridge'>

In [86]:
ridge_reg_model = ScikitLearn.fit!(Ridge(alpha=0.5), X_train, y_train)

PyObject Ridge(alpha=0.5)

In [87]:
typeof(ridge_reg_model)

PyCall.PyObject

Making the model is as simple as calling the `fit` method for `Ridge`. However, since we would like to select the best value of alpha, lets try to do it using the below function.

In [88]:
# Make predictions using the testing set
y_pred = ScikitLearn.predict(ridge_reg_model,X_test);
y_pred = ridge_reg_model.predict(X_test)

49-element Array{Float64,1}:
 182938.13280732383
 267057.57298137917
 108246.87024747391
 247152.20805478108
 138815.85232697328
 295141.8682028086
 135561.86241886733
 114918.88293925491
 220052.15271643328
 243749.7868393778
 306549.8113897716
 134842.21828385277
 347912.5367159734
      ⋮
 263097.7393053608
 221663.56934151924
 299463.32962183334
 269325.76025941677
 248831.20634915543
 159590.84787834546
 106324.88541929678
  94723.9732604602
 277533.7495631008
 192908.39177949686
 358756.6583930644
 149731.88931181407

In [89]:
# The coefficients
print("Coefficients: \n", ridge_reg_model.coef_)
print("Intercept: \n", ridge_reg_model.intercept_)

Coefficients: 
[2212.538121660945, -14870.784967336791, -69628.98908251722, -12069.877841376046, 131778.78242220942, 109077.18546018447, 19921.428847709547, -67689.57893840299, 34568.0976313315, 38923.064440075534, 860.9539268966311, 427.9497900556872, -1268.5486553915246, 1727.2269513232438, 52840.47928948892, 32039.576283917217, -1498.4573042947532, 323.3507352814162, 285.4631588578064, -45260.391818217635, 363.2167750531941, -71903.63748240218, -110851.63876744456, 1.5643374914094357, -86.69072064626941]Intercept: 
170020.87868965947

In [90]:
@which mean_squared_error

LoadError: "mean_squared_error" is not defined in module Main

In [91]:
#from sklearn.metrics import mean_squared_error, r2_score

@sk_import metrics: (mean_squared_error, r2_score)

PyObject <function r2_score at 0x7f91e3f61550>

In [92]:
# The mean squared error
@printf("The mean squared error is %d", mean_squared_error(y_test, y_pred))

The mean squared error is 7621825864

In [93]:
# Explained variance score: 1 is perfect prediction

@printf("The mean squared error is %.2f", r2_score(y_test, y_pred))

The mean squared error is 0.53

### Random Search with cross validation

To use RandomizedSearchCV, create a parameter grid from where sample will be picked during model building:

In [94]:
alpha = collect(0.1:0.1:10)

# Create the grid
random_grid = Dict("alpha" => alpha)
random_grid

Dict{String,Array{Float64,1}} with 1 entry:
  "alpha" => [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0  …  9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10.0]

### Model with Grid Search

To report the performance on the selected KPI use `sklearn.metrics.SCORERS.keys()` to get the list of all the metrics and pass the relevant one in `RandomizedSearchCV` or `GridSearchCV`

In [95]:
#@sk_import metrics: (SCORERS)

#SCORERS

In [96]:
# Use the random grid to search for best hyperparameters
#from sklearn.model_selection import GridSearchCV

In [97]:
#using ScikitLearn.CrossValidation: GridSearchCV
@sk_import model_selection: (GridSearchCV)

└ @ ScikitLearn.Skcore /Users/Rahul/.julia/packages/ScikitLearn/ssekP/src/Skcore.jl:179


PyObject <class 'sklearn.model_selection._search.GridSearchCV'>

In [98]:
# Random search of parameters, using 3 fold cross validation, 

ridge_reg_model = Ridge()
ridge_best_model = GridSearchCV(estimator = ridge_reg_model, 
                               param_grid = random_grid, 
                                scoring = "r2",
                               cv = 3, verbose=0)
# Fit the random search model
#ridge_best_model.fit(X_train, y_train)

ScikitLearn.fit!(ridge_best_model,X_train,y_train)

PyObject GridSearchCV(cv=3, estimator=Ridge(),
             param_grid={'alpha': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,
        1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,
        2.3,  2.4,  2.5,  2.6,  2.7,  2.8,  2.9,  3. ,  3.1,  3.2,  3.3,
        3.4,  3.5,  3.6,  3.7,  3.8,  3.9,  4. ,  4.1,  4.2,  4.3,  4.4,
        4.5,  4.6,  4.7,  4.8,  4.9,  5. ,  5.1,  5.2,  5.3,  5.4,  5.5,
        5.6,  5.7,  5.8,  5.9,  6. ,  6.1,  6.2,  6.3,  6.4,  6.5,  6.6,
        6.7,  6.8,  6.9,  7. ,  7.1,  7.2,  7.3,  7.4,  7.5,  7.6,  7.7,
        7.8,  7.9,  8. ,  8.1,  8.2,  8.3,  8.4,  8.5,  8.6,  8.7,  8.8,
        8.9,  9. ,  9.1,  9.2,  9.3,  9.4,  9.5,  9.6,  9.7,  9.8,  9.9,
       10. ])},
             scoring='r2')

### Report the parameter

The best model has the following parameter selected from the random search grid

In [99]:
ridge_best_model.best_params_

Dict{Any,Any} with 1 entry:
  "alpha" => 1.5

## Model Evaluation


### 1. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [100]:
y_pred = ScikitLearn.predict(ridge_best_model,X_test);
y_pred = ridge_best_model.predict(X_test)

49-element Array{Float64,1}:
 176111.07181267103
 261363.78119535543
 105486.04023229908
 261956.7804636434
 143030.5329592488
 306177.78078664414
 141534.58314760696
 122197.40569525906
 208940.93064918497
 238011.7596422452
 307050.04226832214
 134140.48933373034
 348179.6428613549
      ⋮
 275076.1542552053
 218785.4416889535
 274258.5340234546
 262266.5196967971
 244105.48774713802
 164836.29872330086
 104397.43067207298
 102777.24456481021
 272307.24122067285
 189837.98429433335
 358149.0965153544
 144811.2085408224

In [101]:
@show mean_squared_error(y_test, y_pred)
@show r2_score(y_test, y_pred)

mean_squared_error(y_test, y_pred) = 6.448588927410615e9
r2_score(y_test, y_pred) = 0.6010198991261695


0.6010198991261695

## Deployment - Save model

Save the model using JLD and PyCallJLD (Neeed if using @sk_import)

In [102]:
#from sklearn.externals import joblib
#import joblib
#import pickle

#joblib.dump( ridge_best_model, "ridge_best_model.joblib" )
#pickle.dump(ridge_best_model,open("ridge_best_model.pkl",'wb'))

In [103]:
JLD.save("ridge_best_model.jld", "model", ridge_best_model)
#@save "ridge_best_model1.JLD" ridge_best_model

## Use model on New Cases

We can load the model object for later use. Assuming that X_test is a new data on which we will want to use the model.

In [104]:
new_model = JLD.load("ridge_best_model.jld", "model")    # Load it back

#@load "ridge_best_model1.JLD" ridge_best_model

PyObject GridSearchCV(cv=3, estimator=Ridge(),
             param_grid={'alpha': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,
        1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,
        2.3,  2.4,  2.5,  2.6,  2.7,  2.8,  2.9,  3. ,  3.1,  3.2,  3.3,
        3.4,  3.5,  3.6,  3.7,  3.8,  3.9,  4. ,  4.1,  4.2,  4.3,  4.4,
        4.5,  4.6,  4.7,  4.8,  4.9,  5. ,  5.1,  5.2,  5.3,  5.4,  5.5,
        5.6,  5.7,  5.8,  5.9,  6. ,  6.1,  6.2,  6.3,  6.4,  6.5,  6.6,
        6.7,  6.8,  6.9,  7. ,  7.1,  7.2,  7.3,  7.4,  7.5,  7.6,  7.7,
        7.8,  7.9,  8. ,  8.1,  8.2,  8.3,  8.4,  8.5,  8.6,  8.7,  8.8,
        8.9,  9. ,  9.1,  9.2,  9.3,  9.4,  9.5,  9.6,  9.7,  9.8,  9.9,
       10. ])},
             scoring='r2')

Predict on the test set:

In [105]:
new_model.estimator

PyObject Ridge()

In [106]:
new_model.predict( X_test )

49-element Array{Float64,1}:
 176111.07181267103
 261363.78119535543
 105486.04023229908
 261956.7804636434
 143030.5329592488
 306177.78078664414
 141534.58314760696
 122197.40569525906
 208940.93064918497
 238011.7596422452
 307050.04226832214
 134140.48933373034
 348179.6428613549
      ⋮
 275076.1542552053
 218785.4416889535
 274258.5340234546
 262266.5196967971
 244105.48774713802
 164836.29872330086
 104397.43067207298
 102777.24456481021
 272307.24122067285
 189837.98429433335
 358149.0965153544
 144811.2085408224

Model performance on the test set

In [107]:
new_model.score(X_test,y_test)

0.6010198991261695


#### End of Document

***
***
