This notebook is a simple starter where you can get some basic EDA and a Baseline models which can help you crack the 0.42 mark on the leaderboard

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from scipy import stats 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import collections

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



Here we are given a problem where we need to classify the saty of the patients in the hospital given some features. As this is a Multiclass classification problem, we might see a imbalanced dataset issue, but we will try to tackle it later.

Lets begin by reading our train and test files

In [None]:
train = pd.read_csv("../input/av-healthcare-analytics-ii/healthcare/train_data.csv")
train.head()

In [None]:
test = pd.read_csv("../input/av-healthcare-analytics-ii/healthcare/test_data.csv")
test.head()

In [None]:
train.shape, test.shape

In [None]:
train.info()

Looking at the dataset we can say that it consist of both numerical and categorical variables, which needs to be treated, so that they can be used in building our model.

### Basic EDA and dataset insights

In [None]:
plt.rcParams["figure.figsize"] = 12, 8
sum_ad_deposit = train.groupby("Stay").agg({"Admission_Deposit": "sum"})
sum_ad_deposit.plot(kind = "bar")
plt.title("Sum of Admission deposit")
plt.show()

We can clearly see that the Admission Deposits are very high for certain classes (11-20, 21-30, 31-40)and low for others. Lets check if the classes count has something to do with this.

In [None]:
collections.Counter(train["Stay"])

Yes, the classes count is affecting the admission deposit because as the classes count are more for the above 3 classes (11-20, 21-30, 31-40), the admission deposits also increase with it.

Lets now check how Visitors with Patients affects the Stay variable

In [None]:
sns.boxenplot(train["Stay"], train["Visitors with Patient"])

As expected as the stay increases the visitors with patient also increases.

Lets try building a baseline model and then further develop on it.

In [None]:
train_X = train.drop("Stay", axis = 1) #dropping the dependent variable for preprocessing
full_data = pd.concat([train_X, test], ignore_index= True)

Check if there are any NaN values in the dataset

In [None]:
full_data.isnull().sum()

As we see that our dataset does have some Nan values which we can treat by simply imputing the "median" of the column. Though the column is a float, I still consider it as a category hence the imputation is a median. 

In [None]:
simple_Impute_median = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
full_data["Bed Grade"] = simple_Impute_median.fit_transform(full_data[["Bed Grade"]]).ravel()
full_data["City_Code_Patient"] = simple_Impute_median.fit_transform(full_data[["City_Code_Patient"]]).ravel()

In [None]:
full_data.isnull().sum()

Now that our dataset is clear with no Nan Values, we can concentrate on getting the "object" dtype to numbers. For this we will use LabelEncoder, we can also use various other techniques like Onehotecoding, pandas.dummies etc but for now we will go with the basic.

## LabelEncoding the categorical variables

In [None]:
Lab_enc = LabelEncoder()

for i in full_data.columns:
    if full_data[i].dtype == "object":
        full_data[i] = Lab_enc.fit_transform(full_data[i])

Also lets get our dependent variable encoded using Labelencoder

In [None]:
y = Lab_enc.fit_transform(train["Stay"])

### Helper Functions

In [None]:
def metric(model, pred, y_valid ):
    if hasattr(model, 'oob_score_'): 
        return (accuracy_score(y_valid, pred)) * 100, model.oob_score_
    else:
        return (accuracy_score(y_valid, pred)) * 100

def get_sample(df,y, number):
    df = df.sample(number)
    return df, y[df.index]

def split(X, y, pct = 0.2):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=pct, stratify = y )
    return X_train, X_valid, y_train, y_valid


def feat_imp(model, cols):
    return pd.DataFrame({"Col_names": cols, "Importance": model.feature_importances_}).sort_values("Importance", ascending=False)

def plot_i(fi, x, y):
    return fi.plot(x, y, "barh", figsize = (12,8))

def create_csv(preds):
    cols = ["case_id", "Stay"]
    sub = pd.DataFrame({"case_id": test["case_id"], "Stay": preds})
    sub = sub[cols]
    
    return sub.to_csv("submission.csv", index = False)


### Sampling the Data

As we can see that the dataset is of a decent size we should always experiment with data which we can get outputs in matter of seconds rather than training our algorithm on an entire dataset. So we will work on a sample of our dataset.


In this section we are sampling the dataset without thinking much about the "Balance of dependent variable" in our sample, further we will try to tackle this stratified sampling. But for now lets keep it simple.

In [None]:
#using the helper functions above

sample_X, sample_y = get_sample(full_data[:train.shape[0]], y, 30000)
sample_X.shape, sample_y.shape

## Building our Model

As we now have a sample size of 30000, we dont know if it represents our dataset in a proper manner but we can surely check and experiment with this size of data.

Lets build our first Model.

My goto model for any Machine Learning problem is always a **RandomForest** because it gives you a great baseline without doing much and gives you good insights of your dataset with which you can build on further.

Before we build our model there one more important step - removing a small portion from our dataset for vaildation. This helps us to see how is our model performing on unseen data.


    

In [None]:
x_train , x_valid, y_train, y_valid = split(sample_X, sample_y)

In [None]:
%%time
Rf = RandomForestClassifier(oob_score=True) 
model = Rf.fit(x_train, y_train)
preds = model.predict(x_valid)
print(metric(model, preds, y_valid))

#### Special Mention OOB-Score
***Here the 2nd score that you see is the **"out of bag score", which I think is the best tool everybody should use to check check how is your model performing not just on the test data but also a test dataset which your Randomforest algorithm creates while creating an estimator. This helps us a lot in understanding if we are overfitting the data.
Consider it as a testset which your algorithm creates for every tree. ***

Now that we have our baseline model, lets try tuning in using some Hyperparameters from the algorithm


- Max_featues: This is a very important hyperparameter which tells the algorithm - what ratio of features in the dataset are available for every tree while its being created. For ex: if our dataset has 20 features, it has a choice from the 10 features to at **every node** for splitting
  Values which work for me: **1** , **0.5**, **"sqrt"**

- Min samples leaf: This is something which tells how many minimum values can a leaf node. Split no further than it. Values which work for me: **1, 3, 5, 10, 25**

For now these two are sufficient as we want to get a good baseline which can get us in top 40%

In [None]:
%%time
Rf = RandomForestClassifier(n_estimators=160, max_features=0.5, min_samples_leaf= 5, oob_score=True) 
model = Rf.fit(x_train, y_train)
preds = model.predict(x_valid)
print(metric(model, preds, y_valid))

As we did see some improvement in our Performance metric, also our Oobscore improved we can surely say that it has improved our model to some extent. You can further play with the hyperparameters using a gridsearch maybe and try to get better results at this point.

Lets further see some insights from the model:

In [None]:
feat10 = feat_imp(model, sample_X.columns)
feat10[:10]

In [None]:
file4 = pd.read_csv("../input/av-healthcare-analytics-ii/healthcare/train_data_dictionary.csv")
file4

In [None]:
plot_i(feat10, "Col_names", "Importance")

From the above graph we can see that Admission_Deposit and Visitor with Patient are the most important variables. Surprisingly "case_id" and "patient_id" are also very important to the model. 

But we can surely say that these would be some unique values in the dataset which can be removed. Lets experiment with the same model hyperparameters by removing these 2 columsn and see how it affects our model performance.

In [None]:
#Removing caseid and patient id from sample_x

sample_X = sample_X.drop(["case_id", "patientid"], axis = 1 )
sample_X.shape

As we have dropped the 2 columns lets agains split the dataset and build a model on it.

In [None]:
x_train , x_valid, y_train, y_valid = split(sample_X, sample_y)

In [None]:
%%time
Rf = RandomForestClassifier(n_estimators=160, max_features=0.5, min_samples_leaf= 5, oob_score=True) 
model = Rf.fit(x_train, y_train)
preds = model.predict(x_valid)
print(metric(model, preds, y_valid))

In [None]:
feat10 = feat_imp(model, sample_X.columns)
feat10[:12]

As we can see that removing those 2 columsn we did have much impact on our model performace, though we dropped a bit but it has changed the dynamics of how the feature importance to great extent. 

Lets further improve, by trying different models and getting to know how they perform on the dataset.

## LGBM 

Lets first use the LGBM with all the default parameters to see how it performs on our sample dataset. I prefer specifying the categorical features manually to the model.

For more information about the algorithm, please look out in this excellent explanation abour the wrking and also hyperparameters:
https://towardsdatascience.com/understanding-lightgbm-parameters-and-how-to-tune-them-6764e20c6e5b

In [None]:
from lightgbm import LGBMClassifier
cat= ['Hospital_code', 'Hospital_code', 'Hospital_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 
              'City_Code_Patient', 'Type of Admission', 'Severity of Illness', 'Age']


model_Lgm = LGBMClassifier(random_state=45)
model_Lgm.fit(x_train, y_train, categorical_feature=cat)
preds = model_Lgm.predict(x_valid)
print(metric(model_Lgm, preds, y_valid))

That looks great, we have beaten our previous score by Random Forest by a nice margin. As we have a good score on this lets try using the model on our complate dataset to see how it performs. We would also use some hyper params to see if we can overall improve the score on our entire dataset. We would also use the regularization param to control overfitting of the data.

In [None]:
X = full_data[:train.shape[0]]
y = y
X.shape, y.shape

In [None]:
x_train, x_valid, y_train, y_valid = split(X, y)
x_train.shape, x_valid.shape, y_train.shape, y_valid.shape

In [None]:
%%time
model_Lgm = LGBMClassifier(n_estimators=160, num_leaves=32, max_depth=5, reg_lambda= 0.3, random_state=46, n_jobs = -1)
model_Lgm.fit(x_train, y_train, categorical_feature=cat)
preds = model_Lgm.predict(x_valid)
print(metric(model_Lgm, preds, y_valid))

Wow, this looks great, the model has worked really well on the entire dataset. Maybe we can also tune the parameters more using Gridsearch to obtain better results. This we would try on our sample dataset.

Any feedback suggestions would be highly appreciated

Work In Progress ... 