# Introduction

Exploratory data analysis is an essential task to solve any machine learning problem as it provide an verious aspects of data such as various features, its nature and distribution, etc. In this problem  we will first look at the data and then make some hypothesis related to the problem. Finally we will do some analysis on data to check whether our hypothesis is true or not and then we will perform data cleaning and feature engineering to ease up training process.


In [None]:
#importing necessary libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from matplotlib import patches

In [None]:
#setting up path for the data directory and reading csv files
path = Path('../input/av-healthcare-analytics-ii/healthcare')
train = pd.read_csv(path/'train_data.csv')
data_info = pd.read_csv(path/'train_data_dictionary.csv', index_col = 'Column')
test = pd.read_csv(path/'test_data.csv')

In [None]:
data_info

We have a csv file which provides information about the parameters which are present in our data. As we can see, there are total 18 parmaeters present in our dataset. In this problem we have to predict the stay of the patient based on other 17 features given to us. 

In [None]:
train.head()

Before we start our EDA, we will create a function which will give us the information about the features we pass into it. This is really helpful when you are dealing with a lot of features.

In [None]:
def get_info(parameter):
    print(data_info.loc[parameter]['Description'])

get_info('Hospital_region_code')

In [None]:
train['Stay'] = train['Stay'].apply(lambda x: x.replace('More than 100 Days', '100+'))

In [None]:
test.head()

In [None]:
print(f"shape of training data is {train.shape}.")
print(f'shape of testing data is {test.shape}.')
percent = (test.shape[0]/train.shape[0])*100
print(f"tesing data is {percent:.2f}% of the training data.")

In [None]:
train.info()

In [None]:
train.nunique()

By looking at unique values for the features, we can say that most of the data is of categorical type. The features case_id, patientid, Admission_Deposit are continuous type of data and rest of the features are of categorical type. 

In [None]:
train.describe()

# Things to explore:

1. Hospital and city codes may have very less impact on the stay of patients.
2. There may be relation between extra rooms available with the stay of the patient. 
3. Department may have strong relation with the stay. As patients in general departments tends to get discharges quickly compared to departments such as surgery, TB & chest disease.
4. distribution of ward type and bed grades
5. severity types may have very strong relation with stay
6. With increase in stay more visitors will visit the patient?
7. Recovery rate is dependant on the age?
8. Usually admission deposit is based on the how much advanced hopspital is? So how long these hopsitals keep the patients?

# Starting the fun stuff:

## Univariate Analysis

As most of the features are of categorial type we will take a look at their distribution using bar charts. To save some time and some lines of code, we will make a 'bar_chart' function to create the bar_chart for any feature we want to explore.

In [None]:
def bar_chart(parameter, figsize=(8,8)):
    target_counts = train[parameter].value_counts()
    target_perc = target_counts.div(target_counts.sum(), axis=0)
    plt.figure(figsize=figsize)
    ax = sns.barplot(x=target_counts.index.values, y=target_counts.values, order=target_counts.index)
    plt.xticks(rotation=90)
    plt.xlabel(f'{parameter}', fontsize=12)
    plt.ylabel('# of occurances', fontsize=12)

    rects = ax.patches
    labels = np.round(target_perc.values*100, 2)
    for rect, label in zip(rects, labels):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2, height + 5, f'{label}%', ha='center', va='bottom')

We'll first take a look at out target i.e. stay of the patient in hopsital.

In [None]:
bar_chart('Stay')

Unfortunately data is very imbalanced. Data related patients with the stay for more than 60 days is only about 10% of the overall data and for stay less than 40 days is almost 80% of the total data. We will look some at some techniques to tackle this imabalnce problem while training. 

In [None]:
bar_chart('Department')

Bar chart shows that most of the patients were admitted in gynecology department. 

In [None]:
bar_chart('Available Extra Rooms in Hospital', figsize=(20, 4))

By looking at the chart, we can say that in general 2-4 rooms were available.

In [None]:
get_info('Ward_Type')
bar_chart('Ward_Type')

This shows that almost 97% of the total patients were admitted in 'R', 'S' and 'Q' type of ward.

In [None]:
bar_chart('Age')

This shows that anout 90% of the patients had age between 31-60 years.

In [None]:
bar_chart('Type of Admission')

In [None]:
bar_chart('Severity of Illness')

In [None]:
bar_chart('Bed Grade')

In [None]:
bar_chart('Hospital_region_code')

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['Admission_Deposit'], kde=False, bins=50)

By looking at the distribution of the admission deposits, we can see that the average admission deposit is 5000 and the distribution is of Gaussian type.

# Bivariate analysis

Upto this point we looked at the nature of distributions of some of the important features. Now will look at how these features affect the target variable i.e. stay of the patients in the hospital.

In [None]:
order = ['0-10', '11-20', '21-30', '31-40', 
         '41-50', '51-60', '61-70', '71-80', '81-90', '91-100', '100+']

In [None]:
sns.catplot(x="Stay", hue="Bed Grade", kind="count",
            palette="pastel", edgecolor=".6",
            data=train, height=5, aspect=2, order=order)
plt.xticks(rotation=90);

In [None]:
sns.catplot(x="Stay", hue="Severity of Illness", kind="count",
            palette="colorblind", edgecolor=".9",
            data=train, height=5, aspect=2, order=order)
plt.xticks(rotation=90);

In [None]:
sns.catplot(x='Stay', hue="Type of Admission", kind="count",
            palette="colorblind", edgecolor=".6",
            data=train, height=5, aspect=2, order=order)
plt.xticks(rotation=90);

In [None]:
sns.catplot(x='Stay', hue="Hospital_region_code", kind="count",
            palette="colorblind", edgecolor=".6",
            data=train, height=5, aspect=2, order=order)
plt.xticks(rotation=90);

In [None]:
plt.figure(figsize=(14,8))
sns.boxplot(y='Visitors with Patient',x='Stay', hue="Severity of Illness", order=order, data=train);

This shows that number of visitors are increasing with longer stay.However, We can see that there are lots of outliers present in the dataset.

In [None]:
plt.figure(figsize=(14,8))
sns.boxplot(x='Stay',y='Admission_Deposit', order=order, data=train);

Amount depostied doesn't have any strong relation with the stay of the patient in the hospital. 

In [None]:
sns.catplot(x='Stay', hue="Ward_Facility_Code", kind="count",
            palette="colorblind", edgecolor=".6",
            data=train, height=5, aspect=2, order=order)
plt.xticks(rotation=90);

ward type also does not have significant relationship with the stay of the patient.