# Diabetes 130-US hospitals for years 1999-2008 Data Set

**Abstract:** This case has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes.

**Each phase of the process:**
1. [Business understanding](#Businessunderstanding)
    1. [Assess the Current Situation](#Assessthecurrentsituation)
        1. [Inventory of resources](#Inventory)
        2. [Requirements, assumptions and constraints](#Requirements)
        3. [Risks and contingencies](#Risks)
        4. [Terminology](#Terminology)
        5. [Costs and benefits](#CostBenefit)
    2. [What are the Desired Outputs](#Desiredoutputs)
    3. [What Questions Are We Trying to Answer?](#QA)
2. [Data Understanding](#Dataunderstanding)
    1. [Initial Data Report](#Datareport)
    2. [Describe Data](#Describedata)
    3. [Initial Data Exploration](#Exploredata) 
    4. [Verify Data Quality](#Verifydataquality)
        1. [Missing Data](#MissingData) 
        2. [Outliers](#Outliers) 
    5. [Data Quality Report](#Dataqualityreport)
3. [Data Preparation](#Datapreparation)
    1. [Select Your Data](#Selectyourdata)
    2. [Cleanse the Data](#Cleansethedata)
        1. [Label Encoding](#labelEncoding)
        2. [Drop Unnecessary Columns](#DropCols)
        3. [Altering Datatypes](#AlteringDatatypes)
        4. [Dealing With Zeros](#DealingZeros)
    3. [Construct Required Data](#Constructrequireddata)
    4. [Integrate Data](#Integratedata)
4. [Exploratory Data Analysis](#EDA)
5. [Modelling](#Modelling)
    1. [Modelling Technique](#ModellingTechnique)
    2. [Modelling Assumptions](#ModellingAssumptions)
    3. [Build Model](#BuildModel)
    4. [Assess Model](#AssessModel)
6. [Evaluation](#Evaluation)
7. [Deployment](#Deployment)

# 1. Stage One - Determine Business Objectives and Assess the Situation  <a class="anchor" id="Businessunderstanding"></a>

## 1.1 Assess the Current Situation<a class="anchor" id="Assessthecurrentsituation"></a>

There are not insights about the diabetes patients hospital readmissions. 

### 1.1.1. Inventory of resources <a class="anchor" id="Inventory"></a>
List the resources available to the project including:
- Personnel: 1 "full stack" DS Coordinator
- Data: Diabetes 130-US hospitals for years 1999-2008 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#
- Computing resources: Intel i7 8-core @ 2.8Ghz, GPU GTX 1060 6GB GDDR5, Ram 16gb. 
- Software: Linux, Python 3.7, Visual Studio, Notebook Jupyter


### 1.1.2. Requirements, assumptions and constraints - <a class="anchor" id="Requirements"></a> 
Predict three categories of readmission pacients:
* Less than 30: if the patient was readmitted in less than 30 days  
* More than 30: if the patient was readmitted in more than 30 days  
* No record: for no record of readmission  


### 1.1.3.Risks and contingencies <a class="anchor" id="Risks"></a>
- Changes in current pacients behaviours in constrast to given dataset  
I suggest to update the data. As this a very high impact prediction for life quality of the pacients, we should not experiment with pacient in online prediction with the model train with more than 10 years old dataset. The cost of false negative could be very high. TODO: CHECK

### 1.1.4.Terminology <a class="anchor" id="Terminology"></a>
- No apply by the moment

### 1.1.5.Costs and benefits  <a class="anchor" id="CostBenefit"></a>
- It has been doing by PM team.

 ## 1.2 What are the desired outputs of the project? <a class="anchor" id="Desiredoutputs"></a>


**Business success criteria**
- Reduce cost of diabetes readmissions by 10% during the next 12 month after prediction get online after pacient trial phase. 

**Data mining success criteria**
- F-score above 90%
- Get the MVP in one day


**Produce project plan**
- https://github.com/wiflore/Diabetes-ML-Case/projects/1


 ## 1.3 What Questions Are We Trying To Answer? <a class="anchor" id="QA"></a>

- How could we know which pacient will comeback in the next 30 days due to a potental bad diabetes treatment?

# 2. Stage  Two - Data Understanding <a class="anchor" id="Dataunderstanding"></a>

## 2.1 Initial Data Report <a class="anchor" id="Datareport"></a>

In [None]:
# Import Libraries Required
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

In [None]:
#Data source: 
#Source Query location: 
path =  'F:/Projects/Data Science/Defaults/train_/train.csv'
# reads the data from the file - denotes as CSV, it has no header, sets column headers
df =  pd.read_csv(path, sep=',') 

## 2.2 Describe Data <a class="anchor" id="Describedata"></a>

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.head(5)

## 2.3 Verify Data Quality <a class="anchor" id="Verifydataquality"></a>

### 2.3.1. Missing Data <a class="anchor" id="MissingData"></a>

In [None]:
df.isnull().sum()

In [None]:
def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [None]:
missing_values_table(df)

In [None]:
# Get the columns with > 50% missing
missing_df = missing_values_table(df);
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

In [None]:
# Drop the columns
df = df.drop(list(missing_columns))

### 2.3.2. Outliers <a class="anchor" id="Outliers"></a>

## 2.4 Initial Data Exploration  <a class="anchor" id="Exploredata"></a>

### 2.4.1 Distributions  <a class="anchor" id="Distributions"></a>

In [None]:
def count_values_table(df):
        count_val = df.value_counts()
        count_val_percent = 100 * df.value_counts() / len(df)
        count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
        count_val_table_ren_columns = count_val_table.rename(
        columns = {0 : 'Count Values', 1 : '% of Total Values'})
        return count_val_table_ren_columns

In [None]:
# Histogram
def hist_chart(df, col):
        plt.style.use('fivethirtyeight')
        plt.hist(df[col].dropna(), edgecolor = 'k');
        plt.xlabel(col); plt.ylabel('Number of Entries'); 
        plt.title('Distribution of '+col);

In [None]:
col = 'account_risk_band'
# Histogram & Results
hist_chart(df, col)
count_values_table(df.account_risk_band)

### 2.4.2 Correlations  <a class="anchor" id="Correlations"></a>

In [None]:
#Seaborn allows to make a correlogram or correlation matrix really easily. 
#sns.pairplot(df.dropna().drop(['x'], axis=1), hue='y', kind ='reg')

#plt.show()


In [None]:
#df_agg = df.drop(['x'], axis=1).groupby(['y']).sum()
df_agg = df.groupby(['y']).sum()

### Differencing

In [None]:
df_dif_agg = df_agg

In [None]:
#Differencing
#Specifically, a new series is constructed where the value at the current time step is calculated 
#as the difference between the original observation and the observation at the previous time step.
#value(t) = observation(t) - observation(t-1)
df_dif = df_dif_agg.diff()

## 2.5 Data Quality Report <a class="anchor" id="Dataqualityreport"></a>

# 3. Stage Three - Data Preperation <a class="anchor" id="Datapreperation"></a>

## 3.1 Select Your Data <a class="anchor" id="Selectyourdata"></a>

In [None]:
X_train_regr = df.drop(['date_maint', 'account_open_date'], axis = 1)
X_train = df.drop(['target', 'date_maint', 'account_open_date'], axis = 1)
X_test = test.drop(['date_maint', 'account_open_date'], axis = 1)

## 3.2 Clean The Data <a class="anchor" id="Cleansethedata"></a>

### 3.2.1 Label Encoding <a class="anchor" id="labelEncoding"></a>

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
for col in CAT_COLS:
        encoder = LabelEncoder()
        X_train[col] = encoder.fit_transform(X_train[col].astype(str))
        X_test[col] = encoder.transform(X_test[col].astype(str))

In [None]:
df["column"] = df["column"].astype('category')
df.dtypes

In [None]:
df["column"] = df["column"].cat.codes
df.head()

### 3.2.2 Drop Unnecessary Columns <a class="anchor" id="DropCols"></a>

In [None]:
del_col_list = ['col1', 'col2']

df = df.drop(del_col_list, axis=1)
df.head()

### 3.2.3 Altering Data Types <a class="anchor" id="AlteringDatatypes"></a>
Sometimes we may need to alter data types. Including to/from object datatypes

In [None]:
#df['date'] = pd.to_datetime(df['date'])

### 3.2.4 Dealing With Zeros <a class="anchor" id="DealingZeros"></a>

In [None]:
#cols = ['col1', 'col2']
#df[cols] = df[cols].replace(0, np.nan)

In [None]:
# dropping all the rows with na in the columns mentioned above in the list.

# df.dropna(subset=cols, inplace=True)


### 3.2.5 Dealing With Duplicates <a class="anchor" id="DealingDuplicates"></a>
Remove duplicate rows. **Note** You may not want to do this - add / remove as required

In [None]:
#df = df.drop_duplicates(keep='first')

## 3.3 Construct Required Data   <a class="anchor" id="Constructrequireddata"></a>


## 3.4 Integrate Data  <a class="anchor" id="Integratedata"></a>

### Construct Our Primary Data Set
Join data 

# 4. Stage Four - Exploratory Data Analysis <a class="anchor" id="EDA"></a>

# 5. Stage Four - Modelling <a class="anchor" id="Modelling"></a>
As the first step in modelling, you'll select the actual modelling technique that you'll be using. Although you may have already selected a tool during the business understanding phase, at this stage you'll be selecting the specific modelling technique e.g. decision-tree building with C5.0, or neural network generation with back propagation. If multiple techniques are applied, perform this task separately for each technique.



## 5.1. Modelling technique <a class="anchor" id="ModellingTechnique"></a>
Document the actual modelling technique that is to be used.

Import Models below:

## 5.2. Modelling assumptions <a class="anchor" id="ModellingAssumptions"></a>

## 5.3. Build Model <a class="anchor" id="BuildModel"></a>


## 5.4. Assess Model <a class="anchor" id="AssessModel"></a>
