# Explanation of the project

Hello everybdoy, and welcolme to my notebook.  
We are going to analyze the database called "Diabetes 130 US hospitals for years 1999-2008". First of all, what is this dataset?  
  
"The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.  
  
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc." (https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008).  
  
I have two objectives with this analysis:
1. First of all, I would like to set up a model that could predict if a patient will be readmitted within 30 days...
2. ... and I want this model to explain the drivers that leads to a readmission, or at least some idicators that can be used in order to detect such a readmission.   
  
Because of those two objectives, I am going to create "just" a Logistic Regression. The aim of this regression is to assess the probability of an evenement regarding the variables availables:
$$P(y=1\mid X) = \frac{e^{\beta X}}{1+e^{\beta X}}$$  
where $y=1$ represents the readmission within 30 days, $X=(x_0, x_1, ..., x_n)$ are the variables, and $\beta = (\beta_0, \beta_1,..., \beta_n)$ the parameters of the model.

# Imports

First of all, let's import some libraries we are going to use within this project, and the database.

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
import pandas_profiling
from sklearn.model_selection import train_test_split
import statsmodels.api as sm 
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc, precision_score, recall_score
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from imblearn.over_sampling import SMOTE

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

path = "/kaggle/input/diabetes/diabetic_data.csv"
bdd = pd.read_csv(path)

Now, we are going to make a first and quick analyze of the existing features in order to understand better current data:

In [None]:
pandas_profiling.ProfileReport(bdd)

First things to notice:
1. Many information are embedded within this database. Some categorical features have a lot of differents possible values, so those variables will need to be transformed (expecialy for the diag_1, diag_2 and diag_3 which have more than 700 diffents values).
2. _weight_ , _payer\_code_ and _medical\_specialization_ have a lot of missing values, so we are going to remove those three features.
3. Some features, especially about drugs, are not well distributed (a lot of _No_ ).  
4. 101766 different values for _encounter_id_ , and only 71518 for _patient_nbr_ : some patients came several times (which is normal because we are tracking readmission), and we can identify those people.
  
Then, we have some cleaning to do before attacking the logistic regression.

First of all we sort the database in function of the _encounter_id_ . Indeed, an intuition I have is that low values represent old admissions, and high values represent recent ones. This is an hypothesis I make.

In [None]:
bdd = bdd.sort_values("encounter_id").reset_index(drop=True)
bdd.head()

# Feature engineering

We create a column named _uno_ full of **1**, which will be used in this part.

In [None]:
bdd["uno"] = 1

## Qualitative features

### Response variable

We set at **True** the objective value if the patient is readmitted within 30 days, and **False** otherwise.

In [None]:
bdd["Objective"] = False
bdd.loc[bdd.readmitted == "<30", "Objective"] = True
bdd = bdd.drop(["readmitted"], axis = 1)

sns.countplot(bdd["Objective"])
plt.show()

### Other binary features

#### First admition or not

With the hypothesis made on the _encounter\_id_ feature, we check if the patient has already come in this period.

In [None]:
bdd["alreadyCame"] = True
bdd.loc[bdd.duplicated("patient_nbr", keep="first"), "alreadyCame"] = False
bdd.loc[bdd.duplicated("patient_nbr", keep=False),:].sort_values(["patient_nbr", "encounter_id"]).head(20)

Looking at the ages validates our hypothesis (or at least it does not refute it). Indeed in the fist encounter of patient **5220**, this patient was **[60-70)**, and then this patient was **[70-80)**.

#### Normalization of features _change_ and _diabetesMed_

In [None]:
bdd.loc[bdd.change == "Ch", "change"] = True
bdd.loc[bdd.change == "No", "change"] = False

bdd.loc[bdd.diabetesMed == "Yes", "diabetesMed"] = True
bdd.loc[bdd.diabetesMed == "No", "diabetesMed"] = False

fig, axs = plt.subplots(1, 2, figsize = (10, 5))
sns.countplot(data = bdd, x ="change", hue = "Objective", ax = axs[0])
sns.countplot(data = bdd, x ="diabetesMed", hue = "Objective", ax = axs[1])
plt.show()

#### Tests A1C and GluSerum

For the two tests, A1C and GluSerum, we transform the categorical variables, with 4 possible values, into 2 binary features:
- test norm: if is =1, the test has been done, and the result is normal.
- test abnorm: if is =1, the test has been done, and the result is not normal.
- if both =0, the test has not been done.

In [None]:
bdd["A1C"] = bdd["A1Cresult"]
bdd.loc[bdd.A1Cresult.isin([">7", ">8"]), "A1C"] = "Abnorm"
fig, axs = plt.subplots(1, 2, figsize = (10, 5))
sns.countplot(data = bdd, x = "A1Cresult", hue = "Objective", ax = axs[0])
sns.countplot(data = bdd, x = "A1C", hue = "Objective", ax = axs[1])
plt.show()

bdd = pd.get_dummies(data = bdd, columns = ["A1C"], prefix = "A1C", drop_first=False)
bdd = bdd.drop(["A1Cresult", "A1C_None"], axis = 1)

In [None]:
bdd["GluSerum"] = bdd["max_glu_serum"]
bdd.loc[bdd.max_glu_serum.isin([">200", ">300"]), "GluSerum"] = "Abnorm"
fig, axs = plt.subplots(1, 2, figsize = (10, 5))
sns.countplot(data = bdd, x = "max_glu_serum", hue = "Objective", ax = axs[0])
sns.countplot(data = bdd, x = "GluSerum", hue = "Objective", ax = axs[1])
plt.show()

bdd = pd.get_dummies(data = bdd, columns = ["GluSerum"], prefix = "GluSerum", drop_first=False)
bdd = bdd.drop(["max_glu_serum", "GluSerum_None"], axis = 1)

#### Gender

In [None]:
sns.countplot(data = bdd, x = "gender", hue = "Objective")
plt.show()

We normalize this feature and remove the three patients with "Unknown/Invalid" to have a single binary feature.

In [None]:
bdd["isFemale"] = False
bdd.loc[bdd.gender == "Female", "isFemale"] = True
bdd = bdd[bdd.gender != "Unknown/Invalid"]

We check if the _Gender_ remains the same for each admission. If not we set the value at the most common feature.

In [None]:
people_multiple_gender = bdd.loc[(bdd.duplicated("patient_nbr", keep=False)), ["patient_nbr", "gender", "uno"]].groupby(["patient_nbr", "gender"]).count().reset_index()
people_multiple_gender = people_multiple_gender[people_multiple_gender.duplicated("patient_nbr", keep=False)]

list_nb_to_drop = []

for nb in people_multiple_gender.patient_nbr.unique() :
    nbmin = 1
    suppr = False
    value = ""
    
    for sex in people_multiple_gender.loc[people_multiple_gender.patient_nbr == nb, "gender"] :
        if people_multiple_gender.loc[(people_multiple_gender.patient_nbr == nb) & (people_multiple_gender.gender == sex), "uno"].values[0] == nbmin :
            suppr = True
        else :
            suppr = False
            nbmin = people_multiple_gender.loc[(people_multiple_gender.patient_nbr == nb) & (people_multiple_gender.gender == sex), "uno"]
            value = sex
        
    
    if suppr :
        list_nb_to_drop.append(nb)
    else :
        bdd.loc[(bdd.patient_nbr == nb), "gender"] = value

bdd = bdd[~bdd.patient_nbr.isin(list_nb_to_drop)]
bdd = bdd.drop("gender", axis = 1)

#### Race

In [None]:
sns.countplot(data = bdd, x = "race", hue = "Objective")

We do the same check for the feature _Race_.

In [None]:
people_multiple_race = bdd.loc[(bdd.duplicated("patient_nbr", keep=False)), ["patient_nbr", "race", "uno"]].groupby(["patient_nbr", "race"]).count().reset_index()
people_multiple_race = people_multiple_race[people_multiple_race.duplicated("patient_nbr", keep=False)]

list_nb_to_drop = []

for nb in people_multiple_race.patient_nbr.unique() :
    
    list_race = list(people_multiple_race.loc[people_multiple_race.patient_nbr == nb, "race"].unique())
    try : 
        list_race.remove("?")
    except :
        "Nothing"
    
    if len(list_race) == 1 :
        #print(list_race[0])
        bdd.loc[(bdd.patient_nbr == nb), "race"] = list_race[0]
    else : 
        nbmin = 1
        suppr = False
        value = ""

        for rac in list_race :
            if people_multiple_race.loc[(people_multiple_race.patient_nbr == nb) & (people_multiple_race.race == rac), "uno"].values[0] == nbmin :
                suppr = True
            else :
                suppr = False
                nbmin = people_multiple_race.loc[(people_multiple_race.patient_nbr == nb) & (people_multiple_race.race == rac), "uno"].values[0]
                value = rac

    
    if suppr :
        list_nb_to_drop.append(nb)
    else :
        bdd.loc[(bdd.patient_nbr == nb), "race"] = value

bdd = bdd[~bdd.patient_nbr.isin(list_nb_to_drop)]

We then normalize this categorical feature into 5 binary features. If all are set at 0, it means that this variable is not available.

In [None]:
bdd.loc[bdd.race == "?", "race"] = "unavailable"

bdd = pd.get_dummies(data = bdd, columns = ["race"], prefix = "race", drop_first=False)
bdd = bdd.drop("race_unavailable", axis = 1)

#### Diagnostics

There are 3 diagnostics, with more than 700 possible values. So we make some order and normalize those diagnostics into 9 wide categories.

In [None]:
def norm_diag(bdd, diag) :
    if bdd[diag] == "?" :
        return "Unavailable"
    elif bdd[diag][0] == "E" :
        return "Other"
    elif bdd[diag][0] == "V" :
        return "Other"
    else :
        num = float(bdd[diag])
        
        if np.trunc(num) == 250 :
            return "Diabetes"
        elif num <= 139 :
            return "Other"
        elif num <= 279 :
            return "Neoplasms"
        elif num <= 389 :
            return "Other"
        elif num <= 459 :
            return "Circulatory"
        elif num <= 519 :
            return "Respiratory"
        elif num <= 579 :
            return "Digestive"
        elif num <= 629 :
            return "Genitourinary"
        elif num <= 679 :
            return "Other"
        elif num <= 709 :
            return "Neoplasms"
        elif num <= 739 :
            return "Musculoskeletal"
        elif num <= 759 :
            return "Other"
        elif num in [780, 781, 782, 783, 784] : 
            return "Neoplasms"
        elif num == 785 :
            return "Circulatory"
        elif num == 786 :
            return "Respiratory"
        elif num == 787 :
            return "Digestive"
        elif num == 788 :
            return "Genitourinary"
        elif num == 789 :
            return "Digestive"
        elif num in np.arange(790, 800) :
            return "Neoplasms"
        elif num >= 800 :
            return "Injury"
        else :
            return num

In [None]:
bdd["diag_1_norm"] = bdd.apply(norm_diag, axis=1, diag="diag_1")
bdd["diag_2_norm"] = bdd.apply(norm_diag, axis=1, diag="diag_2")
bdd["diag_3_norm"] = bdd.apply(norm_diag, axis=1, diag="diag_3")

list_diag = ['Circulatory', 'Neoplasms', 'Diabetes', 'Respiratory', 'Other', 'Injury', 'Musculoskeletal', 'Digestive', 'Genitourinary']

fig, axs = plt.subplots(3, 1, figsize = (15, 10))
sns.countplot(data = bdd, y = "diag_1_norm", hue = "Objective", ax = axs[0], order = list_diag)
sns.countplot(data = bdd, y = "diag_2_norm", hue = "Objective", ax = axs[1], order = list_diag)
sns.countplot(data = bdd, y = "diag_3_norm", hue = "Objective", ax = axs[2], order = list_diag)
plt.show()

But those three features are not exploitables for now. Thus we tranform those three categorical features into 9 binary features. Indeed, we are going to check for each one if it has been diagnosed during the admission.

In [None]:
def diag_atleast (bdd, val) :
    if (bdd["diag_1_norm"] == val) | (bdd["diag_2_norm"] == val) | (bdd["diag_3_norm"] == val) :
        return True
    else :
        return False

for val in list_diag :
    name_var = "diag_atleast_"+ val
    print(name_var)
    bdd[name_var] = bdd.apply(diag_atleast, axis = 1, val=val)

In [None]:
fig, axs = plt.subplots(3, 3, figsize = (15, 10))
sns.countplot(data = bdd, x = "diag_atleast_Circulatory", hue = "Objective", ax = axs[0,0])
sns.countplot(data = bdd, x = "diag_atleast_Neoplasms", hue = "Objective", ax = axs[0,1])
sns.countplot(data = bdd, x = "diag_atleast_Diabetes", hue = "Objective", ax = axs[0,2])
sns.countplot(data = bdd, x = "diag_atleast_Respiratory", hue = "Objective", ax = axs[1,0])
sns.countplot(data = bdd, x = "diag_atleast_Other", hue = "Objective", ax = axs[1,1])
sns.countplot(data = bdd, x = "diag_atleast_Injury", hue = "Objective", ax = axs[1,2])
sns.countplot(data = bdd, x = "diag_atleast_Musculoskeletal", hue = "Objective", ax = axs[2,0])
sns.countplot(data = bdd, x = "diag_atleast_Digestive", hue = "Objective", ax = axs[2,1])
sns.countplot(data = bdd, x = "diag_atleast_Genitourinary", hue = "Objective", ax = axs[2,2])
plt.show()

Maybe 1 diagnostic is not enough to detect a future readmission. That is why we check if some pairs of diagnosis have been diagnosed.

In [None]:
list_diag_inter = list_diag.copy()

for diag in list_diag :
    list_diag_inter.remove(diag)
    
    for diag2 in list_diag_inter :
        name = "diag_" + diag + "_&_" + diag2
        bdd[name] = (bdd["diag_atleast_" + diag] & bdd["diag_atleast_" + diag2])

In [None]:
bdd = bdd.drop(["diag_1", "diag_2", "diag_3"], axis = 1)

#### Drugs

With the first analyze of the feautures, we realised that many drugs are very rarely taken by the patients. The others are more commons.

In [None]:
medoc = ['metformin', 'repaglinide', 'nateglinide',
       'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide',
       'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',
       'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton',
       'insulin', 'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone']

medoc_rare = ["nateglinide", "chlorpropamide", "acetohexamide", "tolbutamide",
             "acarbose", "miglitol", "troglitazone", "tolazamide", "examide",
             "citoglipton", "glyburide-metformin", "glipizide-metformin",
             "glimepiride-pioglitazone", "metformin-rosiglitazone", "metformin-pioglitazone"]

medoc_usuels = [med for med in medoc if med not in medoc_rare]

Then, for each "common" drug, we check ih the patient consumes it. As for the diagnosis, we determine which combination of medications the patient is taking.

In [None]:
for med in medoc_usuels :
    name = "take_" + med
    bdd[name] = bdd[med].isin(["Down", "Steady", "Up"])
    
    
fig, axs = plt.subplots(3, 3, figsize = (15, 10))
sns.countplot(data = bdd, x = "take_metformin", hue = "Objective", ax = axs[0,0])
sns.countplot(data = bdd, x = "take_repaglinide", hue = "Objective", ax = axs[0,1])
sns.countplot(data = bdd, x = "take_glimepiride", hue = "Objective", ax = axs[0,2])
sns.countplot(data = bdd, x = "take_glipizide", hue = "Objective", ax = axs[1,0])
sns.countplot(data = bdd, x = "take_glyburide", hue = "Objective", ax = axs[1,1])
sns.countplot(data = bdd, x = "take_pioglitazone", hue = "Objective", ax = axs[1,2])
sns.countplot(data = bdd, x = "take_rosiglitazone", hue = "Objective", ax = axs[2,0])
sns.countplot(data = bdd, x = "take_insulin", hue = "Objective", ax = axs[2,1])
plt.show()

In [None]:
medoc_inter = medoc_usuels.copy()

for med in medoc_usuels :
    
    medoc_inter.remove(med)
    
    for med2 in medoc_inter :
        name = "take_" + med + "_&_" + med2
        bdd[name] = (bdd["take_" + med] & bdd["take_" + med2])

Finally, we count the number of "rare" drugs the patient is taking.

In [None]:
def nbMedocRare (bdd, listMedoc) :
    nb = 0
    for med in listMedoc :
        if bdd[med] != "No" :
            nb += 1
    return nb

bdd["nb_rare_medoc"] = bdd.apply(nbMedocRare, listMedoc = medoc_rare, axis = 1)

#### Admissions and discharges

For _admission_type_id_, _admission_source_id_ and _discharge_disposition_id_, we reduce the number of possible values by regrouping them into $n$ wider categories. Then we creates $n-1$ binary features.

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (10, 5))
sns.countplot(data = bdd, x = "admission_type_id", hue = "Objective", ax = axs[0])
axs[0].set_title("admission_type_id before transformation")

bdd['admission_type_id'] = bdd['admission_type_id'].replace([1, 2, 7], "emergency")
bdd['admission_type_id'] = bdd['admission_type_id'].replace([4, 5, 6, 8], "unavailable")
bdd['admission_type_id'] = bdd['admission_type_id'].replace(3, "elective")

sns.countplot(data = bdd, x = "admission_type_id", hue = "Objective", ax = axs[1])
axs[1].set_title("admission_type_id after transformation")

bdd = pd.get_dummies(data = bdd, columns = ["admission_type_id"], prefix="admission_type", drop_first=False)
bdd = bdd.drop(["admission_type_unavailable"], axis = 1)

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (10, 5))
sns.countplot(data = bdd, x = "admission_source_id", hue = "Objective", ax = axs[0])
axs[0].set_title("admission_source_id before transformation")

bdd['admission_source_id'] = bdd['admission_source_id'].replace([1, 2, 3], "referral")
bdd['admission_source_id'] = bdd['admission_source_id'].replace([4, 5, 6, 10, 22, 25], "transfert")
bdd['admission_source_id'] = bdd['admission_source_id'].replace([8, 14, 11, 13, 9, 15, 17, 20, 21], "unavailable")
bdd['admission_source_id'] = bdd['admission_source_id'].replace(7, "emergencyRoom")

sns.countplot(data = bdd, x = "admission_source_id", hue = "Objective", ax = axs[1])
axs[1].set_title("admission_source_id after transformation")

bdd = pd.get_dummies(data = bdd, columns = ["admission_source_id"], prefix="admission_source", drop_first=False)
bdd = bdd.drop(["admission_source_unavailable"], axis = 1)

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (10, 5))
sns.countplot(data = bdd, x = "discharge_disposition_id", hue = "Objective", ax = axs[0])
axs[0].set_title("discharge_disposition_id before transformation")

bdd['discharge_disposition_id'] = bdd['discharge_disposition_id'].replace([1, 6, 8, 9, 10], "home")
bdd['discharge_disposition_id'] = bdd['discharge_disposition_id'].replace([2, 3, 4, 5, 14, 22, 23, 24], "transfert")
bdd['discharge_disposition_id'] = bdd['discharge_disposition_id'].replace([18, 25, 26], "unavailable")
bdd['discharge_disposition_id'] = bdd['discharge_disposition_id'].replace([7, 10, 11, 13, 12, 15, 16, 17, 19, 20, 27, 28], "other")

sns.countplot(data = bdd, x = "discharge_disposition_id", hue = "Objective", ax = axs[1])
axs[1].set_title("discharge_disposition_id after transformation")

bdd = pd.get_dummies(data = bdd, columns = ["discharge_disposition_id"], prefix="discharge_type", drop_first=False)
bdd = bdd.drop(["discharge_type_unavailable"], axis = 1)

## Quantitative features

In [None]:
var_quanti = ["time_in_hospital", 'num_lab_procedures', 'num_procedures', 
              'num_medications', 'number_outpatient', 'number_emergency', 
              'number_inpatient','number_diagnoses', 'nb_rare_medoc']

We transform the qualitative feature _age_ into a quantitative one by taking the upper bound (aribitrary choice)

In [None]:
def recup_age (bdd) :
    return int(bdd.age[-4::].replace('-', '').replace(')', ''))

bdd["age_num"] = bdd.apply(recup_age, axis = 1)
var_quanti.append("age_num")
bdd = bdd.drop("age", axis = 1)

We count the number of medications the patient is taking, and the number of changement in his treatment...

In [None]:
def count_num_medoc(bdd) :
    nb = 0
    for med in medoc :
        if bdd[med] != "No" :
            nb += 1
    return nb

def count_num_medoc_chgmnt(bdd) :
    nb = 0
    for med in medoc :
        if (bdd[med] != "No") & (bdd[med] != "Steady") :
            nb += 1
    return nb

bdd["num_medo_arrived"] = bdd.apply(count_num_medoc, axis = 1)
bdd["num_medo_chgmnt"] = bdd.apply(count_num_medoc_chgmnt, axis = 1)

var_quanti.append("num_medo_arrived")
var_quanti.append("num_medo_chgmnt")

... and the proportion those changements represent.

In [None]:
bdd["proportion_chgmnt"] = bdd["num_medo_chgmnt"] / bdd["num_medo_arrived"]
bdd["proportion_chgmnt"] = bdd["proportion_chgmnt"].fillna(0)
var_quanti.append("proportion_chgmnt")

Now, let's have a first visualisation of those data:

In [None]:
fig, axs = plt.subplots(5, 3, figsize = (20, 20))

sns.countplot(data = bdd, y = "time_in_hospital", hue = "Objective", ax = axs[0, 0])
axs[0, 0].set_title('time_in_hospital')

sns.countplot(data = bdd, y = "num_lab_procedures", hue = "Objective", ax = axs[0, 1])
axs[0, 1].set_title('num_lab_procedures')

sns.countplot(data = bdd, y = "num_procedures", hue = "Objective", ax = axs[0, 2])
axs[0, 2].set_title('num_procedures')


sns.countplot(data = bdd, y = "num_medications", hue = "Objective", ax = axs[1, 0])
axs[1, 0].set_title('num_medications')

sns.countplot(data = bdd, y = "number_outpatient", hue = "Objective", ax = axs[1, 1])
axs[1, 1].set_title('number_outpatient')

sns.countplot(data = bdd, y = "number_emergency", hue = "Objective", ax = axs[1, 2])
axs[1, 2].set_title('number_emergency')


sns.countplot(data = bdd, y = "number_inpatient", hue = "Objective", ax = axs[2, 0])
axs[2, 0].set_title('number_inpatient')

sns.countplot(data = bdd, y = "number_diagnoses", hue = "Objective", ax = axs[2, 1])
axs[2, 1].set_title('number_diagnoses')

sns.countplot(data = bdd, y = "num_medo_arrived", hue = "Objective", ax = axs[2, 2])
axs[2, 2].set_title('num_medo_arrived')


sns.countplot(data = bdd, y = "num_medo_chgmnt", hue = "Objective", ax = axs[3, 0])
axs[3, 0].set_title('num_medo_chgmnt')

sns.countplot(data = bdd, y = "age_num", hue = "Objective", ax = axs[3, 1])
axs[3, 1].set_title('age_num')

sns.countplot(data = bdd, y = "proportion_chgmnt", hue = "Objective", ax = axs[3, 2])
axs[3, 2].set_title('proportion_chgmnt')

sns.countplot(data = bdd, y = "nb_rare_medoc", hue = "Objective", ax = axs[4, 0])
axs[4, 0].set_title('nb_rare_medoc')

plt.show()

**First observation**: There is **not a significative difference** between the people with a readmission within 30 days and the others in term of distribution of those variables.  
  
**Second observation**: For the feature _num_lab_procedures_, the data seems to be **truncated**. Is this normal? Is this a problem in the dataset? We dont know, but it will obviously affect the future model. So let's keep it in mind, and we will consider that this observation is normal, that there is no problem in the data.  
  
**Third observation**: For a lot of features, the data seems **skewed**. It can also affect the future model by over representing some information. Let's check the skewness through the unbiased skew coefficient.

In [None]:
bdd[var_quanti].skew()

Two features are particulary skewed: *number_emergency* and *number_outpatient*. Thus we are going to make a log transformation:  
$$variable\_logtransformed = ln(1+variable)$$  
We add 1 in the log, because the variables take the value 0.  
  
We don't transform nb_rare_medoc because it can take only 3 differents values.

In [None]:
bdd["number_outpatient"] = np.log1p(bdd["number_outpatient"])
bdd["number_emergency"] = np.log1p(bdd["number_emergency"])

## Features removal

Now we remove all unnecessary data.

In [None]:
bdd = bdd.drop(["weight", "medical_specialty", 'encounter_id', 'patient_nbr', "payer_code", 'diag_1_norm', 'diag_2_norm', 'diag_3_norm'] + medoc, axis = 1)
bdd = bdd.drop("uno", axis = 1)

Now that we have this new dataset, let's try to analyze it!  
  
As explain earlier, the objective is not to obtain the best results through a "black-box" model. I would like to create a model which determines whether or not a patient is going to be readmitted within 30 days, and what can explain the readmission (or what can help us detecting it).  
  
Thus we are going to make a first, but relevant we hope, parametric model: **A LOGISTIC REGRESSION** !

# Creation of the model

## Before the model

First, we split our database into two: X with the variables, and y with the responses. We also add a column full of **1** to have an intercept for the regression.

In [None]:
X = bdd.drop(["Objective"], axis=1)
y = bdd["Objective"]
print('Original dataset shape {}'.format(Counter(y)))

In [None]:
X.insert(0, "Intercept", 1)

To create a relevant logistic regression, we must select relevant variables. To do so, we are going to create a model with all variables, and then remove the variable with the highest p-value. By iterations, we are goind to remove all non-relevant features, until all of our variables have a p-value > 0.05. Function below automates this process.

In [None]:
def reduction_variable_logit(X_train, y_train, showVarToDel=False) :
    ultime_model = False
    var_to_del = []

    while (ultime_model == False) :
        log_reg = sm.Logit(y_train, X_train.drop(var_to_del, axis = 1).astype(float)).fit(maxiter = 100, disp = False)

        max_pvalue = max(log_reg.pvalues)

        if max_pvalue < 0.05 :
            ultime_model = True
        else :
            varToDel = log_reg.pvalues.index[log_reg.pvalues == max(log_reg.pvalues)].values[0]
            if showVarToDel :
                print(varToDel + ", p-value = " + str(max(log_reg.pvalues)))
            var_to_del.append(varToDel)
    
    return log_reg, var_to_del

In [None]:
plt.figure(figsize = (20, 20))
sns.heatmap(abs(X.corr()), cmap="Greens")
plt.show()

We can see that the overall correlation betweeen the variable is quite low, with some exception, especially with the admission_type and admission_source, and between the number of medications and the proportion of changes in the medications. Even if correlation is something avoided to have the well convergence of the algorithms, we conserve all of the variables. If we observe too many divergences, we will remove some of those variables.

## Logistic regressions

### Naive first regression

Let's split the database into two sub-databases again: train (80% of the data), to create the model, and test (20% remaining) to evaluate it. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
log_reg, var_to_del = reduction_variable_logit(X_train, y_train, True)

In [None]:
log_reg.summary()

36 variables left, the model has converged, LLR p-value equals to zero (meaning that the models at least does something)... It is a good start. Let's evaluate it on our test dataset.  
  
We evaluate the probability, for each people, to come back within 30 days. If the probability exceeds a threshold, we set the response to 1, else we set it to 0.  
  
**Let's start with this good old Bayes' threshold : 0.5!**

In [None]:
threshold = 0.5

yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, yhat>threshold)))
print("Precision is {0:.2f}".format(precision_score(y_test, yhat>threshold)))
print("Recall is {0:.2f}".format(recall_score(y_test, yhat>threshold)))

**0.89 is GREAT, but the 0.01 is AWFULL!**  
  
Why do we have those results? Because we are facing an unbalanced problem: in our database, we have almost 90% of **FALSE** label, and only 10% of **TRUE** label.   
  
This model is smart, he set (almost) all rows to **FALSE**, so he has at least 90% of accuracy.  
Maybe if we change the threshold we are going to see better results?

In [None]:
yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

acc = []
prec = []
rec = []

for ts in np.arange(0.1, 1, 0.01) : 
    acc.append(accuracy_score(y_test, yhat>ts))
    prec.append(precision_score(y_test, yhat>ts))
    rec.append(recall_score(y_test, yhat>ts))
    
fig = plt.figure(figsize=(15,15))
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = acc, label = "accuracy")
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = prec, label = "precision")
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = rec, label = "recall")
plt.xlabel("Threshold")

plt.show()

... and the answer is no!  
But I won't give up. Lets try two things:  
- Undersampling: We reduce our dataset to have balanced categories
- Oversampling: We create artificial data

### Undersampling regression

Let's see this undersampling:

In [None]:
undersample = RandomUnderSampler(random_state=42)
new_X, new_y = undersample.fit_resample(X, y)

print('undersampled dataset shape {}'.format(Counter(new_y)))

Ok that's better. We are on a balanced problem... but we have lost a lot of information. Let's cross our fingers!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_X, new_y, test_size=0.2, random_state=42)

In [None]:
log_reg, var_to_del = reduction_variable_logit(X_train, y_train, showVarToDel=True)

In [None]:
log_reg.summary()

Ok, 19 variables left, not bad. Is it good?

In [None]:
threshold = 0.5

yhat = log_reg.predict(X_test.drop(var_to_del, axis=1).astype(float)) 

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, yhat>threshold)))
print("Precision is {0:.2f}".format(precision_score(y_test, yhat>threshold)))
print("Recall is {0:.2f}".format(recall_score(y_test, yhat>threshold)))

It is not "good", but this is more reasonnable. But I have hope, maybe if we don't blindy take the Bayes' threshold we can enhance our results: 

In [None]:
yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

acc = []
prec = []
rec = []

for ts in np.arange(0.1, 1, 0.01) : 
    acc.append(accuracy_score(y_test, yhat>ts))
    prec.append(precision_score(y_test, yhat>ts))
    rec.append(recall_score(y_test, yhat>ts))
    
fig = plt.figure(figsize=(15,15))
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = acc, label = "accuracy")
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = prec, label = "precision")
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = rec, label = "recall")
plt.axvline(0.45, color = "purple", label="theshold=0.45")
plt.xlabel("Threshold")

plt.show()

In [None]:
threshold = 0.45

yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, yhat>threshold)))
print("Precision is {0:.2f}".format(precision_score(y_test, yhat>threshold)))
print("Recall is {0:.2f}".format(recall_score(y_test, yhat>threshold)))

Indeed we have slightly increased the accuracy... and the recall! This is something very important I think for this project, because we want to catch as many readmitions as possible.  
  
Now lets see if the oversampling gives better results:

### Oversampling regression



To do so we use the method SMOTE (Synthetic Minority Oversampling TEchnique). The idea is to create new data from existing, through interpolations. Then it's not "just" some duplications as with a bootstrap.

In [None]:
oversample = SMOTE(random_state=42)
new_X, new_y = oversample.fit_resample(X, y)
print('Oversampled dataset shape {}'.format(Counter(new_y)))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_X, new_y, test_size=0.2, random_state=42)

In [None]:
log_reg, var_to_del = reduction_variable_logit(X_train, y_train, showVarToDel=True)

In [None]:
log_reg.summary()

Oh... so we have 111 variables... and a Pseudo R-square of 61%... this is strange.

In [None]:
threshold = 0.5

yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, yhat>threshold)))
print("Precision is {0:.2f}".format(precision_score(y_test, yhat>threshold)))
print("Recall is {0:.2f}".format(recall_score(y_test, yhat>threshold)))

We obtain **AMAZING results!** Is this magic? or just an illusion?  
Let's see what happens if we play with the threshold.

In [None]:
yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

acc = []
prec = []
rec = []

for ts in np.arange(0.1, 1, 0.01) : 
    acc.append(accuracy_score(y_test, yhat>ts))
    prec.append(precision_score(y_test, yhat>ts))
    rec.append(recall_score(y_test, yhat>ts))
    
fig = plt.figure(figsize=(15,15))
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = acc, label = "accuracy")
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = prec, label = "precision")
sns.lineplot(x = np.arange(0.1, 1, 0.01), y = rec, label = "recall")
plt.axvline(0.31, color = "purple")
plt.xlabel("Threshold")

plt.show()

In [None]:
threshold = 0.31

yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, yhat>threshold)))
print("Precision is {0:.2f}".format(precision_score(y_test, yhat>threshold)))
print("Recall is {0:.2f}".format(recall_score(y_test, yhat>threshold)))

Bayes is still right, (around) 0.5 gives better accuracy, but as it is quite stable, let's move to 0.31 to increase the recall.  
I still have some doubts, so we are going to try to mix under and over sampling, and track some indicators.

### Oversampling & Undersampling regression

We are going to play with the sampling_strategy component of the undersampling method: we will make it move from 0.13 (no undersampling) to 1 (full undersampling). 

For each iteration:

1. we separate our DB into train & test part
2. we undersample then oversample the train part (the idea is to check if this oversampling really makes sense, and could be applied to have a robust model to use in real-life)
3. we collect accuracy, precision, recall, f measure and the number of variable in the model obtained. We also calculate those three first indicators on the train part.  
  
**ARE YOU EXCITED??**  
  
(i am)

In [None]:
oversample = SMOTE(random_state = 42)

acc = []
prec = []
rec = []
f_measure = []
nb_var = []

acc_train = []
prec_train = []
rec_train = []

rat = []

for ratio in tqdm(np.arange(0.13, 1, 0.05)) :
    try :
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
        
        undersample = RandomUnderSampler(sampling_strategy=ratio, random_state = 42)
        new_X, new_y = undersample.fit_resample(X_train, y_train)


        new_X, new_y = oversample.fit_resample(new_X, new_y)

        log_reg, var_to_del = reduction_variable_logit(new_X, new_y, showVarToDel=False)
        
        yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 
        yhat_train = log_reg.predict(new_X.drop(var_to_del, axis = 1).astype(float)) 

        threshold = 0.5
        
        myPrec = precision_score(y_test, yhat>threshold)
        myRecall = recall_score(y_test, yhat>threshold)
        rat.append(ratio)
        rec.append(myRecall)
        prec.append(myPrec)
        acc.append(accuracy_score(y_test, yhat>threshold))
        f_measure.append(2 * (myPrec * myRecall) / (myPrec + myRecall))
        nb_var.append(len(log_reg.pvalues))
        
        rec_train.append(recall_score(new_y, yhat_train>threshold))
        prec_train.append(precision_score(new_y, yhat_train>threshold))
        acc_train.append(accuracy_score(new_y, yhat_train>threshold))
    except :
        print("ERROR : " + str(ratio))

First thing: there are many errors due to singular matrix, causing a problem of convergence.

In [None]:
fig = plt.figure(figsize=(15,15))
sns.lineplot(x = rat, y = acc, label = "accuracy", color = "blue")
sns.lineplot(x = rat, y = prec, label = "precision", color="orange")
sns.lineplot(x = rat, y = rec, label = "recall", color="green")
sns.lineplot(x = rat, y = f_measure, label = "f_measure", color="red")

sns.lineplot(x = rat, y = acc_train, label = "accuracy on train", style=True, dashes=[(3, 3)], color = "blue")
sns.lineplot(x = rat, y = prec_train, label = "precision on train", style=True, dashes=[(3, 3)], color = "orange")
sns.lineplot(x = rat, y = rec_train, label = "recall on train", style=True, dashes=[(3, 3)], color = "green")

plt.xlabel("Ratio Nb_People_Readmitted_Within_30_Days / Nb_People before undersampling : 0.13 = no undersampling ; 1=no oversampling")

plt.show()

In [None]:
fig = plt.figure(figsize=(15,15))
sns.lineplot(x = rat, y = nb_var)
plt.xlabel("Ratio Nb_People_Readmitted_Within_30_Days / Nb_People before undersampling : 0.13 = no undersampling ; 1=no oversampling")
plt.ylabel("Number of selected variables")
plt.show()

**NOOOOOOO!**  
  
Above results shows what I feared: The amazing results we obtained with the oversampling are not robusts. In this case, creating a logistic model with an oversampling method is not relevant, because we are not able to recover the readmissions :(  
  
In fact creating data through SMOTE leads to a complete quasi-separation. It can be observed throught the very low number of variables to be deleted. Thus the resulting dataset is not representatitve of the reality. 
Let's go back to the undersampling method...

## Chosen logistic regression : on undersampled dataset

In [None]:
undersample = RandomUnderSampler(random_state=42)
new_X, new_y = undersample.fit_resample(X, y)
print('Oversampled dataset shape {}'.format(Counter(new_y)))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_X, new_y, test_size=0.2, random_state=42)

In [None]:
log_reg, var_to_del = reduction_variable_logit(X_train, y_train, showVarToDel=False)

In [None]:
log_reg.summary()

Let's have a deeper look at this model, with the confusion matrix and the calculation of the AUC:

In [None]:
threshold = 0.45

yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, yhat>threshold)))
print("Precision is {0:.2f}".format(precision_score(y_test, yhat>threshold)))
print("Recall is {0:.2f}".format(recall_score(y_test, yhat>threshold)))


fig, axs = plt.subplots(1, 2, figsize = (20, 10))

sns.heatmap(confusion_matrix(y_test, yhat>threshold), 
            annot=True, 
            cmap="Blues", 
            xticklabels = ["Predicted False", "Predicted True"], 
            yticklabels = ["Real False", "Real True"], ax=axs[0])

fpr, tpr, threshold = roc_curve(y_test, yhat)
roc_auc = auc(fpr, tpr)
axs[1] = plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
axs[1] = plt.title('Receiver Operating Characteristic')
axs[1] = plt.legend(loc = 'lower right')
axs[1] = plt.plot([0, 1], [0, 1],'r--')
axs[1] = plt.xlim([0, 1])
axs[1] = plt.ylim([0, 1])
axs[1] = plt.ylabel('True Positive Rate')
axs[1] = plt.xlabel('False Positive Rate')

We can observe that there is still many False Positive (1000) and False Negative (690). Furthermore an AUC equals to 0.65 is not very good.  
  
Thus this model is obviously not perfect at all. With an objective to detect readmissions, we can create two thresholds, and separate the two categories into three : **Not likely** to be readmitted, **Likely** to be readmitted, **Very likely** to be readmitted.

In [None]:
threshold_down = 0.45
threshold_up = 0.8

yhat = log_reg.predict(X_test.drop(var_to_del, axis = 1).astype(float)) 

response_df = pd.DataFrame(columns=["Real_label", "Proba", "Predicted_Label"])

response_df["Real_label"] = y_test
response_df["Proba"] = yhat
response_df.loc[response_df["Proba"] > threshold_up, "Predicted_Label"] = "Very_likey"
response_df.loc[response_df["Proba"] < threshold_up, "Predicted_Label"] = "Likely"
response_df.loc[response_df["Proba"] < threshold_down, "Predicted_Label"] = "Not_likely"

response_df["uno"] = 1

table = pd.pivot_table(data=response_df[["Predicted_Label", "Real_label", "uno"]], index=["Predicted_Label"], columns = "Real_label", values="uno", aggfunc=np.sum)

table.loc[["Not_likely", "Likely", "Very_likey"], :].fillna(0).transpose()

In [None]:
plt.figure(figsize = (20, 10))

sns.heatmap(table.loc[["Not_likely", "Likely", "Very_likey"], :].fillna(0).apply(lambda x: x / float(x.sum()), axis=1).transpose(), 
            cmap="Blues", annot=True, fmt='.2%')
plt.show()

Even with those new labels, there still are many errors.  
This model being the best that we have, we will try to understand it.  

# Analysis of the model

First of all, we run this model, with selected variables, on the whole undersampled dataset, without the separation train / test:

In [None]:
log_reg_tot = sm.Logit(new_y, new_X.drop(var_to_del, axis=1).astype(float)).fit()
log_reg_tot.summary()

Finally, as our objective is to explain the model, we check which are the most influent features of this model, by looking at the z evaluation. This score is an evaluation of the importance of the variable.  
$$z\_evaluation = \frac{coef}{std\_err}$$  
High coefficients have a bigger impact on the model, low standard errors reveal the robustness of the coefficient.

In [None]:
zipped_lists = zip(abs(log_reg_tot.tvalues[1::].values), log_reg_tot.tvalues[1::].index, log_reg_tot.params[1::].values)
zipped_lists = sorted(zipped_lists, reverse=True)

fig, axs = plt.subplots(1, 2, figsize = (20, 10))

sns.barplot(x = [element for element, _, _ in zipped_lists], 
            y = [element for _, element, _ in zipped_lists], 
            ax=axs[0])
axs[0].set_xlabel("Z-evaluation")

sns.barplot(x = [element for _, _, element in zipped_lists], 
            y = [element for _, element, _ in zipped_lists], ax=axs[1])
axs[1].set_xlabel("Coefficient")

plt.show()

What we understand from this model:
- **_number_inpatient_** is clearly the more relevant feature in term of z-evaluation. The higher the number of inpatient visits of the patient in the year preceding the encounter is, the higher the probability of the readmission within 30 days is.
- _alreadyCame_ is second one, with a negative coefficient. It means that first admission is more likely to lead to a quick readmission.
- The **discharge type** seems very important. Indeed all 3 resulting features appears in the top 10. In this case, being discharge because of a transfert increase significatively the probability to be readmitted.
- Other quantitative features seem important with a positive coefficient (meaning a positive correlation with the probability to be readmitted): _number_diagnoses_ , _number_emergency_ , _time_in_hospital_ , _number_medo_arrived_ and _proportion_chgmnt_.
- About the diagnoses, even if the Circulatory & Diabetes diagnosis seems to increase the probability, I don't think we can infer on this feature. Moreover _diag_atleast_Digestive_ has a p-value > 0.05 when we run it on the full dataset, increasing my doubts about this feature.
- The same conclusion can be assessed with the drugs: even if taking metformin or glimepiride seems to lower the probability, as the risk increase with the number of medications, I don't know if we can extract much about this feature.
- The race seems not to be relevant neither. _race_Caucasian_ and _race_AfricanAmerican_ have the same coefficient, and together they represent more than 95% of the studied population. This can been seen in the correlation heatmap below.

In [None]:
plt.figure(figsize = (20, 20))
sns.heatmap(abs(X.drop(["Intercept"] + var_to_del, axis = 1).corr()), cmap="Greens", annot=True)
plt.show()

This analysis of the database leads to the following conclusions:
- A predictive model to detect in advance the readmission within 30 days is possible. Indeed our model, even if it is not very precise, **is still better than nothing**, and gives insights of some causes of readmission, our factors that give the opportunity to detect a readmission.
- **The status of database and the way I have exploited it is not enough to have a powerful parametric model.** Maybe more precise data or a better feature engineering work could lead to better results.