# Introduction

# Preparatory analysis

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('data/diabetic_data.csv')

In [3]:
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


## Counting missing values for each feature

### Nan Values

In [4]:
data.shape

(101766, 50)

In [5]:
na_values = data.isna().sum(axis=0)

print(na_values[na_values > 0]/len(data) * 100)

max_glu_serum    94.746772
A1Cresult        83.277322
dtype: float64


Apparently, max_glu_serum and A1Cresults are the only features with 'Nan' values. We can easily drop these columns, as these features are categorical so they cannot be inferred in any way. 


In [6]:
data['max_glu_serum'].unique()

array([nan, '>300', 'Norm', '>200'], dtype=object)

In [None]:
data['A1Cresult'].unique()

In [None]:
data = data.drop(columns=['max_glu_serum', 'A1Cresult'])

### Other missing values

Taking a look at the datasaet, it is clear that for some features missing values are indicated by a question mark rathen than being nan values. Let's count missing values for each column

In [None]:
for col in data.columns:
    missing_values = data.loc[data[col] == '?', col].value_counts()
    if len(missing_values.values > 0):
        print(col, missing_values.values/len(data) * 100)

From the results above, we find that the feature 'weight' is available only for 3% of the observations! This column can be dropped. We also find that 49% of observations do not have values for medical_speciality and 39% do not have values for payer_code. These two are both categorical variables, so we cannot use the mean or other statistics to replace missing values. So, they can easily be dropped
diag_1, diag_2 and diag_3 are diagnosis so they cannot be inferred. In a classification setting these features can cleary be useful. Suppose the primary diagnosis is correct: intuitevely, a patient with a correct diagnosis has a less probability of returning to the hospital rather than a patient with n incorrect diaagnosis. Rather than drop the columns it is wiser to drop the observations with missing diagnosis.

In [None]:
data = data.drop(columns=['weight', 'payer_code', 'medical_specialty'])

In [None]:
labels = (data['diag_1'] == '?') | (data['diag_2'] == '?') | (data['diag_3'] == '?')
data.drop(data[labels].index, inplace=True) # data[~labels] OP

In [None]:
data.shape

Data is now ready for the analysis pipeline!

# Changing unique values for some columns

Is this safe for logistic regression?

In [None]:
data.loc[data['readmitted'] == 'NO', 'readmitted'] = 0
data.loc[data['readmitted'] == '>30', 'readmitted'] = 1
data.loc[data['readmitted'] == '<30', 'readmitted'] = 2

In [None]:
data.loc[data['change'] == 'No', 'change'] = 0
data.loc[data['change'] == 'Ch', 'change'] = 1

In [None]:
data.loc[data['diabetesMed'] == 'Yes'] = 1
data.loc[data['diabetesMed'] == 'No'] = 0

In [None]:
len(data.columns)

In [None]:
data.loc[data == 'Steady', ['']]


In [None]:
data['readmitted'] = pd.to_numeric(data['readmitted'])
data['change'] = pd.to_numeric(data['change'])
data['diabetesMed'] = pd.to_numeric(data['diabetesMed'])

# Exploratory data analysis

In [None]:
freqs = data['readmitted'].value_counts(normalize=True).sort_index()
print(freqs)

In [None]:
freqs.plot.bar(figsize=(15, 6))

Il 54% dei pazienti non è tornato in ospedale, il 35% è tornato in ospedale in un periodo che supera i 30 giorni e, infine, il restante 11% è tornato entro i 30 giorni. Da questo si evince che le valutazioni effettuate dai medici su #num pazienti si sono rivelate corrette e senza falle. ??

In [None]:
#data['readmitted'].plot.bar()

# Come si distribuisce il genere?

In [None]:
data['gender'].unique()

In [None]:
(data.drop(labels=data['gender'] == 'Unknown/Invalid'))

In [None]:
data['gender'].value_counts(normalize=True)

## Come si distribuisce il sesso tra tutti i pazienti non riammessi?

In [None]:
d = data[data['readmitted'] == 0][['gender', 'readmitted']]

(d.groupby('gender').value_counts()/len(d))

### Il 53% dei pazienti riammessi è di sesso femminile, mentre il restante 43% di sesso maschile. 

Domande

Distribuzione del sesso tra le riamissioni

Tempo medio in ospedale: test statistico

weight numero valori mancanti e unici

"race" presente: valori unici

distribuzione diabetes med readmitted

distribuzione race

Feature engineering

readmisssions 0 1 2
change: 0 -> 1
tutte le altre colonne

# Methods

# Results

# Conclusions