# Data Cleaning and Transformation
## Notebook 1

Group project by:
    - Tinuke Durotolu
    - Sara Ruini
    - Susan Yousefi


<b> Data source: </b> https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008# 

<b> Reference: </b>
Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014. 

In [None]:
#libraries needed
import pandas as pd
import numpy as np
import os

## Dataset challenges
This dataset has been pre-cleaned. However, there's some modifications that we need to perform.

## About the medicines
# ADD SOURCE?!?!??!??!??!?!!? ->

    - Metformin is a biguanide that lowers blood sugar levels.
    - Repaglinide and Nateglinide belongs to metglinides drug class and also helps lower blood glucose levels.
    - Tolbutamide, glimepiride and glipizide are sulfonylureas that help stimulate pancreas to produce insulin.
    - Pioglitazine is a thiazolidinedione and increased the sensitivity of body cells to insulin.
    - The other groups include gliptins, insulins and the other medications we have grouped as others.
    
In the column of each medicine (e.g. 'metmorfin'), "Up" means that the medicine got incresed, "Down" means that it got decreased, "Steady" means that the dosage has not changed and "No" means it was not prescribed.


In [None]:
df = pd.read_csv("diabetic_data.csv", encoding="Latin-1")

## Visualising the datasets

First of all, we are going to visualise the datatype of each column in the dataset and see if there is any missing values.

Our aim is to remove unnecessary values and columns and remove any information that we consider redundant.

## Removing unnecessary values

We will visualise the info of the dataset (presence of null-values) for each column, show the head and tail of the dataset and show the percentage of unique value_counts. 

In [None]:
df.info()

There are a total of 101766 data entries in this dataset.

Before analysing this dataframe further, we will drop unnecessary columns. Specifically, we will delete medicines prescribed that are not used in the UK and some variables that are not needed for our model.

In [None]:
to_drop = ['weight','medical_specialty', 'admission_type_id', 'discharge_disposition_id','max_glu_serum', 'admission_source_id', 'encounter_id','patient_nbr', 'diag_1', 'diag_2', 'diag_3', 'payer_code', 'A1Cresult']
df.drop(to_drop, inplace=True, axis=1)

In [None]:
#df.dropna()

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
df = df.dropna(axis=0, how="any")
df.info()

In [None]:
df.describe(include='all')

In [None]:
def value_counter(dataset):
    for index in df.columns:
        print("---- Index: " + index + "----")
        print(df[index].value_counts(normalize=True, sort=True))
value_counter(df)

Some columns have a lot of unique numeric values which could be put in ranges. Let's display them to make each category more readable.  These columns are: 'num_lab_procedures', 'num_medications', 

In [None]:
cols_to_range = ['num_lab_procedures', 'num_medications']
for col in cols_to_range:
    unique_val_col = df[col].unique().shape
    print("Column: '" + col + "'. Number of unique values: " + str(unique_val_col) + ".")
    print(df[col].value_counts(bins=10, sort=True))
    print("-------")

## Data Modification

### Removing missing/unclear and null values.


In [None]:
df = df[df['gender']!='Unknown/Invalid']
df = df[df['race']!='?']

### Renaming columns

In [None]:
df.columns

The goal is to:
    - Clarify what the column in the data is;
    - Conform to python standards (e.g. using snakecase)

In [None]:
new_col_names = {
    "time_in_hospital" : "days_in_hospital",
    "admission_type_id": "admission_type",
    "discharge_disposition_id": "discharge_disposition_type",
    "num_procedures": "num_not_lab_procedures",
    "num_medications": "num_current_medications",
    "num_diagnoses": "num_existing_conditions",
    "change": "change_in_meds",
    "diabetesMed": "diabates_med_prescribed",
    "number_emergency": "num_previous_emergencies",
    "number_outpatient": "num_outpatient_appointments",
    "number_inpatient": "num_inpatient_overnight_stays"
    }
df = df.rename(columns=new_col_names)
df.columns

### Renaming values for each medicine

In [None]:
list_of_meds = ['metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone']

new_values = {"Up": "dosage_increased", "Down": "dosage_decreased","Steady":"no_change_dosage", "No":"not_prescribed"}
for medicine in list_of_meds:
    df[medicine] = df[medicine].replace(new_values)

for col in list_of_meds[:3]:
    print("----" + col + "----")
    print(df[col].value_counts(sort=True))
    print("-------")

In [None]:
df.dropna()

In [None]:
df.describe(include='all')

In [None]:
## Data Cleaning and Transformation conclusion
At first glance, we can notice that the majority of people being re-admitted are:
- Caucasian
- Female
- Aged 50+
- takes between 9 and 17 medications
# TO BE FINISHED

In [None]:
df.to_csv("cleaned_data_v5.csv")