# Diabetes Exploratory Data Analysis



__Author:__ Desiree Unselt <br>
__Date:__ 18JUL224 <br>

__Dataset:__ Diabetes 130 dataset representing 130 US hospitals collected from 1999 to 2008, hosted on the UC Irvine Dataset Repository.

__Objectives:__
  - Perform an exploratory data analysis (EDA) on the dataset to understand how it's structured and what it contains
  - Once completed:
    - Ask someone to review your code
    - Share your findings wiht the team
    - Commit your code to a local git repository
  - Think about how you could train a model to predict patient readmission, then try and build a model to predict this property.
    - The dataset has the following classes in the readmission column:
        - 'No' (didn't get readmitted)
        - '>30' (was readmitted after 30 days)
        - '<30' (was readmitted within 30 days)
    - How can you measure the performance of the model? How well is it doing?
    - What's the best model that you can develop?
    - How does it compare against other volunteer models? Can you combine ideas from other people to make your model better?
  - Can you come up with a reasonable real life application of these algorithms being applied to the Diabetes data? Are there any limitations of your approach?

## Exploratory Data Analysis (EDA)

### Import Libraries, Set Working Directory, and Load Data

In [4]:
# Import libraries 
import os
import pandas as pd
import numpy as np

# Confirm working directory
print(os.getcwd())

# Change working directory
os.chdir('/data/project_data')

# Load diabetes data
df_diabetes = pd.read_csv('diabetic_data.csv')
df_diabetes.head()

/data/project_data


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


### Review Dataset Shape

In [6]:
# Dataset Rows & Columns count
# Checking number of rows and columns of the dataset using shape
print("Number of rows are: ",df_diabetes.shape[0])
print("Number of columns are: ",df_diabetes.shape[1])

Number of rows are:  101766
Number of columns are:  50


### Review Dataset Information

In [7]:
# Dataset Info
# Checking information about the dataset using info
df_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [None]:
# Print column names
# print(df_diabetes.columns)

# print(df_diabetes.dtypes)
# print(df_diabetes['weight'].value_counts())
# Some weights are listed as '?'
# print(df_diabetes['readmitted'].value_counts())
# print(df_diabetes.isnull().sum())

# Determine which columns have at least 1 null value
# print(df_diabetes.columns[df_diabetes.isnull().any()])

# Determine which columns have at least 1 '?' value
# print(df_diabetes.columns[df_diabetes.eq('?').any()])

# List columns that are object data types
# print(df_diabetes.select_dtypes(include='object').columns)

# Provide basic statistics for numerical columns
# print(df_diabetes.describe())

# Provide basic statistics for categorical columns
# print(df_diabetes.describe(include='object'))

# List columns where unique is equal to 1
print(df_diabetes.columns[df_diabetes.nunique() == 1])


In [None]:
# Create a table containing mean, median, mode, and standard deviation for numerical columns
numerical_columns = df_diabetes.select_dtypes(include=np.number).columns
# Drop identifier columns
numerical_columns = numerical_columns.drop(['encounter_id', 'patient_nbr'])
statistics_table = pd.DataFrame(columns=['Column', 'Mean', 'Median', 'Mode', 'Standard Deviation'])

for column in numerical_columns:
    mean = df_diabetes[column].mean()
    median = df_diabetes[column].median()
    mode = df_diabetes[column].mode().values[0]
    std_dev = df_diabetes[column].std()
    new_row = pd.DataFrame({'Column': [column], 'Mean': [mean], 'Median': [median], 'Mode': [mode], 'Standard Deviation': [std_dev]})
    statistics_table = pd.concat([statistics_table, new_row], ignore_index=True)

print(statistics_table)



In [None]:
# Create a table containing mode for categorical columns
categorical_columns = df_diabetes.select_dtypes(include='object').columns
mode_table = pd.DataFrame(columns=['Column', 'Mode'])

for column in categorical_columns:
    mode = df_diabetes[column].mode().values[0]
    new_row = pd.DataFrame({'Column': [column], 'Mode': [mode]})
    mode_table = pd.concat([mode_table, new_row], ignore_index=True)

print(mode_table)

In [None]:
# Convert '?' to null values
df_diabetes.replace('?', np.nan, inplace=True)

# Create a table containing the number of missing values for each column in one column and then another column as percentage of total rows
missing_values_table = pd.DataFrame(columns=['Column', 'Missing Values', 'Percentage of Total Rows'])

for column in df_diabetes.columns:
    missing_values = df_diabetes[column].isnull().sum()
    percentage = (missing_values / len(df_diabetes)) * 100
    new_row = pd.DataFrame({'Column': [column], 'Missing Values': [missing_values], 'Percentage of Total Rows': [percentage]})
    missing_values_table = pd.concat([missing_values_table, new_row], ignore_index=True)

print(missing_values_table)