## Exploratory data analysis and data cleaning on diabetes data among a sample of Pima Indian female patients.

For my Codecademy certification course in
data science and machine learning engineering

Robert Hall 01.24.2024

Data found here, from the National Institute of Diabetes and Digiestive and Kidney Diseases:
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data 

In [1]:
# import needed libraries
import numpy as np 
import pandas as pd 

Task 1:

"Load the data in a variable called 'diabetes_data' and print the first few rows"

In [2]:
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age Outcome  
0                     0.627   50       1  
1                     0.351   31       0  
2                     0.672   32       1  
3                     0.167   21       0  
4                     2.288   33       1  


Task 2:

"Do any of the columns in the data contain null (missing) values?"

In [3]:
diabetes_null = diabetes_data.isnull().sum()
print(diabetes_null)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


Task 3:

Calculate summary statistics on 'diabetes_data' using 
.describe() 
[to determine whether there are null values labeled as something other than NoneType or NaN, etc]

as we can see, the minumum values for Glucose, BloodPressure, SkinThickness, Insulin and BMI are 0.0, indicating null data (assuming because those counts are never a flat 0)

In [4]:
print(diabetes_data.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age  
count  768.000000                768.000000  768.000000  
mean    31.992578                  0.471876   33.240885  
std      7.884160                  0.331329   11.760232  
min      0.000000                  0.078000   21.000000  
25%     27.300000        

Task 4:

Replace 0.0 values with np.NaN

In [6]:
diabetes_data[['Glucose',
               'BloodPressure',
               'SkinThickness',
               'Insulin',
               'BMI']]\
= diabetes_data[['Glucose',
                 'BloodPressure',
                 'SkinThickness',
                 'Insulin',
                 'BMI']].replace(0,np.nan)

Task 5:

Check for null values in all columns, now that all 0.0 values should be converted to np.NaN

In [7]:
diabetes_null = diabetes_data.isnull().sum()
print(diabetes_null)

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


Task 6:

To get a better idea of why some data might be missing, print out all rows containing null values

In [8]:
print(diabetes_data[diabetes_data.isnull().any(axis=1)])

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6    148.0           72.0           35.0      NaN  33.6   
1              1     85.0           66.0           29.0      NaN  26.6   
2              8    183.0           64.0            NaN      NaN  23.3   
5              5    116.0           74.0            NaN      NaN  25.6   
7             10    115.0            NaN            NaN      NaN  35.3   
..           ...      ...            ...            ...      ...   ...   
761            9    170.0           74.0           31.0      NaN  44.0   
762            9     89.0           62.0            NaN      NaN  22.5   
764            2    122.0           70.0           27.0      NaN  36.8   
766            1    126.0           60.0            NaN      NaN  30.1   
767            1     93.0           70.0           31.0      NaN  30.4   

     DiabetesPedigreeFunction  Age Outcome  
0                       0.627   50       1  
1                    

Task 7:

Take a look at the datatypes for each feature in the dataset

In [9]:
print(diabetes_data.dtypes)

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object


Task 8:

The 'Outcome' column is an object type instead of the expected int64 type. Print out the unique columns to get a clue as to why

In [10]:
print(diabetes_data.Outcome.unique())

['1' '0' 'O']


Task 9: 

The 'Outcome' column appears to be composed of boolean 1's and 0's, except for one 'O' value which appears to be a mistake. Convert all instances with a value of 'O' to 0. 

Then, change the datatype from object to int64 so that the numbers aren't read as strings would be.

Then, print the unique feature values to verify that the modifications were successful.

In [12]:
diabetes_data['Outcome'] = diabetes_data.Outcome.replace('O', '0')
diabetes_data['Outcome'] = diabetes_data.Outcome.astype('int64')
print(diabetes_data.Outcome.unique())

[1 0]
