Part 1: The 10-Point Data Inspection

Shape

In [3]:
import pandas as pd

df = pd.read_csv('diabetes.csv')
print(df.shape)

(768, 9)


Rows (observations): 768; Columns (features): 9
Each row represents one individual patient

Column Names

In [4]:
print(df.columns)

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='str')


Columns such as SkinThickness, Insulin, and especially DiabetesPedigreeFunction require further research to understand how they are measured and how they relate to diabetes risk.

Data Types

In [5]:
print(df.dtypes)

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


All 9 columns are numeric. The integer columns are Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, Age, and Outcome. The float columns are BMI and DiabetesPedigreeFunction. There are no object/string (categorical) columns. However, Outcome is stored as an integer but represents a categorical binary variable (0 or 1).

First Look

In [6]:
print(df.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


The values are mostly numeric medical measurements, and Outcome is binary (0 or 1). However, some columns like Insulin and SkinThickness contain 0 values, which are unlikely to be realistic. These zeros likely represent missing data placeholders and would need to be cleaned before analysis.

Last Look

In [8]:
print(df.tail())

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
763                     0.171   63        0  
764                     0.340   27        0  
765                     0.245   30        0  
766                     0.349   47        1  
767                     0.315   23        0  


Yes, the data ends cleanly. The last rows follow the same structure as the first rows, with the same columns and numeric types. The last rows are consistent with the first rows, including similar ranges of values and the presence of 0 values in columns like Insulin and SkinThickness, which likely represents the missing data.

Memory Usage

In [7]:
print(df.memory_usage())

Index                        132
Pregnancies                 6144
Glucose                     6144
BloodPressure               6144
SkinThickness               6144
Insulin                     6144
BMI                         6144
DiabetesPedigreeFunction    6144
Age                         6144
Outcome                     6144
dtype: int64


The dataset uses about 48 KB of memory. This is considered a very small dataset by data science standards.

Missing Values

In [10]:
print(df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


No columns show missing values, so 0% of each column is missing. However, zeros are used as placeholders for missing data in Glucose, BloodPressure, SkinThickness, Insulin, and BMI, since values of 0 for these measurements are biologically impossible.

Duplicates

In [11]:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
763    False
764    False
765    False
766    False
767    False
Length: 768, dtype: bool


There are no duplicate rows.

Basic Statistics

In [12]:
print(df.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

The age range is 21 to 81. The glucose range is 0 to 199, and the BMI range is 0 to 67.1. Yes, minimum values of 0 for variables like glucose and BMI are biologically impossible and likely represent missing data.

Unique Counts

In [13]:
print(df.nunique())

Pregnancies                  17
Glucose                     136
BloodPressure                47
SkinThickness                51
Insulin                     186
BMI                         248
DiabetesPedigreeFunction    517
Age                          52
Outcome                       2
dtype: int64


Columns with very few unique values are Outcome (2) and Pregnancies (17), which suggests Outcome is binary and Pregnancies is more count-based. Columns with many unique values, such as Glucose, Insulin, BMI, DiabetesPedigreeFunction, Age, SkinThickness, and BloodPressure, are likely continuous. The Outcome column represents whether a patient has diabetes (0 = no, 1 = yes) and it has 2 unique values.