
## Exploratory Data Analysis

In this notebook you will get practical experience with common data exploration steps. Here we will use a well-known dataset often used for machine learning and statistical analysis projects based on the "**Pima Indians Diabetes Database**"




> **This notebook will serve as an EDA report for this dataset. Your task is to examine results from each step and add a summary at the end of the notebook that addresses the given questions.**

# About the dataset

The dataset is available on [Kaggle](https://https://www.kaggle.com/), a popular platform for data science competitions and projects.

The dataset contains various attributes or features associated with individuals, particularly from India, who have been diagnosed as either diabetic or non-diabetic. Each row typically represents one individual, and each column represents a specific attribute or feature.

The original objective of the dataset is to diagnostically **predict whether or not a patient has diabetes, based on certain diagnostic measurements** included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.



# EDA checklist

The first step in any data analysis task is to create different summaries of the data that help understand the dataset properties. As a result of a comprehensive Exploratory Data Analysis you should have as a result a cleaned and filtered dataset, where any errors in the data, format conversions and value scale adjustments have been addressed.


Ok. Let's start!
This the plan
1. Load python packages
2. Load dataset by using Pandas
3. Check the data content and dimensions
4. Then take a closer look on statistical summary of data to identify missing values
5. Devise a data cleaning and preprocessing strategy
6. Exploring relationships between features (correlations)
7. Identify patterns and outliers, revise data cleaning and preprocessing as relevant
8. Summarize findings, decisions based on these and how they were achieved. Describe the resulting analysis-ready dataset.



# 1. Load Python packages

Import basic Python datascience packages/libraries.
This tutorial assumes Python version 3.6+.

In [2]:
# import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# 2. Load dataset by using Pandas

Option 1: run the code from google colab, load the data from CBM101 git repository as following

Option 2: If you run the notebook in your machine locally, you can load the dataset by giving local path for the dataset.

```
df = pd.read_csv('/your path to directory/diabetes.csv')
```



In [3]:
url = "https://raw.githubusercontent.com/thilinib/CBM101/main/assets/diabetes.csv"
df = pd.read_csv(url)

Now the pandas .csv file is transformed to a python data structure DataFrame.

# 3. Check the data content and dimensions

In [4]:
# check the shape of the data
df.shape

(768, 9)

In [5]:
# check how dataset looks like
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Another important method is *info()* which gives a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

In [None]:
# check the data types and general info
df.info()



> **Make a note to your EDA report summary**: Do you notice that there is need to do any data type conversions e.g. encoding group categories as numbers? (hint: Are all columns containing numeric data? )



The code below is demonstrating an example of how to perform type conversion - for this feature it is actually not needed so after demonstrating how this is done, we will actually revert the type back to what it was originally.

In [None]:
#change Glucose Dtype
df.Glucose = df.Glucose.astype(object)


In [None]:
df.info()

In info table check the 'Dtype'. For machine learning downstream tasks we are expecting to see int or float. You can notice that Glucose is now of "object" type. So let's revert it back to numerical data.

In [None]:
df.Glucose = df.Glucose.astype(int)
df.info()

# 4. Statistical summary of data

In [None]:
# check the summary statistics
df.describe()

In descriptive statistic we can see,


*   **Count**: The count column indicates the number of non-null values for each variable. If these values vary by column, this can suggest that there are missing values.
*   **Mean (Average)**: The mean represents the average value of each variable across all observations. For example, the mean number of pregnancies is approximately 3.85, the mean glucose level is around 120.89 mg/dL, and so forth.

*   **min :** Examine whether there are an unusually small values. E.g. consider based on your biological knowledge that data for glucose level, blood pressure, skin thickness, insulin level and BMI.
*   **max: ** Examine whether there are an unusually large values.


*   **Standard Deviation (Std):** The standard deviation measures the dispersion or variability of values around the mean. A higher standard deviation indicates greater variability in the data. For instance, variables like insulin and blood pressure have relatively high standard deviations compared to others, suggesting wider variability in their values which could be biological or indicate some technical issues with these measurements

*   **Percentiles (25th, 50th, 75th):** These percentiles provide insights into the distribution and spread of the data. If the mean and median(50th) are close to each other, it suggests that the data is approximately symmetrically distributed and likely has low skewness. However, if the mean is significantly larger or smaller than the median, it indicates skewness in the data like Insulin.




It is always good to check the above info feature by feature. In addition, you should also compare the features: Do we have features that can dominate others in magnitude?
If yes, you should consider if data needs to be scaled?



> **Make a note to your EDA summary**: Did you notive unusual/suspicious values? If yes, describe your observations. Comment also on the need to scale feature values.



# 5. Data cleaning and preprocessing

Most Machine Learning algorithms cannot work with missing values. The goal in this section is to learn how to detect and remove those observations (in this case data for a given individual) that include missing values, or alternatively impute (i.e. replace with some value) missing values.

In the code below we inspect two scenarios: i) value is missing in the original table and assigned "null" when read into Python, ii) zero value is used to indicate that no information was available. Additionally, it could also be worth checking for "na", "NA", or other ways that this could be encoded.

In [None]:
#check null values
df.isnull().values.any()

In [None]:
df.isnull().sum()

In [None]:
# missing values (zero values)
(df == 0).sum()

In [None]:
# missing values (zero values) percentage
(df == 0).mean()

We can see that there are some features with zero values. However, we should first distinsguish which are real zeros from likely missing values, and then think how we can solve this issue.

> **Make a note to your EDA summary** Specify for each feature whether zero values are meaningful. E.g. consider variables like "Pregnancies" vs features such as "Glucose" or "BloodPressure". Summarize how much missing values are present to guide the choice between imputation or removal.

This choice depends on various factors including the distribution of the data and the specific objectives of the analysis.

We will show one strategy to solve this issue step by step below. First lets identify observations with many missing values.

In [None]:
# check the shape after dropping the rows with zero values in three features
df[(df.BloodPressure != 0) & (df.SkinThickness != 0) & (df.Insulin != 0)].shape

In [None]:
# calculate the percentage of the data if dropping the rows with zero values in these features
df[(df.BloodPressure != 0) & (df.SkinThickness != 0) & (df.Insulin != 0)].shape[0] / df.shape[0]

> **Make a note to your EDA summary** What proportion of data is lost if data entries with suspected missing values are removed? You can report for the above selection and try also different alternative ways, based on what features you consider may have missing values.

We can check duplicates and scatter matrix to help in deciding what to do with zero values

In [None]:
# check duplicates
df.duplicated().sum()

In [None]:
# check scatter matrix
pd.plotting.scatter_matrix(df, figsize=(15,15), range_padding=0.3, alpha=0.3);

On scatterplots you can spot "saturation" of zeros if many data points separate out from the "data cloud" at zero value. If there are several such observations, one strategy is to remove these observations, especially those with a lot of zeros. Below we remove observations if value for SkinThickness and Insulin is zero.

> **Make a note to your EDA summary** What did you observe in the scatter plots? Comment also on your choice of what data to remove in case you adapt the code below.

In [None]:
# let's drop the rows with zero values
df = df[(df.SkinThickness != 0) & (df.Insulin != 0)]

In [None]:
# check the amount of missing values (zero values) after dropping the rows
(df == 0).sum()

In [None]:
# check the percentage of missing values (zero values) after dropping the rows
(df == 0).mean()

> **Make a note to your EDA summary**: Does it looks like we have removed most of them?

You can use the dataset as now or if you wish you can remove remaining zero values in other features.

If removal of these observations leads to considerable loss of datapoints, as alternative, we can do imputation. There are lot of ways to impute data. Below, we provide code that simply imputes the missing values by each feature mean. This is the default in the Pycaret package (but only works if missing values are null and not encoded by zero).

Following is an example for Glucose zero value imputation (notice that this is applied to dataset that already went through the removal step based on skinThickness and Insulin):

In [None]:
# Calculate the mean of 'Glucose' excluding zero values
glucose_mean = df[df['Glucose'] != 0]['Glucose'].mean()

# Impute missing values (zero values) in 'Glucose' with the calculated mean
df['Glucose'] = df['Glucose'].replace(0, glucose_mean)

# Display the DataFrame with imputed values
print(df['Glucose'])

In [None]:
# missing values (zero values) percentage
(df == 0).mean()


You can see there's no any 0 values for Glucose.




Now let's visualize features seperately.

Visualizations are often used to intuitively understand the distribution of the data.

Histograms can be used to visualize the distribution of continuous data/features. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).

> **Make a note to your EDA summary**: Do you observe unusually low/high values?

In [None]:
#Plot all values on a frequency graph (showing how often they occur).
df.hist(bins=50, figsize=(10,15))
plt.show()

If a feature looks suspicious, you can sort values and show how many such values are in the data using the following code (exemplified by glucose level and blood pressure )

In [None]:
# lower limit of the diastolic blood pressure
df.BloodPressure.value_counts().sort_index(ascending=True)

> **Make a note to your EDA summary**: Are there several unusual values? If yes, you should consider how to handle them. Examine and include comment to your summary from all of the features that looked suspicious based on the histograms.

In [None]:
# lower limit of the skin thickness
df.SkinThickness.value_counts().sort_index(ascending=True)

In [None]:
# lower limit of the glucose
df.Glucose.value_counts().sort_index(ascending=True)



You can also re-plot the scatter matrix after cleaning data (i.e. in code above the removal of observations with  missing values in two specific features).

> **Make a note to your EDA summary**: Compare to the plots before/after data cleaning steps.

In [None]:
# check scatter matrix after dropping the rows with zero values
pd.plotting.scatter_matrix(df, figsize=(15,15), range_padding=0.3, alpha=0.3);

# 6. Exploring relationships between features(correlations)

Since the dataset is not too large, we can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the *corr()* method:

Calculating correlation on very large datasets can be  computationally expensive

In [None]:
corr_matrix = df.corr()

Now let’s look first at how much each attribute correlates with the outcome feature:

In [None]:
corr_matrix["Outcome"].sort_values(ascending=False)

We can see all have positive correlations with the outcome variable. Next, we can visualize the correlation values for each other simply using a heatmap.

In [None]:
# check the correlation matrix
sns.heatmap(df.corr(), annot=True);



Using the correlation matrix, we get a complete picture of the correlations amongst the variables and their effect on the outcome.

Here, we can see that the feature ‘glucose’ has a high correlation with the outcome, which is expected.

> **Make a note to your EDA summary:** Do the features have strong correlation to each other? If yes, you should consider dropping redundant features (e.g. columns "height in cm" and "height in inches" would be redundant).

Another way to check for correlation between attributes is to use Pandas’ scatter_matrix function, (we already drew it)which plots every numerical attribute against every other numerical attribute.


# 7. Identify patterns and outliers

In order to check existing patterns in dataset, we can use visualization plots like scatter, histogram etc.

Let's see how we can plot value distribution of each feature for diabetics vs non-diabetics.

In [None]:
bins = 12
plt.figure(figsize=(15,18))

for i, feature in enumerate(df.columns):
    rows = int(len(df.columns)/2)
    plt.subplot(rows, 3, i+1)
    sns.histplot(df, x = feature, hue='Outcome', common_norm = False, stat='density')
    sns.kdeplot(x = df[feature], hue=df['Outcome'], common_norm = False)

From the statistical summary you have seen the Insulin has a high standard deviation and quartile values. This can also be visually observed here. Should we consider the very high values as outliers for Insulin?

For more insight, lets plot box plots. These can be used as a quick check of the value distributions of each feature and they indicate the interquartile range and candidate outliers.





In [None]:
df.boxplot(figsize=(15, 10))
plt.title('Box Plot for All Features')
plt.ylabel('Values')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()

If there are many clear outliers in the features, an additional data cleaning strategy may be needed, in addition to the approach to handling zero values that we added to the analysis code earlier. This strategy could involve imputing outliers e.g. with the median of each respective feature. The benefit of an imputing strategy is that it allows for the retention of all data points while addressing the outliers. Can you think of any caveats?

After completing the preprocessing steps, it's a good practice to save the preprocessed dataset for future use in machine learning tasks.

In [None]:
df.to_csv('preprocessed_diabetes.csv', index=False)



# Exercise
**Summarize each step of the analysis, focusing on what was observed, what was decided based on it, and how it was accomplished.** Refer to the plan at the beginning and "Make a note" parts highlighted in the notebook to ensure you address all key points.



*   Statistical summary of data
*   Detecting missing values and outliers
*   Feature correlations


Describe the analysis-ready processed dataset (how many observations it contains, were all features kept for downstream analysis).











