# <font color='cyan'> Data Exploration with Python

### Dataset Overview Type 2 Diabetes (T2D)

Type 2 Diabetes (T2D) is a chronic metabolic condition that affects how your body processes blood sugar (glucose). Unlike Type 1 Diabetes, where the body doesn't produce insulin, people with T2D either do not produce enough insulin or their cells become resistant to insulin. This results in high blood sugar levels, which can lead to serious health complications over time.

In [None]:
# Importing the necessary libraries for data exploration
import pandas as pd          # For data manipulation and analysis
import numpy as np           # For numerical computations and array operations
import matplotlib.pyplot as plt  # For creating static visualizations (plots)
import seaborn as sns        # For statistical data visualization, built on top of Matplotlib
from scipy.stats import pearsonr  # For statistical analysis (Pearson correlation)



In [None]:
data = pd.read_csv('/Users/hadarklimovski/Desktop/T2D_dataset_final.csv').set_index('PatientID')

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.shape

### **Distributions with histograms(matplotlib.pyplot)**

A histogram is a graphical representation of the distribution of a numerical dataset. It is used to visualize the frequency of data points within different intervals or bins. In a histogram, the x-axis represents the range of values (e.g., age, height, or BMI), and the y-axis shows how many times each range or bin appears in the data.

In [None]:

# Generate a histogram for BMI
plt.figure(figsize=(8, 6))
plt.hist(data['BMI'], bins=10, color='lightblue', edgecolor='black')  # Create histogram
plt.title('Distribution of BMI', size = 20)  # Title of the plot
plt.xlabel('BMI', size = 20)  # Label for the x-axis
plt.ylabel('Frequency', size = 20)  # Label for the y-axis
plt.show()  # Display the plot



In [None]:

# Generate a histogram for Blood_Glucose
plt.figure(figsize=(8, 6))
plt.hist(data['Blood_Glucose'], bins=15, color='lightpink', edgecolor='black')  # Create histogram
plt.title('Distribution of Blood_Glucose', size = 20)  # Title of the plot
plt.xlabel('Blood_Glucose', size = 20)  # Label for the x-axis
plt.ylabel('Frequency', size = 20)  # Label for the y-axis
plt.show()  # Display the plot


### **Pearson correlation(scipy)**

Pearson Correlation is a statistical measure that quantifies the linear relationship between two continuous variables. It tells us how strongly the two variables are related and whether the relationship is positive or negative.

In [None]:
from scipy.stats import pearsonr

correlation, p_value = pearsonr(data['BMI'], data['Cholesterol_Level'])
print(f'Correlation: {correlation}, P-value: {p_value}')


In [None]:
plt.scatter(data['BMI'], data['Cholesterol_Level'], alpha=0.5)
plt.title('BMI vs Cholesterol Level')
plt.xlabel('BMI')
plt.ylabel('Cholesterol_Level')
plt.show()


### **Correlation Matrix**

A correlation matrix is a table that shows the correlation coefficients between multiple variables in a dataset. Each cell in the table displays the correlation between two variables. The values in the matrix typically range from -1 to 1, indicating the strength and direction of the relationship between the variables.

In [None]:

# Calculate the correlation matrix
corr_matrix = data.corr()

# Visualize the correlation matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()


### **Independent T-test (SciPy)**

An **Independent T-test** is used to determine whether there is a significant difference between the **means** of two independent groups. These groups are not related in any way.
### Key Concepts

- **Null Hypothesis (H₀)**: The means of the two groups are equal (i.e., no significant difference).
- **Alternative Hypothesis (H₁)**: The means of the two groups are not equal (i.e., a significant difference).
  
### T-statistic:
The T-statistic quantifies how much the group means differ relative to the variability (spread) within the groups.

### P-value:
- **P-value < 0.05**: There is a **significant difference** between the groups.
- **P-value ≥ 0.05**: There is **no significant difference** between the groups.


```python


### Comparing Blood Glucose levels between **male** and **female** groups:

In [None]:
male_data = data[data['Gender (Male=0, Female=1)'] == 0]
female_data = data[data['Gender (Male=0, Female=1)'] == 1]

In [None]:
male_data.shape,female_data.shape

In [None]:
male_data_glu = male_data['Blood_Glucose']
female_data_glu = female_data['Blood_Glucose']

In [None]:
from scipy.stats import ttest_ind
# Perform the t-test
t_stat, p_value = ttest_ind(male_data_glu, female_data_glu)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}, P-value: {p_value}")



In [None]:

plt.figure(figsize=(6, 4))
sns.boxplot(
    x=data['Gender (Male=0, Female=1)'], 
    y=data['Blood_Glucose'], 
    width=0.3  
)
plt.title('Blood Glucose Comparison between Male and Female', fontsize=14)
plt.xlabel('Gender (male=0, female=1)', fontsize=12)
plt.ylabel('Blood Glucose', fontsize=12)
plt.tight_layout()
plt.show()


### Comparing Blood Glucose levels between **healthy** and **sick** groups:

In [None]:
sick_data = data[data['Disease Status (Healthy=0, T2D=1)'] == 1]
healthy_data = data[data['Disease Status (Healthy=0, T2D=1)'] == 0]

In [None]:
sick_data.shape,healthy_data.shape

In [None]:
sick_data_glu = sick_data['Blood_Glucose']
healthy_data_glu = healthy_data['Blood_Glucose']

In [None]:
# Perform the t-test
t_stat, p_value = ttest_ind(sick_data_glu, healthy_data_glu)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}, P-value: {p_value}")


In [None]:
# Map disease status labels for the x-axis
disease_labels = ['Healthy', 'T2D']

# Create the box plot with white boxes
plt.figure(figsize=(8, 6))
sns.boxplot(
    x=data['Disease Status (Healthy=0, T2D=1)'], 
    y=data['Blood_Glucose'], 
    width=0.5, 
    boxprops=dict(facecolor='steelblue', edgecolor='black'),
    medianprops=dict(color="black")
)

# Replace 0 and 1 with 'Healthy' and 'T2D'
plt.xticks(ticks=[0, 1], labels=disease_labels, fontsize=12)

# Add titles and labels
plt.title('Blood Glucose Comparison between Healthy and Diseased Patients', fontsize=14, fontweight='bold')
plt.xlabel('Disease Status', fontsize=12)
plt.ylabel('Blood Glucose Level', fontsize=12)

# Add significance asterisks above the boxes
x1, x2 = 0, 1  # Positions for the boxes
y, h, col = data['Blood_Glucose'].max() + 10, 10, 'black'  # Height and color of the line
plt.plot([x1, x1, x2, x2], [y, y + h, y + h, y], lw=1.5, color=col)
plt.text((x1 + x2) * .5, y + h + 2, f"p = {p_value:.3f}", ha='center', va='bottom', color=col, fontsize=12)

# Adjust layout and show the plot
plt.tight_layout()
plt.show()


# <font color = pink> class exercise


1. Plot a histogram of glucose levels. What can you infer from it?

In [None]:
#write your code here

2. Check the Pearson correlation between age and sleep duration and plot it.

In [None]:
#write your code here

3. Perform a t-test to check if there are any statistical differences in BMI between sick and healthy individuals, and plot a box plot. Is there are significant differences ?

In [None]:
#write your code here