# Ben Toaz
## Work at the bottom of the file

<div style="border: 1px solid #4CAF50; padding: 10px; border-radius: 5px;">
    <h2 style="color: #4CAF50;">Pre-Class Assignment (PCA) Instructions</h2>
</div>


# 

This assignment is due by **midnight on Wednesday**. The goal is to explore the diabetes dataset in preparation for the **In-Class Assignment (ICA)** on Thursday. 

### Instructions:
1. **Explore the Dataset**: Take some time to understand the data, the key features, and how the target variable (disease progression) behaves.
   
2. **Prepare for Role Playing**: Imagine you work at a company, and you’ve been tasked with analyzing and presenting this dataset to stakeholders. Consider:
   - **Who is your audience**? (e.g., older generation, Gen-Z, millennials, or a mixed audience)
   - **Invent a setting**: Is this for a drug company, a health education seminar, or another industry?
   - **What are the plot twists**? Which trends or surprising insights will make your story more engaging?
   - **Best visualizations**: Based on your audience and setting, what are the most effective ways to visualize the data? Which plots from the EDA stand out, or what additional visualizations might you create?
   - **Delivery**: How would you structure your presentation to convey the story to your audience clearly and persuasively?

3. **Read the ICA Instructions**: Make sure you review the instructions for the in-class assignment on Thursday to understand how this pre-work will fit into the larger activity. Those instructions are included with this PCA.

4. **Submit a Summary**: By Wednesday night, submit a short summary (half a page) of your observations and ideas. This should include:
   - Key insights from the dataset
   - Ideas for your role-playing scenario (audience, setting, visualizations)
   - Your name!

This summary will count toward your ICA grade. Put all of your answers into a markdown cell at the bottom of this notebook and turn that in. 

### A Brief Introduction to Diabetes:
Diabetes is a chronic condition that affects how your body turns food into energy. It occurs when the pancreas doesn't produce enough insulin or the body can’t use insulin effectively. Over time, high blood sugar levels can lead to serious health complications like heart disease, vision loss, and kidney disease. Understanding the factors that contribute to diabetes progression can help inform treatment plans and preventative measures.

In this dataset, you’ll be exploring health-related variables such as BMI, blood pressure, and glucose levels to understand their relationship with the progression of diabetes over time.


In [1]:
# Import necessary libraries
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import numpy as np
from sklearn.datasets import load_diabetes
import plotly.graph_objs as go

# Load the diabetes dataset and convert to a DataFrame
diabetes = load_diabetes()
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['target'] = diabetes.target

# s1 = cholesterol, s6 = blood glucose
selected_features = ['age', 'bmi', 'bp', 's1', 's6', 'target']  # Focus on these variables

# 1. Correlation Heatmap (Interactive)
correlation_matrix = diabetes_df[selected_features].corr().values
fig_heatmap = ff.create_annotated_heatmap(
    z=correlation_matrix,
    x=selected_features,
    y=selected_features,
    colorscale='Reds'
)
fig_heatmap.update_layout(
    title="Correlation Heatmap (Interactive)",
    xaxis_title="Features",
    yaxis_title="Features"
)
fig_heatmap.show()

# 2. Interactive Scatterplot of BMI vs Target with Regression Line
fig_scatter_bmi = px.scatter(diabetes_df, x='bmi', y='target', trendline='ols',
                             labels={'bmi':'BMI', 'target':'Disease Progression'},
                             title="Interactive Scatterplot of BMI vs Disease Progression")
fig_scatter_bmi.show()

# 3. Interactive Scatterplot of Blood Pressure vs Target with Regression Line
fig_scatter_bp = px.scatter(diabetes_df, x='bp', y='target', trendline='ols',
                            labels={'bp':'Blood Pressure', 'target':'Disease Progression'},
                            title="Interactive Scatterplot of Blood Pressure vs Disease Progression")
fig_scatter_bp.show()

# 4. Bin BMI and Blood Pressure for better categorical comparison in violin plots
diabetes_df['bmi_binned'] = pd.cut(diabetes_df['bmi'], bins=5)
diabetes_df['bp_binned'] = pd.cut(diabetes_df['bp'], bins=5)

# Convert the binned intervals to strings for compatibility with Plotly
diabetes_df['bmi_binned'] = diabetes_df['bmi_binned'].astype(str)
diabetes_df['bp_binned'] = diabetes_df['bp_binned'].astype(str)

# 5. Interactive Violin Plot for Binned BMI vs Target
fig_violin_bmi = px.violin(diabetes_df, x='bmi_binned', y='target', box=True, points='all',
                           labels={'bmi_binned':'BMI Binned', 'target':'Disease Progression'},
                           title="Interactive Violin Plot of Binned BMI vs Disease Progression")
fig_violin_bmi.show()

# 6. Interactive Violin Plot for Binned Blood Pressure vs Target
fig_violin_bp = px.violin(diabetes_df, x='bp_binned', y='target', box=True, points='all',
                          labels={'bp_binned':'Blood Pressure Binned', 'target':'Disease Progression'},
                          title="Interactive Violin Plot of Binned Blood Pressure vs Disease Progression")
fig_violin_bp.show()


# Pair Plot with Reduced Features (Ensure only numeric features are used)
numeric_features = ['age', 'bmi', 'bp', 's1', 'target', 's6']  # Focus on numeric features only

fig_splom = go.Figure(data=go.Splom(
    dimensions=[dict(label=col, values=diabetes_df[col]) for col in numeric_features],
    showupperhalf=False,  # Only show the lower half of the matrix
    diagonal_visible=False  # Hide diagonal subplots
))

fig_splom.update_layout(
    title="Scatter Plot Matrix of Selected Numeric Features",
    dragmode='select',
    width=800,
    height=800
)

fig_splom.show()

In [2]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         442 non-null    float64
 1   sex         442 non-null    float64
 2   bmi         442 non-null    float64
 3   bp          442 non-null    float64
 4   s1          442 non-null    float64
 5   s2          442 non-null    float64
 6   s3          442 non-null    float64
 7   s4          442 non-null    float64
 8   s5          442 non-null    float64
 9   s6          442 non-null    float64
 10  target      442 non-null    float64
 11  bmi_binned  442 non-null    object 
 12  bp_binned   442 non-null    object 
dtypes: float64(11), object(2)
memory usage: 45.0+ KB


In [3]:
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target,bmi_binned,bp_binned
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0,"(0.0141, 0.0662]","(-0.0146, 0.0343]"
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0,"(-0.0905, -0.0381]","(-0.0635, -0.0146]"
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0,"(0.0141, 0.0662]","(-0.0146, 0.0343]"
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0,"(-0.0381, 0.0141]","(-0.0635, -0.0146]"
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0,"(-0.0381, 0.0141]","(-0.0146, 0.0343]"


Next, read carefully the instructions for the ICA for Thursday. Invent some scenarios so that you can quickly work with your subgroup on a scenario you all agree to. 


Here are some hints, but you and your subgroup should be creative!

### Subgroup Scenarios for Data Storytelling

Below are five distinct scenarios that you will consider in your subgroup. Each scenario presents a unique audience and setting, requiring you to adapt your data narrative, tone, and visualizations accordingly. Your goal is to tailor your presentation to fit the needs and expectations of the audience described in your assigned scenario.

##### Scenario 1: **Insulin Production Company – Presenting to New Employees**
- **Audience**: New employees at an insulin production company who are mainly curious about diabetes data but have limited knowledge.
- **Objective**: Explain the relationship between health factors (BMI, glucose, etc.) and diabetes progression in a simple and informative way.
- **Tone**: Educational and accessible, focusing on the basics.
- **Challenge**: Keeping it engaging for a less technical audience while delivering key insights about the product (insulin) and its impact on health.

##### Scenario 2: **Soft Drink Company – Presenting to a Traditional CEO**
- **Audience**: The CEO of a major soft drink company, who is resistant to change and prefers things the way they’ve always been.
- **Objective**: Persuade the CEO to consider the impact of sugary drinks on health and how it might relate to diabetes, potentially advocating for product reformulation or marketing shifts.
- **Tone**: Convincing but respectful, using hard-hitting data to support your case.
- **Challenge**: Overcoming resistance and biases, especially when the data suggests a negative impact of sugary drinks on diabetes progression.

##### Scenario 3: **Health Education Nonprofit – Presenting to a Diverse Audience**
- **Audience**: A mixed group at a health education seminar, including young adults, parents, and senior citizens.
- **Objective**: Raise awareness about the risk factors for diabetes and the importance of lifestyle changes (diet, exercise) to prevent or manage the condition.
- **Tone**: Clear, engaging, and actionable, with an emphasis on how each demographic can take steps to improve their health.
- **Challenge**: Addressing a diverse audience with varying levels of health literacy and interest in the topic.

##### Scenario 4: **Medical Research Conference – Presenting to Experts**
- **Audience**: Doctors, researchers, and healthcare professionals attending a medical research conference.
- **Objective**: Present the data on diabetes progression in relation to key health metrics like BMI and glucose, aiming to spark discussion on new treatment approaches or clinical trials.
- **Tone**: Technical and data-driven, with a focus on cutting-edge research and clinical applications.
- **Challenge**: Providing enough depth to keep experts engaged while ensuring clarity and focus.

##### Scenario 5: **Fitness Company – Presenting to a Group of Fitness Trainers**
- **Audience**: Fitness trainers and health coaches at a company focused on fitness and wellness.
- **Objective**: Highlight the role of exercise, diet, and physical fitness in managing and preventing diabetes. Connect diabetes data to fitness practices.
- **Tone**: Motivational and informative, focusing on how trainers can use this information to support their clients.
- **Challenge**: Translating medical data into practical, fitness-related advice that trainers can apply in their day-to-day work.


# Ben Toaz's Thoughts

2. **Prepare for Role Playing**: Imagine you work at a company, and you’ve been tasked with analyzing and presenting this dataset to stakeholders. Consider:
   - **Who is your audience**? (e.g., older generation, Gen-Z, millennials, or a mixed audience)
   - **Invent a setting**: Is this for a drug company, a health education seminar, or another industry?
   - **What are the plot twists**? Which trends or surprising insights will make your story more engaging?
   - **Best visualizations**: Based on your audience and setting, what are the most effective ways to visualize the data? Which plots from the EDA stand out, or what additional visualizations might you create?
   - **Delivery**: How would you structure your presentation to convey the story to your audience clearly and persuasively?


         Audience: Middle-Aged Businesss Executives

         Setting: Government Bioweapons Facility

         Twists: Those with high BMI  succomb to diabetes more easily, which is a predictor for numerous other diseases. 

         Visuals: The correlation heatmap shows a strong relationship between BMI and target disease onset. It also shows that there is only a low correlation with age, meaning that young and old people are largely succeptable at similar rates. Other traits (gender, income, location, etc.) of a target population should also be examined to find and exploit vulnerabilities.

         Delivery: You want power. You get it by building deadly weapons. These are the segments of the population that are most vulnerable to chemical and biological agents. Here's why.


4. **Submit a Summary**: By Wednesday night, submit a short summary (half a page) of your observations and ideas. This should include:
   - Key insights from the dataset
   - Ideas for your role-playing scenario (audience, setting, visualizations)
   - Your name!

         The highest correlation with target disease onset is BMI, along with blood pressure in second place. These two factors can be used to predict the severity of diabetes, perhaps not in isolation, but combined with other demographic information. We only have age here, but adding nationality, gender, and income could greatly increase the predictability of the target. The BMI violin plot binning process is interesting because the bins at the upper extreme have far less data, which is fairly concentrated compaired to the other bins. There aren't a lot of people in the dataset with a large BMI past a certain threshold. Blood pressure is more evenly distributed across the bins. I wonder if BMI would be as correlated to the target if it had a similar distribution of data as blood pressure. A few features are correlated with each other higher than others, like blood pressure + BMI, cholesterol + BMI and age + blood pressure. Adding blood glucose level is interesting because it has a moderate correlation with all of the selected feature variables, the highest of which is with blood pressure. 

         2nd Role Playing Scenario: Class Action Suite Against Nestle

         Audience: The Supreme Court (Learned Old People)

         Setting: Formal argument that aims to prove that large food conglomerate Nestle is partly responsible for causing diabetes in US citizens due to high cholesterol, fat, and glucose levels in their food. A similar dataset could be used by an expert witness to explain the factors that lead to the disease.

         Visuals: This dataset can establish the link between these food ingredients and the onset of diabetes. The correlation matrix is likely the most useful here. Once the causal link is solidified the prosecution can look at proving that Nestle was aware of this link and made unhealthy food anyway.



