---
layout : post
title : List Filtering
descriptions : List FIltering
courses : { csp: {week: 1} }
comments: true
sticky_rank: 1

---

#  Popcorn Hack #1


In [1]:
def find_students_in_range(df, min_score, max_score):
    return df[(df['Score'] >= min_score) & (df['Score'] <= max_score)]

# Popcorn Hack #2

In [2]:
def add_letter_grades(df):
    def get_letter(score):
        if score >= 90:
            return 'A'
        elif score >= 80:
            return 'B'
        elif score >= 70:
            return 'C'
        elif score >= 60:
            return 'D'
        else:
            return 'F'
    df['Letter'] = df['Score'].apply(get_letter)
    return df


# Popcorn Hack #3


In [3]:
def find_mode(series):
    return series.mode().iloc[0]

# Homework Hack 

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('diabetes.csv')

#Highest to lowest
highest_glucose = data[data['Glucose'] == data['Glucose'].max()]
lowest_glucose = data[data['Glucose'] == data['Glucose'].min()]


# Difference between Max and min
glucose_diff = data['Glucose'].max() - data['Glucose'].min()

#Instance when glucose exceeds average
average_glucose = data['Glucose'].mean()
above_average_glucose = data[data['Glucose'] > average_glucose]

#Groups by bmi catgories and calculate average glucose
# Define BMI categories
bins = [0, 18.5, 24.9, 29.9, data['BMI'].max()]
labels = ['Underweight', 'Normal', 'Overweight', 'Obese']
data['BMI_Category'] = pd.cut(data['BMI'], bins=bins, labels=labels)

# Group by BMI category
grouped_data = data.groupby('BMI_Category')[['Glucose', 'Age']].mean()


#Is There a Correlation Between BMI and Glucose Levels? To explore the relationship between BMI and glucose levels, I calculated the correlation coefficient using Pandas. A positive value suggests that as BMI increases, glucose levels tend to increase too. I also used a scatter plot to visually observe this trend.
correlation = data[['BMI', 'Glucose']].corr().iloc[0, 1]

#Visualization:
import matplotlib.pyplot as plt

plt.scatter(data['BMI'], data['Glucose'])
plt.title('Correlation between BMI and Glucose Levels')
plt.xlabel('BMI')
plt.ylabel('Glucose')
plt.show()

#Which Age Group Has the Highest Average Glucose Level? I divided the individuals into age groups using pd.cut() and calculated the average glucose level for each group. This helps identify which age range is most associated with higher glucose levels.


age_groups = pd.cut(data['Age'], bins=[20, 30, 40, 50, 60, 70, 80], right=False)
age_group_avg_glucose = data.groupby(age_groups)['Glucose'].mean()
highest_glucose_age_group = age_group_avg_glucose.idxmax()

#What Percentage of Individuals Have Glucose Levels Above 140 mg/dL? To understand how common dangerously high glucose levels are, I calculated the percentage of individuals with glucose readings above 140 mg/dL. This threshold can signal potential diabetes risk.


high_glucose_count = data[data['Glucose'] > 140].shape[0]
percentage_high_glucose = (high_glucose_count / len(data)) * 100

#Store the Data in a Table:
import sqlite3

conn = sqlite3.connect('diabetes_analysis.db')
data.to_sql('diabetes_data', conn, if_exists='replace', index=False)

#Find the Average Glucose and BMI for Each Age Group:
SELECT Age_Group, AVG(Glucose) AS Avg_Glucose, AVG(BMI) AS Avg_BMI
FROM (
    SELECT *,
           CASE
               WHEN Age BETWEEN 20 AND 29 THEN '20-29'
               WHEN Age BETWEEN 30 AND 39 THEN '30-39'
               WHEN Age BETWEEN 40 AND 49 THEN '40-49'
               WHEN Age BETWEEN 50 AND 59 THEN '50-59'
               WHEN Age BETWEEN 60 AND 69 THEN '60-69'
               ELSE '70+'
           END AS Age_Group
    FROM diabetes_data
)
GROUP BY Age_Group;

#Identify Individuals with Glucose Levels Above 180 mg/dL and BMI Over 30:

SELECT *
FROM diabetes_data
WHERE Glucose > 180 AND BMI > 30;

#Group by Outcome and Calculate Average Age and BMI:
SELECT Outcome, AVG(Age) AS Avg_Age, AVG(BMI) AS Avg_BMI
FROM diabetes_data
GROUP BY Outcome;











Compare SQL Approach with Pandas Approach:

Advantages of SQL:

- Efficient handling of large datasets.

- Standardized query language.

- Optimized for complex joins and aggregations.

Advantages of Pandas:

- Intuitive and flexible data manipulation.

- Seamless integration with Python's data ecosystem.

- Ideal for exploratory data analysis with rich visualization support.


Considerations:

- For large-scale data, SQL may offer performance benefits.

- Pandas provides more flexibility for complex data transformations and is well-suited for in-memory operations.

# Extra Credit

To earn extra credit, I visualized trends in my glucose dataset using Seaborn and Matplotlib. These graphs help highlight relationships between BMI, glucose levels, and diabetes outcomes. Before plotting, I cleaned the data to ensure values were realistic and removed outliers.

In [None]:
# Code Cell — Setup and Load Data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Style settings
plt.style.use('ggplot')
sns.set_palette("pastel")

# Load dataset
data = pd.read_csv('/mnt/data/diabetes.csv') 

# 1. Boxplot of Glucose by Diabetes Outcome

This boxplot compares glucose levels between individuals who tested positive or negative for diabetes. Higher glucose values are clearly associated with a positive diagnosis.

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data)
plt.title('Glucose Levels by Diabetes Outcome')
plt.xlabel('Diabetes Outcome (0 = No, 1 = Yes)')
plt.ylabel('Glucose Level (mg/dL)')
plt.show()

# 2 Scatter Plot: BMI vs. Glucose
This scatter plot shows how glucose levels tend to rise with increasing BMI, especially for individuals with diabetes (Outcome = 1). It helps reveal a correlation between body mass and blood sugar levels.


In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='BMI', y='Glucose', hue='Outcome', data=data)
plt.title('BMI vs Glucose Level by Outcome')
plt.xlabel('BMI')
plt.ylabel('Glucose Level (mg/dL)')
plt.legend(title='Diabetes Outcome')
plt.show()


# 3. Bar Chart: Average Glucose by Age Group
I grouped people by age and calculated average glucose levels in each range. This shows which age groups have higher risk for elevated glucose.

In [None]:
data['Age_Group'] = pd.cut(data['Age'], bins=[20, 30, 40, 50, 60, 70, 80], right=False)

age_group_avg = data.groupby('Age_Group')['Glucose'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(x='Age_Group', y='Glucose', data=age_group_avg)
plt.title('Average Glucose Level by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Average Glucose (mg/dL)')
plt.xticks(rotation=45)
plt.show()


# 4. Histogram: Distribution of Glucose Levels
This histogram shows how glucose levels are distributed across the dataset. A noticeable right skew suggests several high-glucose outliers.

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(data['Glucose'], bins=10, kde=True)
plt.title('Distribution of Glucose Levels')
plt.xlabel('Glucose (mg/dL)')
plt.ylabel('Frequency')
plt.show()