# Lab 5 Experiment B: AI-Assisted Data Visualization

**Student:** Springer Tebeau  
**Course:** DCDA 40833  
**Date:** February 19, 2026

## Overview
This notebook demonstrates using AI to generate Python code for data analysis and visualization. The task is to analyze student study habits and their correlation with exam performance.

### AI Prompt Used:
> "Create Python code that generates a sample dataset of 50 students with columns for study_hours (1-10), sleep_hours (4-9), and exam_score (0-100). Then create a scatter plot showing the relationship between study hours and exam scores, with points colored by sleep quality. Add a trend line and proper labels."

### Understanding Level:
- **Data Generation:** I understand how numpy.random is used to create realistic sample data
- **Pandas DataFrame:** I understand this is a table-like data structure, similar to Excel
- **Matplotlib:** I understand the basics of creating plots, but the advanced styling was AI-generated
- **Statistical Correlation:** I know what it means but asked AI to calculate it

In [None]:
# AI-Generated: This cell installs and imports required libraries
# Understanding: I know these are the standard data science libraries:
# - numpy: numerical operations and random number generation
# - pandas: data manipulation and analysis (like Excel in Python)
# - matplotlib: creating visualizations

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

print("Libraries imported successfully!")

In [None]:
# AI-Generated: Code to create synthetic student data
# Understanding: I understand the logic here:
# - np.random.seed(42) makes the random numbers reproducible (same results each time)
# - np.random.randint() creates random whole numbers in a range
# - The exam_score calculation adds study hours and sleep hours with some randomness
# - The formula ensures higher study/sleep = higher scores (realistic pattern)
# I modified: The AI originally used uniform(0,20) for noise, but I changed it to (0,15) 
# to reduce the randomness and make the correlation clearer

np.random.seed(42)  # For reproducibility

# Generate sample data for 50 students
n_students = 50

data = {
    'student_id': range(1, n_students + 1),
    'study_hours': np.random.randint(1, 11, n_students),  # 1-10 hours
    'sleep_hours': np.random.randint(4, 10, n_students),  # 4-9 hours
}

# Calculate exam scores based on study and sleep (with some randomness)
# Formula: base score + (study impact) + (sleep impact) + random noise
data['exam_score'] = (
    50 +  # Base score
    (data['study_hours'] * 3) +  # More study = higher score
    (data['sleep_hours'] * 2) +  # More sleep = better performance
    np.random.uniform(0, 15, n_students)  # Random variation
)

# Make sure scores stay within 0-100 range
data['exam_score'] = np.clip(data['exam_score'], 0, 100)

# Create DataFrame
df = pd.DataFrame(data)

# Display first few rows
print("First 5 students:")
print(df.head())

# Display basic statistics
print("\nSummary Statistics:")
print(df.describe())

In [None]:
# AI-Generated: This creates a categorical variable for sleep quality
# Understanding: I understand this uses pandas.cut() to bin continuous sleep hours
# into categories. The bins parameter defines the ranges, and labels names them.
# This makes it easier to color-code the visualization by sleep quality level.

# Create sleep quality categories
df['sleep_quality'] = pd.cut(
    df['sleep_hours'], 
    bins=[0, 5, 7, 10], 
    labels=['Low (≤5h)', 'Medium (6-7h)', 'High (8+h)']
)

print("Sleep quality distribution:")
print(df['sleep_quality'].value_counts())

In [None]:
# AI-Generated: Creates the main scatter plot visualization
# Understanding: I get the overall structure - create figure, plot points, add trend line
# Parts I understand fully:
# - plt.figure() creates a new plot with specified size
# - The for loop iterates through each sleep quality category
# - plt.scatter() plots points with different colors for each group
# - Labels, title, and legend are straightforward
#
# Parts I understand the concept but not implementation details:
# - np.polyfit(degree=1) fits a linear trend line (I know what it does, not how)
# - np.poly1d() creates a function from the trend line (black box to me)
# - The correlation coefficient calculation using stats.pearsonr()
#
# I did not modify this code - it worked perfectly as generated

# Create the visualization
plt.figure(figsize=(12, 7))

# Define colors for each sleep quality category
colors = {'Low (≤5h)': '#ff6b6b', 'Medium (6-7h)': '#ffd93d', 'High (8+h)': '#6bcf7f'}

# Plot each sleep quality group with different colors
for quality in df['sleep_quality'].unique():
    mask = df['sleep_quality'] == quality
    plt.scatter(
        df[mask]['study_hours'], 
        df[mask]['exam_score'],
        c=colors[quality], 
        label=quality, 
        alpha=0.6, 
        s=100,
        edgecolors='black',
        linewidth=0.5
    )

# Add trend line
z = np.polyfit(df['study_hours'], df['exam_score'], 1)
p = np.poly1d(z)
plt.plot(df['study_hours'], p(df['study_hours']), "r--", linewidth=2, label='Trend Line')

# Calculate and display correlation
correlation = stats.pearsonr(df['study_hours'], df['exam_score'])[0]

# Labels and styling
plt.xlabel('Study Hours per Week', fontsize=12, fontweight='bold')
plt.ylabel('Exam Score', fontsize=12, fontweight='bold')
plt.title(f'Study Hours vs Exam Performance\n(Correlation: {correlation:.2f})', 
          fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3, linestyle='--')
plt.ylim(0, 105)
plt.xlim(0, 11)

# Save the figure
plt.tight_layout()
plt.savefig('study_hours_analysis.png', dpi=300, bbox_inches='tight')
print(f"Visualization saved as 'study_hours_analysis.png'")
plt.show()

print(f"\nKey Finding: There is a {correlation:.2f} correlation between study hours and exam scores.")
if correlation > 0.5:
    print("This is a STRONG positive relationship - more studying leads to better scores!")
elif correlation > 0.3:
    print("This is a MODERATE positive relationship.")
else:
    print("This is a WEAK relationship.")

In [None]:
# AI-Generated: Creates a second visualization comparing sleep vs performance
# Understanding: I understand this follows the same pattern as above but with sleep_hours
# on the x-axis instead. I asked AI to generate this after seeing the first plot.

plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(df['sleep_hours'], df['exam_score'], c='#6bcf7f', alpha=0.6, s=100, 
            edgecolors='black', linewidth=0.5)

# Trend line
z2 = np.polyfit(df['sleep_hours'], df['exam_score'], 1)
p2 = np.poly1d(z2)
plt.plot(df['sleep_hours'], p2(df['sleep_hours']), "b--", linewidth=2, label='Trend Line')

# Calculate correlation
correlation2 = stats.pearsonr(df['sleep_hours'], df['exam_score'])[0]

plt.xlabel('Sleep Hours per Night', fontsize=12, fontweight='bold')
plt.ylabel('Exam Score', fontsize=12, fontweight='bold')
plt.title(f'Sleep Hours vs Exam Performance\n(Correlation: {correlation2:.2f})', 
          fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3, linestyle='--')
plt.tight_layout()
plt.show()

print(f"\nSleep-Performance Correlation: {correlation2:.2f}")

## Reflection on AI-Assisted Coding

### What Worked Well:
- AI generated syntactically correct code on the first try
- The visualization was professional-looking without me knowing advanced matplotlib
- AI handled the statistical calculations (correlation) that I don't know how to do yet

### What I Modified:
- Reduced the random noise in exam scores from 20 to 15 to make the trend clearer
- Added the second sleep hours visualization (initial AI prompt only included study hours)
- Changed some color values to match my portfolio's color scheme

### What I Learned:
- How pandas DataFrames work and why they're useful for tabular data
- The concept of correlation coefficients (though I still don't know the math)
- How to create categorical variables from continuous data using pd.cut()
- The importance of setting a random seed for reproducibility

### Honest Assessment:
I could NOT have written this code from scratch. However, I CAN now:
- Read and understand what each section does
- Modify the data inputs and labels
- Troubleshoot basic errors (I had a typo in 'exam_score' that I fixed myself)
- Explain to someone else what the code accomplishes and why it's structured this way

The AI acted as a teacher by providing working examples I could learn from.