# Lab 3 — Exploratory Data Analysis (Beginner) and Interactive Dashboard

This notebook introduces basic EDA techniques and shows how to create and save interactive visualizations using Plotly. The dataset used is the Seaborn `tips` dataset which is small and easy to work with.

Estimated time: 60-90 minutes.

## Learning Objectives
- Load and inspect a dataset
- Compute simple derived features (e.g., tip percentage)
- Create univariate and bivariate plots (histogram, boxplot, scatter)
- Produce multivariate views (pairplot, correlation heatmap)
- Build and save interactive Plotly charts and a minimal dashboard example

In [1]:
%pip install pandas numpy seaborn matplotlib plotly dash

Collecting plotly
  Downloading plotly-6.3.1-py3-none-any.whl.metadata (8.5 kB)
Collecting dash
  Downloading dash-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting narwhals>=1.15.1 (from plotly)
  Downloading narwhals-2.9.0-py3-none-any.whl.metadata (11 kB)
Collecting Flask<3.2,>=1.0.4 (from dash)
  Downloading flask-3.1.2-py3-none-any.whl.metadata (3.2 kB)
Collecting Werkzeug<3.2 (from dash)
  Downloading werkzeug-3.1.3-py3-none-any.whl.metadata (3.7 kB)
Collecting importlib-metadata (from dash)
  Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB)
Collecting typing-extensions>=4.1.1 (from dash)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting requests (from dash)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting retrying (from dash)
  Downloading retrying-1.4.2-py3-none-any.whl.metadata (5.5 kB)
Collecting setuptools (from dash)
  Downloading setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collec

In [None]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Plot styling
plt.style.use('seaborn-v0_8')
pd.options.display.max_columns = 50

print('Libraries imported')

### Method explanations:
- `sns.load_dataset('tips')`: loads a small example dataset included with Seaborn.
- `df.shape`: returns (rows, columns).
- `df.head()`: shows the first rows for a quick preview.

In [None]:
# Load dataset
tips = sns.load_dataset('tips')

# Quick overview
print('Dataset shape:', tips.shape)
tips.head()

### Data cleaning and feature engineering
We'll add a `tip_pct` column (tip divided by total bill) which is often useful when comparing tips across bills of different sizes.
Also we'll check for missing values and data types.

In [None]:
# Create tip percentage column and check dataset info
tips = tips.copy()
tips['tip_pct'] = tips['tip'] / tips['total_bill'] * 100
print('Missing values per column:')
print(tips.isnull().sum())

print('Data types:')
print(tips.dtypes)

### Univariate plots — distributions
We will look at the distribution of `tip_pct` and `total_bill` using histogram and boxplot. Histograms show frequency while boxplots highlight medians and outliers.

In [None]:
# Histogram and boxplot for tip percentage
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
sns.histplot(tips['tip_pct'], bins=20, kde=True, color='skyblue')
plt.title('Tip Percentage Distribution')

plt.subplot(1,2,2)
sns.boxplot(x=tips['tip_pct'], color='lightgreen')
plt.title('Tip Percentage (Boxplot)')
plt.tight_layout()
plt.show()

### Bivariate plots — relationships
Scatter plots help understand relationships between two numerical variables. We'll plot `total_bill` vs `tip` and add a regression line.
We'll also compute a simple correlation coefficient.

In [None]:
# Scatter plot with regression
plt.figure(figsize=(8,6))
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='day')
sns.regplot(data=tips, x='total_bill', y='tip', scatter=False, color='red')
plt.title('Total Bill vs Tip (with regression)')
plt.show()

# Correlation
corr = tips[['total_bill','tip','tip_pct']].corr()
print('Correlation matrix:')
print(corr)

### Multivariate view — pairplot and heatmap
Pairplots show pairwise relationships. Heatmaps visualize correlations. These are useful for spotting patterns across multiple variables.

In [None]:
# Pairplot (small subset for clarity)
sns.pairplot(tips[['total_bill','tip','tip_pct','size']], corner=True, plot_kws={'alpha':0.6})
plt.show()

# Correlation heatmap
plt.figure(figsize=(6,4))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

### Interactive visualizations with Plotly
Plotly Express makes it easy to create interactive charts that support hover, zoom, and pan. We'll create an interactive scatter and a bar chart and save them to standalone HTML files.

In [None]:
# Interactive scatter with hover information
fig = px.scatter(tips, x='total_bill', y='tip', color='smoker', size='size',
                 hover_data=['day','time','tip_pct'], title='Interactive: Total Bill vs Tip')
fig.show()

# Save to HTML so it can be shared
fig.write_html('lab-3/total_bill_vs_tip.html')

# Interactive bar (average tip_pct by day)
bar_df = tips.groupby('day', as_index=False)['tip_pct'].mean()
fig2 = px.bar(bar_df, x='day', y='tip_pct', title='Average Tip Percentage by Day')
fig2.show()
fig2.write_html('lab-3/avg_tip_pct_by_day.html')

### Simple dashboard example
A full dashboard will be provided as a small Dash app (`app.py`) in the `lab-3` folder. The app includes a dropdown and a slider to filter the dataset.
You can run it locally with `python app.py` and open `http://127.0.0.1:8050`.

### Exercises (Beginner-friendly)
1. Add a column `tip_pct` (already done) and compute its mean and median. Plot a histogram and describe the shape.
2. Create a scatter plot of `total_bill` vs `tip`, color by `day`. Which day shows the highest tips?
3. Run the Dash app in `app.py`, change filters and describe what you observe.