# Data Visualization Exercises

## MCDA 5511

Apply what you've learned to analyze real datasets. These exercises require you to **choose** the right visualization and **think** about what story the data tells.

**Instructions:**
- Complete the code in each cell where you see `# YOUR CODE HERE`
- Hints are provided - try without them first!
- Multiple correct approaches exist for each problem
- Focus on clarity and insight, not just making a plot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np

plt.style.use('seaborn-v0_8-whitegrid')

# Load the datasets
sales = pd.read_csv('../data/sales_data.csv', parse_dates=['date'])
weather = pd.read_csv('../data/weather_data.csv', parse_dates=['date'])
experiment = pd.read_csv('../data/experiment_results.csv')

print('Datasets loaded!')

---
## Part 1: Sales Data Analysis

You're a data analyst at a company selling two products (Widget A, Widget B) across four regions.

In [None]:
print(sales.head(10))
print(f'\nDate range: {sales.date.min()} to {sales.date.max()}')
print(f'Regions: {sales.region.unique()}')
print(f'Products: {sales.product.unique()}')

### Exercise 1.1: Which region has the highest total revenue?

Create a visualization that answers this question clearly. The answer should be obvious at a glance.

<details>
<summary>Hint</summary>

- You'll need to aggregate the data first using `groupby()`
- Bar charts are good for comparing categories
- Sorting makes comparison easier

</details>

In [None]:
# YOUR CODE HERE


### Exercise 1.2: How does revenue trend over time for each region?

Create a line chart showing monthly revenue by region. Which region is growing fastest?

<details>
<summary>Hint</summary>

- Group by both date AND region
- Line plots show trends; use `hue` or `color` to distinguish regions
- Markers help show actual data points

</details>

In [None]:
# YOUR CODE HERE


### Exercise 1.3: Compare profit margins between products

Calculate profit (revenue - cost) for each row. Then visualize which product has better profit margins across regions.

Choose an appropriate chart type that shows both the comparison and any regional variation.

<details>
<summary>Hint</summary>

- First create the profit column
- Grouped bar charts or box plots work well for comparing categories within categories
- The `hue` parameter lets you add a second grouping variable

</details>

In [None]:
# YOUR CODE HERE
sales['profit'] = sales['revenue'] - sales['cost']


---
## Part 2: Weather Data Analysis

You have 10 days of weather data from three Canadian cities.

In [None]:
print(weather.head(10))
print(f'\nCities: {weather.city.unique()}')

### Exercise 2.1: Temperature distributions by city

Create a visualization comparing the temperature distributions across the three cities. Which city has the most variable temperature? Which is warmest on average?

<details>
<summary>Hint</summary>

- Box plots and violin plots show distributions well
- Look at the spread (IQR) to judge variability
- The median line shows the "typical" value

</details>

In [None]:
# YOUR CODE HERE


### Exercise 2.2: Is there a relationship between humidity and precipitation?

Create a scatter plot to explore whether higher humidity correlates with more precipitation. Color by city to see if the relationship differs by location.

<details>
<summary>Hint</summary>

- Scatter plots reveal relationships between two numeric variables
- Adding color by a category can reveal if patterns differ by group
- Trend lines help quantify the relationship

</details>

In [None]:
# YOUR CODE HERE


### Exercise 2.3: Weather correlation heatmap

Create a correlation heatmap of all numeric weather variables. Which variables are most strongly correlated? Does this make meteorological sense?

<details>
<summary>Hint</summary>

- Use `.select_dtypes('number')` to get only numeric columns
- The `.corr()` method computes pairwise correlations
- Diverging colormaps (like RdBu) work well centered at 0

</details>

In [None]:
# YOUR CODE HERE


---
## Part 3: Scientific Experiment Data

You have results from a drug trial testing three drugs (A, B, C) at different concentrations, plus control and placebo groups.

In [None]:
print(experiment)
print(f'\nTreatments: {experiment.treatment.unique()}')

### Exercise 3.1: Dose-response curves with error bars

Create a line plot showing mean_response vs concentration for each drug (A, B, C only - exclude Control and Placebo). Include error bars using the std_error column.

Which drug shows the strongest dose-response relationship?

<details>
<summary>Hint</summary>

- Filter the dataframe to include only the three drugs
- Matplotlib's `ax.errorbar()` supports the `yerr` parameter
- Loop through each drug to plot separate lines with different colors
- Don't forget `capsize` to make error bars visible

</details>

In [None]:
# YOUR CODE HERE
drugs = experiment[experiment['treatment'].isin(['Drug A', 'Drug B', 'Drug C'])]


### Exercise 3.2: Statistical significance visualization

The p_value column shows whether each result is statistically significant (p < 0.05).

Create a visualization that clearly shows which drug/concentration combinations are statistically significant. Consider using color, markers, or annotations to highlight significance.

<details>
<summary>Hint</summary>

- Create a boolean column for significance (p < 0.05)
- Options: different colors, different marker styles, or asterisk annotations
- A heatmap with drug vs concentration is another approach

</details>

In [None]:
# YOUR CODE HERE


### Exercise 3.3: Compare all treatments at highest dose

Filter to only the highest concentration (2.0) for each drug, plus Control and Placebo. Create a bar chart with error bars comparing mean response across all treatments.

Add a horizontal line showing the Control baseline for reference.

<details>
<summary>Hint</summary>

- Filter for concentration == 2.0 OR concentration == 0.0
- `ax.bar()` accepts a `yerr` parameter for error bars
- `ax.axhline()` draws a horizontal reference line

</details>

In [None]:
# YOUR CODE HERE


---
## Part 4: Interactive Dashboard (Plotly)

Create interactive versions of your analyses.

### Exercise 4.1: Interactive sales explorer

Create an interactive scatter plot of units_sold vs revenue, with:
- Color by region
- Symbol/marker by product
- Hover showing date, region, product, and profit

Save as `sales_explorer.html`

<details>
<summary>Hint</summary>

- `px.scatter()` has parameters for `color`, `symbol`, and `hover_data`
- Use `.write_html()` to save interactive plots

</details>

In [None]:
# YOUR CODE HERE


### Exercise 4.2: Animated weather comparison

Create an animated bar chart showing daily temperature for each city, with animation_frame as the date.

Fix the y-axis range so the animation is smooth.

<details>
<summary>Hint</summary>

- Convert date to string for the animation frame
- `px.bar()` supports `animation_frame` parameter
- Use `range_y` to fix the axis and prevent jumping

</details>

In [None]:
# YOUR CODE HERE


---
## Part 5: Challenge - Tell a Story

Choose ONE of the three datasets. Create a **single figure** (can have multiple subplots) that tells a compelling story about the data.

Requirements:
- Must include a clear title
- Must include axis labels
- Must be publication-quality (save at 300 DPI)
- Write 2-3 sentences explaining your insight below the figure

<details>
<summary>Hint</summary>

Think about what's interesting or surprising in the data. What would a manager, scientist, or journalist want to know?

For multi-panel figures, use `plt.subplots()` with multiple axes.

</details>

In [None]:
# YOUR CODE HERE


**Your insight:** (Write 2-3 sentences here)

