# Chapter 6: Data Visualization with Matplotlib

Welcome to data visualization! Being able to analyze data is crucial, but being able to effectively communicate your findings is just as important. **Matplotlib** is the foundational library for creating static, animated, and interactive visualizations in Python.

**Session Goals:**
* Understand the basic components of a Matplotlib plot: the **`Figure`** and the **`Axes`**.
* Learn the two main interfaces for plotting: the MATLAB-style and the object-oriented style.
* Create and customize fundamental plot types: **line plots**, **scatter plots**, and **histograms**.
* Learn how to label plots with titles, axis labels, and legends.

---
## Part 1: General Matplotlib Tips

Before we start plotting, let's cover some essential setup and tips for working with Matplotlib.

### Importing Matplotlib

The standard convention for importing Matplotlib's `pyplot` module is with the alias `plt`.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# This "magic" command tells Jupyter to display plots inline in the notebook.
%matplotlib inline

### Setting Styles

Matplotlib's default style can sometimes look a bit dated. We can use `plt.style.use()` to apply a more modern aesthetic. For these examples, we'll use `'seaborn-whitegrid'`, which provides a clean grid background.

In [None]:
plt.style.use('seaborn-whitegrid')

### Two Interfaces: MATLAB-Style vs. Object-Oriented

Matplotlib has two distinct ways of creating plots, which can be confusing for beginners.

1.  **MATLAB-Style (State-Based):** Uses a single `plt` interface that keeps track of the "current" figure and axes. It's quick for simple plots but can be clunky for complex ones.
2.  **Object-Oriented:** Explicitly create `Figure` and `Axes` objects. This gives you much more control and is the recommended approach for complex plots. A common way to start is `fig, ax = plt.subplots()`.

**We will primarily use the object-oriented interface in this notebook as it is more powerful and explicit.**

In [None]:
# MATLAB-style interface
plt.figure()
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(np.linspace(0, 10, 100), np.sin(np.linspace(0, 10, 100)))
plt.subplot(2, 1, 2)
plt.plot(np.linspace(0, 10, 100), np.cos(np.linspace(0, 10, 100)))
plt.show()

# Object-oriented interface (preferred)
fig, ax = plt.subplots(2)
ax[0].plot(np.linspace(0, 10, 100), np.sin(np.linspace(0, 10, 100)))
ax[1].plot(np.linspace(0, 10, 100), np.cos(np.linspace(0, 10, 100)))
plt.show()

---
## Part 2: Simple Line Plots

The simplest plot visualizes a single function, $y=f(x)$.

In Matplotlib, the `Figure` (instance of `plt.Figure`) is the container for everything, and the `Axes` (instance of `plt.Axes`) is the bounding box with ticks and labels where we draw our data.

In [None]:
# Create a figure and an axes
fig, ax = plt.subplots()

# Prepare some data
x = np.linspace(0, 10, 1000)

# Plot the data on the axes
ax.plot(x, np.sin(x));

### Adjusting the Plot: Colors and Styles

You can customize the appearance of your plot using keyword arguments in the `.plot()` method.

* `color`: Can be a name (`'blue'`), a short code (`'g'`), a hex code (`'#FFDD44'`), etc.
* `linestyle`: Can be `'solid'`, `'dashed'`, `'dashdot'`, `'dotted'`, or shorthand (`'-'`, `'--'`, `'-.'`, `':'`).

In [None]:
fig, ax = plt.subplots()

ax.plot(x, x + 0, linestyle='solid', color='blue')
ax.plot(x, x + 1, linestyle='dashed', color='g')
ax.plot(x, x + 2, linestyle=':', color='red');

### Adjusting the Plot: Axes Limits

Matplotlib does a good job of setting default axis limits, but you can easily customize them with `.set_xlim()` and `.set_ylim()`, or by using the convenient `.set()` method on the axes object.

In [None]:
fig, ax = plt.subplots()

ax.plot(x, np.sin(x))

# Customize the limits and labels with the .set() method
ax.set(xlim=(0, 10), ylim=(-1.5, 1.5),
       xlabel='x',
       ylabel='sin(x)',
       title='A Simple Plot');

### Labeling Plots with Legends

When plotting multiple lines, a legend is essential to label each line. The easiest way is to add a `label` to each plot call and then call `ax.legend()`.

In [None]:
fig, ax = plt.subplots()

ax.plot(x, np.sin(x), '-b', label='Sine')
ax.plot(x, np.cos(x), '--r', label='Cosine')
ax.axis('equal') # Set aspect ratio to be equal

ax.legend();

#### ✏️ Try It Yourself: Line Plot

Create a single plot with two lines:
1.  A plot of $y = x^2$ for x values from -5 to 5. Make this a solid blue line.
2.  A plot of $y = x^3$ for the same x values. Make this a dashed red line.
3.  Add a title "Quadratic vs. Cubic" and labels for the x and y axes.
4.  Include a legend to identify which line is which.

In [None]:
# Write your code here

---
## Part 3: Simple Scatter Plots

A scatter plot represents individual points of data, rather than connecting them with lines. It's excellent for showing the relationship between two variables.

### Scatter Plots with `plt.scatter`
While you can create scatter plots with `ax.plot()` by providing a marker style (e.g., `'o'`), the `ax.scatter()` function is more powerful. It allows the properties of each individual point (like size, color, and transparency) to be controlled by other variables. This is useful for visualizing multidimensional data.

In [None]:
fig, ax = plt.subplots()

rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

# Use 'c' for color and 's' for size to map data to visual properties
scatter = ax.scatter(x, y, c=colors, s=sizes, alpha=0.3,
                       cmap='viridis')

# Add a colorbar to show the color scale
fig.colorbar(scatter, label='Color Value');

#### ✏️ Try It Yourself: Scatter Plot

Create a scatter plot of 50 random points where:
1.  The x-values are random numbers between 0 and 10.
2.  The y-values are random numbers between 0 and 10.
3.  The **color** of each point is determined by its y-value.
4.  The **size** of each point is 10 times its x-value.
5.  Include a colorbar.

In [None]:
# Write your code here

---
## Part 4: Histograms, Binnings, and Density

A histogram is a great first step in understanding the distribution of a single variable. It groups data into bins and shows the count of data points that fall into each bin.

In [None]:
fig, ax = plt.subplots()

data = np.random.randn(1000)

# The ax.hist() function has many options for customization
ax.hist(data, bins=30, alpha=0.5, color='steelblue', edgecolor='none');

ax.set(title='Simple Histogram',
       xlabel='Value',
       ylabel='Frequency');

You can easily overlay multiple histograms to compare distributions.

In [None]:
fig, ax = plt.subplots()

x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)

# Use a dictionary to define common styling
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)

ax.hist(x1, **kwargs, label='Group 1')
ax.hist(x2, **kwargs, label='Group 2')
ax.hist(x3, **kwargs, label='Group 3')
ax.legend();

---
## Final Capstone Exercises 🎬

These exercises will require you to use Pandas to analyze a dataset and Matplotlib to visualize your findings.

### Step 0: Create the Dataset

Run the cell below to create a sample `movies.csv` file. This contains data on a few movies, including their genre, year, budget, and box office revenue.

In [None]:
%%writefile movies.csv
Title,Genre,Year,Budget,Revenue
Avatar,Sci-Fi,2009,237,2788
Titanic,Romance,1997,200,2187
The Avengers,Action,2012,220,1519
Jurassic World,Action,2015,150,1672
The Dark Knight,Action,2008,185,1005
Finding Nemo,Animation,2003,94,940
The Lion King,Animation,1994,45,968
Forrest Gump,Drama,1994,55,678
Inception,Sci-Fi,2010,160,828
The Matrix,Sci-Fi,1999,63,463

### Exercise 1: Analyze and Plot Movie Budgets

Your first task is to load the movie data and create a visualization related to movie budgets.

**Requirements:**
1.  Load `movies.csv` into a Pandas `DataFrame` called `movies_df`.
2.  Create a **bar chart** that shows the budget for each movie.
3.  Customize the plot:
    * Add a title: "Movie Budgets (in millions USD)".
    * Use the movie titles for the x-axis ticks.
    * Rotate the x-axis tick labels so they are readable (Hint: `plt.xticks(rotation=90)`).
    * Label the y-axis "Budget".

In [None]:
# Write your code for Exercise 1 here

### Exercise 2: Budget vs. Revenue Scatter Plot

Building on your work from Exercise 1, now you need to create a more complex visualization to explore the relationship between budget, revenue, and genre.

**Requirements:**
1.  Start with your `movies_df` from the previous exercise.
2.  Create a **scatter plot** of `Budget` (x-axis) vs. `Revenue` (y-axis).
3.  Enhance the scatter plot to include more dimensions of data:
    * **Color-code** the points based on their `Genre`.
    * Make the **size** of each point proportional to its `Year` (Hint: You may need to scale the year values to make the sizes look good, e.g., `(movies_df['Year'] - 1990) * 10`).
4.  Customize the plot:
    * Add a title: "Movie Budget vs. Revenue by Genre".
    * Add x and y axis labels.
    * Add a **legend** to explain the colors for each genre. (Hint: You may need to plot each genre separately in a loop to create a proper legend).

In [None]:
# Write your code for Exercise 2 here