<img src='images/Practicum_AI_Logo.white_outline.svg' width=250 alt='Practicum AI logo'> <img src='https://github.com/PracticumAI/practicumai.github.io/blob/main/images/icons/practicumai_python.png?raw=true' align='right' width=50>

***

# Data Visualization with Matplotlib

Matplotlib (a portmanteau of "MATLAB", "plot" and "library") is one of the most widely used data visualization libraries in Python. It provides a flexible framework for creating static, animated, and interactive visualizations in Python.

In this lesson, we're going to be exploring some USDA data on energy use in agriculture [[1](#Citations)]. **Check out [the description of the data](https://data.nal.usda.gov/dataset/data-chapter-5-energy-use-agriculture-us-agriculture-and-forestry-greenhouse-gas-inventory-1990-2018) before continuing.**

It is probably also a good idea to **open a tab with the [`matplotlib` documentation](https://matplotlib.org/stable/api/index.html).**

## Let's get plotting!

`matplotlib` needs to be imported. Like with other popular libraries such as Pandas (pd) or Numpy (np), matplotlib has a standard abbreviation, `plt`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load and examine the data

Now, let's load the data. We'll be loading in "Data for Figure 5-2: Energy use in agriculture, by source, 1965–2018 in QBTU (quadrillion British thermal units)". We've included a copy of this CSV file in the data folder.

In [None]:
file_path = 'data/Figure5_2.csv'
energy_data = pd.read_csv(file_path)

In [None]:
# Take a look at the data
energy_data.head()

In [None]:
# Get some summary statistics
energy_data.describe()

## Plotting with `matplotlib`


A `matplotlib` plot is composed of several elements. 
- The `fig` variable (short for figure) is the overall window or page that everything is drawn on. Into the `fig` go one or more `ax` (short for axes). 
- The `ax` variable is the area on which the data are plotted with functions like `plot()` and `scatter()` and that can have ticks, labels, etc. associated with it.
- To create a figure and an axes, we use the `plt.subplots()` function. 
- In the code cell below we are using the `scatter()` function to create a scatter plot of Gasoline consumption vs. Year. Notice that inside the scatter function, we are specifying the `x` and `y` values to be plotted from the `energy_data` DataFrame we created above.
- After specifying the data to be plotted, we (optionally) set the x and y labels using the `set_xlabel()` and `set_ylabel()` functions. Finally, we use `plt.show()` to display the plot.

In [None]:
# Create a figure and an axes
fig, ax = plt.subplots()

# Plot Gasoline consumption vs. Year
ax.scatter(energy_data['Year'], energy_data['Gasoline'])

# Set the labels
ax.set_xlabel('Year')
ax.set_ylabel('Gasoline')

# Show the plot
plt.show()

### <img src='images/exercise_icon.svg' alt="Exercise icon" width=40 align=center> Exercise 1

> Make a plot similar to the Gasoline plot above but for one of the other energy sources.

In [None]:
# Add your code here


# Building your plots iteratively

Building plots with `matplotlib` can also be an iterative process. We start by creating a figure and axes, then we add data and customize the appearance. Using the code above as a starting point, let's modify it step-by-step. First we'll change the color of the points from the default blue to red.

In [None]:
fig, ax = plt.subplots()

ax.scatter(energy_data['Year'], energy_data['Gasoline'], color='red')
ax.set_xlabel('Year')
ax.set_ylabel('Gasoline')

plt.show()

Maybe we want to try a different `marker` argument. Below we'll change the `marker` from its default value to stars (`'*'`).

In [None]:
fig, ax = plt.subplots()

ax.scatter(energy_data['Year'], energy_data['Gasoline'], color='red', marker='*')
ax.set_xlabel('Year')
ax.set_ylabel('Gasoline')

plt.show()

Those stars are a bit small, let's change their size (using the `s` parameter) from the default to 300.

In [None]:
fig, ax = plt.subplots()

ax.scatter(energy_data['Year'], energy_data['Gasoline'], color='red', marker='*', s=300)
ax.set_xlabel('Year')
ax.set_ylabel('Gasoline')

plt.show()

And it would hardly be a plot without a trendline! We can add one using `numpy`'s `polyfit()` function to calculate the slope and intercept of the best-fit line, then use those values to plot the line on our scatter plot.

In [None]:
fig, ax = plt.subplots()

# Calculate the trendline
m, b = np.polyfit(energy_data["Year"], energy_data["Gasoline"], 1) # 1 indicates linear

# Plot the trendline, including an f-string with the line equation
plt.plot(
    energy_data["Year"], 
    m*energy_data["Year"] + b, 
    color="blue", 
    label=f"Trendline: y={m:.2f}x+{b:.2f}"
    )

ax.scatter(energy_data["Year"], energy_data["Gasoline"], color="red", marker="*", s=300)
ax.set_xlabel("Year")
ax.set_ylabel("Gasoline")

plt.show()

Something's still missing. We specified a trendline, but it didn't show up on the graph. If we add a legend using the `ax.legend()` method, we will see the trendline. 

While we're at it, let's add a title to the plot using the `ax.set_title()` method.

In [None]:
fig, ax = plt.subplots()

# Calculate the trendline
m, b = np.polyfit(energy_data["Year"], energy_data["Gasoline"], 1)  # 1 indicates linear

# Plot the trendline, including an f-string with the line equation
plt.plot(
    energy_data["Year"],
    m * energy_data["Year"] + b,
    color="blue",
    label=f"Trendline: y={m:.2f}x+{b:.2f}", # the.2f formats to 2 decimal places
)

ax.scatter(energy_data["Year"], energy_data["Gasoline"], color="red", marker="*", s=300)
ax.set_xlabel("Year")
ax.set_ylabel("Gasoline")
ax.set_title("Gasoline Consumption Over Time")
ax.legend()

plt.show()

Et voilà! A finished plot.

## Plotting Multiple Columns of Data

With `matplotlib`, we can plot multiple columns of data by calling a plotting function (e.g., `ax.plot()` or `ax.scatter()`) for each column we want to plot. It's often useful to loop through the columns. Let's put your new looping skills to work:

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

energy_sources = energy_data.columns.drop('Year')

for source in energy_sources:
    ax.plot(energy_data['Year'], energy_data[source], label=source)

ax.set_xlabel('Year')
ax.set_ylabel('Energy Consumption (QBTU)')
ax.set_title('Energy Use in Agriculture by Source')
ax.legend()

plt.show()

This is looking pretty good!

### <img src='images/exercise_icon.svg' alt="Exercise icon" width=40 align=center> Exercise 2

There are a few more things we could do. Use the [`matplotlib` documentation](https://matplotlib.org/stable/api/index.html) to take the above plot and:

* Plot the trend lines for each energy source using dotted lines.
* Add the trend line calculations to the legend.
* Bonus challenge: Change the title to pull the year range from the data instead of hardcoding it (Hint: Recall the `.min()` and `.max()` methods!).

In [None]:
# Add your code here


## Making Box Plots

Box plots are a great way to visualize the distribution of a dataset. They show the median, quartiles, and potential outliers in the data. Here's how to create a box plot using `matplotlib`.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

energy_sources = energy_data.columns.drop('Year')

ax.boxplot(energy_data[energy_sources], tick_labels=energy_sources)

ax.set_xlabel("Energy Source")
ax.set_ylabel("Energy Consumption (QBTU)")
ax.set_title(f"Energy Use in Agriculture by Source from {energy_data['Year'].min()} to {energy_data['Year'].max()}")

plt.show()

Nice. If you can understand boxplots, this visualization is probably informative. 

And that's it! We encourage you to keep playing with code and the **documentation** to get more familiar with `matplotlib` and data visualization in Python. Later in this course we'll look at for loops, which will make it easier to plot multiple columns of data without repeating code. When you've had some practice with loops, come back and revisit this lesson to see how you can improve your code!

## Citations

[1]
(original article) Xiarchos, I.M. (2022). Chapter 5: Energy Use in Agriculture. In U.S. Agriculture and Forestry Greenhouse Gas Inventory: 1990–2018. Technical Bulletin No. 1957, United States Department of Agriculture, Office of the Chief Economist, Washington, DC. p. 177-181. January 2022. Hanson, W.L., S.J. Del Grosso, L. Gallagher, Eds.

(dataset) Xiarchos, Irene M. (2021). Data from: Chapter 5: Energy Use in Agriculture. U.S. Agriculture and Forestry Greenhouse Gas Inventory: 1990-2018. Ag Data Commons. https://doi.org/10.15482/USDA.ADC/1524410. Accessed 2023-11-30.