# Lab: Fossil Fuel Consumption

## 1. Introduction

In this lab, you will be investigating monthly fossil fuel consumption in the United States from January 2001 to July 2020 using a time series dataframe from pandas! Time series data consists of a series of observations recorded in time, where the order of the observations is crucial because they're dependent. In our case, we want to investigate monthly fossil fuel consumption because we want to know if the COVID-19 lockdown affected people's decisions to consume/use fossil fuels and how much effect it had over the months. For example, since the lockdown, people may feel scared to drive outside and thus, use less fuel. 

There is a bit of data manipulation involved, so if you need some help on filtering rows/columns in datasets, here are some helpful links: 
https://medium.com/python-in-plain-english/filtering-rows-and-columns-in-pandas-python-techniques-you-must-know-6cdfc32c614c 

https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset/index.html 

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html. 

There are also a lot of resources/articles you can look up online if you need extra help. Please note that there are a lot of different ways to subset for pandas dataframes.

**Question 1.1.** Import "fossil_fuel_consumption.csv" using the pandas read_csv() method and store the dataframe in a variable. The data is pulled directly from the U.S. Energy Information Administration (EIA) here: https://www.eia.gov/totalenergy/data/browser/index.php?tbl=T01.01#/?f=M&start=200001&end=202007&charted=11. 

First, print out the dataframe. Inspect the rows and columns. There should be 7704 rows and 6 columns. Below is a brief summary of the data contained in each column:

"MSN": Mnemonic Series Names, the name for the category of data. In this lab, we are interested in the rows with "FFTCBUS" which stands for "Total Fossil Fuel Consumption". There are other categories in this big dataset, such as total fossil fuel production, nuclear electric power production, but we can ignore those categories for the purpose of this lab.

"YYYYMM": The year and month when the value was recorded. YYYY is the 4-digit year while MM is the 2-digit month.

"Value": The recorded value for the category we're interested in, at that specific year and month.

"Column Order": An integer to represent which category the value belongs to, out of the 12 total categories.

"Description": A description of the category. 

"Unit": The unit for the values.

In [1]:
import pandas as pd
## your code here

**Question 1.2.** As an example, print out the row indexed at 5136. This is the row where the Total Fossil Fuels Consumptions category starts. You should see that measurements start on "194913" with total fossil fuel consumption at 28.9884 quadrillion Btu. What does this mean? There is no 13th month. Right, this is because "13" here stands for the entire year. The EIA did not record monthly total fossil fuel consumption for 1949, they only recorded the annual consumption for that year. So this row tells us that in 1949, the annual total fossil fuel consumption is 28.9884 quadrillion Btu.

Since we are investigating monthly consumption, we are not interested in dates that end in "13" because those stand for annual values. We will need to wrangle the dataset a bit to get the exact dates/values we're interested in. You will see that in this dataset, for the years that do include monthly data, an annual value will appear after December, denoted by a date ending in "13".

In [2]:
## your code here

## 2. Data Manipulation

**Question 2.1.** For the purposes of this data, we will only be looking at data starting from January 2001 and from the category "FFTCBUS". Filter out all the rows that fulfills these two conditions and save it in another variable. Print out your new dataframe to check that you have the correct rows. You should have 254 rows and 6 columns. Note that the index (numbering each row) is saved from the old dataframe. 

In [7]:
## your code here

**Question 2.2.** Print out the first 26 rows from your new dataframe. Notice that you have rows with the date "200113" and "200213". Remember that these two rows contain annual information and is thus, not needed from our dataset. Can you figure out a way to drop every 13th row (all the rows that contain dates ending in "13") from this dataset? Hint: Look at "Slicing ranges" section at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html. Now you should have 235 rows and 6 columns.

In [8]:
## your code here

## 3. Creating a time series dataframe

**Question 3.1.** To make a time series dataframe, we need to convert the dates in "YYYYMM" into timestamps using pandas. You can do so by running code like:

In [9]:
## your_dataframe["YYYYMM"] = pd.to_datetime(your_dataframe["YYYYMM"], format='%Y%m')

Print out a snippet of your dataframe. Examine the "YYYYMM" column. Now the date should be formatted like YYYY-MM-DD. Please ignore the days. The days are not supposed to show up but they do anyways because it is a known bug in to_datetime().

In [10]:
## your code here

**Question 3.2.** Since we are working with a time series, it is helpful to set the index as the date instead of ordered numbers. Use the pandas dataframe's set_index() method to set the index to "YYYYMM". Now that our dataframe's index is a DatetimeIndex, we can extract the months and years into new columns. Make a "Month" and "Year" column in your dataframe. Hint: use your_dataframe.index.month and your_dataframe.index.year

In [11]:
## your code here

## 4. Plotting the time series

**Question 4.1.** Now we are ready to plot our time series using Seaborn, a data visualization library. Plot the "Value" column of your dataframe using the dataframe's plot() method.

In [12]:
import seaborn as sns 
sns.set(rc={'figure.figsize':(16, 4)}) #setting the appropriate size of our plot

## your code here

Let's examine our plot closely. We all know that COVID-19 affected the US starting around early April 2020. Can you see the huge dip in the plot during this month? You should notice that there is an apparent huge DECREASE in fossil fuel consumption for April 2020. This makes sense because this is the time when people were extremely cautious of catching COVID-19 and many restrictions/lockdowns/quarantines were in place. People were wary and did not drive outside as much, thus lowering the total fuel consumption. 

However, do you also notice the natural dips and peaks in our plot that occur every year? For example, you can see that people tend to use more fuel during certain months like December and January. Think of a reason for this. These dips/peaks each year are a natural part of the data because people just tend to use more or less fuel during these certain months. This is called seasonality in time series analysis.

What we want to know is: is the big dip at April 2020 significantly different from the seasonal variation in the data? How significant is it? At first glance, we can tell that that huge dip is obviously not a part of the typical seasonal variation. But how can we investigate this further?

**Question 4.2.** One way is to plot boxplots grouped by months! Plot 12 boxplots, one for each month, using sns.boxplot().

In [13]:
## your code here

**Question 4.3.** Look back on your dataframe. What are the total fossil fuel consumption values for April, May, June, and July 2020? Can you spot them on your boxplots? You should be able to see that these 4 data points show up as outliers on our boxplots, marked by a diamond marker. Can you say something about fuel consumption in April 2020 compared to previous years' consumptions in April? Do the same for May, June, and July. Think about the seasonal variation and how this relates to the outliers you see.

What does this plot say about people's willingness to self-isolate and respect COVID-19 lockdowns/quarantines as time passes from April to July 2020?

(write your answers here)