# EDA | U.S. Gasoline and Diesel Retail Prices 1995-2021

<div style="align:center">
    <img src="https://images.freeimages.com/images/large-previews/f25/oil-pumps-1180629.jpg" style="width:70%">
</div>

# Introduction

- The DataFrame contains data about *U.S. Gasoline and Diesel Retail Prices from 1995 until 2021*.
- The data is divided between **Regular**, **Mid** and **Premium** grade of conventional or reformulated gasoline.
- This notebook's objective is to compare the evolution of price between the different grades of gasoline and if they are either conventional or reformulated.

Here is some vocabulary before beginning.

**Conventional Gasoline**
:
*Finished motor gasoline not included in the oxygenated or reformulated gasoline categories. Excludes reformulated gasoline blendstock for oxygenate blending (RBOB) as well as other blendstock.*

**Reformulated Gasoline**
:
*Finished motor gasoline formulated for use in motor vehicles, the composition and properties of which meet the requirements of the reformulated gasoline regulations promulgated by the U.S. Environmental Protection Agency under Section 211(k) of the Clean Air Act. This category includes oxygenated fuels program reformulated gasoline (OPRG) but excludes reformulated gasoline blendstock for oxygenate blending (RBOB).*

> *Source* : [U.S. Energy Information Administration](https://www.eia.gov/dnav/pet/TblDefs/pet_move_pipe_tbldef2.asp#:~:text=Conventional%20Gasoline,as%20well%20as%20other%20blendstock.)

## Main package imports

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plots creating
import matplotlib.pyplot as plt # plots handling

# Kaggle file system setup
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Load

In [None]:
data = pd.read_csv('../input/us-gasoline-and-diesel-retail-prices-19952021/PET_PRI_GND_DCUS_NUS_W.csv')

## Quick check

In [None]:
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
print(f"There are {data.isna().sum().sum()} missing values.")

In [None]:
# Convert Object type variable to datetime variable
data['Date'] = pd.to_datetime(data['Date'])

# Analysis

In [None]:
from datetime import datetime

# Plot dimensions
WIDTH   = 20
HEIGHT  = 6  

# Interesting milestones
crisis_2009 = {'date':datetime(year=2009, month=1, day=1), 'label':'Great Recession', 'color':'navy'}
covid_2019  = {'date':datetime(year=2019, month=12, day=31), 'label':'COVID-19', 'color':'olivedrab'}
sras_2003   = {'date':datetime(year=2003, month=1, day=1), 'label':'SRAS', 'color':'olivedrab'}
oversupply1 = {'date':datetime(year=2015, month=1, day=1), 'label':'Decline 1', 'color':'teal'}
oversupply2 = {'date':datetime(year=2016, month=1, day=1), 'label':'Decline 2', 'color':'teal'}
september11 = {'date':datetime(year=2011, month=9, day=1), 'label':'September 11 attacks', 'color':'crimson'}
milestones  = [crisis_2009, covid_2019, oversupply1, oversupply2, september11, sras_2003]

def analysis(df, gasolines, legends, milestones=None, corr=False):
    """
    @params : gasolines
    @params : data
    """  
    # Create a dataframe with 2 columns 'grade' and 'price'
    df_cg         = pd.melt(df[gasolines])
    df_cg.columns = ['grade', 'price']
    df_cg         = df_cg.replace(legends)
    
    # Create the price distribution plot
    plt.figure(figsize=(WIDTH, HEIGHT))
    sns.kdeplot(data=df_cg, x='price', hue='grade')
    plt.title('Price distribution')
    plt.show()
    
    # Create the datetime line plot
    plt.figure(figsize=(WIDTH, HEIGHT))
    
    # Loop over the different gasoline grades
    for gasoline in gasolines:
        sns.lineplot(x=df['Date'].rename(legends), y=df[gasoline], label=legends[gasoline])
    
    # Plot the different milestones
    if milestones:
        
        minimum = df[gasolines].min().min()
        maximum = df[gasolines].max().max()
        
        for milestone in milestones:
            plt.plot([milestone['date'], milestone['date']], [minimum, maximum], color=milestone['color'], linestyle='dashed', ms=10)
            plt.text(milestone['date'], minimum, milestone['label'], color=milestone['color'], rotation=45)
        
    plt.ylabel('Price')
    plt.title('Temporal evolution of prices')
    plt.show()
    
    # Plot the correlation
    if corr:
        sns.jointplot(
            data=df[gasolines].rename(columns=legends),
            x=legends[gasolines[0]], 
            y=legends[gasolines[1]], 
            kind="scatter", 
            height=10,
            color='crimson',
            alpha=0.4,
            marginal_kws=dict(bins=15, alpha=.6)
        )
        
        plt.title(f'{legends[gasolines[0]]} vs {legends[gasolines[1]]}')
        plt.xlabel(legends[gasolines[0]])
        plt.ylabel(legends[gasolines[1]])
        plt.show()

In [None]:
generate_gasoline_type = lambda x: [ f"{e}{x}" for e in ['R', 'M', 'P'] ]

conventionals = generate_gasoline_type(2)
reformulated  = generate_gasoline_type(3)

## Conventional Gasoline

In [None]:
analysis(data, conventionals, {"R2":"Regular", "M2":"Mid-Grade", "P2":"Premium"}, milestones=milestones)

## Reformulated Gasoline

In [None]:
analysis(data, reformulated, {"R3":"Regular", "M3":"Mid-Grade", "P3":"Premium"}, milestones=milestones)

## Reformulated vs Conventional Gasoline

In [None]:
analysis(data, ['A2', 'A3'], {"A2":"Reformulated", "A3":"Conventional"}, milestones=milestones, corr=True)

# Conclusion

- The data has a **high positive correlation** between variables.
- The historic events have **great influence** on the increase and drop of the prices.
- ...