# Hi, I'm Rodrigo. Welcome to this Data Science Project.
__________________________________________________________________________

---


## This real dataset we're exploring is about US gas prices. The data comes from the U.S. Energy Information Administration (EIA).


---


### We are going to make a EDA and then put a ML model to track down future prices of US GAS.
Next steps below



**Requirements**: Install pydytuesday to run the data

In [1]:
!pip install pydytuesday

Collecting pydytuesday
  Downloading pydytuesday-0.1.2-py3-none-any.whl.metadata (7.1 kB)
Collecting pandas>=2.2.3 (from pydytuesday)
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Downloading pydytuesday-0.1.2-py3-none-any.whl (11 kB)
Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pandas, pydytuesday
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependen

**Step 1**: Charge the libraries

In [2]:
import pydytuesday
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

**Step 2**: Read the data

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-01/weekly_gas_prices.csv')

**Step 3**: Read the first 10 colums to have a visual representation of the data

In [4]:
df.head(10)

Unnamed: 0,date,fuel,grade,formulation,price
0,1990-08-20,gasoline,regular,all,1.191
1,1990-08-20,gasoline,regular,conventional,1.191
2,1990-08-27,gasoline,regular,all,1.245
3,1990-08-27,gasoline,regular,conventional,1.245
4,1990-09-03,gasoline,regular,all,1.242
5,1990-09-03,gasoline,regular,conventional,1.242
6,1990-09-10,gasoline,regular,all,1.252
7,1990-09-10,gasoline,regular,conventional,1.252
8,1990-09-17,gasoline,regular,all,1.266
9,1990-09-17,gasoline,regular,conventional,1.266


**Step 4**: See Data types and the shape of the data. I'm working with two data types and 22,360 entries with 5 columns.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22360 entries, 0 to 22359
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         22360 non-null  object 
 1   fuel         22360 non-null  object 
 2   grade        22360 non-null  object 
 3   formulation  19672 non-null  object 
 4   price        22360 non-null  float64
dtypes: float64(1), object(4)
memory usage: 873.6+ KB


**Step 6**: I'm adding new columns with new dtype's by **day**, **month** and **year**. This will help me organize the data better.

In [6]:
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day

df.drop('date', axis=1, inplace=True)
df.head(10)

Unnamed: 0,fuel,grade,formulation,price,year,month,day
0,gasoline,regular,all,1.191,1990,8,20
1,gasoline,regular,conventional,1.191,1990,8,20
2,gasoline,regular,all,1.245,1990,8,27
3,gasoline,regular,conventional,1.245,1990,8,27
4,gasoline,regular,all,1.242,1990,9,3
5,gasoline,regular,conventional,1.242,1990,9,3
6,gasoline,regular,all,1.252,1990,9,10
7,gasoline,regular,conventional,1.252,1990,9,10
8,gasoline,regular,all,1.266,1990,9,17
9,gasoline,regular,conventional,1.266,1990,9,17


**Step 7**: I'm dropping down any duplicates in the data

In [7]:
df.drop_duplicates().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22360 entries, 0 to 22359
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fuel         22360 non-null  object 
 1   grade        22360 non-null  object 
 2   formulation  19672 non-null  object 
 3   price        22360 non-null  float64
 4   year         22360 non-null  int32  
 5   month        22360 non-null  int32  
 6   day          22360 non-null  int32  
dtypes: float64(1), int32(3), object(3)
memory usage: 960.9+ KB


**Step 8**: We drop all missing values to work just with data available.
I got the percentage of Null Values.

In [8]:
null_values = df.isnull().sum()
null_percentage = (df.isnull().sum() / len(df)) * 100

df_null = pd.DataFrame({
    'Null Values': null_values,
    'Null Percentage': null_percentage
})

print(df_null)

             Null Values  Null Percentage
fuel                   0         0.000000
grade                  0         0.000000
formulation         2688        12.021467
price                  0         0.000000
year                   0         0.000000
month                  0         0.000000
day                    0         0.000000


In [9]:
df_null = df.dropna()
df_null.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19672 entries, 0 to 22357
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fuel         19672 non-null  object 
 1   grade        19672 non-null  object 
 2   formulation  19672 non-null  object 
 3   price        19672 non-null  float64
 4   year         19672 non-null  int32  
 5   month        19672 non-null  int32  
 6   day          19672 non-null  int32  
dtypes: float64(1), int32(3), object(3)
memory usage: 999.0+ KB


**Step 9**: I'm going to filter out the data to know the variables in the data i'm working on in the **next** steps.

In [10]:
df_null['fuel'].unique()

array(['gasoline'], dtype=object)

In [11]:
df_null['formulation'].unique()
df_null['formulation'].value_counts()

Unnamed: 0_level_0,count
formulation,Unnamed: 1_level_1
all,6687
conventional,6601
reformulated,6384


In [12]:
formulation_available = df['formulation'].unique()
formulation_available = [x for x in formulation_available if str(x) != 'nan']
print(formulation_available)

['all', 'conventional', 'reformulated']


In [13]:
df_null['grade'].unique()
df_null['grade'].value_counts()

Unnamed: 0_level_0,count
grade,Unnamed: 1_level_1
regular,5222
all,4874
midgrade,4788
premium,4788


In [14]:
years_available = df_null['year'].value_counts().sort_index()
print(years_available)

year
1990     32
1991    100
1992    104
1993    143
1994    201
1995    624
1996    636
1997    624
1998    624
1999    624
2000    624
2001    636
2002    624
2003    624
2004    624
2005    624
2006    624
2007    636
2008    624
2009    624
2010    624
2011    624
2012    636
2013    624
2014    624
2015    624
2016    624
2017    624
2018    636
2019    624
2020    624
2021    624
2022    624
2023    624
2024    636
2025    300
Name: count, dtype: int64


In [15]:
df_null = df['price'].value_counts().sort_index()
print(df_null)

price
0.885    1
0.891    1
0.899    1
0.900    1
0.907    2
        ..
5.858    1
5.955    1
5.964    1
6.033    1
6.064    1
Name: count, Length: 3807, dtype: int64


In [16]:
df_null.info()

<class 'pandas.core.series.Series'>
Index: 3807 entries, 0.885 to 6.064
Series name: count
Non-Null Count  Dtype
--------------  -----
3807 non-null   int64
dtypes: int64(1)
memory usage: 59.5 KB


In [17]:
df_null.describe()

Unnamed: 0,count
count,3807.0
mean,5.873391
std,3.755989
min,1.0
25%,3.0
50%,5.0
75%,8.0
max,24.0


**Step 10**: Initial analysis

I've cleaned out the data. I've decided to eliminate the null values and I've created new date variables that will help me to do an advanced eda and ml model.

**Data Meaning**:

*   **fuel**:	The type of fuel reported (gasoline or diesel).
*   **grade**:	The grade or specification of the fuel (for gasoline: all, regular, midgrade, or premium; for diesel: all, ultra_low_sulfur, low_sulfur).
*   **formulation**:	The formulation of the gasoline (all, conventional, or reformulated). Only applies to gasoline.
*   **price**: The average U.S. retail price per gallon in U.S. dollars for that fuel type.
*  **day, month, year**: The day-week-year-ending date for the reported fuel price.

**Key points**:

 **KEY**: I discovered all the null values where 'diesel' fuel type. I deleted that variable, know i just have'gasoline' as fuel type.

*   I have 19, 672 entries with 7 columns.
*   I have 7 different variables including (fuel, grade, formulation, price and date.)
*  I have data from 1990 to 2025.
*  I have four different grades.
*  I have three types of formulation
*  Price variable go from 0.885 to 6.064 USD


