# LOAD DATA

**Objective**: Load and adapt the data provided and additional

Loaded data in the following order:
1. [Provided](#1-Provided-data)
    - [Corn.paquet](#1.1-Main-dataset---Corn.parquet)
    - [Corn futures](#1.2-Corn-Futures)
    - [Chemicals](#1.3-Chemicals)
    - [Climate](#1.4-Climate)
2. [Additional](#2.-Additional-datasets)
    - [Agricultural](#2.1-Agricultural-info-in-US)
    - [Cropland in US](#1.2-Cropland-use-for-cops-in-all-US)
    - [Cropland by state](#2.3-Cropland-use-for-cops-in-each-state)
    - [Ethanol produced in US](#2.4-Ethanol-produced-US)

## Load packages

In [1]:
# Set Git path
import os 
CURRENT_PATH = os.getcwd()

# Data manipulation
import pandas as pd
import csv
import xlrd
import openpyxl
from datetime import datetime
from ydata_profiling import ProfileReport

In [2]:
if not os.path.exists('../data/'):
    os.makedirs('../data/')

## 1. Provided data

### 1.1 Main dataset - Corn.parquet

In [3]:
corn_raw = pd.read_parquet('../exdata/provided/CORN.parquet/part-00000-79520c00-c34f-45a5-abf4-58866e63cb2f-c000.snappy.parquet')
#corn_raw.head()

#### Check filters and get the main only with the output feature:

1.            FILTER the dataset for STATISTICCAT == Area Planted, Acres Harvested, Yield

2.            Filter for AGG_LEVEL_DESC == STATE

3.            SHORT_DESC == CORN - ACRES PLANTED

4.            REFERENCE_PERIOD_DESC == YEAR

All filters applied

In [4]:
# Apply filters
filter1 = corn_raw[corn_raw['STATISTICCAT_DESC'].isin(['AREA PLANTED','AREA HARVESTED','YIELD'])]
filter2 = filter1[filter1['AGG_LEVEL_DESC'] == 'STATE']
filter3 = filter2[filter2['SHORT_DESC'] == 'CORN - ACRES PLANTED']
filter4 = filter3[filter3['REFERENCE_PERIOD_DESC'] == 'YEAR']
#filter4.head()

To check if any of the other columns apart from: **'VALUE'**,**'STATISTICCAT_DESC'**, **'LOCATION_DESC'** and **'YEAR'** are relevant

In [5]:
profile_filter4 = ProfileReport(filter4)
profile_filter4

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [6]:
# Get main DF from filters
maindf = filter4[['VALUE','STATISTICCAT_DESC', 'LOCATION_DESC', 'YEAR']].copy()

profile_maindf = ProfileReport(maindf)
profile_maindf

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



After checking the profiling, there are needed changes in the **VALUE** and **LOCATION_DESC** columns

In [7]:
# Give correct format to value column
maindf['VALUE'] = maindf['VALUE'].str.replace(r',[^,]*$', '', regex=True)
maindf['VALUE'] = maindf['VALUE'].str.replace(',', '').astype(float)

# Only rows with LOCATION_DESC defined
maindf = maindf[maindf['LOCATION_DESC'] != 'OTHER STATES']

In [8]:
# Pivot to use each STATISTICCAT_DESC as feature
main_output = maindf.pivot(index=['YEAR','LOCATION_DESC'], columns=['STATISTICCAT_DESC'], values='VALUE')
main_output.reset_index(inplace=True)

In [9]:
main_output

STATISTICCAT_DESC,YEAR,LOCATION_DESC,AREA PLANTED
0,1919,ARIZONA,31.0
1,1920,ARIZONA,29.0
2,1921,ARIZONA,35.0
3,1922,ARIZONA,39.0
4,1923,ARIZONA,33.0
...,...,...,...
4642,2022,VIRGINIA,480.0
4643,2022,WASHINGTON,140.0
4644,2022,WEST VIRGINIA,46.0
4645,2022,WISCONSIN,3950.0


#### Main with more features besides filters

Applied:
- FILTER 2: **AGG_LEVEL_DESC** == *STATE*
- FILTER 4: **REFERENCE_PERIOD_DESC** == *YEAR*
- And take all combinations of **STATISTICCAT_DESC** and **SHORT_DESC** features

In [10]:
filter2 = corn_raw[corn_raw['AGG_LEVEL_DESC'] == 'STATE']
filter4 = filter2[filter2['REFERENCE_PERIOD_DESC'] == 'YEAR']
main = filter4.copy()

# remain with the same subset that main
main = main[['VALUE','STATISTICCAT_DESC', 'SHORT_DESC', 'LOCATION_DESC', 'YEAR']].copy()

In [11]:
# Give correct format to value column
main['VALUE'] = main['VALUE'].str.replace(r',[^,]*$', '', regex=True)
main['VALUE'] = main['VALUE'].str.replace(',', '')
main['VALUE'] = pd.to_numeric(main['VALUE'], errors='coerce')

# Only rows with LOCATION_DESC defined
main = main[main['LOCATION_DESC'] != 'OTHER STATES']

In [12]:
# Pivot to use each SHORT_DESC as feature
main_pivot = main.pivot_table(index=['YEAR','LOCATION_DESC'], columns=['SHORT_DESC'], values='VALUE')

# Filter by y data -> CORN - ACRES PLANTED
main_pivot.dropna(subset=['CORN - ACRES PLANTED'], inplace=True)

main_pivot.reset_index(inplace=True)
main_pivot = main_pivot.rename_axis(None, axis=1)

main_pivot

Unnamed: 0,YEAR,LOCATION_DESC,CORN - ACRES PLANTED,CORN - OPERATIONS WITH SALES,"CORN - SALES, MEASURED IN $","CORN - SALES, MEASURED IN PCT OF FARM OPERATIONS","CORN - SALES, MEASURED IN PCT OF FARM SALES","CORN, BIOTECH - AREA PLANTED, MEASURED IN PCT BY TYPE","CORN, BIOTECH, BT - AREA PLANTED, MEASURED IN PCT BY TYPE","CORN, BIOTECH, HERBICIDE RESISTANT - AREA PLANTED, MEASURED IN PCT BY TYPE",...,"CORN, SILAGE, ORGANIC - PRODUCTION, MEASURED IN TONS","CORN, SILAGE, ORGANIC - SALES IN ORGANIC MARKETS, MEASURED IN $","CORN, SILAGE, ORGANIC - SALES IN ORGANIC MARKETS, MEASURED IN TONS","CORN, SILAGE, ORGANIC - SALES, MEASURED IN $","CORN, SILAGE, ORGANIC - SALES, MEASURED IN TONS","CORN, TRADITIONAL OR INDIAN - ACRES HARVESTED","CORN, TRADITIONAL OR INDIAN - OPERATIONS WITH AREA HARVESTED","CORN, TRADITIONAL OR INDIAN - PRODUCTION, MEASURED IN LB","CORN, TRADITIONAL OR INDIAN, IRRIGATED - ACRES HARVESTED","CORN, TRADITIONAL OR INDIAN, IRRIGATED - OPERATIONS WITH AREA HARVESTED"
0,1919,ARIZONA,31.0,,,,,,,,...,,,,,,,,,,
1,1920,ARIZONA,29.0,,,,,,,,...,,,,,,,,,,
2,1921,ARIZONA,35.0,,,,,,,,...,,,,,,,,,,
3,1922,ARIZONA,39.0,,,,,,,,...,,,,,,,,,,
4,1923,ARIZONA,33.0,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4642,2022,VIRGINIA,480.0,,,,,,,,...,,,,,,,,,,
4643,2022,WASHINGTON,140.0,,,,,,,,...,,,,,,,,,,
4644,2022,WEST VIRGINIA,46.0,,,,,,,,...,,,,,,,,,,
4645,2022,WISCONSIN,3950.0,,,,,91.0,3.0,11.0,...,,,,,,,,,,


#### Save main datasets

In [13]:
main_output.to_pickle('../data/main_output.pkl')
main_pivot.to_pickle('../data/main.pkl')

### 1.2 Corn Futures

In [14]:
futures_raw = pd.read_csv('../exdata/provided/Corn_Futures.csv')

# save only values for mean calculated from HIGH and LOW
futures_raw = futures_raw.assign(Futures = (futures_raw['High'] + futures_raw['Low'])/2)

# Rename column date
futures = futures_raw[['Date', 'Futures']].copy()
rename_cols = {'Date': 'DATE'}
futures.rename(columns=rename_cols, inplace=True)

# Format date format equally than in main df
futures['DATE'] = pd.to_datetime(futures['DATE'], format='%m/%d/%Y')
futures['DATE'] = futures['DATE'].dt.strftime('%Y-%m-%d')

# Ver correlación de media/mediana/sd con estado por año
futures


Unnamed: 0,DATE,Futures
0,2022-11-04,681.875
1,2022-11-03,682.500
2,2022-11-02,688.625
3,2022-11-01,692.875
4,2022-10-31,692.125
...,...,...
2541,2012-11-13,724.875
2542,2012-11-12,719.250
2543,2012-11-09,744.250
2544,2012-11-08,740.500


In [15]:
futures.to_pickle('../data/futures.pkl')

### 1.3 Chemicals

In [16]:
chem = pd.read_excel('../exdata/provided/WPU0652013A.xls', index_col=None, na_values=['NA'], usecols="A,B", skiprows=10)

# Rename column date
rename_cols = {'observation_date': 'DATE', 'WPU0652013A': 'chem'}
chem.rename(columns=rename_cols, inplace=True)

# Ver correlación de media/mediana/sd con estado por año
chem

Unnamed: 0,DATE,chem
0,2014-12-01,100.000
1,2015-01-01,101.800
2,2015-02-01,102.100
3,2015-03-01,100.100
4,2015-04-01,101.500
...,...,...
89,2022-05-01,216.727
90,2022-06-01,204.198
91,2022-07-01,180.314
92,2022-08-01,177.149


In [17]:
chem.to_pickle('../data/chem.pkl')

### 1.4 Climate

In [18]:
# Create dictionary with all subfolders and pathfiles

pathfiles_dict = {}
files_dict = {}

for folder_path, folders, files in os.walk('../exdata/provided/Climate/'):
    # Create a list to store filenames for the current subfolder
    subfolder_filenames = []
    for file in files:
        # Append the filename to the list
        subfolder_filenames.append(os.path.join(folder_path, file))
    # Store the list of filenames in the dictionary with the subfolder path as the key
    pathfiles_dict[folder_path] = subfolder_filenames

pathfiles_dict.pop('../exdata/provided/Climate/')

prefix_to_remove = '../exdata/provided/Climate/'
for key in pathfiles_dict:
    updated_key = key.replace(prefix_to_remove, '', 1)
    files_dict[updated_key] = pathfiles_dict[key]

In [19]:
# Get dataframe of all metrics and pivot it

metric = pd.DataFrame()

def get_start_row(file):
    with open(file, 'r') as file:
        reader = csv.reader(file)
        for i, row in enumerate(reader):
            # Check if the row matches the condition
            if row[0] == 'Date':
                start_row = i
                break
    return start_row

for key in files_dict.keys():
    for file in files_dict[key]:
        if file.endswith('.csv'):  # Check if the file is a CSV file
            # Read state and metric info
            df = pd.read_csv(file, nrows=1, header=None)
            # Find in which row starts the values
            start_row = get_start_row(file)
            # Append the data to the main dataframe
            temp_df = pd.read_csv(file, skiprows = start_row, header=0)
            temp_df = temp_df.assign(State = df[0][0], Metric = df[1][0])
            metric = pd.concat([metric, temp_df], ignore_index=True)
            
# remove anomaly column and rename Date
metric = metric.drop('Anomaly', axis=1)
rename_cols = {'Date': 'DATE'}
metric.rename(columns=rename_cols, inplace=True)

# pivot to get climate df
climate = metric.pivot(index=['DATE','State'], columns='Metric', values='Value')
climate.reset_index(inplace=True)

In [20]:
# Format date format equally than in main df  - here only year and month
climate['DATE'] = pd.to_datetime(climate['DATE'], format='%Y%m')
climate['DATE'] = climate['DATE'].dt.strftime('%Y-%m')

# Ver correlación de media/mediana/sd con estado por año
climate = climate.rename_axis(None, axis=1)
climate

Unnamed: 0,DATE,State,Average Temperature,Cooling Degree Days,Heating Degree Days,Maximum Temperature,Minimum Temperature,Palmer Drought Severity Index (PDSI),Precipitation
0,1895-01,Alabama,43.1,5.0,716.0,52.7,33.4,0.78,7.52
1,1895-01,Arizona,40.4,0.0,508.0,49.0,31.8,1.67,2.78
2,1895-01,Arkansas,36.1,0.0,914.0,46.0,26.2,0.37,5.04
3,1895-01,California,40.5,0.0,654.0,47.4,33.6,2.23,9.25
4,1895-01,Colorado,21.6,0.0,1355.0,33.6,9.6,1.64,1.96
...,...,...,...,...,...,...,...,...,...
74360,2022-01,Virginia,32.9,0.0,994.0,42.8,23.1,-1.34,4.51
74361,2022-01,Washington,31.2,0.0,869.0,37.0,25.4,1.44,6.58
74362,2022-01,West Virginia,27.4,0.0,1155.0,37.0,17.8,0.72,4.86
74363,2022-01,Wisconsin,9.2,0.0,1622.0,19.6,-1.3,-1.46,0.50


In [21]:
climate.to_pickle('../data/climate.pkl')

## 2. Additional datasets

Links with info about this data can be found in:
- Agricultural info in US:
- Cropland in US and by state:
- Ethanol consumed: 

### 2.1 Agricultural info in US

In [22]:
# Read exdata
agriculture = pd.read_excel('../exdata/additional/table01.xlsx', index_col=None, skiprows=2)
agriculture = agriculture.iloc[:72]

# Adapt colname YEAR
rename_cols = {'Year': 'YEAR'}
agriculture.rename(columns=rename_cols, inplace=True)

agriculture.to_pickle('../data/agriculture.pkl')
agriculture

Unnamed: 0,YEAR,Total agricultural output,Livestock and products output: Total 1/,Livestock and products output: Meat animals,Livestock and products output: Dairy,Livestock and products output: Poultry and eggs,Crops output: Total,Crops output: Food grains,Crops output: Feed crops,Crops output: Oil crops,...,Labor inputs: Hired labor,Labor inputs: Self-employed and unpaid family,Intermediate inputs: Total,Intermediate inputs: Feed and seed,Intermediate inputs: Energy,Intermediate inputs: Fertilizer and lime,Intermediate inputs: Pesticides,Intermediate inputs: Purchased services,Intermediate inputs: Other intermediate,Total factor productivity (TFP)
0,1948,0.362833,0.437903,0.562957,0.448529,0.129304,0.338529,0.524309,0.392067,0.110668,...,2.999098,4.570501,0.431257,0.473741,1.015513,0.299008,0.016234,0.433501,0.266192,0.383332
1,1949,0.357259,0.440938,0.582993,0.46862,0.148074,0.327655,0.450339,0.36051,0.10584,...,2.786415,4.546308,0.445904,0.490771,1.124234,0.289959,0.019592,0.432306,0.311152,0.372391
2,1950,0.348639,0.448577,0.597969,0.472827,0.158971,0.308353,0.41978,0.369673,0.107492,...,2.904928,4.236456,0.454706,0.48946,1.153244,0.37092,0.025088,0.442771,0.270106,0.361446
3,1951,0.364145,0.466605,0.630788,0.465135,0.170449,0.325324,0.414267,0.357164,0.112389,...,2.804684,4.066649,0.475389,0.512182,1.196978,0.36238,0.021484,0.481686,0.277572,0.37173
4,1952,0.375937,0.477123,0.655737,0.469577,0.173769,0.338618,0.531508,0.369027,0.112398,...,2.740418,3.976753,0.477837,0.507401,1.24953,0.376972,0.022599,0.494674,0.287195,0.382885
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2015,1,1,1.0,1,1,1,1.0,1,1,...,1,1,1.0,1.0,1,1,1,1,1,1
68,2016,1.030058,1.056603,1.052079,1.014423,1.019666,1.049156,1.128115,1.065863,1.082993,...,1.051175,0.990508,0.978174,1.005181,1.073448,0.772758,1.041096,0.9908,0.916261,1.039679
69,2017,1.023858,1.055334,1.096645,1.028846,1.044365,1.020214,0.86043,1.020404,1.125213,...,1.016105,0.929956,0.983741,1.052362,0.953896,0.7044,1.074263,0.995362,0.91745,1.044425
70,2018,1.006604,1.05318,1.079276,1.048077,1.069143,1.004979,0.961248,0.997175,1.123626,...,1.031205,0.900358,0.948811,0.979248,0.918758,0.721832,1.040349,0.9963,0.886237,1.052431


### 2.2 Cropland use for cops in all US 

In [23]:
# Read exdata
cropland_us = pd.read_excel('../exdata/additional/summary_Table_3_cropland_used_for_crops_19102022_update.xlsx', skiprows=1)
cropland_us = cropland_us.iloc[:113]

# Fill nas
cropland_us = cropland_us.fillna(cropland_us.median(numeric_only=True))

# Adapt colname YEAR
cropland_us.at[cropland_us.index[-1], 'Year 1/'] = 2022
rename_cols = {'Year 1/': 'YEAR'}
cropland_us.rename(columns=rename_cols, inplace=True)

cropland_us.to_pickle('../data/cropland_us.pkl')
cropland_us

Unnamed: 0,YEAR,Total crops harvested (million acres) 2/,Double cropped (million acres) 3/,Cropland harvested (million acres) 4/,Crop failure (million acres),Cultivated summer fallow (million acres),Total cropland used for crops (million acres) 4/
0,1910,322.0,8.0,317.0,9.0,4.0,330.0
1,1911,322.0,8.0,322.0,10.0,5.0,337.0
2,1912,322.0,8.0,320.0,12.0,5.0,337.0
3,1913,322.0,8.0,324.0,11.0,5.0,340.0
4,1914,322.0,8.0,326.0,11.0,5.0,342.0
...,...,...,...,...,...,...,...
108,2018,317.0,6.0,311.0,11.0,16.0,338.0
109,2019,303.0,5.0,298.0,11.0,15.0,323.0
110,2020,310.0,6.0,304.0,11.0,14.0,329.0
111,2021,317.0,6.0,311.0,10.0,15.0,336.0


### 2.3 Cropland use for cops in each state

In [24]:
# Read exdata
cropland_state = pd.read_excel('../exdata/additional/Cropland_used_for_crops_19452012_by_state.xls', index_col=None, skiprows=2)
cropland_state = cropland_state.iloc[4:72]

# Filter empty rows
cropland_state = cropland_state.dropna()

# Fill years
cropland_state.set_index('Regions and States', inplace=True)

# Create columns for all intermediate years
cropland_state.columns = cropland_state.columns.astype(str)
years = [int(col) for col in cropland_state.columns]
for idx, year in enumerate(range(min(years), max(years)+4)):
    if str(year) in cropland_state.columns:
        last = cropland_state.iloc[:, idx]
    if str(year) not in cropland_state.columns:
        cropland_state.insert(idx, year, last)

cropland_state = cropland_state.reset_index()

# Melt df
df_dropped = cropland_state.drop('Regions and States', axis=1)
cropland_state_melted = pd.melt(cropland_state, id_vars='Regions and States', value_vars=df_dropped,
                                                var_name='YEAR', value_name='VALUE')

cropland_state = cropland_state_melted.copy()
cropland_state.to_pickle('../data/cropland_state.pkl')
cropland_state

Unnamed: 0,Regions and States,YEAR,VALUE
0,Northeast,1945,20904
1,Maine,1945,1331
2,New Hampshire,1945,443
3,Vermont,1945,1171
4,Massachusetts,1945,589
...,...,...,...
4184,Nevada,2015,478.006
4185,Pacific,2015,17395.510277
4186,Washington,2015,5557.553
4187,Oregon,2015,3522.318277


### 2.4 Ethanol produced US

In [25]:
# Read exdata
ethanol = pd.read_excel('../exdata/additional/PET_PNP_OXY_A_EPOOXE_YOP_MBBLPD_A.xls', sheet_name = 'Data 1', index_col=None, skiprows=2)

# Adapt colname YEAR
rename_cols = {'Date': 'YEAR'}
ethanol.rename(columns=rename_cols, inplace=True)
ethanol['YEAR'] = pd.to_datetime(ethanol['YEAR'], format='%Y%m%d')
ethanol['YEAR'] = ethanol['YEAR'].dt.strftime('%Y')

# Fill nas
ethanol = ethanol.fillna(ethanol.median())
#ethanol.head()

In [26]:
# Map per region
df = ethanol.drop('U.S. Oxygenate Plant Production of Fuel Ethanol (Thousand Barrels per Day)', axis=1)
df.columns = df.columns.str.replace(r' \(.+$', '', regex=True)

df = pd.melt(df, id_vars='YEAR', value_vars=df,
                var_name='region', value_name='value')
                                                

region_to_state_map = {'East Coast': ['Connecticut', 'Delaware', 'Florida', 'Georgia', 'Maine', 'Maryland',
                                     'Massachusetts', 'New Hampshire', 'New Jersey', 'New York', 'North Carolina',
                                     'Pennsylvania', 'Rhode Island', 'South Carolina', 'Virginia'],
                      'Midwest': ['Illinois', 'Indiana', 'Iowa', 'Kansas', 'Michigan', 'Minnesota', 'Missouri',
                                  'Nebraska', 'North Dakota', 'Ohio', 'South Dakota', 'Wisconsin'],
                      'Gulf Coast': ['Alabama', 'Louisiana', 'Mississippi', 'Texas'],
                      'Rocky Mountain': ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico',
                                         'Utah', 'Wyoming'],
                      'West Coast': ['Alaska', 'California', 'Oregon', 'Washington']}


# Create a new DataFrame to store the results
per_region_df = pd.DataFrame(columns=['state', 'region', 'ethanol', 'YEAR'])

# Iterate through each row in the original DataFrame
for _, row in df.iterrows():
    # Get the region, value, and year values for the current row
    region = row['region']
    value = row['value']
    year = row['YEAR']
    # Get the corresponding states for the region from the reverse mapping
    states_for_region = region_to_state_map.get(region, [])
    # Iterate through the states and create a new row for each state
    for state in states_for_region:
        # Create a new row with the state, region, value, and year values
        new_row = pd.Series({'state': state, 'region': region, 'ethanol': value, 'YEAR': year})
        # Append the new row to the result DataFrame
        per_region_df = per_region_df.append(new_row, ignore_index=True)

#per_region_df

  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=True)
  per_region_df = per_region_df.append(new_row, ignore_index=T

In [27]:
# add again value in all us
df_ethanol = ethanol[['YEAR','U.S. Oxygenate Plant Production of Fuel Ethanol (Thousand Barrels per Day)']]

df_ethanol = pd.melt(df_ethanol, id_vars='YEAR',
                                 value_vars='U.S. Oxygenate Plant Production of Fuel Ethanol (Thousand Barrels per Day)',
                                 var_name='all_regions', value_name='U.S. ethanol')

df_ethanol = df_ethanol.drop(['all_regions'], axis=1)

ethanol_merged = pd.merge(per_region_df, df_ethanol, on='YEAR', how='inner')
ethanol = ethanol_merged.copy()

ethanol.to_pickle('../data/ethanol.pkl')
ethanol

Unnamed: 0,state,region,ethanol,YEAR,U.S. ethanol
0,Connecticut,East Coast,19.0,1981,5
1,Delaware,East Coast,19.0,1981,5
2,Florida,East Coast,19.0,1981,5
3,Georgia,East Coast,19.0,1981,5
4,Maine,East Coast,19.0,1981,5
...,...,...,...,...,...
1801,Wyoming,Rocky Mountain,14.0,2022,1002
1802,Alaska,West Coast,7.0,2022,1002
1803,California,West Coast,7.0,2022,1002
1804,Oregon,West Coast,7.0,2022,1002
