# PREPROCESSING

The preprocessing of the raw data taken to achieve the data that will be employed is executed here. 

Steps followed
1. [Load data](#1.-Load-data)
2. [Preprocess climate data](#2.-Preprocess-climate-data)
3. [Preprocess main](#3.-Preprocess-main-data)
4. [Merge features](#4.-Merge-features)
5. [Save data](#5-Save-data)

## Load packages

In [38]:
# Set Git path
import os 
CURRENT_PATH = os.getcwd()

# data manipulation
from pathlib import Path
import pandas as pd

In [39]:
data_path = Path('../../data/01-Preprocessing/')

## 1. Load data

In [40]:
# Loading required data to blend
main = pd.read_pickle(data_path/'main.pkl')
climate = pd.read_pickle(data_path/'climate.pkl')
cropland_state = pd.read_pickle(data_path/'cropland_state.pkl')

## 2. Preprocess climate data

Necessary to be merged with main:
- Unify colname of year as in main data
- As data come in months, unify data per year/state obtaining two new columns:
    - mean of each year
    - std of each year

In [41]:
climate['DATE'] = pd.to_datetime(climate['DATE'])
climate = climate.rename(columns={'DATE': 'YEAR'})
climate['YEAR'] = climate['YEAR'].dt.year

agg_df = climate.groupby(['YEAR','State']).agg(['mean', 'std'])
new_columns = [f'{col}_{stat}' for col, stat in agg_df.columns]
agg_df.columns = new_columns

climate = agg_df.copy()
climate.reset_index(inplace=True)
climate = climate.rename_axis(None, axis=1)
climate.rename(columns={'State': 'LOCATION_DESC'}, inplace=True)
climate = climate.apply(lambda x: x.str.upper() if x.dtype == 'object' else x)

climate

Unnamed: 0,YEAR,LOCATION_DESC,Average Temperature_mean,Average Temperature_std,Cooling Degree Days_mean,Cooling Degree Days_std,Heating Degree Days_mean,Heating Degree Days_std,Maximum Temperature_mean,Maximum Temperature_std,Minimum Temperature_mean,Minimum Temperature_std,Palmer Drought Severity Index (PDSI)_mean,Palmer Drought Severity Index (PDSI)_std,Precipitation_mean,Precipitation_std
0,1895,ALABAMA,61.641667,15.202898,153.750000,184.686528,274.000000,303.660097,73.066667,15.096136,50.208333,15.393237,-0.325833,0.813080,4.200000,2.039947
1,1895,ARIZONA,58.491667,14.733602,206.250000,223.985034,192.333333,215.727662,71.408333,15.850406,45.550000,13.858670,1.408333,0.679971,0.950833,0.915478
2,1895,ARKANSAS,58.858333,17.049471,142.083333,179.073097,337.500000,363.964908,69.933333,17.121827,47.766667,17.078926,-0.200833,1.196688,3.781667,1.972999
3,1895,CALIFORNIA,56.458333,11.784771,42.833333,57.142300,290.583333,236.191547,68.416667,14.054105,44.475000,9.570801,0.868333,1.096348,1.906667,2.495794
4,1895,COLORADO,42.433333,17.281747,10.750000,18.091309,693.750000,500.796298,57.016667,18.314219,27.891667,16.279740,2.825833,0.685121,1.635000,0.731207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6237,2022,VIRGINIA,32.900000,,0.000000,,994.000000,,42.800000,,23.100000,,-1.340000,,4.510000,
6238,2022,WASHINGTON,31.200000,,0.000000,,869.000000,,37.000000,,25.400000,,1.440000,,6.580000,
6239,2022,WEST VIRGINIA,27.400000,,0.000000,,1155.000000,,37.000000,,17.800000,,0.720000,,4.860000,
6240,2022,WISCONSIN,9.200000,,0.000000,,1622.000000,,19.600000,,-1.300000,,-1.460000,,0.500000,


## 3. Preprocess main data 

Clean data before merging:
- Filter NaNs < 20% ¡PER STATE!
- Inputation by mean of NaNs for the remaining 

**Disclaimer** -> As it were filtered the columns by state, there will be NaNs in the whole dataset

In [42]:
# NaNs filter PER STATE
# Group by 'state' and filter columns with missing values < 20% for each state
filtered_dfs = {}
for state, state_df in main.groupby('LOCATION_DESC'):
    missing_percentages = state_df.isna().mean() * 100
    columns_to_keep = missing_percentages[missing_percentages < 20].index
    filtered_df = state_df.loc[:, columns_to_keep]
    # Fill missing values with column means
    filtered_df = filtered_df.fillna(filtered_df.mean(numeric_only=True))
    filtered_dfs[state] = filtered_df

# Merge all filtered DataFrames into one by 'year' column
main = pd.concat(filtered_dfs.values(), ignore_index=True)
main = main.sort_values('YEAR')
main = main.reset_index(drop=True)

main

Unnamed: 0,YEAR,LOCATION_DESC,CORN - ACRES PLANTED,"CORN, GRAIN - ACRES HARVESTED","CORN, GRAIN - PRODUCTION, MEASURED IN BU","CORN, GRAIN - YIELD, MEASURED IN BU / ACRE","CORN, SILAGE - ACRES HARVESTED","CORN, SILAGE - PRODUCTION, MEASURED IN TONS","CORN, SILAGE - YIELD, MEASURED IN TONS / ACRE","CORN, GRAIN - PRODUCTION, MEASURED IN $"
0,1919,ARIZONA,31.0,22.0,4.620000e+02,21.000000,6.000000,35.000000,5.900000,929.000000
1,1920,ARIZONA,29.0,21.0,2.730000e+02,13.000000,5.000000,32.000000,6.500000,420.000000
2,1921,ARIZONA,35.0,21.0,3.990000e+02,19.000000,8.000000,54.000000,6.700000,459.000000
3,1922,ARIZONA,39.0,23.0,3.910000e+02,17.000000,4.000000,26.000000,6.600000,461.000000
4,1923,ARIZONA,33.0,23.0,4.370000e+02,19.000000,4.000000,28.000000,7.000000,485.000000
...,...,...,...,...,...,...,...,...,...,...
4642,2022,ARIZONA,85.0,30.0,2.027508e+03,85.770874,16.360118,296.550026,17.886408,8436.572816
4643,2022,NEBRASKA,9600.0,9300.0,1.599600e+06,172.000000,276.201622,2955.949759,11.135417,
4644,2022,NEVADA,14.0,,,,9.534034,65.214162,15.683562,
4645,2022,MAINE,33.0,,,,24.263921,294.388543,13.940625,


## 4. Merge features

**Only in existing rows where the output is present**

### 4.1 Climate on Main

In [43]:
# Add climate
data = pd.merge(main, climate, on=['YEAR', 'LOCATION_DESC'], how='left')

for col in climate.columns: # Complete missing from climate with median
   if col != 'YEAR' and col != 'LOCATION_DESC':  
        median = data[col].median()
        data[col].fillna(median, inplace=True)

data

Unnamed: 0,YEAR,LOCATION_DESC,CORN - ACRES PLANTED,"CORN, GRAIN - ACRES HARVESTED","CORN, GRAIN - PRODUCTION, MEASURED IN BU","CORN, GRAIN - YIELD, MEASURED IN BU / ACRE","CORN, SILAGE - ACRES HARVESTED","CORN, SILAGE - PRODUCTION, MEASURED IN TONS","CORN, SILAGE - YIELD, MEASURED IN TONS / ACRE","CORN, GRAIN - PRODUCTION, MEASURED IN $",...,Heating Degree Days_mean,Heating Degree Days_std,Maximum Temperature_mean,Maximum Temperature_std,Minimum Temperature_mean,Minimum Temperature_std,Palmer Drought Severity Index (PDSI)_mean,Palmer Drought Severity Index (PDSI)_std,Precipitation_mean,Precipitation_std
0,1919,ARIZONA,31.0,22.0,4.620000e+02,21.000000,6.000000,35.000000,5.900000,929.000000,...,214.666667,229.288755,71.300000,16.231171,44.825000,15.267203,2.422500,1.430627,1.372500,1.194854
1,1920,ARIZONA,29.0,21.0,2.730000e+02,13.000000,5.000000,32.000000,6.500000,420.000000,...,211.000000,213.963038,71.208333,15.444002,44.375000,13.085184,2.357500,2.827209,1.069167,0.813896
2,1921,ARIZONA,35.0,21.0,3.990000e+02,19.000000,8.000000,54.000000,6.700000,459.000000,...,167.750000,182.168815,73.958333,13.132850,45.858333,13.304918,-0.372500,2.107459,1.136667,1.200578
3,1922,ARIZONA,39.0,23.0,3.910000e+02,17.000000,4.000000,26.000000,6.600000,461.000000,...,218.583333,229.390282,72.358333,16.656064,44.858333,14.897496,0.980833,1.182374,1.010833,0.682368
4,1923,ARIZONA,33.0,23.0,4.370000e+02,19.000000,4.000000,28.000000,7.000000,485.000000,...,200.500000,204.252740,71.116667,14.752000,44.600000,13.089760,0.172500,1.478286,1.266667,0.986881
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4642,2022,ARIZONA,85.0,30.0,2.027508e+03,85.770874,16.360118,296.550026,17.886408,8436.572816,...,444.000000,417.486527,57.200000,17.462243,30.300000,15.427858,-3.140000,1.184574,0.300000,1.490589
4643,2022,NEBRASKA,9600.0,9300.0,1.599600e+06,172.000000,276.201622,2955.949759,11.135417,,...,1265.000000,417.486527,39.500000,17.462243,11.700000,15.427858,-1.880000,1.184574,0.250000,1.490589
4644,2022,NEVADA,14.0,,,,9.534034,65.214162,15.683562,,...,663.000000,417.486527,45.800000,17.462243,22.400000,15.427858,-3.840000,1.184574,0.140000,1.490589
4645,2022,MAINE,33.0,,,,24.263921,294.388543,13.940625,,...,1536.000000,417.486527,21.900000,17.462243,-0.200000,15.427858,-0.860000,1.184574,2.430000,1.490589


### 4.2 Cropland on current 4.1

In [44]:
cropland_state.rename(columns={'Regions and States': 'LOCATION_DESC'}, inplace=True)
cropland_state['LOCATION_DESC'] = cropland_state['LOCATION_DESC'].str.upper()

# Add cropland
data = data.merge(cropland_state, on=['YEAR', 'LOCATION_DESC'], how='left')

for col in climate.columns: # Complete missing from cropland value with median
   if col != 'YEAR' and col != 'LOCATION_DESC':  
        median = data[col].median()
        data[col].fillna(median, inplace=True)
        
data

Unnamed: 0,YEAR,LOCATION_DESC,CORN - ACRES PLANTED,"CORN, GRAIN - ACRES HARVESTED","CORN, GRAIN - PRODUCTION, MEASURED IN BU","CORN, GRAIN - YIELD, MEASURED IN BU / ACRE","CORN, SILAGE - ACRES HARVESTED","CORN, SILAGE - PRODUCTION, MEASURED IN TONS","CORN, SILAGE - YIELD, MEASURED IN TONS / ACRE","CORN, GRAIN - PRODUCTION, MEASURED IN $",...,Heating Degree Days_std,Maximum Temperature_mean,Maximum Temperature_std,Minimum Temperature_mean,Minimum Temperature_std,Palmer Drought Severity Index (PDSI)_mean,Palmer Drought Severity Index (PDSI)_std,Precipitation_mean,Precipitation_std,VALUE
0,1919,ARIZONA,31.0,22.0,4.620000e+02,21.000000,6.000000,35.000000,5.900000,929.000000,...,229.288755,71.300000,16.231171,44.825000,15.267203,2.422500,1.430627,1.372500,1.194854,
1,1920,ARIZONA,29.0,21.0,2.730000e+02,13.000000,5.000000,32.000000,6.500000,420.000000,...,213.963038,71.208333,15.444002,44.375000,13.085184,2.357500,2.827209,1.069167,0.813896,
2,1921,ARIZONA,35.0,21.0,3.990000e+02,19.000000,8.000000,54.000000,6.700000,459.000000,...,182.168815,73.958333,13.132850,45.858333,13.304918,-0.372500,2.107459,1.136667,1.200578,
3,1922,ARIZONA,39.0,23.0,3.910000e+02,17.000000,4.000000,26.000000,6.600000,461.000000,...,229.390282,72.358333,16.656064,44.858333,14.897496,0.980833,1.182374,1.010833,0.682368,
4,1923,ARIZONA,33.0,23.0,4.370000e+02,19.000000,4.000000,28.000000,7.000000,485.000000,...,204.252740,71.116667,14.752000,44.600000,13.089760,0.172500,1.478286,1.266667,0.986881,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4642,2022,ARIZONA,85.0,30.0,2.027508e+03,85.770874,16.360118,296.550026,17.886408,8436.572816,...,417.486527,57.200000,17.462243,30.300000,15.427858,-3.140000,1.184574,0.300000,1.490589,
4643,2022,NEBRASKA,9600.0,9300.0,1.599600e+06,172.000000,276.201622,2955.949759,11.135417,,...,417.486527,39.500000,17.462243,11.700000,15.427858,-1.880000,1.184574,0.250000,1.490589,
4644,2022,NEVADA,14.0,,,,9.534034,65.214162,15.683562,,...,417.486527,45.800000,17.462243,22.400000,15.427858,-3.840000,1.184574,0.140000,1.490589,
4645,2022,MAINE,33.0,,,,24.263921,294.388543,13.940625,,...,417.486527,21.900000,17.462243,-0.200000,15.427858,-0.860000,1.184574,2.430000,1.490589,


## 5. Save data

In [45]:
data.to_pickle(data_path/'data.pkl')