<a href="https://colab.research.google.com/github/stevegbrooks/commodify/blob/preprocessing/usda_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
import requests, zipfile, io
import pandas as pd
import numpy as np
out_path = "~/CIS550/commodify/data/"

## Process commodities data from USDA

In [19]:
zip_url = "https://apps.fas.usda.gov/psdonline/downloads/psd_alldata_csv.zip"

r = requests.get(zip_url)
if r.ok:
  z = zipfile.ZipFile(io.BytesIO(r.content))
  usda_data = pd.read_csv(z.open('psd_alldata.csv'))

usda_data.head(n=5)

KeyboardInterrupt: 

### Deal with null values

In [3]:
usda_data.isnull().values.any()

True

It looks like only the `Country_Code` column has `NaN`. 


In [4]:
usda_data[usda_data.isnull().any(axis=1)]

Unnamed: 0,Commodity_Code,Commodity_Description,Country_Code,Country_Name,Market_Year,Calendar_Year,Month,Attribute_ID,Attribute_Description,Unit_ID,Unit_Description,Value
716727,813100,"Meal, Soybean",,Netherlands Antilles,1976,2006,6,20,Beginning Stocks,8,(1000 MT),0.0
716728,813100,"Meal, Soybean",,Netherlands Antilles,1976,2006,6,7,Crush,8,(1000 MT),0.0
716729,813100,"Meal, Soybean",,Netherlands Antilles,1976,2006,6,125,Domestic Consumption,8,(1000 MT),1.0
716730,813100,"Meal, Soybean",,Netherlands Antilles,1976,2006,6,176,Ending Stocks,8,(1000 MT),0.0
716731,813100,"Meal, Soybean",,Netherlands Antilles,1976,2006,6,88,Exports,8,(1000 MT),0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1726314,612000,"Sugar, Centrifugal",,Netherlands Antilles,2021,2020,11,99,Refined Exp.(Raw Val),8,(1000 MT),0.0
1726315,612000,"Sugar, Centrifugal",,Netherlands Antilles,2021,2020,11,74,Refined Imp.(Raw Val),8,(1000 MT),0.0
1726316,612000,"Sugar, Centrifugal",,Netherlands Antilles,2021,2020,11,126,Total Disappearance,8,(1000 MT),0.0
1726317,612000,"Sugar, Centrifugal",,Netherlands Antilles,2021,2020,11,178,Total Distribution,8,(1000 MT),0.0


Lets check which values for `Country_Name` don't have a `Country_Code`

In [5]:
usda_data[usda_data.isnull().any(axis=1)]["Country_Name"].unique()

array(['Netherlands Antilles'], dtype=object)

Before gaining indepedence in 2010, these islands were part of the Netherlands, but now the group of islands consists of smaller countries. 

We can just set the country code of Netherlands Antilles to the Netherlands' country code.

In [6]:
usda_data[usda_data["Country_Name"] == "Netherlands"]["Country_Code"].unique()

array(['NL'], dtype=object)

In [9]:
usda_data.loc[usda_data.Country_Name == "Netherlands Antilles", 'Country_Code'] = "NL"

Check to make sure it worked. This should return `False`

In [10]:
usda_data.isnull().values.any()

False

## Reshape from long to wide

In [16]:
usda_pivot = usda_data.pivot(index = ["Commodity_Description", "Market_Year", "Month", "Country_Name"], columns = "Attribute_Description", values = "Value")
usda_pivot

NameError: name 'usda_data' is not defined

In [15]:
usda_pivot_reset = usda_pivot.reset_index(drop=False)
usda_pivot_reset

NameError: name 'usda_pivot' is not defined

In [14]:
cols_to_keep = ["Commodity_Description", "Market_Year", "Month", "Country_Name",
           "Beginning Stocks", "Ending Stocks", "Imports", "Exports", 
           "Area Harvested", "Yield", "Production", "Domestic Consumption"]

output = usda_pivot_reset[cols_to_keep]

output

NameError: name 'usda_pivot_reset' is not defined

The last step is to make convert the `Country_Name` column to the matching political entity IDs

In [13]:
pol_ent = pd.read_csv(out_path + "political_entity.csv")
pol_ent = pol_ent[pol_ent["is_country"] == 1]
pol_ent.rename(columns={"name" : "Country_Name"})

output.join(pol_ent, on = "Country_Name")
output
#cols_to_keep = ["Commodity_Description", "Market_Year", "Month", "PE_ID",
#                "Beginning Stocks", "Ending Stocks", "Imports", "Exports", 
#                "Area Harvested", "Yield", "Production", "Domestic Consumption"]

#output.to_csv(out_path[cols_to_keep] + "usda_data.csv", index = False)

#output

NameError: name 'output' is not defined