## Drought and Crop Yield

#### Project Proposal:
Throughout the years, droughts have received more attention especially by weather specialists. With climate change being actively tracked, it is important for us to understand drought patterns and how they will relate to crop yield, which will help us understand global food security for the coming years. Recently, there has been a growing food demand with an increase in the world population along with drastic changes in the weather patterns. Through this study, we plan to help farmers understand the extreme changes within weather patterns and how it will impact crop yield by understanding previous crop yield and loss patterns. 


#### Proposed Project: 
We will understand the relationship trend between crop yield and drought pattern in North and South America, and try to comprehend which crops are less correlated to drought data indications. Throughout this project, we will want to get a better knowledge of drought impact on agriculture and to extend further, how crops are currently affected by climate change. 

#### Questions to answer:
* Which crops are less correlated to drought data indications? If we have some crops that are less correlated, will this crop be a ‘good-yield’ crop in the areas that are more prone to drought? 
* How has climate change impacted crop yields over the last few years and analyzing whether there were any extreme changes within the US crop yield patterns?
* As an addition, we would also want to understand the soil moisture data in correspondence to drought patterns and how soil moisture affects crop yields.

#### Scope of Study:
* Location: We will study this crop yield and drought indication relation across US states
* Timeframe: according to limitation of data, We will need to focus on study of data from year 2010 to 2020

#### Limitations: 
Analyzing crop yield data requires tons and tons of data such as soil moisture, temperature, use of fertilizers. For better analysis and performance, it is important to obtain as much data as possible. 


In [92]:
import os
import re
import csv
import glob
import matplotlib
import numpy as np
import geopandas as gpd
import pandas as pd
import folium
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import altair as alt
from datetime import datetime
from functools import reduce
warnings.simplefilter("ignore")
%matplotlib inline
import ee

# set max df column display
pd.set_option('display.max_columns', 500)

### Primary Dataset

our crop dataset from

In [93]:
# get all the data together
def getframe(folderpath, axis=0):
    path = folderpath
    all_files = glob.glob(os.path.join(path, "*.csv"))
    df_from_each_file = (pd.read_csv(f) for f in all_files)
    concat_df   = pd.concat(df_from_each_file, axis=axis, ignore_index=True)
    return concat_df

crop_df = getframe('crops_datasets_raw')

In [94]:
crop_df.head(3)

Unnamed: 0,Program,Year,Period,Week Ending,Geo Level,State,State ANSI,Ag District,Ag District Code,County,County ANSI,Zip Code,Region,watershed_code,Watershed,Commodity,Data Item,Domain,Domain Category,Value,CV (%)
0,SURVEY,2020,MARKETING YEAR,,STATE,ALABAMA,1,,,,,,,0,,CORN,"CORN, GRAIN - PRICE RECEIVED, MEASURED IN $ / BU",TOTAL,NOT SPECIFIED,5.14,
1,SURVEY,2020,MARKETING YEAR,,STATE,ALABAMA,1,,,,,,,0,,COTTON,"COTTON, COTTONSEED - PRICE RECEIVED, MEASURED ...",TOTAL,NOT SPECIFIED,139,
2,SURVEY,2020,MARKETING YEAR,,STATE,ALABAMA,1,,,,,,,0,,COTTON,"COTTON, PIMA - PRICE RECEIVED, MEASURED IN $ / LB",TOTAL,NOT SPECIFIED,(NA),


### Secondary Dataset
#### Now let's get our drought data from USA Drought Monitor website

Source: https://droughtmonitor.unl.edu/DmData/DataDownload/ComprehensiveStatistics.aspx 

The file will be loaded in a csv format from the above website using the following parameters:

* Start Date: 01/01/2010 and End Date: 12/31/2020
* Spatial Scale: State and choose all states
* Statistics Category: Reports at percent level one drought category at time

To understand the data better, we could use the data dictionary present in the website itself: 
*   None is no reported drought
*   D0 - Abnormally dry
*   D1 - Moderate Drought
*   D2 - Severe Drought
*   D3 - Extreme Drought
*   D4 - Exceptional Drought.

These drought variables are calculated using various data such as precipitation, soil moisuture, surface temperature being the main variables.







In [95]:
drought_df_raw = pd.read_csv('secondary_dataset/drought_data.csv')

In [96]:
# get date to Datetime format
# find dict for state abbreviation to lookup key and value then merge with the crop file

us_state_to_abbrev = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO",
    "Connecticut": "CT", "Delaware": "DE", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN",
    "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI",
    "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH",
    "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK",
    "Oregon": "OR", "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN",
    "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY",
    "District of Columbia": "DC", "American Samoa": "AS", "Guam": "GU", "Northern Mariana Islands": "MP", "Puerto Rico": "PR",
     "United States Minor Outlying Islands": "UM", "U.S. Virgin Islands": "VI",
}
    
# invert the dictionary
abbrev_to_us_state = dict(map(reversed, us_state_to_abbrev.items()))

drought_df_raw['state'] = drought_df_raw['StateAbbreviation'].map(abbrev_to_us_state)
drought_df_raw['MapDate'] = pd.to_datetime(drought_df_raw['MapDate'], format='%Y%M%d')
drought_df_raw.set_index('MapDate', inplace=True)

In [97]:
drought_year_df = drought_df_raw.groupby([pd.Grouper(freq='Y'), 'state']).mean()
drought_year_df = drought_year_df.reset_index()
drought_year_df['Year'] = drought_year_df['MapDate'].dt.year
drought_year_df.rename(columns={'state':'State'}, inplace=True)

#### Precipitation by State from 2010-2022
Source : https://www.ncdc.noaa.gov/cag/statewide/time-series

* First get the data from the source above and concat them together.
* The data that we get for this project is monthly data by state from year 2010 to year 2022

In [98]:
# read each file and change column value to each state
path = 'secondary_dataset/precipitation'
all_files = glob.glob(os.path.join(path, "*.csv"))
files_list = []
for file in all_files:
    df = pd.read_csv(file, skiprows=4)
    x = re.findall(r'[A-Z]\w+', file) # using regex to extract file state name and fill in column
    df['state'] = x[0]
    files_list.append(df)
pcp_df = pd.concat(files_list, axis=0, ignore_index=True)
pcp_df['Date'] = pd.to_datetime(pcp_df.Date, format='%Y%m')
pcp_df.set_index('Date', inplace=True)
# pcp_df

In [99]:
# pcp_df by year average value
pcp_year_df = pcp_df.groupby([pd.Grouper(freq='Y'), 'state']).mean()
pcp_year_df = pcp_year_df.reset_index()
pcp_year_df['Year'] =pcp_year_df['Date'].dt.year
pcp_year_df.rename(columns={'Value':'precip', 'state':'State'}, inplace=True)

#### Land Temperature by State
Source : https://www.ncdc.noaa.gov/cag/statewide/time-series

* First get the data from the source above and concat them together.
* The data that we get for this project is monthly data by state from year 2010 to year 2022

In [100]:
# read and write csv to collect state from the top of the cell
path = 'secondary_dataset/temperature'
files = [os.path.join(path, f) for f in os.listdir(path)] # if os.path.isfile(os.path.join(path, f))
df_list = []
for f in files:
    file = open(f, 'r')
    firstline = file.readline()
    state = firstline.split(',')[0] # extract state from csv file
    df = pd.read_csv(f, skiprows=4) # read_csv to df 
    df['state'] = state
    df_list.append(df)
temp_df = pd.concat(df_list, ignore_index=True)
temp_df['Date'] = pd.to_datetime(temp_df['Date'], format='%Y%M')
temp_df.set_index('Date', inplace=True)


In [101]:
temp_year_df = temp_df.groupby([pd.Grouper(freq='Y'), 'state']).mean()
temp_year_df = temp_year_df.reset_index()
temp_year_df['Year'] =temp_year_df['Date'].dt.year
temp_year_df.rename(columns={'Value':'temp', 'state':'State'}, inplace=True)

#### Palmer Drought Severity Index (PDSI)
Source : https://www.ncdc.noaa.gov/cag/statewide/time-series

In [102]:
# read and write csv to collect state from the top of the cell
path = 'secondary_dataset/pdsi'
files = [os.path.join(path, f) for f in os.listdir(path)] # if os.path.isfile(os.path.join(path, f))
df_list = []
for f in files:
    file = open(f, 'r')
    firstline = file.readline()
    state = firstline.split(',')[0] # extract state from csv file
    df = pd.read_csv(f, skiprows=3) # read_csv to df 
    df['state'] = state
    df_list.append(df)
pdsi_df = pd.concat(df_list, axis=0, ignore_index=True)
pdsi_df['Date'] = pd.to_datetime(pdsi_df['Date'], format='%Y%M')
pdsi_df.set_index('Date', inplace=True)


In [103]:
pdsi_year_df = pdsi_df.groupby([pd.Grouper(freq='Y'), 'state']).mean()
pdsi_year_df = pdsi_year_df.reset_index()
pdsi_year_df['Year'] =pdsi_year_df['Date'].dt.year
pdsi_year_df.rename(columns={'Value':'pdsi', 'state':'State'}, inplace=True)
# pdsi_year_df

## **Data Cleansing and Manipulation** 

**Step 1**

* For our primary dataset from the crops, we first want to look at the top 10 commodoties in the U.S. We will need to visualize this data at a commodity level rather than at state-level. We will then get the top 10 commodities by filtering out by data-item on acres harvested or produced.

**Step 2**

Create a bar graph for top 10 Commodities in altair with an interaction by year

In [104]:
def top_commodities():
    filter1 = crop_df.copy()

    filter1 = filter1.drop(columns = ['Program', 'Period','Week Ending','Geo Level','State ANSI','Ag District',
                                    'Ag District Code','County','County ANSI','Zip Code','Region', 'watershed_code',
                                    'Watershed','Domain','Domain Category', 'CV (%)'])

    filter1 = filter1[filter1["Commodity"].str.contains("CROP TOTALS") == False]
    filter1 = filter1[filter1["Commodity"].str.contains("RENT") == False]
    commodity = filter1['Commodity'].unique() # Check to Remove commodities that are totals

    filter1 = filter1[filter1["Data Item"].str.contains("ACRES HARVESTED") == True]
    filter1 = filter1[filter1["Data Item"].str.contains("EXCL ALFALFA") == False]
    data_items = filter1['Data Item'].unique() # Check to retain only acres harvested

    # Convert Values to integer to add and remove any rows that
    filter1['Value'] = filter1['Value'].str.replace(r'[()]',"to_remove")

    # Remove any characters that are not numbers from the value column as it isn't needed
    filter1_f = filter1[filter1["Value"].str.contains("to_remove") == True] ## Do not need
    filter1_t = filter1[filter1["Value"].str.contains("to_remove") == False]

    # Convert Value column into an integer
    filter1_t['Value'] = filter1_t['Value'].str.replace(',', '')
    filter1_t['Value'] = filter1_t['Value'].astype(str).astype(int)
    filter1_t = filter1_t.reset_index(drop=True)

    # Groupby commodity and sum the acres harvested in thousands
    filter2 = filter1_t.groupby(['Commodity'])['Value'].mean().sort_values(ascending=False).reset_index()
    filter2 = filter2.rename(columns = {'Value':'Average Acres Harvested (0000s)'})
    filter2['Average Acres Harvested (0000s)'] = filter2['Average Acres Harvested (0000s)'].div(10000).round(2)

    return filter2


In [105]:
top_com = top_commodities()
com_filter = list(top_com['Commodity'][:5])
com_filter

['SOYBEANS', 'CORN', 'HAY & HAYLAGE', 'WHEAT', 'HAY']

In [106]:
alt.themes.enable('fivethirtyeight')

source = top_commodities()


bar = alt.Chart(source, title='Top 10 Commodities within 2010 - 2020').mark_bar().encode(
    x='Average Acres Harvested (0000s):Q',
    y=alt.Y('Commodity:N', sort='-x')
).transform_window(
    rank='rank(Average Acres Harvested (0000s))',
    sort=[alt.SortField('Average Acres Harvested (0000s)', order='descending')]
).transform_filter(
    (alt.datum.rank < 11)
)

bar


### Merging the Datasets

We also wants to merge the crop data with all other variables that we have as well as making it a GeoDataframe for easy choropleth plots.

In [107]:
# state shape file df
geo_gdf = gpd.read_file('states_shapefile/cb_2018_us_state_500k.shx')
geo_gdf['NAME'] = geo_gdf['NAME'].str.upper()
geo_gdf.rename(columns={'NAME':'State'}, inplace=True)
geo_gdf.State = geo_gdf.State.str.capitalize()
geo_gdf = geo_gdf[['State', 'geometry']]

#### Cleaning the Primaty Crop Data into Acre, Value ($), and Weight Harvested

In [108]:
# filter the crop_df1 to only acres_harvested items and only interested columns
crop_df1 = crop_df.copy()
crop_df1 = crop_df1[(crop_df1['Data Item'].str.contains('ACRES HARVESTED') == True)&(crop_df1['Year'] >= 2010)&(crop_df1['Year'] <= 2020)\
                    &(~crop_df1['Commodity'].str.contains('FIELD CROP TOTALS'))]
crop_df1 = crop_df1[['Year', 'State', 'Commodity', 'Value']]
crop_df1.Value = crop_df1.Value.apply(lambda x : x.replace(',', ''))
crop_df1.Value = pd.to_numeric(crop_df1.Value, errors='coerce').fillna(0)

In [109]:
# group by crop and year, sum value
crop_df1 = crop_df1.groupby(['Year', 'State', 'Commodity']).sum()
crop_df1.reset_index(inplace=True)
crop_df1.State = crop_df1.State.str.capitalize()
crop_df1.rename(columns={'Value':'Acre'}, inplace=True)
crop_df1

Unnamed: 0,Year,State,Commodity,Acre
0,2010,Alabama,CORN,259000.0
1,2010,Alabama,COTTON,676000.0
2,2010,Alabama,HAY,1560000.0
3,2010,Alabama,OATS,11000.0
4,2010,Alabama,PEANUTS,185000.0
...,...,...,...,...
4376,2020,Wyoming,BEANS,115100.0
4377,2020,Wyoming,CORN,200000.0
4378,2020,Wyoming,HAY,7790000.0
4379,2020,Wyoming,SUGARBEETS,181600.0


In [110]:
# filter crop data item by production measured in $
crop_df2 = crop_df.copy()
crop_df2 = crop_df2[(crop_df2['Data Item'].str.contains('PRODUCTION, MEASURED IN \$') == True) &(crop_df2['Year'] >= 2010)&(crop_df2['Year'] <= 2020)\
                   &(~crop_df2['Commodity'].str.contains('CROP TOTALS'))] 
crop_df2 = crop_df2[['Year', 'State', 'Commodity', 'Value']]
crop_df2.Value = crop_df2.Value.apply(lambda x : x.replace(',', ''))
crop_df2.Value = pd.to_numeric(crop_df2.Value, errors='coerce').fillna(0)
crop_df2.rename(columns={'Value':'Value_$'}, inplace=True)
crop_df2 = crop_df2.groupby(['Year', 'State', 'Commodity']).sum()
crop_df2.reset_index(inplace=True)
crop_df2.State = crop_df2.State.str.capitalize()
crop_df2.rename(columns={'Value':'Acre'}, inplace=True)
crop_df2

Unnamed: 0,Year,State,Commodity,Value_$
0,2010,Alabama,CORN,145763000.0
1,2010,Alabama,COTTON,219194000.0
2,2010,Alabama,HAY,318240000.0
3,2010,Alabama,HAY & HAYLAGE,159120000.0
4,2010,Alabama,OATS,1188000.0
...,...,...,...,...
4462,2020,Wyoming,CORN,26023000.0
4463,2020,Wyoming,HAY,917370000.0
4464,2020,Wyoming,HAY & HAYLAGE,917370000.0
4465,2020,Wyoming,SUGARBEETS,43307000.0


In [137]:
crop_df3 = crop_df.copy()

# looking at what filter we would need to use and what different unit do we have that we need to transform into weight single unit
crop_df3 = crop_df3[crop_df3['Data Item'].str.contains('PRODUCTION, MEASURED IN')]
units = crop_df3['Data Item'].unique().tolist()
new_unit = set([unit.split('-')[1].strip() for unit in units])
new_unit

unit_list = ['PRODUCTION, MEASURED IN 480 LB BALES',
'PRODUCTION, MEASURED IN BU',
'PRODUCTION, MEASURED IN CWT',
'PRODUCTION, MEASURED IN GALLONS',
'PRODUCTION, MEASURED IN LB',
'PRODUCTION, MEASURED IN TONS']

# first filter the unit above in the data items then convert the unit to tons
# crop_df3 = crop_df3[crop_df3['Data Item'].str.contains(('|').join(unit_list))]
crop_df3 = crop_df3[crop_df3['Data Item'].str.contains('PRODUCTION, MEASURED IN TONS')]
crop_df3 = crop_df3[['Year', 'State', 'Commodity', 'Data Item', 'Value']]
crop_df3.Value = crop_df3.Value.apply(lambda x : x.replace(',', ''))
crop_df3.Value = pd.to_numeric(crop_df3.Value, errors='coerce').fillna(0)

crop_df3['Value'] = np.where(crop_df3['Data Item'].str.contains('PRODUCTION, MEASURED IN 480 LB BALES'), crop_df3['Value']*480/2000, crop_df3['Value']) #convert the unit into pounds then tons
crop_df3['Value'] = np.where(crop_df3['Data Item'].str.contains('PRODUCTION, MEASURED IN BU'), crop_df3['Value']*0.021772, crop_df3['Value']) 
crop_df3['Value'] = np.where(crop_df3['Data Item'].str.contains('PRODUCTION, MEASURED IN CWT'), crop_df3['Value']*0.056, crop_df3['Value'])
crop_df3['Value'] = np.where(crop_df3['Data Item'].str.contains('PRODUCTION, MEASURED IN GALLONS'), crop_df3['Value']*11.358/2000, crop_df3['Value']) # convert syrub gallon to pounds then tons
crop_df3 = crop_df3[['Year', 'State', 'Commodity', 'Value']]
crop_df3.rename(columns={'Value':'Tons'}, inplace=True)
crop_df3.State = crop_df3.State.str.capitalize()
crop_df3 = crop_df3.groupby(['Year', 'State', 'Commodity']).mean().reset_index()
crop_df3

Unnamed: 0,Year,State,Commodity,Tons
0,2010,Alabama,CORN,1.350000e+05
1,2010,Alabama,COTTON,1.490000e+05
2,2010,Alabama,HAY,1.872000e+06
3,2010,Alaska,HAY,2.400000e+04
4,2010,Arizona,CORN,5.980000e+05
...,...,...,...,...
1928,2020,Wisconsin,HAY & HAYLAGE,4.828000e+06
1929,2020,Wisconsin,HAYLAGE,5.070667e+06
1930,2020,Wyoming,CORN,5.000000e+05
1931,2020,Wyoming,HAY,1.513000e+06


In [141]:
# merge 2 crops value by Acre and by Value
all_crop_df = pd.merge(crop_df1, crop_df2, on=['Year', 'State', 'Commodity'], how='left')
all_crop_df_value = pd.merge(all_crop_df, crop_df3, on=['Year', 'State', 'Commodity'], how='left')
all_crop_df_value
# all_crop_df

Unnamed: 0,Year,State,Commodity,Acre,Value_$,Tons
0,2010,Alabama,CORN,259000.0,145763000.0,135000.0
1,2010,Alabama,COTTON,676000.0,219194000.0,149000.0
2,2010,Alabama,HAY,1560000.0,318240000.0,1872000.0
3,2010,Alabama,OATS,11000.0,1188000.0,
4,2010,Alabama,PEANUTS,185000.0,97643000.0,
...,...,...,...,...,...,...
4376,2020,Wyoming,BEANS,115100.0,15830000.0,
4377,2020,Wyoming,CORN,200000.0,26023000.0,500000.0
4378,2020,Wyoming,HAY,7790000.0,917370000.0,1513000.0
4379,2020,Wyoming,SUGARBEETS,181600.0,43307000.0,909000.0


In [143]:


# merge crop and all variables on year and state
dfs = [all_crop_df_value, drought_year_df, pdsi_year_df, pcp_year_df, temp_year_df]
df = reduce(lambda left,right : pd.merge(left, right, on=['Year', 'State'], how='left'), dfs).fillna(0)
df = df[['Year', 'State', 'Commodity', 'Value_$','Acre', 'Tons','None', 'D0', 'D1', 'D2', 'D3', 'D4', 'pdsi', 'precip', 'temp']]
df['Year'] = pd.to_datetime(df.Year, format='%Y')
df

Unnamed: 0,Year,State,Commodity,Value_$,Acre,Tons,None,D0,D1,D2,D3,D4,pdsi,precip,temp
0,2010-01-01,Alabama,CORN,145763000.0,259000.0,135000.0,58.051731,41.948269,23.075192,9.373269,2.024038,0.000000,-0.887500,3.932500,62.708333
1,2010-01-01,Alabama,COTTON,219194000.0,676000.0,149000.0,58.051731,41.948269,23.075192,9.373269,2.024038,0.000000,-0.887500,3.932500,62.708333
2,2010-01-01,Alabama,HAY,318240000.0,1560000.0,1872000.0,58.051731,41.948269,23.075192,9.373269,2.024038,0.000000,-0.887500,3.932500,62.708333
3,2010-01-01,Alabama,OATS,1188000.0,11000.0,0.0,58.051731,41.948269,23.075192,9.373269,2.024038,0.000000,-0.887500,3.932500,62.708333
4,2010-01-01,Alabama,PEANUTS,97643000.0,185000.0,0.0,58.051731,41.948269,23.075192,9.373269,2.024038,0.000000,-0.887500,3.932500,62.708333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4376,2020-01-01,Wyoming,BEANS,15830000.0,115100.0,0.0,41.825769,58.174231,42.290577,22.692500,8.695577,0.062308,-0.759167,0.976667,42.775000
4377,2020-01-01,Wyoming,CORN,26023000.0,200000.0,500000.0,41.825769,58.174231,42.290577,22.692500,8.695577,0.062308,-0.759167,0.976667,42.775000
4378,2020-01-01,Wyoming,HAY,917370000.0,7790000.0,1513000.0,41.825769,58.174231,42.290577,22.692500,8.695577,0.062308,-0.759167,0.976667,42.775000
4379,2020-01-01,Wyoming,SUGARBEETS,43307000.0,181600.0,909000.0,41.825769,58.174231,42.290577,22.692500,8.695577,0.062308,-0.759167,0.976667,42.775000


In [114]:
# create a geodataframe to work with geopandas or folium map
new_gdf = pd.merge(df, geo_gdf, on='State', how='left')
gdf = gpd.GeoDataFrame(new_gdf)
gdf

Unnamed: 0,Year,State,Commodity,Value_$,Acre,Tons,None,D0,D1,D2,D3,D4,pdsi,precip,temp,geometry
0,2010-01-01,Alabama,CORN,145763000.0,259000.0,625945.00,58.051731,41.948269,23.075192,9.373269,2.024038,0.0,-0.8875,3.9325,62.708333,"MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ..."
1,2010-01-01,Alabama,CORN,145763000.0,259000.0,135000.00,58.051731,41.948269,23.075192,9.373269,2.024038,0.0,-0.8875,3.9325,62.708333,"MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ..."
2,2010-01-01,Alabama,COTTON,219194000.0,676000.0,115200.00,58.051731,41.948269,23.075192,9.373269,2.024038,0.0,-0.8875,3.9325,62.708333,"MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ..."
3,2010-01-01,Alabama,COTTON,219194000.0,676000.0,149000.00,58.051731,41.948269,23.075192,9.373269,2.024038,0.0,-0.8875,3.9325,62.708333,"MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ..."
4,2010-01-01,Alabama,COTTON,219194000.0,676000.0,115200.00,58.051731,41.948269,23.075192,9.373269,2.024038,0.0,-0.8875,3.9325,62.708333,"MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20502,2020-01-01,Wyoming,SUGARBEETS,43307000.0,181600.0,921000.00,41.825769,58.174231,42.290577,22.6925,8.695577,0.062308,-0.759167,0.976667,42.775,"POLYGON ((-111.05456 45.00095, -111.04507 45.0..."
20503,2020-01-01,Wyoming,SUGARBEETS,43307000.0,181600.0,921000.00,41.825769,58.174231,42.290577,22.6925,8.695577,0.062308,-0.759167,0.976667,42.775,"POLYGON ((-111.05456 45.00095, -111.04507 45.0..."
20504,2020-01-01,Wyoming,SUGARBEETS,43307000.0,181600.0,921000.00,41.825769,58.174231,42.290577,22.6925,8.695577,0.062308,-0.759167,0.976667,42.775,"POLYGON ((-111.05456 45.00095, -111.04507 45.0..."
20505,2020-01-01,Wyoming,WHEAT,22042000.0,390000.0,50946.48,41.825769,58.174231,42.290577,22.6925,8.695577,0.062308,-0.759167,0.976667,42.775,"POLYGON ((-111.05456 45.00095, -111.04507 45.0..."


In [115]:
# crop_gdf.explore(column='Value', cmap='OrRd', legend=True, tooltip=['State', 'Data Item', 'Value'])

## Crop and Drought Data Analysis
<!-- What do we want to know from the analysis and how can we tell the story from those finding -->
The questions we want answer are, how do crop response to different drought variables? What crop are more tolerance to drought relative to others? How would you re-clustering US state based on drought and climate data?

1. let's first explore each drought variable in chart
2. then let's explore top 5 crops in US small multiples choropleth over 5 years periods
3. crops correlation with each variables
4. US state clustering by drought and climate variables, we can work on monthly data of each varibles

In [116]:
# Let's take a peak at varibles vis but let's also plob in crops by states
alt.data_transformers.disable_max_rows() # disable max row error

#base altair with df
base = alt.Chart(df)
precip = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('precip:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','precip']
)

temp = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('temp:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','temp']
)

D0 = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('D0:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','D0']
)

D1 = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('D1:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','D1']
)

D2 = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('D2:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','D2']
)

D3 = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('D3:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','D3']
)

D4 = base.mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('D4:Q'),
    color=alt.Color('State:N'),
    tooltip=['State','D4']
)


In [117]:
# let's see overall crop production by state
crop_year_df = df.groupby(['State', 'Year'])['Acre', 'Value_$'].sum().reset_index()

crop_acre = alt.Chart(crop_year_df).mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('Acre:Q'),
    color=alt.Color('State:N'),
    tooltip=['State', 'Acre']
)

crop_value = alt.Chart(crop_year_df).mark_line().encode(
    x=alt.X('Year:T'),
    y=alt.Y('Value_$:Q'),
    color=alt.Color('State:N'),
    tooltip=['State', 'Acre']
)

((precip | temp) & (D0 | D1) & (D2 | D3) & (D4 | crop_acre) & crop_value )


In [118]:
# Let's see if we can average variables across states
# all the variable group only by year and average the results
avg_df = df.groupby('Year').mean().reset_index()
corr_df = avg_df.corr()
corr_df

Unnamed: 0,Value_$,Acre,Tons
Value_$,1.0,-0.087386,-0.043853
Acre,-0.087386,1.0,0.861489
Tons,-0.043853,0.861489,1.0


In [119]:
def corrmatrix(df, method='pearson'):
    # corr_df_stack = df.corr(method).stack().reset_index().rename(columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'})
    corr_df_stack = df.corr(method).reset_index().melt('index').rename(columns={'index': 'variable', 'variable': 'variable2', 'value': 'correlation'})
    corr_df_stack['correlation_label'] = corr_df_stack['correlation'].map('{:.2f}'.format).fillna(0)
    sort=df.columns.tolist() 
    # create altair matrix
    base = alt.Chart(corr_df_stack).encode(
        x=alt.X('variable:O', sort=sort, title=None, axis=alt.Axis(labelFontSize=16)),
        y=alt.Y('variable2:O', sort=sort, title=None, axis=alt.Axis(labelFontSize=16))
    )

    # text
    text = base.mark_text(size=14).encode(
        text='correlation_label:N',
        color=alt.condition(alt.datum.correlation > 0.5, alt.value('white'), alt.value('black'))
    )

    # heatmap
    hm = base.mark_rect().encode(
        color='correlation:Q'
    )

    corr_chart = (hm + text).properties(
        width=800,
        height=800
    )
    return corr_chart

In [120]:
corrmatrix(corr_df)

In [121]:
# # let's look at the relationship between variables of top 5 commodity and 
# gdf_pivot = gdf.pivot_table(index='Year', columns=['State', 'Commodity'])
# gdf_pivot