# How is US hazelnut production affected by inter-annual weather variation in the Willamette Valley of Oregon?

[Previous Notebook](cap2_NB02.ipynb)
# Notebook3
If you are viewing on Git, the links below may not work; for that or other issues try the [alternate render](https://nbviewer.jupyter.org/github/sbBEM/cap2/blob/master/cap2_NB03.ipynb).

Table of contents
- [Importing](#Importing)
- [Cleaning](#Cleaning)
    - [Profiling](#Profiling)
    - [Data Defining](#DataDefining)
    - [Correcting Anomalies](#CorrectingAnomalies)
    - [Filling Empty Values](#FillingEmptyValues)

[Next Notebook](cap2_NB04.ipynb)

- [Transforming](cap2_NB04.ipynb#Transforming)
- [Visualizing](cap2_NB04.ipynb#Visualizing)
- [Modeling](cap2_NB04.ipynb#Modeling)
- [Evaluating](cap2_NB04.ipynb#Evaluating)
- [Concluding](cap2_NB04.ipynb#Concluding)

The previous notebook focused on cleaning the weather data. Here we will clean the two crop data files and merge them. 

## Importing <a name="Importing"></a>

Load python modules

In [1]:
import matplotlib as mpl
import numpy as np
import pandas as pd
#import google.cloud.bigquery as bq
import matplotlib.pyplot as plt
#from mpl_toolkits.basemap import Basemap
#from io import BytesIO
#from zipfile import ZipFile
#import requests
#from IPython.core.display import display, HTML
#import tabula
import pandas_profiling
from pandas_profiling.utils.cache import cache_file

Activate jupyter extentions

In [2]:
#%load_ext google.cloud.bigquery
%load_ext autoreload
%autoreload 2

Bring in the data from previous notebook.

In [3]:
#weatherdf = pd.read_pickle("../data/weather.pkl")
#our_stations = pd.read_pickle("../data/station.pkl")
cropdf = pd.read_pickle("../data/crop.pkl")
cropdf2 = pd.read_pickle("../data/crop2.pkl")
#region_points.read_pickle("../data/region.pkl")

## Cleaning <a name="Cleaning"></a>

### Profiling <a name="Profiling"></a>

In [4]:
cropdf.describe()

Unnamed: 0,49,Year,Utilized,per ton,Production,50
count,50.0,82.0,82.0,82.0,82.0,32.0
mean,49.0,1967.5,13285.853659,603.426829,11198.890244,50.0
std,0.0,23.815261,12191.606802,378.75449,15566.805948,0.0
min,49.0,1927.0,60.0,200.0,19.0,50.0
25%,49.0,1947.25,5387.5,344.5,1955.75,50.0
50%,49.0,1967.5,9125.0,514.0,3765.5,50.0
75%,49.0,1987.75,17875.0,785.25,15096.5,50.0
max,49.0,2008.0,49500.0,2240.0,75480.0,50.0


In [5]:
cropdf2.describe()

Unnamed: 0,year,Area,Yield,Utilized,Price
count,11,11,11,11,11
unique,11,10,11,10,11
top,2016,30000,(NA),44000,2680
freq,1,2,1,2,1


We want to build data definitions to describe the features in the datasets to identify any issues that will require cleaning. We'll try the Pandas profiling module. 

In [6]:
cropdfprofile = pandas_profiling.ProfileReport(cropdf, title='cropdf Profiling Report', explorative=True)
#displays the report directly in the notebook:
#cropdfprofile.to_widgets()
#due to a notebook size problem, generate as a separate HTML report instead: 
cropdfprofile.to_file("cropdfprofile.html")

Summarize dataset: 100%|██████████| 26/26 [00:04<00:00,  5.25it/s, Completed] 
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.34s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.87it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 320.00it/s]


[View Profile Report Here](cropdfprofile.html)

In [7]:
cropdf2profile = pandas_profiling.ProfileReport(cropdf2, title='cropdf2 Profiling Report', explorative=True)
#displays the report directly in the notebook:
#cropdf2profile.to_widgets()
#due to a notebook size problem, generate as a separate HTML report instead: 
cropdf2profile.to_file("cropdf2profile.html")

Summarize dataset: 100%|██████████| 19/19 [00:00<00:00, 37.39it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.45s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  4.68it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 563.67it/s]


[View Profile Report Here](cropdf2profile.html)

### Data Defining <a name="DataDefining"></a>
Data definitions for crop fields.

In [8]:
#Let's bring some organization to our list of crop fields. 
cfields = { }

### Correcting Outliers/Anomalies <a name="CorrectingAnomalies"></a>

### Filling Empty Values <a name="FillingEmptyValues"></a>

In [2]:
#verify files are < 100MB, due to a .ipynb size issue: 
!ls -lh

total 78552
-rw-r--r--@ 1 bem  staff    16M Apr 30 14:44 cap2_NB01.ipynb
-rw-r--r--@ 1 bem  staff   5.1M Apr 30 15:07 cap2_NB02.ipynb
-rw-r--r--@ 1 bem  staff    15K Apr 30 15:08 cap2_NB03.ipynb
-rw-r--r--  1 bem  staff   640K Apr 30 14:46 cropdf2profile.html
-rw-r--r--  1 bem  staff   1.3M Apr 30 14:46 cropdfprofile.html
-rw-r--r--  1 bem  staff    15M Apr 30 14:41 weatherdfprofile.html
