## Introduction

The primary dataset comes from the USDA/AMS/Market News/Specialty Crops Program [refrigerated truck volume data](https://agtransport.usda.gov/Truck/Refrigerated-Truck-Volumes/rfpn-7etz) on various staple fruits and vegetables within California. Some of the data collection methods used include Federal marketing orders, telephone interviews, faxes, emails, and access to other data sources. The rows of the dataset show aggregated truck volume by commodity on a daily basis. The features of the dataset include datetime-related columns, origin, destination, commodity, and volume.

This notebook (1/3) explores the refrigerated truck volume data set.

First, we must install additional packages if not already. We follow Jake Vanderplas's [advice](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/) on installing packages within notebooks.

In [1]:
import sys
!{sys.executable} -m pip install plotly-express



In [2]:
# import data science packages
import numpy as np
import pandas as pd

# import visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots

# import custom packages
import datasets
import utils

## Importing Truck Data

We take the liberty of modularizing some methods to reduce redundant code. One such example is the importing and cleaning of the refrigerated truck volume data set. Some of the cleaning steps include:
- Aggregating some rows due to unnecessary columns. Some rows for truck volume may be assigned to different commodity marketing seasons indicated by the `Season` column, but otherwise are complete duplicates. For our analysis, we disregard this fact because we are more interested in the `date` of the movement.
- Keeping only important columns. There are extra columns relating to the date.
- Removing rows with none/useless information. Particularly, rows that have a value of 0 for the `10,000 LBS` column are removed. This is because the columns is encoded as an integer value. Therefore, truck volumes below the 10,000 lbs threshold with just have a value of 0.
- Casting the data type of the `date` column to be a datetime type.

In [3]:
truck_df = datasets.load_truck_data(start=2015, end=2022, trace=True)

The file 'data\Refrigerated_Truck_Volumes_2015.csv' contains 114836 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2016.csv' contains 115813 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2017.csv' contains 115112 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2018.csv' contains 116071 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2019.csv' contains 121176 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2020.csv' contains 129428 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2021.csv' contains 140029 rows and 14 columns.
The file 'data\Refrigerated_Truck_Volumes_2022.csv' contains 137708 rows and 14 columns.
The full data frame contains 990173 rows and 14 columns.
The cleaned data frame contains 856759 rows and 9 columns.


In [4]:
# check the number of unique values for each column
for col in truck_df.columns:
    print(f"The column `{col}` has {len(truck_df[col].unique())} unique values.")

The column `date` has 2922 unique values.
The column `Month` has 12 unique values.
The column `Year` has 8 unique values.
The column `Mode` has 2 unique values.
The column `Region` has 17 unique values.
The column `Origin` has 42 unique values.
The column `District` has 135 unique values.
The column `Commodity` has 133 unique values.
The column `10,000 LBS` has 2555 unique values.


We decide to aggregate the data frame on a monthly basis for easier viewing. We can immediately group by the month, but we created a custom function for aggregating based on any datetime frequency. Later on, we also decide to aggregate on a weekly basis which is only achieved with our custom function.

In [5]:
# aggregate every month
monthly_truck_df = datasets.aggregate_truck_df(truck_df, rule='1M', trace=True)

## Choropleth Map Visualization

A choropleth map is a visualization technique that showcases a value to be observed across certain regions (e.g., countries, US states, etc.). The choice of region type defines its geometric shape and, thus, the boundaries between the regions. The value of interest is encoded onto a color scale. We decide to use [Plotly Express](https://plotly.com/python/plotly-express/) for an easier time with creating the choropleth map.

In [6]:
# check the unique values for `Region`
monthly_truck_df['Region'].unique()

array(['Canada', 'Other', 'Arizona', 'California', 'Colorado', 'Florida',
       'Great Lakes', 'Mexico-Arizona', 'Mexico-California',
       'Mexico-NewMexico', 'Mexico-Texas', 'Midatlantic', 'New York',
       'PNW', 'Southeast', 'Texas', 'Indiana'], dtype=object)

Looking at the unique values for the `Region` column, we find that not all regions are US states. We would have to modify the geometric shapes for the US states and conform them into the new regions. This is out of scope for this project. Therefore, we only choose to look at those regions that are specifically a single US state.

In [7]:
# Plotly Express requires the state abbreviation code to enable the US states mapping type
monthly_truck_df.loc[:,'state_code'] = monthly_truck_df['Region'].apply(utils.map_to_state_code)

# for this visualization, we are going to change `date` back to string type as it produces errors
monthly_truck_df.loc[:,'date'] = monthly_truck_df['date'].dt.date.astype(str)

We will actually be creating a small multiples visualization of choropleth maps. Doing so, will allow us to observe trends/patterns of the refrigerated truck volumes over time with a single glance. According to the [documentation](https://plotly.com/python/subplots/), Plotly Express does not support customizable subplots. Fortunately, a workaround was inspired by the [Stack Overflow](https://stackoverflow.com/questions/56727843/how-can-i-create-subplots-with-plotly-express) post to...
1. create the figures
2. break apart each figure into its `data` attribute
3. re-assemble into its individual slot in the subplot grid


In [8]:
# For this visualization, we just want to group by the region/state and see the total
# refrigerated truck volume over time.
df_to_vis = monthly_truck_df.groupby(['date', 'Region', 'state_code'])['10,000 LBS'].sum().reset_index()

# grab the unique dates, but only every 2nd (month)
unique_dates = df_to_vis['date'].unique()[::2]

# 1. create the figures
figures = [
    px.choropleth(
        df_to_vis[df_to_vis['date'] == date].dropna(),
        locations='state_code',
        locationmode='USA-states',
        scope='usa',
        projection='albers usa',
        color='10,000 LBS'
    ) 
    for date in unique_dates
]

# setup the subplot grid
rows, cols = 8, 6
fig = make_subplots(
    rows=rows, 
    cols=cols,
    horizontal_spacing=0.005,
    vertical_spacing=0.001,
    specs=[[{'type': 'choropleth'} for col in np.arange(cols)] for row in np.arange(rows)],
    
    subplot_titles = [
        (   # title is year if month is january else ...
            date[:4] if date[5:7]=='01' else (
                # need to hardcode one of the titles since we cannot get titles for all the months
                utils.month_number_to_name(date[5:7]) if date[:4]=='2015' else ('January' if date == '2016-03-31' else "")
            )
        ) 
        for date in unique_dates
    ],
)


for index, figure in enumerate(figures):
    
    # from the index of the grid, get the corresponding row and column
    row, col = [x[0]+1 for x in np.unravel_index([index], (rows,cols))]
    
    # 2. break apart each figure into its `data` attribute
    small_multiple = figure['data'][0]
    
    # 3. re-assemble into its individual slot in the subplot grid
    fig.append_trace(small_multiple, row=row, col=col)

# This method performs automatic zooming to the desired location being plotted.
# https://plotly.com/python/map-configuration/#automatic-zooming-or-bounds-fitting
fig.update_geos(fitbounds='locations')
fig.update_layout(
    height=800, 
    width=1000, 
    coloraxis=dict(colorscale='cividis'), # Choose color scales from: https://plotly.com/python/builtin-colorscales/
    paper_bgcolor='rgba(0,0,0,0)', # transparent background
)

# for each title, shift positions
for i in range(len(fig.layout.annotations)):
    
    annotation = fig.layout.annotations[i]
    
    # if the title is numeric, then shift to the left
    if annotation['text'].isnumeric():
        annotation.update(textangle=-90, xshift=-75, yshift=-60)
    # else if special case of January then shift to top left
    elif annotation['text'] == 'January':
        annotation.update(xshift=-132, yshift=78)

fig.show()

First, notice the accented yellow hues that indicate a high value of total refrigerated truck volume. Immediately, we notice two trends:
1. California is consistently the overwhelming leader in total refrigerated truck volume compared to the other states and
2. California backs down during the non-summer months indicating a seasonal/periodical trend.

Now that we have taken a good look at the refrigerated truck volume data set, let's continue our downstream analysis on observing the bivaraite relationships between the refrigerated truck volumes, COVID-19 cases, and fruit prices.

## Record Dependencies

We list our dependencies at the end of this notebook.

In [1]:
import sys
!{sys.executable} -m pip install watermark



In [3]:
%load_ext watermark
%watermark -v -m -p numpy,pandas,matplotlib,plotly.express

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.4.0

numpy         : 1.21.5
pandas        : 1.4.3
matplotlib    : 3.5.1
plotly.express: 0.4.1

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
CPU cores   : 12
Architecture: 64bit

