# Visualizing Avocado Economics

This notebook looks at the avocado price data in terms of demand and supply!

# Data

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
import statsmodels.formula.api as smf

In [None]:
df = pd.read_csv('../input/avocado-prices/avocado.csv',index_col=0)

# Date conversion
df['Date'] = pd.to_datetime(df['Date'])
df.head()

In [None]:
# Proportion of Conventional vs Organic
((df[df.region=='TotalUS']
 .pivot_table(index='type', values='Total Volume',
               aggfunc=np.sum, margins=True)
)/sum(df.loc[df.region=='TotalUS','Total Volume'])
).round(3)

- We focus on **conventional type** (97% of the `Total Volume` in the data).
- Also, a quick inspection shows that `Total Volume` = `4046` + `4225` + `4770` + `Total Bags` where `Total Bags` = `Small Bags` + `Large Bags` + `XLarge Bags` (e.g., classifications of sizes, for which we don't have corresponding prices). We will only examine `Total Volume`.  

In [None]:
# Focus on conventional type
df = df[df.type == 'conventional']

# Focus on Date, Region, Price, Quantity 
vars_keep = ['Date', 'Region', 'Price', 'Quantity']

df1 = df.rename(columns = {'AveragePrice':'Price',
                          'Total Volume':'Quantity', 
                          'region':'Region'})[vars_keep]

# Add Year, Month, and Year-Month
df1['Yr'] = df1['Date'].dt.year
df1['Mo'] = df1['Date'].dt.month
df1['YrMo'] = df1.Date.dt.strftime('%Y-%m')
df1

In [None]:
# Describe Date
df1.Date.describe()

Notes:
- Date: **2015-01-04** to **2018-03-25**. 

In [None]:
# Count Region
df1.Region.value_counts()

Notes:
- Likely aggregate areas: 
    *'TotalUS','West','California','Midsouth','Northeast','SouthCarolina',
    'SouthCentral','Southeast','GreatLakes','NothernNewEngland','Plains'*

In [None]:
# Describe Price and Quantity
df1.describe()

Notes:
    - Price: mean = $1.16, std = 0.26
    - Quantity: mean = 1,65 million, std = 4.75 million 

# National Demand

In [None]:
# Define National dataset
US = df1[df1.Region == 'TotalUS']
US = US.assign(Q_m = (US.Quantity/pow(10,6)).round(2))

sns.set_palette("RdBu_r")

# Colored by Year
plt.figure(figsize=(8,6))  
fig_US1 = sns.scatterplot('Q_m','Price', data=US, hue = 'Yr', 
                          palette="RdBu_r")
fig_US1.set(xlabel='Quantity, Million');

In [None]:
# Plot by Year
plt.figure(figsize=(12,8))  
fig_US2 = sns.lmplot('Q_m','Price', data=US, col = 'Yr',
                     height=6, aspect=.5)
fig_US2.set(xlabel='Quantity, Million');

Notes:
- These plots show a downward line for a given year, meaning that **the price-quantity point is moving along the demand curve.** 
- **The supply can be taken as given in a short run** (i.e., the supply of avocados is likely determined by factors affecting agriculture and trade. They may be influenced by external shocks but cannot be changed in response to the increase or decrease in market demand in a short run). 
- **2015** was a **stable, baseline** year.
- **2016** and **2017** showed marked **price hikes during the times of supply shortage**.
- **2018** seems to be returned to a normal year but with **increased quantity compared to 2015**. 
- The price hikes in 2016 and 2017 are likely to indicate **a change in demand (upward shift)**, which created a quite **large group of consumers who would pay high premiums if they have to**. Those consumers did not exist in 2015 (i.e, the relative shortage of supply in 2015 did not lead to price hikes). In the 2018 data, we have no supply shortage, so it is difficult to tell whether this new group of avocado-lovers still exist and would bid up prices in the case of shortage.       

In [None]:
# Color by Month
plt.figure(figsize=(8,6))  
sns.scatterplot('Q_m','Price', data=US, hue = 'Mo', 
                palette=sns.color_palette("Paired"));

Notes:
- **Supply shortage** tends to occur in **September, October, and November**.
- **Supply surplus** tends to occur in **May and February**.
- These seasonal trends appear to be linked to the supply side (farming and international trade). 

In [None]:
# Aggregate to Year-Month level
# note: technically a quantity-weighted average should be used for the price. 
US_ym = US.groupby('YrMo').agg({'Price':'mean', 'Q_m':'sum'})
bins_p = np.array([0, 1.2, 1.6, 2])
US_ym = US_ym.assign(labels_p = pd.cut(US_ym.Price, bins_p))
US_ym[:5]

In [None]:
import plotly.express as px

US_ani = px.scatter(US_ym.assign(size = 20).reset_index(), 
                    x="Q_m", y="Price",  animation_frame="YrMo", 
                    color = 'labels_p',  
                    size = 'size',
                    #mode='markers', marker=dict(size=20),
                    color_discrete_sequence=["blue",  "magenta", "red"],
                    range_x=[80,280], range_y=[.7,1.7])

US_ani.update_layout(
    title="Animation: Avocado Price and Quantity, US",
    xaxis_title="Quantity, million/month",
    showlegend=False)
US_ani.show()

Notes: with some arbitrary definition of price hike (Price > $1.2), we can spot three episodes of price hikes;  
- Price hike 1: **2016-10** to **2016-11**
- Price hike 2: **2017-03** to **2016-05**
- Price hike 3: **2017-07** to **2016-10**


Next, let's take a look at price and quantity along the time axis. 

In [None]:
# Define a relative quantity variable with 2015-01 as a baseline 
Q_base = US_ym.Q_m["2015-01"]
US_ym['Q_change'] = US_ym.Q_m/Q_base
US_ym[:5]

In [None]:
# Stack Price and Q_change variables
US_ym2 = US_ym.reset_index().melt(id_vars=['YrMo'], 
                                  value_vars=['Price', 'Q_change'])
US_ym2

In [None]:
sns.relplot(x="YrMo", y="value", hue="variable", kind="line",
            palette=["#e74c3c", "#3498db"],
            data=US_ym2, aspect=3);
plt.xticks(rotation=45);

Notes:
- **The overall supply is not smooth over time**; there are a number of monthly fluctuations.   
- Nationally, there are two months (**2016-10** and **2017-08**) that experienced a supply shortage compared to the baseline. And around the time of those shortages, price rose. 

# City-level Demands

In [None]:
# Exclude non-city areas
large_areas = ['TotalUS','West','California','Midsouth',
               'Northeast','SouthCarolina',
               'SouthCentral','Southeast','GreatLakes',
               'NothernNewEngland','Plains']

cities_ym = (df1
             .assign(Q_m = (df1.Quantity/pow(10,6)).round(3))
             .query('not(Region in @large_areas)')
             .groupby(['YrMo','Region'])
             .agg({'Price':'mean', 'Q_m':'sum'})
            )

cities_ym

We will focus on **top 10 cities** in terms of quantity.

In [None]:
# Top 10 cities in 2018-01 
top10 = (cities_ym.loc['2018-01']
         .sort_values('Q_m', ascending=False).iloc[:10])         
top10

In [None]:
top10_city_names = top10.index
top10_city_names

In [None]:
# Keep the data of the top 10 cities
cities_ym = (cities_ym
             .reset_index()
             .set_index("Region")
             .loc[top10_city_names]
            )

In [None]:
cities_ym

Let's start with plotting price and quantity by year for each city.  

In [None]:
# Put the year column back 
cities_ym['Yr'] = (cities_ym.YrMo
                   .str.slice(start=0, stop=4).astype(int))

In [None]:
# Plot by Year and Region
fig_Cities = sns.lmplot('Q_m','Price', 
                        data=cities_ym.reset_index(), 
                     col = 'Yr', row = "Region",
                     height=2, aspect=2.5)
fig_Cities.set(xlabel=''); 
# It is "Quantity, Million" but supressed here for readability

Notes:
- Los Angeles consumes avocados a lot more than other cities. 
- Some cities show downward-sloping relationships in 2016 and 2017, suggesting price hikes with supply shortages. 

We will animate the data to visualize this.  

In [None]:
# Define 2015-01 price and quantity as the baseline 
cities_ym_base = (cities_ym[cities_ym.YrMo=="2015-01"]
                  .rename(columns={'Price':"BasePrice",
                                   "Q_m":"BaseQ_m"})
                  .drop(columns=['YrMo', 'Yr'])
                 )
cities_ym_base 

In [None]:
# Join the base price
cities_ym = cities_ym.join(cities_ym_base)           

# Define change in price and quantity as the ratio to the baseline
cities_ym['PriceChange'] = cities_ym.Price/cities_ym.BasePrice
cities_ym['QChange'] = cities_ym['Q_m']/ cities_ym.BaseQ_m
cities_ym

In [None]:
# Check stats on cities_ym 
(cities_ym[['Price','PriceChange', 'Q_m', 'QChange']]
 .groupby('Region')
 .agg(['mean','std','min','max'])
 .round(2).sort_values(('Price','mean'), ascending=False)
)

In [None]:
# Animate the city data! 
cities_ym = cities_ym.reset_index()
fig = px.scatter(cities_ym, x="QChange", y="Price",  
                 animation_frame="YrMo", hover_name="Region",
                 size = "Q_m", color="Region",
                range_x=[0.5,2.25], range_y=[.5,2.2]
                )

fig.show()

Time line:
- **2015-06** through **2015-08**: Early sign of shortage in San Francisco with price rising to \\$1.57.
- **2016-07**: Sign of price hike. 
- **2016-08**: Shortage in Chicago pushing price to \\$1.65. 
- **2016-10**: Some shortage in cities like Chicago, San Francisco, and New York and prices rose over \\$1.86 in those cities. The price in Baltimore/Washington also rose to \\$1.73. 
- **2016-11**: Price peaked for 2016 at \\$1.91 in San Francisco, and the price remained high in some cities e.g., \\$1.75 in New York, \\$1.66 in Baltimore/Washington, and \\$1.53 in Chicago. The supply was tight in San Francisco (merely 61% of the baseline in 2015-01), Chicago (67%), and many major-avocado loving cities (about 80% in New York, Baltimore/Washington, Denver, Los Angeles, Houston, Dallas Fort Worth, Phoenix Tucson). 
- **2016-12**: Price came down with additional supply. 
- **2017-03**: Another sign of shortage: San Francisco (\\$1.74 and supply 77%) and Chicago (\\$1.64 and supply 86%).
- **2017-04** to **2017-06**: Supply shortage did not occur but prices remained high e.g., above \\$1.6 in San Francisco, Chicago, New York, and Baltimore/Washington.
- **2017-08**: Beginning of shortage. Phoenix Tucson, Chicago, and San Francisco getting only around 80% of their baseline supply. Price rose to \\$1.86 in San Francisco and \\$1.77 in Chicago. 
- **2017-09**: Supply shortage worsened and price peaked. Phoenix Tucson got 64% of its baseline supply.  Chicago, LA, San Francisco, and Denver got about 75-80% of their baseline supply. The peak price this time was \\$2.06 in Chicago, \\$1.77 in San Francisco, and \\$1.70 in Los Angeles.  
- **2017-10** to **2017-11**: The shortage continued and price remained high in Chicago.
- **2017-12**: Price came down in all cities with additional supply. 


Notes:
- Consumers in **Chicago, San Francisco, New York, and Baltimore/Washington** are **willing to pay price premiums** in the time of supply shortage. 
- Consumers in **Phoenix Tucson** are **not willing to pay** high price premiums even in the time of severe supply shortage. 
- **Chicago** and **San Francisco** are **most likely to experience supply shortage**. 

Let's see if we can show these movements in price and quantity along the time axis. 

In [None]:
# Stack Price and Q_change variables
cities_ym2  = cities_ym.melt(id_vars=['YrMo', 'Region'],
                             value_vars=['Price', 'QChange'])
cities_ym2

In [None]:
# Plot price and quantity along the time axis in two subplots stacked vertically 
sns.relplot(x="YrMo", y="value", hue="Region", kind="line",
            palette=sns.color_palette("Paired", 10),
            data=cities_ym2, aspect=3, row="variable");
plt.xticks(rotation=45);

Notes:
- Across cities, price and quantity movements are generally similar.
- Some cities maintain higher prices than others throughout the year. 
- The cities with higher prices generally experience larger price hikes. 
- Price hikes in Los Angeles and Denver grew larger in 2017 compared to 2016. 

## Regression Analysis

Consider a regression fit of the form;

$p_{it} = \alpha_i + \beta_i\: Q_{it} + \gamma_t + \varepsilon_{it}$

where 
- $p_{it}$: price of city $i$ in year-month $t$
- $\alpha_i$: city-specific intercept
- $\beta_i$: city-specific slope for quantity $Q_{it}$
- $\varepsilon_{it}$: error term


In [None]:
# OLS estimation
rlt1 = smf.ols('Price ~ 0 + Q_m*Region + YrMo',
               data=cities_ym.reset_index())
rlt1.fit().summary()

**Durbin-Watson is far from 2, suggesting autocorrelation.**

Let's regress the residual on its lagged value. 

i.e., estimate a model: $\varepsilon_{it} = \rho\: \varepsilon_{it-1} + \nu_{it}$ 

where $\rho$ measures the extent of autocorrelation for AR(1).

In [None]:
# Create residual, take a 1-month lagged variable
cities_ym['res1']  = rlt1.fit().resid
cities_ym['res1_L1'] = cities_ym.groupby('Region').res1.shift(1)
cities_ym 

In [None]:
rlt2 = smf.ols('res1 ~ res1_L1',
               data=cities_ym)
rlt2.fit().summary()

It confirms that residuals are serially correlated. 

There is a simple way to check serial correlation; 

In [None]:
# got this from: http://web.vu.lt/mif/a.buteikis/wp-content/uploads/PE_Book/4-8-Multiple-autocorrelation.html 
from statsmodels.graphics.tsaplots import plot_acf
#
res1   = rlt1.fit().resid
fig = plt.figure(num = 1, figsize = (10, 8))
_ = plot_acf(res1, lags = 10, zero = False, 
             ax = fig.add_subplot(111))
plt.show()

This suggests that the residual may be auto-correlated up to three degrees or so. 

For simplicity, we assume one degree of auto-correlation and estimate a AR(1) model here. 

In [None]:
import statsmodels.api as sm
rlt3 = sm.GLSAR(rlt1.endog, rlt1.exog, rho=1)
rlt3_fit = rlt3.fit() # alternatively: .iterative_fit(maxiter = 100)
rlt3_fit.summary()

In [None]:
# Correct the coefficient names in the table
coeff_names = rlt1.fit().summary2().tables[1].index
rlt3_coeff = rlt3_fit.summary2().tables[1].set_index(coeff_names)
rlt3_coeff 

Note:
- Compared to the OLS regression estimate, the AR(1) model seems to show smaller magnitudes of slope coefficients for the quantity variable. For example, the baseline slope (which is for Baltimore/Washington) is  -0.2433 in the OLS estimate and -0.1696 in the AR(1) model. 


In [None]:
time_trends = rlt3_coeff['Coef.'][rlt3_coeff.index.str.startswith("YrMo")] 
intercepts = rlt3_coeff['Coef.'][rlt3_coeff.index.str.startswith("Region")] 
slopes = rlt3_coeff['Coef.'][rlt3_coeff.index.str.startswith("Q_m")] 
slopes[1:] = slopes[0] + slopes[1:]

In [None]:
# Extract the time dummies and add a base intercept 
time_trends = pd.DataFrame(time_trends).reset_index()
time_trends['YrMo'] =  (pd
                        .to_datetime(
                            time_trends['index']
                            .str.replace("YrMo\\[T.","")
                            .str.replace("\\]","")
                        ).dt.strftime('%Y-%m'))
time_trends['Base_LosAngeles'] = intercepts["Region[LosAngeles]"] + time_trends['Coef.']
time_trends[:5]

In [None]:
figA = sns.relplot(x = "YrMo", y ="Base_LosAngeles", 
                   data = time_trends, kind='line', aspect=2.5);
plt.xticks(rotation=45);
plt.title('Fig A: Estimated Time Trends without Supply Change')
plt.ylabel('Base Price at Los Angeles, $');

In [None]:
# Extract city specific intercept and slope coefficients
Region = (intercepts
          .index.str.replace('Region\\[','')
          .str.replace('\\]',''))
rlt3_cities = (pd.DataFrame({'intercept': intercepts.values,
                             'slope': slopes.values})
               .set_index(Region)
              )
rlt3_cities.index.name = "Region"
rlt3_cities

In [None]:
# Add cities' Quantity range
rlt3_cities = (rlt3_cities
               .join(
                   cities_ym.groupby('Region').Q_m.agg(['min','max']))
              )
rlt3_cities['at_2018_03'] = float(
    time_trends[time_trends.YrMo=="2018-03"]["Coef."]
)
rlt3_cities['low'] = (rlt3_cities.intercept + 
                      rlt3_cities.at_2018_03 + 
                      rlt3_cities.slope * rlt3_cities['min'])
rlt3_cities['high'] = (rlt3_cities.intercept + 
                       rlt3_cities.at_2018_03 + 
                       rlt3_cities.slope * rlt3_cities['max'])
rlt3_cities

In [None]:
# Stack min-low and max-high variables under variable names of Q_m and Price
rlt3_cities_PQ = pd.concat(
    [rlt3_cities[['min','low']].rename(
        columns={'min':"Q_m", "low":"Price"}),
     rlt3_cities[['max','high']].rename(
         columns={'max':"Q_m", "high":"Price"})
    ], axis=0)
rlt3_cities_PQ 

In [None]:
filled_markers = ('o', 'v', '^', '<', '>', 's', '*', 'D', 'P', 'X')
sns.relplot(x = "Q_m", y="Price", hue = "Region", kind="line", 
            style= "Region",
            markers= filled_markers,
            dashes=False, aspect = 2, 
            data = rlt3_cities_PQ.reset_index());
plt.title('Fig B: Estimated Demand Curve by City without Time Trends')
plt.ylabel('Base Price at 2018-03, $');
plt.xlabel('Quantity, million');

Notes:
- **Los Angeles** has **the largest demand** in terms of quantity. 
- **New York** has **the second largest demand** and **high prices with steep demand curves**.
- **San Francisco**, **Chicago**, and **Baltimore/Washington** have **high prices with steep demand curves**.
- **Denver** also have relatively **high prices with steep demand curves**.
- **West Texas/New Mexico**, **Phoenix/Tucson**, **Houston**, and **Dallas Fort Worth** generally maintain **low prices**. 

In [None]:
# Recall the time trend figure
figA.fig

**How the figures relate to the regression:**

The two figures above visualize the regression coefficients from $p_{it} = \alpha_i + \beta_i \: Q_{it} + \gamma_t + \varepsilon_{it}$ into:
- Fig A (showing $\gamma_t$): **Common Time Trends across Cities** (using Los Angeles' intercept as a baseline)
- Fig B (showing $\alpha_i + \beta_i\: Q_{it}$): **City-specific Demand Curves** (using the intercept at time 2018-03)

The reason we need some baseline city or baseline time intercept is that $\alpha_i$ and $\gamma_t$ are estimated relative to each other. To interpret coefficient estimates of $\alpha_i$'s, we need to explicitly specify which $\gamma_t$ as a baseline. And, to interpret coefficient estimates of $\gamma_t$'s, we need to explicitly specify which $\alpha_i$ as a baseline.

### Thank you for reading!