# Market Price Model
## 1. Overview
The purpose of the market price model is to capture the spatial and temporal variability in the prices of key commodities at a local scale with high temporal resolution, in order to assess the consumption and expenditure impacts of price changes. Currently, the model can generate predictions for prices at the state level for only South Sudan and on a monthly basis.

## 2. Input Data
The market price model currently uses the following explanatory variables:
* The price of the commodity in the previous month -- source: [CLiMIS South Sudan](http://climis-southsudan.org/markets)
* The price of petrol (to approximate transport costs) -- source: [CLiMIS South Sudan](http://climis-southsudan.org/markets)
* Total rainfall in the previous month (to approximate road conditions; This will eventually be replaced with the output from the accessibility model) -- source: [Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)](http://chg.geog.ucsb.edu/data/chirps/)
* Net weight of crop production in the state in the previous year -- source: [CLiMIS South Sudan](http://climis-southsudan.org/crop)
* Number of fatalities from violent conflict in the previous month -- source: [ACLED](https://www.acleddata.com/curated-data-files/)

#### Data Cleaning 
All of the downloading and preprocessing of data occurs in the `data_cleaning.py` script and can be performed en masse by calling the `WriteCleanedPriceModelData` task. There are, however, several preprocessing steps worth detailing:

* Prices for food commodities and petrol are standardized to 2011 USD/kg using the units of measure provided by CLiMIS South Sudan and historical exchange rates from the [FEWS NET API](https://fdw.fews.net/api/exchangeratevalue/). 
* The food commodities currently modeled are: beans; groundnuts; maize; meat; milk; okra; rice; sorghum; sugar; vegetable oil; and wheat.
* Rainfall is downloaded as a raster with 3 arcminute resolution and aggregated to the state level by taking the mean value occuring within the state boundaries
* Unlike the other variables, crop production is a yearly quantity of net cereal production. This value is applied to each month in that year for the corresponding state.

#### Spatial and Temporal Bounds and Resolution
These quantities are computed for each state in South Sudan for every month from January 2013 through December 2017. It may be possible to run the model at the county level, as that is the level at which CLiMIS reports market prices, but the time series of price data are much more sparse and model performance suffers. This model is not currently generalizable beyond South Sudan, as it relies on price data that is specific to South Sudan. Once we have a data store with price data for multiple countries, the pipeline code will need to be revised to be able to flexibly query and retrieve data for only the desired geography.

## 3. Model
### Fixed Effects Regression
To capture both spatial and temporal availability within one model, we treat the model as panel data -- observations of multiple entities over multiple time periods -- and employ a Fixed Effects Estimation, sometimes called Panel Ordinary Least Squares (PanelOLS). Panel models are specified to estimate parameters of models of the general form:
$$y_{it}=x_{it}\beta+\alpha_i+\epsilon_{it}$$

where i indexes the entity, t indexes the time period, $\beta$ is the vector of parameters of interest, $\alpha_i$ contains the entity-specific components not generally captured in standard OLS, and $\epsilon_{it}$ are idiosyncratic errors uncorrelated with the covariates $x_{it}$ and $\alpha_i$. The Fixed Effects Estimator eliminates the unobserved but entity-specific components by imposing the restriction: $$\sum_i{\alpha_i}=0$$

Conceptually, this is equivalent to adding a dummy variable for each entity.

An advantage of the Fixed Effects Estimator is that it allows us to fit a single model for an arbitrary number of geographic locations over an arbitrarily long time period. Additionally, it allows for easy interpretation of the effects of the covariates of interest and the uncertainty surrounding the estimated parameters. The model does, however, imply that the parameterization of the predictors of market prices, , is time-invariant (i.e. that what determines the price of a commodity today will determine the price of that same commodity tomorrow). Another drawback is that a separate model must be trained for each commodity.

### Running in kiluigi
The price model pipeline uses an implemention of Fixed Effects regression from the `linearmodels` package. The fitting of the model is wrapped up inside the `TrainPriceModel` task, which takes as its only parameter the a list of the start date and end date for testing, the first of which indicates time up to which the model should be trained. (While somewhat clunky, this allows us to define a single `DateIntervalParameter` for the scenario.)

In [None]:
import luigi
from models.market_price_model.tasks import TrainPriceModel
from datetime import datetime

train_task = TrainPriceModel(time=[datetime(2017,6,1), datetime(2017,12,1)])
luigi.build([train_task])

In [3]:
with train_task.output().open("r") as src:
    models = src.read()
sorghum_model = models["food_Sorghum"]
sorghum_model.fit()

DEBUG: Getting parent pipe


0,1,2,3
Dep. Variable:,p_food_Sorghum,R-squared:,0.6219
Estimator:,PanelOLS,R-squared (Between):,0.8963
No. Observations:,373,R-squared (Within):,0.6219
Date:,"Thu, Apr 18 2019",R-squared (Overall):,0.6500
Time:,14:26:48,Log-likelihood,-493.42
Cov. Estimator:,Unadjusted,,
,,F-statistic:,118.43
Entities:,8,P-value,0.0000
Avg Obs:,46.625,Distribution:,"F(5,360)"
Min Obs:,40.000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,0.2438,0.1839,1.3255,0.1859,-0.1179,0.6055
fatalities,0.0003,0.0004,0.6836,0.4946,-0.0005,0.0010
crop_prod,9.287e-07,1.856e-06,0.5003,0.6172,-2.722e-06,4.58e-06
rainfall,0.0048,0.0016,3.0648,0.0023,0.0017,0.0078
p_petrol,7.259e-05,0.0019,0.0387,0.9692,-0.0036,0.0038
p_food_Sorghum_t-1,0.7571,0.0335,22.581,0.0000,0.6912,0.8230


### Model Fit
As we are trying to capture both spatial and temporal variability, we can examine the $R^2$ value between states at the same point in time and the $R^2$ for one state across time.

In [4]:
import numpy as np
print("\t\t\tBetween\t\tWithin\t\tOverall")
for c, m in models.items():
    res = m.fit()
    print(f"{c}\t\t{np.round(res.rsquared_between, 2)}\t\t{np.round(res.rsquared_within, 2)}\t\t{np.round(res.rsquared_overall, 2)}")

			Between		Within		Overall
food_Rice		0.71		0.59		0.6
food_Veg Oil		0.85		0.57		0.58
food_Maize		0.9		0.63		0.65
food_Sorghum		0.9		0.62		0.65
food_Beans		0.85		0.67		0.68
food_Sugar		0.94		0.65		0.67
food_Meat		0.81		0.63		0.65
food_Okra		0.65		0.63		0.63
food_Groundnuts		0.57		0.64		0.64
food_Milk		0.6		0.65		0.64
food_Wheat		0.87		0.61		0.63


As we can see, the model is generally better at capturing spatial variation than temporal variation, with average $R^2$ values or 0.79 for the former and 0.65 for the latter.

### Generating Predictions

We can also use set-aside test data to make price forecasts using the `PredictPrices` task, which takes the same `time` parameter to denote the range of dates for which it should make predictions, and a `geography` `GeoParameter` to denote the spatial extent of the area of interest (I'm using the default here to spare a long GeoJSON string). 

In [5]:
from models.market_price_model.tasks import PredictPrices
test_task = PredictPrices(time=[datetime(2017,6,1), datetime(2017,12,1)])
luigi.build([test_task])

DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Checking if models.market_price_model.tasks.PredictPrices(time=(datetime.datetime(2017, 6, 1, 0, 0), datetime.datetime(2017, 12, 1, 0, 0)), geography={"coordinates": [[[31.5982894897462, 3.68958806991583], [31.5905876159669, 3.68718695640564], [31.583158493042, 3.68487811088573], [31.5773696899415, 3.68307805061352], [31.575855255127, 3.68261504173284], [31.5609378814698, 3.67013597488403], [31.5511207580568, 3.66147708892822], [31.5457038879395, 3.65669703483593], [31.5429267883302, 3.65654301643383], [31.539665222168, 3.65636205673218], [31.5372524261476, 3.65499711036682], [31.5330429077149, 3.65261411666881], [31.5215644836427, 3.64612007141119], [31.5177898406984, 3.63849806785595], [31.5164356231689, 3.63934707641613], [31.5144119262696, 3.64032888412481], [31.5125675201417, 3.64103198051464], [31.5111255645752, 3.64120101928717], [31.5090103149415, 3.64119791984558], [31.5065994262696, 3.

DEBUG: Cannot find local file or directory output/intermediate/models/market_price_model/tasks/PredictPrices/PredictPrices_1_0___coordinates_____datetime_dateti_ca2b2830fb.pickle
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Checking if models.market_price_model.tasks.TrainPriceModel(time=(datetime.datetime(2017, 6, 1, 0, 0), datetime.datetime(2017, 12, 1, 0, 0)), geography={"coordinates": [[[31.5982894897462, 3.68958806991583], [31.5905876159669, 3.68718695640564], [31.583158493042, 3.68487811088573], [31.5773696899415, 3.68307805061352], [31.575855255127, 3.68261504173284], [31.5609378814698, 3.67013597488403], [31.5511207580568, 3.66147708892822], [31.5457038879395, 3.65669703483593], [31.5429267883302, 3.65654301643383], [31.539665222168, 3.65636205673218], [31.5372524261476, 3.65499711036682], [31.5330429077149, 3.65261411666881], [31.5215644836427, 3.

DEBUG: Found local file or directory output/intermediate/models/market_price_model/tasks/TrainPriceModel/TrainPriceModel___coordinates_____datetime_dateti___fatalities_____19873fda75.pickle
DEBUG: Checking if models.market_price_model.tasks.MergeData(time=(datetime.datetime(2017, 6, 1, 0, 0), datetime.datetime(2017, 12, 1, 0, 0)), geography={"coordinates": [[[31.5982894897462, 3.68958806991583], [31.5905876159669, 3.68718695640564], [31.583158493042, 3.68487811088573], [31.5773696899415, 3.68307805061352], [31.575855255127, 3.68261504173284], [31.5609378814698, 3.67013597488403], [31.5511207580568, 3.66147708892822], [31.5457038879395, 3.65669703483593], [31.5429267883302, 3.65654301643383], [31.539665222168, 3.65636205673218], [31.5372524261476, 3.65499711036682], [31.5330429077149, 3.65261411666881], [31.5215644836427, 3.64612007141119], [31.5177898406984, 3.63849806785595], [31.5164356231689, 3.63934707641613], [31.5144119262696, 3.64032888412481], [31.5125675201417, 3.6410319805146

DEBUG: Found local file or directory output/intermediate/models/market_price_model/tasks/MergeData/MergeData___coordinates_____datetime_dateti_24a3d3a526.pickle
INFO: Informed scheduler that task   models.market_price_model.tasks.PredictPrices_1_0___coordinates_____datetime_dateti_ca2b2830fb   has status   PENDING
INFO: Informed scheduler that task   models.market_price_model.tasks.MergeData___coordinates_____datetime_dateti_24a3d3a526   has status   DONE
INFO: Informed scheduler that task   models.market_price_model.tasks.TrainPriceModel___coordinates_____datetime_dateti___fatalities_____19873fda75   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 13330] Worker Worker(salt=904506952, workers=1, host=superfly, username=agottlieb, pid=13330) running   models.market_price_model.tasks.PredictPrices(time=(datetime.datetime(2017, 6, 1, 0, 0), datetime.datetime(2017, 12, 1, 0, 0)), geog

DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Found local file or directory output/intermediate/models/market_price_model/tasks/TrainPriceModel/TrainPriceModel___coordinates_____datetime_dateti___fatalities_____19873fda75.pickle
DEBUG: Found local file or directory output/intermediate/models/market_price_model/tasks/MergeData/MergeData___coordinates_____datetime_dateti_24a3d3a526.pickle
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Getting parent pipe
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Not all parameter values are hashable so instance isn't coming from the cache
DEBUG: Getting parent pipe
DEBUG: Getting parent pipe
DEBUG: Pruning output/intermediate/models/market_price_mod

DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   models.market_price_model.tasks.PredictPrices_1_0___coordinates_____datetime_dateti_ca2b2830fb   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=904506952, workers=1, host=superfly, username=agottlieb, pid=13330) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 3 tasks of which:
* 2 complete ones were encountered:
    - 1 models.market_price_model.tasks.MergeData(...)
    - 1 models.market_price_model.tasks.TrainPriceModel(...)
* 1 ran successfully:
    - 1 models.market_price_model.tasks.PredictPrices(...)

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====



True

In [6]:
with test_task.output().open("r") as src:
    preds = src.read()
print(preds["food_Sorghum"]["predictions"])

DEBUG: Getting parent pipe


Geography          Time      
Central Equatoria  2017-07-01    1.165778
                   2017-08-01    1.240110
                   2017-09-01    1.291767
                   2017-10-01    1.238173
                   2017-11-01    1.128379
                   2017-12-01    0.955044
Name: predictions, dtype: float64


### Grouping Commodities

*NOTE: the pipeline from here on out will likely be moved to the beginning of the demand model pipeline*

The downstream demand model requires the prices of commodity groups, such as "pulses and vegetables", "bread and cereals", or "milk, cheese and eggs". We are using the commodity groups established by WFP guidelines for determining food insecurity using the [Food Consumption Score (FCS)](https://documents.wfp.org/stellent/groups/public/documents/manual_guide_proced/wfp197216.pdf?_ga=2.53698400.917708111.1555622583-661089132.1555622583)  In the `GroupCommodities` task, we assign each food item to a group (using a manually-curated dictionary stored in `mappings.py`) and take the mean price within each group. This is an unsophisticated and unrealistic method of aggregating prices from different commodities into a group index, and is more or less a placeholder. Ideally, the prices would be weighted by the consumption of those commodities in the region of interest.

In [None]:
from models.market_price_model.tasks import GroupCommodities
group_task = GroupCommodities(time=[datetime(2017,6,1), datetime(2017,12,1)])
luigi.build([group_task])

In [8]:
with group_task.output().open("r") as src:
    group_prices = src.read()
group_prices

DEBUG: Getting parent pipe


Unnamed: 0_level_0,Unnamed: 1_level_0,Bread and Cereals,Oils and fats,Pulses and vegetables,"Sugar, jam, honey, chocolate and candy",Meat,"Milk, cheese and eggs",geometry
Geography,Time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Central Equatoria,2017-07-01,1.539291,2.599157,2.034372,1.863394,3.874569,1.090138,"POLYGON ((31.79243775500714 3.824261901399202,..."
Central Equatoria,2017-08-01,1.602704,2.589395,1.978994,1.937226,3.89853,1.033553,"POLYGON ((31.79243775500714 3.824261901399202,..."
Central Equatoria,2017-09-01,1.625632,2.573582,2.000924,2.025146,4.083939,1.034134,"POLYGON ((31.79243775500714 3.824261901399202,..."
Central Equatoria,2017-10-01,1.592789,2.588219,1.965299,1.979334,4.307981,1.116424,"POLYGON ((31.79243775500714 3.824261901399202,..."
Central Equatoria,2017-11-01,1.437425,2.390398,1.767129,1.789802,4.350026,1.019716,"POLYGON ((31.79243775500714 3.824261901399202,..."
Central Equatoria,2017-12-01,1.234678,2.131874,1.578765,1.565399,4.03959,0.948895,"POLYGON ((31.79243775500714 3.824261901399202,..."


### Rasterizing Outputs

The demand model takes as its input the prices paid by each household. Because our fundamental unit of analysis is a square-kilometer grid cell, we need to convert the vectors of prices above into rasters, then flatten those rasters to feed into the price model. This is accomplished in the `RasterizePrices` task, which relies on `rasterio.features.rasterize`. The output is a dictionary, where each entry is a data frame in which each column is the flattened array of prices for a commodity group over our area of interest.

In [None]:
from models.market_price_model.tasks import RasterizePrices
rast_task = RasterizePrices(time=[datetime(2017,6,1), datetime(2017,12,1)])
luigi.build([rast_task])

In [21]:
with rast_task.output().open("r") as src:
    price_surfaces = src.read()
print(list(price_surfaces.keys())[0])
list(price_surfaces.values())[0].head(n=20)

DEBUG: Getting parent pipe


2017-07-01 00:00:00


Unnamed: 0,Bread and Cereals,Oils and fats,Pulses and vegetables,"Sugar, jam, honey, chocolate and candy",Meat,"Milk, cheese and eggs"
0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
1,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
2,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
3,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
4,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
5,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
6,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
7,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
8,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
9,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0


These tables are written out to an Excel workbook in the terminal `SavePriceModelOutput` task, with one sheet for each date in question. Note that since we are predicting prices at the state level, each cell with a state will have the same value. As spatial resolution increases, we will hopefully be able to capture more local variability.