# ADDENDUM: Predicting The Trend of the Price of Wheat

This section of the project tries to predict the trend of the price of wheat in the next five years through the `Random Forest` machine learning model.

First of all, let's define what the `Random Forest` regressor model actually is. The `Random Forest` model can be considered as an upgrade on the `Decision Tree` model and is essentially a series of decision trees. A `Decision Tree` regressor algorithm makes predictions using a structured binary tree. The model operates by recursively dividing the data into subsets according to the most important attribute at each tree node.

However, the `Decision Tree` is prone to underfitting, where predictions perform excessively poor, and overfitting, where predictions perform excessively well, *if* there exists no bounds on the tree's depth. This is where the `Random Forest` model comes in. By averaging the predictions of each component tree, the `Random Forest` generates a prediction. Compared to a single decision tree, it typically has far higher predictive accuracy and performs well with default values.

With this in mind, let us apply the theory to our wheat price dataset.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split  # to be used to optimize tree depth
from sklearn.metrics import mean_absolute_error  # to compute the mean absolute error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [8]:
df1 = pd.read_csv('PCU325311325311.csv')  # fertilizer
df2 = pd.read_csv('WTISPLC.csv')  # crude oil
df3 = pd.read_csv('WPUSI024011.csv')  # agricultural machinery
df4 = pd.read_csv('PWHEAMTUSDM.csv')  # WHEAT
df5 = pd.read_csv('Pesticide2003.csv')  # pesticide
df6 = pd.read_csv('seeds.csv')  # seeds

df_list = [df1, df2, df3, df4, df5, df6] 
merged_df = pd.DataFrame()
merged_df = df_list[0]

for df in df_list[1:]:  # merge the dataframes on the date column and find biggest overlap
    merged_df = pd.merge(merged_df, df, on='DATE', how='inner')
    
merged_df = merged_df.dropna(axis=0)  # drop any potential NaN values
    
merged_df.columns

Index(['DATE', 'PCU325311325311', 'WTISPLC', 'WPUSI024011', 'PWHEAMTUSDM',
       'PCU3253203253201', 'WPU02550304'],
      dtype='object')

In machine learning, `y` is conventionally defined as the target value, whilst `X` is defined as the "features", or the training data. In this case, `y` is wheat prices and `X` is rest of the accounted variables exhbited *apart from* the target.

In [10]:
y = merged_df.PWHEAMTUSDM  # target

features = ['PCU325311325311', 'WTISPLC', 'WPUSI024011', 'PCU3253203253201', 'WPU02550304']

X = merged_df[features]

y.head(), X.head()

(0    118.157134
 1    115.775953
 2    133.484492
 3    130.649980
 4    130.056604
 Name: PWHEAMTUSDM, dtype: float64,
    PCU325311325311  WTISPLC  WPUSI024011  PCU3253203253201  WPU02550304
 0            177.6    30.72        166.2             100.0        120.6
 1            173.8    30.76        166.5             100.0        121.5
 2            175.8    31.59        168.0             100.0        121.6
 3            178.4    28.29        168.1             100.0        121.1
 4            180.7    30.33        168.1             100.4        124.5)

Now, before running the `Random Forest` regressor, we must recall that a random forest consists of decision trees. This means it is in our best interest optimize and control the number of tree nodes to ensure the `mean absolute error`, more commonly known as `mae`, is minimal. The `mae` is the mean variance between any significant values in a dataset and its predicted values in the same dataset.

$$
\text{mae} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$
- $n$ is the number of data points.
- $y_{i}$ represents the actual values.
- $\hat{y}_i$ represents the predicted values.

In [13]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

def mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    predsVal = model.predict(val_X)
    mae = mean_absolute_error(val_y, predsVal)
    return(mae)

for maxLeafNodes in [i for i in range(5, 100, 5)]:  # arbitrary range with arbitrary step
    myMAE = mae(maxLeafNodes, train_X, val_X, train_y, val_y)
    print("Maximum leaf nodes: %d  \t\t MAE:  %f" %(maxLeafNodes, myMAE))

Maximum leaf nodes: 5  		 MAE:  28.021280
Maximum leaf nodes: 10  		 MAE:  23.516511
Maximum leaf nodes: 15  		 MAE:  21.407904
Maximum leaf nodes: 20  		 MAE:  21.059343
Maximum leaf nodes: 25  		 MAE:  20.015448
Maximum leaf nodes: 30  		 MAE:  20.823434
Maximum leaf nodes: 35  		 MAE:  22.145005
Maximum leaf nodes: 40  		 MAE:  20.538357
Maximum leaf nodes: 45  		 MAE:  20.236992
Maximum leaf nodes: 50  		 MAE:  19.783986
Maximum leaf nodes: 55  		 MAE:  20.399856
Maximum leaf nodes: 60  		 MAE:  20.040448
Maximum leaf nodes: 65  		 MAE:  20.091223
Maximum leaf nodes: 70  		 MAE:  19.916765
Maximum leaf nodes: 75  		 MAE:  19.846544
Maximum leaf nodes: 80  		 MAE:  19.892061
Maximum leaf nodes: 85  		 MAE:  20.135384
Maximum leaf nodes: 90  		 MAE:  20.116455
Maximum leaf nodes: 95  		 MAE:  19.934180


It can be seen that the maximum number of leaf nodes that yields the least `mae` is `50`. Thus, this is the number we shall use to parse into the `Random Forest` regressor.

In [16]:
bestTreeSize = 50

forestModel = RandomForestRegressor(random_state=1, max_leaf_nodes=bestTreeSize)
forestModel.fit(train_X, train_y)
wheatPredictions = forestModel.predict(val_X)

print(mean_absolute_error(val_y, wheatPredictions))

wheatPredictions

15.265313035013865


array([226.0551411 , 149.38400645, 193.34329415, 169.61499281,
       194.60195641, 205.030027  , 182.97882485, 138.91471611,
       254.48877724, 231.77208219, 294.8747441 , 168.85776009,
       201.56327408, 170.72712442, 142.50454427, 168.56282051,
       338.71527613, 272.89165702, 343.65565893, 283.79979447,
       168.68128052, 240.52735284, 141.03702303, 257.65873704,
       172.84364076, 183.78371617, 179.20828766, 299.84087502,
       300.99901937, 291.23099129, 284.27033762, 139.36115797,
       167.02887437, 286.51957068, 253.23517741, 135.25138239,
       125.87659617, 254.64037272, 138.69970016, 158.05258104,
       263.3007092 , 410.69276945, 283.97688186, 157.8322262 ,
       209.26343157, 186.05641414, 312.23377776, 256.44955948,
       201.19514039, 341.86227781, 286.31786826, 127.2015398 ,
       174.67387697, 127.41905477, 208.65684035, 138.93054271,
       407.93793371, 197.21257077, 189.05481136, 138.03765361,
       239.14492759, 247.94441134, 316.69485585])

The array displayed above contains the predicted wehat prices for the next five years and three months, given the provided training data.

### do more economic analysis and adv/disadv later