In my [previous notebook][1], I provided some background on ammonium, performed exploratory data analysis (EDA), and attempted to use a linear regression to predict ammonium levels in a stream in the Ukraine. Here, we're going to explore the usefulness of decision trees, random forests, and support vector regression as predictive models for the same dataset. 

[1]: https://www.kaggle.com/jessicaleger/ammonium-predictions-in-river-water-model

In [None]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Since the previous notebook included the EDA, I'll proceed to loading both the training and testing datasets, cleaning up and splitting the data before moving on to creating the models. 

In [None]:
ammonium_train=pd.read_csv('/kaggle/input/ammonium-prediction-in-river-water/train.csv')
ammonium_train.head()

In [None]:
ammonium_test=pd.read_csv('/kaggle/input/ammonium-prediction-in-river-water/test.csv')
ammonium_test.head()

In [None]:
ammonium_train.drop(ammonium_train[['3','4','5','6','7']], axis=1, inplace=True)
ammonium_train.head()

In [None]:
ammonium_train.dropna(inplace=True)
ammonium_train.count()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X=ammonium_train[['1','2']]
y=ammonium_train['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

The first model we will assess is a simple regression tree:

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dtree=DecisionTreeRegressor()

In [None]:
dtree.fit(X_train, y_train)

In [None]:
ammonium_train_predictions=dtree.predict(X_test)

In [None]:
from sklearn.metrics import r2_score, mean_squared_error as mse

In [None]:
print("the mean squared error for the decision tree is", mse(y_test, ammonium_train_predictions))
print("the r2 score for the decision tree is", r2_score(y_test, ammonium_train_predictions))

The decision tree has a lower MSE than just a simple linear regression (line 29 in [this notebook][1]), which means that this model better fits the test data. A better fit might not always be a good thing, as overfitting may occur and therefore lead to inaccurate predictions on new data. This model also has a lower $r^2$ score, which means that a lower proportion of the variance can be explained by this model. Therefore, we can't say that this model performs better than the baseline model. Here is what the regression tree looks like:

[1]: https://www.kaggle.com/jessicaleger/ammonium-predictions-in-river-water-model

In [None]:
from sklearn.tree import export_graphviz
import pydot

In [None]:
export_graphviz(dtree, out_file = 'tree.dot', rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
from IPython.display import Image
Image('tree.png')

Now let's try boosting the model

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
dtree_boosted= AdaBoostRegressor(dtree)

In [None]:
dtree_boosted.fit(X_train, y_train)
ammonium_train_predictions_boost=dtree_boosted.predict(X_test)

In [None]:
print("the mean squared error for the decision tree is", mse(y_test, ammonium_train_predictions_boost))
print("the r2 score for the decision tree is", r2_score(y_test, ammonium_train_predictions_boost))

The MSE and $r^2$ appear to have improved with boosting. One concern about a decreasing MSE could be that as models increase in complexity, the risk of overfitting increases. However, boosting has proven to be robust against overfitting. While the MSE is lower on this data using decision tree models, the $r^2$ is still lower than using the baseline model, meaning that more of the variance can be explained by using a simple linear model alone. 

visualize one tree from the boosted decision tree:

In [None]:
sub_tree_5=dtree_boosted.estimators_[5]

export_graphviz(sub_tree_5, out_file = 'tree.dot', rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
from IPython.display import Image
Image('tree.png')

In [None]:
import matplotlib.pyplot as plt

Decision trees tend not to have very good predictive accuracy due to high variance. Random forests reduce this variance and should thus improve predictive accuracy. Let's see if this is the case for our data.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, y_train)

In [None]:
ammonium_train_pred_rf=rf.predict(X_test)

In [None]:
print("the mean squared error for the random forest regression is", mse(y_test, ammonium_train_pred_rf))
print("the r2 score for the random forest regression is", r2_score(y_test, ammonium_train_pred_rf))

The r2 of the random forest is in fact higher than both the non-boosted and boosted regression trees. Its MSE is lower, which may lead to concerns of overfitting. At this stage, a simple linear regression is still able to explain more of the variance than any of the models in this notebook, and would probably generalize better too.

Let's visualize one tree from the random forest:

In [None]:
sub_tree_2=rf.estimators_[2]

export_graphviz(sub_tree_2, out_file = 'tree.dot', rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
from IPython.display import Image
Image('tree.png')

Now we're going to boost the random forest. 

In [None]:
rf_boosted= AdaBoostRegressor(rf)

rf_boosted.fit(X_train, y_train)
ammonium_train_pred_rf_b=rf_boosted.predict(X_test)

In [None]:
print("the mean squared error for the decision tree is", mse(y_test, ammonium_train_pred_rf_b))
print("the r2 score for the decision tree is", r2_score(y_test, ammonium_train_pred_rf_b))

We're starting to see diminishing returns when we boost the random forest. The $r^2$ is starting to go down and the mse has gone up. This means that the boosted forest could generalize better than the random forest alone, but less of the variance can be explained by the model. While boosting is considered a good tool, it can sometimes perform worse than a simpler model that uses less computing power. Let's visualize a tree from this model:

In [None]:
sub_tree_3=rf_boosted.estimators_[3].estimators_[3]

export_graphviz(sub_tree_3, out_file = 'tree.dot', rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
from IPython.display import Image
Image('tree.png')

Next, let's take a look at how support vector regression performs on this data:

In [None]:
from sklearn.svm import SVR

In [None]:
SVR_rbf=SVR(kernel='rbf')
SVR_rbf.fit(X_train, y_train)
SVR_pred=SVR_rbf.predict(X_test)

In [None]:
print("the mean squared error for SVR with rbf kernel is", mse(y_test, SVR_pred))
print("the r2 score for SVR with rbf kernel is", r2_score(y_test, SVR_pred))

In [None]:
SVR_lin=SVR(kernel='linear')
SVR_lin.fit(X_train, y_train)
SVR_lin_pred=SVR_lin.predict(X_test)

In [None]:
print("the mean squared error for SVR with linear kernel is", mse(y_test, SVR_lin_pred))
print("the r2 score for SVR with linear kernel is", r2_score(y_test, SVR_lin_pred))

In [None]:
SVR_poly=SVR(kernel='poly')
SVR_poly.fit(X_train, y_train)
SVR_poly_pred=SVR_poly.predict(X_test)

In [None]:
print("the mean squared error for SVR with polynomial kernel is", mse(y_test, SVR_poly_pred))
print("the r2 score for SVR with polynomial kernel is", r2_score(y_test, SVR_poly_pred))

In [None]:
import timeit

In [None]:
benchmark_results = pd.DataFrame(columns=["Code", "Trial 1 (ms)", "Trial 2 (ms)", "Trial 3 (ms)", "Mean (ms)"])
benchmark_codes = ['dtree', 'dtree_boosted','rf', 'rf_boosted', 'SVR_rbf', 'SVR_lin', 'SVR_poly']

for index, code in enumerate(benchmark_codes):
    row = [code]
    results = timeit.repeat(f'{code}.predict(X_test)', f'from __main__ import {",".join(globals())}', repeat=3, number=10)
    row.extend(results)
    row.append(sum(results)/len(results))
    benchmark_results.loc[index] = row

benchmark_results.round(decimals=4)

Overall, the most suitable candidate model for predicting ammonium downstream for this area would be the random forest model. Support vector regression with a polynomial kernel had the second highest MSE after the baseline model (simple linear regression), but it also had the lowest $r^2$.  The boosted forest was a step down for the random forest, and it also took the most time to compute. The random forest and SVR with a linear kernel performed similarly, however the SVR performed the regression more quickly. In addition, the baseline model had the highest $r^2$ and MSE, meaning it could explain the most variance and it would probably generalize the best. It is also the simplest model and would be the easiest to explain. The issue with using linear models on this data is that the data are not normally distributed, and therefore violate one of the assumptions used in linear regression. Future work might include running a weighted least squares regression on the data, as this model works with heteroskedastic data.