# Lake Bilancino
In this section random forests are used to determine the Lake Level and Flow Rate, which outperform SVR according to the following tables

#### Lake Level Difference
|      ML Method     |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Random Forests** | 0.025 ± 0.0037 | 0.042 ± 0.0091 | 0.50 ± 0.12 |
|       **SVR**      | 0.039 ± 0.0058 | 0.052 ± 0.0084 | 0.25 ± 0.32 |

---

#### Flow Rate
|      ML Method     |     MAE    |    RMSE    |   $R^2$      |
|:------------------:|:----------:|:----------:|:------------:|
| **Random Forests** | 1.3 ± 0.16 | 2.0 ± 0.38 |  0.64 ± 0.09 |
|       **SVR**      | 2.0 ± 0.29 | 3.6 ± 0.73 | -0.04 ± 0.06 |

These values are the result of randomly selecting 50 train/test data combinations and fitting a random forest and SVR model to each data set. It should be noted that the values presented are of the form $\mu \pm 2\sigma$, where $\mu$ is the mean, and $\sigma$ is the associated standard deviation.

The key data processing operations that lead to the values from the tables above come from averaging the rainfall columns into one feature, which is possible due to the high correlation among the different regions. Then, the lake level from the previous day is used as a feature in predicting the *difference* in lake level from day to day. This assumes it is arbitrary or inexpensive to measure the lake level each day, or even weekly. Finally, averaging the values of 7 consecutive rows into one reduces the number of zero values in the data enough to accurately predict the weekly difference in lake level.

The resulting lake level model, while trained on a weekly average, can be expected to perform on daily lake level differences, so long as the rainfall regions are averaged. The resulting statistics for 10 random forest models trained on the weekly data (7 consecutive row averages) and tested on the original daily data is shown below.

#### RF Models Tested on Daily Data
|          Target           |      MAE      |      RMSE     |    $R^2$    |
|:-------------------------:|:-------------:|:-------------:|:-----------:|
| **Lake Level Difference** | 0.042 ± 0.001 | 0.098 ± 0.001 | 0.10 ± 0.01 |
|       **Flow Rate**       | 1.3 ± 0.18    | 2.0 ± 0.42    | 0.65 ± 0.11 |

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
df = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv')
df

---

## Bilancino Missing Data

In [None]:
print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', df.shape[0], '\n',
      35*'=')
df.isna().sum()/(df.shape[0]) *100

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(df.isnull(), cbar=False)
plt.show()

578 rows are missing *feature* information.  
After a quick check, we find that the data is simply missing rainfall measurements for the first 577 rows, and that it is not spread throughout the data. We will drop these two years of missing feature data.

The 578th row begins to record rainfall data, but is missing temperature. We could try to substitute that information, but it won't hurt to drop it, as it is the last missing data point from our dataset.

We can see that `Flow_Rate` is also missing some information. It turns out that those missing data points are all within this same block. So we can safely remove `[0:578]`

In [None]:
df.dropna(subset=['Rainfall_S_Piero','Rainfall_Mangona','Rainfall_Cavallina',
                     'Rainfall_Le_Croci','Temperature_Le_Croci'], inplace=True)

In [None]:
df.reset_index(drop=True,inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', df.shape[0], '\n',
      35*'=')
df.isna().sum()/(df.shape[0]) *100

All entries are now non-null.

### Feature Correlation
Now we should explore how correlated rainfall is in these different areas.

In [None]:
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

We might have expected that the rainfall features are all highly correlated with each other, so then we might want to create a single feature column based on all the features. This could be an average, or a sum.

We will use an average in this case. The reasoning is that it should increase bias within the feature data, and smooth out the events where an isolated region of rainfall might be one that carries *more* or *less* "weight" than other regions when determining our result.

---

In [None]:
df['Mean_Rainfall'] = df[['Rainfall_S_Piero','Rainfall_Mangona','Rainfall_S_Agata','Rainfall_Cavallina','Rainfall_Le_Croci']].mean(axis=1)
df.head()

Before we deconstruct the `Date`, let's get a compiled graph that shows how other features change over just the first 5 years and see if cycles are approximately lined up or if we can distinguish some sort of "phase shift"

In [None]:
years = df[:1825]

fig = plt.figure(figsize=(11,4))
ax1 = fig.add_axes([0,0,1,0.5])
ax2 = fig.add_axes([0,-0.7,1,0.5])
ax3 = fig.add_axes([0,-1.4,1,0.5])
ax4 = fig.add_axes([0,-2.1,1,0.5])

ax1.set_title('Lake Level')
ax1.set_xlabel('Time')
ax1.set_ylabel('meters')

ax2.set_title('Flow Rate')
ax2.set_xlabel('Time')
ax2.set_ylabel('mc/s')

ax3.set_title('Temperature')
ax3.set_xlabel('Time')
ax3.set_ylabel('celsius')

ax4.set_title('Mean Region Rainfall')
ax4.set_xlabel('Time')
ax4.set_ylabel('mm')

ax1.tick_params(axis='x', bottom=False, labelbottom=False)
ax2.tick_params(axis='x', bottom=False, labelbottom=False)
ax3.tick_params(axis='x', bottom=False, labelbottom=False)
ax4.tick_params(axis='x', bottom=False, labelbottom=False)

ax1.plot(years['Date'], years['Lake_Level'], label='Lake Level', color='g')
ax2.plot(years['Date'], years['Flow_Rate'], label='Flow Rate', color='y')
ax3.plot(years['Date'], years['Temperature_Le_Croci'], label='Temperature', color='r')
ax4.plot(years['Date'], years['Mean_Rainfall'], label='Rainfall', color='b')

plt.show()

> In the figure above, we can see that `Lake Level` drops shortly after `Temperature` rises. `Rainfall` however does not appear to have significant shift. Presumably because evaporation is a slower process to affect water levels than precipitation.

Let's do some clean up of our data and break down `Date` as more insight into the cyclical nature of these features might be found when considering the month of the particular date.

In [None]:
df['Month'] = df['Date'].apply(lambda x: x.split('/')[1]) # Strip the month
months = pd.get_dummies(df['Month'], drop_first=True) # One-hot encoding of month

clean = pd.DataFrame()

clean['Mean_Rainfall'] = df['Mean_Rainfall']
clean['Temperature'] = df['Temperature_Le_Croci']
clean['Lake_Level'] = df['Lake_Level']
clean['Flow_Rate'] = df['Flow_Rate']
clean = pd.concat([clean,months],axis=1) # Add the months to the end
clean.head()

---

## Bilancino Visualization
### Mean Rainfall Visualization
It is apparent this relationship is highly non-linear. The data here has a distribution that is reminiscent of a regression tree problem.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=df['Mean_Rainfall'], y=df['Lake_Level'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=df['Mean_Rainfall'], y=df['Flow_Rate'])
plt.show()

### Temperature Visualization
More highly non-linear data. We probably won't want to use lasso or ridge regression with these features/predictors.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=df['Temperature_Le_Croci'], y=df['Lake_Level'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=df['Temperature_Le_Croci'], y=df['Flow_Rate'])
plt.show()

### Month Visualization
It is apparent that the month contains some information about the level of the lake.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.boxplot(x='Month',y='Lake_Level',data=df)
plt.show()

Each month does not appear to contain enough non-zero information on flow rate for a non-zero mean.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.boxplot(x='Month', y='Flow_Rate', data=df)
plt.show()

We see that the same is true for the rainfall, but we also see that the month shares information with temperature.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.boxplot(x='Month', y='Mean_Rainfall', data=df)
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.boxplot(x='Month', y='Temperature_Le_Croci', data=df)
plt.show()

## Lake Level vs Flow Rate
It does appear that Flow Rate is dependant on lake level, as the flow rate doesn't appear to go over 10 until the lake level goes over 250. Lake Level could be a good predictor of Flow Rate, when used with other information.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=df['Lake_Level'], y=df['Flow_Rate'])
plt.show()

---

## Modeling
We briefly discuss the models we should consider for this problem so far.  
The fact that this is a non-linear regression problem with ~6000 samples suggest we can start with considering SVR or Regression Trees.

In this case, I want to start with SVR as performance should be a good benchmark without any optimizations.

#### Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = clean.drop(['Lake_Level','Flow_Rate'],axis=1)
yLL = clean['Lake_Level']
yFR = clean['Flow_Rate']

In [None]:
XLL_train, XLL_test, yLL_train, yLL_test = train_test_split(X, yLL, test_size=0.3)
XFR_train, XFR_test, yFR_train, yFR_test = train_test_split(X, yFR, test_size=0.3)

### SVR

In [None]:
from sklearn.svm import SVR

In [None]:
LLmodel = SVR(kernel='rbf')
FRmodel = SVR(kernel='poly')

In [None]:
LLmodel.fit(XLL_train, yLL_train)
FRmodel.fit(XFR_train, yFR_train)

In [None]:
LLpredictions = LLmodel.predict(XLL_test)
FRpredictions = FRmodel.predict(XFR_test)

---

## Evaluation

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yLL_test<LLpredictions,'indigo','peru')

ax.scatter(x=yLL_test, y=LLpredictions, c=col)
ax.plot(yLL_test,yLL_test, color='r') # Line of accurate predictions
ax.set_xlabel('Lake Level')
ax.set_ylabel('Predicted Lake Level')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yFR_test<FRpredictions,'indigo','peru')

ax.scatter(x=yFR_test, y=FRpredictions, c=col)
ax.plot(yFR_test,yFR_test, color='r') # Line of accurate predictions
ax.set_xlabel('Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
from sklearn import metrics

In [None]:
print("Lake Level","\nMAE:\t", metrics.mean_absolute_error(yLL_test,LLpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yLL_test,LLpredictions)))
print(30*"=","\nFlow Rate")
print("MAE:\t", metrics.mean_absolute_error(yFR_test,FRpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yFR_test,FRpredictions)))

In [None]:
print("Lake Level R^2:\t", metrics.r2_score(yLL_test,LLpredictions))
print("Flow Rate R^2:\t", metrics.r2_score(yFR_test,FRpredictions))

It is obvious here that more feature engineering should be done to improve the prediction power of this model.

After thinking about the problem, it seems that it would be equally useful to simply predict the daily difference ($\pm$) in water level. This will effectively abstract the date or "seasonal" aspect from the lake level. Our model will then only require a rainfall amount and a temperature and you should receive an increase or a decrease in lake level. This relieves the burden of the temperature/month features from having to predict the level of the lake, which should be readily available information from the previous day.

For that same reason, it may be worth while to include the previous day's lake level as a feature of the data. We've already seen in the visualization that lake level would be a good predictor of flow rate. This would be a powerful addition to the feature set.

---

## Further feature engineering

In [None]:
clean.drop(['02','03','04','05','06','07','08','09','10','11','12'], axis=1, inplace=True)
clean.head()

In [None]:
reform = pd.DataFrame()

# Predictors
reform['Mean_Rainfall'] = df['Mean_Rainfall']
reform['Temperature'] = df['Temperature_Le_Croci']
reform['Previous_Level'] = df['Lake_Level'].shift(periods=1,fill_value=251.14) # Fill value is for 01/01/2004

# Response
reform['Lake_Level'] = df['Lake_Level'] # Our model won't directly predict this value anymore
reform['Diff'] = df['Lake_Level'].subtract(reform['Previous_Level'])
reform['Flow_Rate'] = df['Flow_Rate']

reform.head(10)

---

## More Visualizations
Now that we're looking to predict different values, we should visualize them to see if we have more machine learning techniques available.

In [None]:
corr = reform.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

### Mean Rainfall

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=reform['Mean_Rainfall'], y=reform['Diff'])
plt.show()

### Temperature

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=reform['Temperature'], y=reform['Diff'])
plt.show()

### Previous Level

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=reform['Previous_Level'], y=reform['Diff'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=reform['Previous_Level'], y=reform['Flow_Rate'])
plt.show()

---

## Modeling the Modified Target
### Train/Test Split

In [None]:
X = reform.drop(['Lake_Level','Diff','Flow_Rate'],axis=1)
yDiff = reform['Diff']
yFR = reform['Flow_Rate']

In [None]:
XDiff_train, XDiff_test, yDiff_train, yDiff_test = train_test_split(X, yDiff, test_size=0.3)
XFR_train, XFR_test, yFR_train, yFR_test = train_test_split(X, yFR, test_size=0.3)

---

### SVR

In [None]:
Diffmodel = SVR()
FRmodel = SVR()

In [None]:
Diffmodel.fit(XDiff_train, yDiff_train)

In [None]:
FRmodel.fit(XFR_train, yFR_train)

In [None]:
Diffpredictions = Diffmodel.predict(XDiff_test)
FRpredictions = FRmodel.predict(XFR_test)

---

## Evaluation

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yDiff_test<Diffpredictions,'indigo','peru')

ax.scatter(x=yDiff_test, y=Diffpredictions, c=col)
ax.plot(yDiff_test,yDiff_test, color='r') # Line of accurate predictions
ax.set_xlabel('Lake Level Difference')
ax.set_ylabel('Predicted Difference')
ax.set_title('Predicted and true values on the test set')

plt.show()

This may not seem impressive, and indeed it isn't when you realize that the model is effectively just guessing that there is approximately zero meter increase/decrease in lake level from the previous day.

However, it is obvious that when you use the difference as an adjustment to lake level from the previous day, we see that there has been a major improvement. But hopefully it makes sense to the reader that this is an artifical improvement, simply from the way we've constructed our solution. Regardless, we will add a visualization of what those lake level predictions would now look like for reference.

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yDiff_test + XDiff_test['Previous_Level']<Diffpredictions + XDiff_test['Previous_Level'],
               'indigo','peru')

ax.scatter(x=yDiff_test + XDiff_test['Previous_Level'], y=Diffpredictions + XDiff_test['Previous_Level'], c=col)
ax.plot(yDiff_test+XDiff_test['Previous_Level'],
        yDiff_test+XDiff_test['Previous_Level'], color='r') # Line of accurate predictions
ax.set_xlabel('Lake Level')
ax.set_ylabel('Predicted Lake Level')
ax.set_title('Predicted and true values on the test set')

plt.show()

As for the flow rate, there is still not much improvement, as the model is still just guessing the most common flow rate $(0\ m^3/s)$ for each value.

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yFR_test<FRpredictions,'indigo','peru')

ax.scatter(x=yFR_test, y=FRpredictions, c=col)
ax.plot(yFR_test,yFR_test, color='r') # Line of accurate predictions
ax.set_xlabel('Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Lake Level","\nMAE:\t", metrics.mean_absolute_error(yDiff_test,Diffpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yDiff_test,Diffpredictions)))
print(30*"=","\nFlow Rate")
print("MAE:\t", metrics.mean_absolute_error(yFR_test,FRpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yFR_test,FRpredictions)))

In [None]:
print("Lake Level R^2:\t", metrics.r2_score(yDiff_test,Diffpredictions))
print("Flow Rate R^2:\t", metrics.r2_score(yFR_test,FRpredictions))

---

## Improvement Ideas
It is apparent that the data is still heavily skewed around zero rainfall events. This is causing the model to struggle to make predictions other than zero.

We can visualize how many days have a mean rainfall of zero to see how problematic this can be.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.histplot(data=reform, x='Mean_Rainfall', bins=30)
plt.show()

In [None]:
reform['Mean_Rainfall'].value_counts()

We will need to flatten this data so that it is a bit more even and the affect of outliers will be more apparent in our models.

We can accomplish this by averaging out the days of zero rain by grouping the data into weeks.

In [None]:
reform['Mean_Rainfall'].groupby(reform.index // 7).mean()

Note the drop in the size of our data; we've divided it by 7.  
However, we can see that the rainfall feature distribution has been spread out by this process.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.histplot(reform['Mean_Rainfall'].groupby(reform.index // 7).mean(), bins=12)
plt.show()

## Weekly Refactor
Let's refactor the most recent dataframe into a weekly one.

In [None]:
weekly = pd.DataFrame()

# Predictors
weekly['Mean_Rainfall'] = reform['Mean_Rainfall'].groupby(reform.index // 7).mean()
weekly['Mean_Temperature'] = reform['Temperature'].groupby(reform.index // 7).mean()
weekly['Mean_Previous_Level'] = reform['Previous_Level'].groupby(reform.index // 7).mean()

# Response
weekly['Mean_Lake_Level'] = reform['Lake_Level'].groupby(reform.index // 7).mean()
weekly['Mean_Diff'] = reform['Diff'].groupby(reform.index // 7).mean()
weekly['Mean_Flow_Rate'] = reform['Flow_Rate'].groupby(reform.index // 7).mean()

weekly.head(10)

In [None]:
weekly.info()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=weekly['Mean_Rainfall'], y=weekly['Mean_Diff'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=weekly['Mean_Rainfall'], y=weekly['Mean_Flow_Rate'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=weekly['Mean_Temperature'], y=weekly['Mean_Diff'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=weekly['Mean_Temperature'], y=weekly['Mean_Flow_Rate'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=weekly['Mean_Previous_Level'], y=weekly['Mean_Diff'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=weekly['Mean_Previous_Level'], y=weekly['Mean_Flow_Rate'])
plt.show()

In [None]:
corr = weekly.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

---

## Models Revisited
### SVR
To see if we have improved we will want to fit with SVR once more for a comparison to our previous models.

In [None]:
X = weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1)
yDiff = weekly['Mean_Diff']
yFR = weekly['Mean_Flow_Rate']

In [None]:
XDiff_train, XDiff_test, yDiff_train, yDiff_test = train_test_split(X, yDiff, test_size=0.3)
XFR_train, XFR_test, yFR_train, yFR_test = train_test_split(X, yFR, test_size=0.3)

In [None]:
Diffmodel_weekly = SVR(kernel='poly')
FRmodel_weekly = SVR(kernel='poly')

In [None]:
Diffmodel_weekly.fit(XDiff_train, yDiff_train)
FRmodel_weekly.fit(XFR_train, yFR_train)

In [None]:
Diffpredictions = Diffmodel_weekly.predict(XDiff_test)
FRpredictions = FRmodel_weekly.predict(XFR_test)

---

### Evaluation

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yDiff_test<Diffpredictions,'indigo','peru')

ax.scatter(x=yDiff_test, y=Diffpredictions, c=col)
ax.plot(yDiff_test,yDiff_test, color='r') # Line of accurate predictions
ax.set_xlabel('Lake Level Difference')
ax.set_ylabel('Predicted Difference')
ax.set_title('Predicted and true values on the test set')

plt.show()

This is an improvement to our previous model. Again, to emphasize what our prediction for Lake Level looks like based on this difference prediction, we can show that below. However, we can no longer measure our model's performance by the lake level metric, rather we measure based on the difference predicted.

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yDiff_test + XDiff_test['Mean_Previous_Level']<Diffpredictions + XDiff_test['Mean_Previous_Level'],
               'indigo','peru')

ax.scatter(x=yDiff_test + XDiff_test['Mean_Previous_Level'], y=Diffpredictions + XDiff_test['Mean_Previous_Level'], c=col)
ax.plot(yDiff_test+XDiff_test['Mean_Previous_Level'],
        yDiff_test+XDiff_test['Mean_Previous_Level'], color='r') # Line of accurate predictions
ax.set_xlabel('Lake Level')
ax.set_ylabel('Predicted Lake Level')
ax.set_title('Predicted and true values on the test set')

plt.show()

Note that the flow rate has not improved. We will require a different regression method.

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yFR_test<FRpredictions,'indigo','peru')

ax.scatter(x=yFR_test, y=FRpredictions, c=col)
ax.plot(yFR_test,yFR_test, color='r') # Line of accurate predictions
ax.set_xlabel('Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Difference in Lake Level","\nMAE:\t", metrics.mean_absolute_error(yDiff_test,Diffpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yDiff_test,Diffpredictions)))
print(30*"=","\nFlow Rate")
print("MAE:\t", metrics.mean_absolute_error(yFR_test,FRpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yFR_test,FRpredictions)))

In [None]:
print("Difference in Lake Level R^2:\t", metrics.r2_score(yDiff_test,Diffpredictions))
print("Flow Rate R^2:\t", metrics.r2_score(yFR_test,FRpredictions))

---

### Decision Trees
Due to the shape of the data related to the flow rate, we will try to model the data using regression trees. Hopefully we will see some improvement, which will then encourage us to seek optimization and improve the predictive power of decision trees.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
FRmodel_weekly = DecisionTreeRegressor()

In [None]:
FRmodel_weekly.fit(XFR_train, yFR_train)

In [None]:
FRpredictions = FRmodel_weekly.predict(XFR_test)

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yFR_test<FRpredictions,'indigo','peru')

ax.scatter(x=yFR_test, y=FRpredictions, c=col)
ax.plot(yFR_test,yFR_test, color='r') # Line of accurate predictions
ax.set_xlabel('Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Flow Rate")
print("MAE:\t", metrics.mean_absolute_error(yFR_test,FRpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yFR_test,FRpredictions)))

In [None]:
print("Flow Rate R^2:\t", metrics.r2_score(yFR_test,FRpredictions))

Since a simple decision tree performed better on this target, let's try to improve performance by using **AdaBoost**

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
booster = AdaBoostRegressor(DecisionTreeRegressor(max_depth=3),
                          n_estimators=2000, learning_rate=0.01)

In [None]:
booster.fit(XFR_train, yFR_train)

In [None]:
FRpredictions = booster.predict(XFR_test)

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yFR_test<FRpredictions,'indigo','peru')

ax.scatter(x=yFR_test, y=FRpredictions, c=col)
ax.plot(yFR_test,yFR_test, color='r') # Line of accurate predictions
ax.set_xlabel('Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Flow Rate")
print("MAE:\t", metrics.mean_absolute_error(yFR_test,FRpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yFR_test,FRpredictions)))

In [None]:
print("Flow Rate R^2:\t", metrics.r2_score(yFR_test,FRpredictions))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
forest = RandomForestRegressor(n_estimators=500, max_features='sqrt')

In [None]:
forest.fit(XFR_train, yFR_train)

In [None]:
FRpredictions = forest.predict(XFR_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yFR_test<FRpredictions,'indigo','peru')

ax.scatter(x=yFR_test, y=FRpredictions, c=col)
ax.plot(yFR_test,yFR_test, color='r') # Line of accurate predictions
ax.set_xlabel('Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Flow Rate")
print("MAE:\t", metrics.mean_absolute_error(yFR_test,FRpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yFR_test,FRpredictions)))

In [None]:
print("Flow Rate R^2:\t", metrics.r2_score(yFR_test,FRpredictions))

### Lake Level with Random Forests

In [None]:
forest = RandomForestRegressor(n_estimators=500, max_features='sqrt')

In [None]:
forest.fit(XDiff_train, yDiff_train)

In [None]:
Diffpredictions = forest.predict(XDiff_test)

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(yDiff_test<Diffpredictions,'indigo','peru')

ax.scatter(x=yDiff_test, y=Diffpredictions, c=col)
ax.plot(yDiff_test,yDiff_test, color='r') # Line of accurate predictions
ax.set_xlabel('Lake Level Difference')
ax.set_ylabel('Predicted Difference')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Difference in Lake Level","\nMAE:\t", metrics.mean_absolute_error(yDiff_test,Diffpredictions),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(yDiff_test,Diffpredictions)))

In [None]:
print("Difference in Lake Level R^2:\t", metrics.r2_score(yDiff_test,Diffpredictions))

---

## SVR vs Random Forest
We now want to show that Random Forests will consistently out-perform SVR.  
To do this, we compare the mean of the $\text{MAE}$, $\text{RMSE}$, and $R^2$ statistics of 50 respective models of each method.

In [None]:
def fit_method(X_train,Y_train,X_test,Y_test,method,**kwargs):
    #Fit random forest model and return RMSE and R squared values
    try:
        model_k = method(**kwargs)
    except TypeError:
        model_k = method # This will handle AdaBoost
    model_k.fit(X_train,Y_train)
    predictions = model_k.predict(X_test)
    
    MAE = metrics.mean_absolute_error(Y_test,predictions)
    RMSE = np.sqrt(metrics.mean_squared_error(Y_test,predictions))
    R_squared = metrics.r2_score(Y_test,predictions)
    return MAE, RMSE, R_squared

In [None]:
def run_method(X, y, n=20, method=RandomForestRegressor, **kwargs):
    # Initialization variables
    MAE_list, RMSE_list, R_squared_list = [],[],[]

    for k in range(1,n):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        tmp_result = fit_method(X_train, y_train, X_test, y_test, method, **kwargs)   #Store temp result 
        MAE_list.append(tmp_result[0])    #Append lists
        RMSE_list.append(tmp_result[1])
        R_squared_list.append(tmp_result[2])
    
    table = pd.DataFrame({'MAE': MAE_list, 'RMSE': RMSE_list, 'R_squared':R_squared_list})
    return table

In [None]:
def print_method_results(title):
    print(title,
          "\nMean MAE:", np.round(table['MAE'].mean(),3), "±" , np.round(2*table['MAE'].std(),4),
          "\nMean RMSE:", np.round(table['RMSE'].mean(),3), "±" , np.round(2*table['RMSE'].std(),4),
          "\nMean R2:", np.round(table['R_squared'].mean(),3), "±" , np.round(2*table['R_squared'].std(),4), 
          "\n")
    print(25*"=")

### Lake Level Difference with Random Forests

In [None]:
table = run_method(weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1), weekly['Mean_Diff'],
                   n=50,method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Lake Level with Random Forests")

In [None]:
# Initialization variables
MAE_list, RMSE_list, R_squared_list = [],[],[]
X_test = reform.drop(['Lake_Level','Diff','Flow_Rate'],axis=1)
y_test = reform['Diff']

X_train = weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1)
y_train = weekly['Mean_Diff']
n = 10

# Fitting 10 models to 10 different training subset of the data
for k in range(1,n):
    tmp_result = fit_method(X_train, y_train, X_test, y_test, 
                            RandomForestRegressor,n_estimators=500, max_features='sqrt')   #Store temp result 
    MAE_list.append(tmp_result[0])    #Append lists
    RMSE_list.append(tmp_result[1])
    R_squared_list.append(tmp_result[2])
        
table = pd.DataFrame({'MAE': MAE_list, 'RMSE': RMSE_list, 'R_squared': R_squared_list})
print("Difference in Lake Level tested on Daily Data",
      "\nMean MAE:", np.round(table['MAE'].mean(),4), "±" , np.round(2*table['MAE'].std(),5),
      "\nMean RMSE:", np.round(table['RMSE'].mean(),4), "±" , np.round(2*table['RMSE'].std(),5),
      "\nMean R2:", np.round(table['R_squared'].mean(),4), "±" , np.round(2*table['R_squared'].std(),4))

### Flow Rate Difference with Random Forests

In [None]:
table = run_method(weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1),weekly['Mean_Flow_Rate'],
                   n=50,method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Flow Rate with Random Forests")

In [None]:
# Initialization variables
MAE_list, RMSE_list, R_squared_list = [],[],[]
X_test = reform.drop(['Lake_Level','Diff','Flow_Rate'],axis=1)
y_test = reform['Flow_Rate']

X_train = weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1)
y_train = weekly['Mean_Flow_Rate']
n = 10
# Fitting 50 models to 50 different training subset of the data
for k in range(1,n):
    tmp_result = fit_method(X_train, y_train, X_test, y_test,
                           RandomForestRegressor,n_estimators=500, max_features='sqrt')   #Store temp result 
    MAE_list.append(tmp_result[0])    #Append lists
    RMSE_list.append(tmp_result[1])
    R_squared_list.append(tmp_result[2])
    
table = pd.DataFrame({'MAE': MAE_list, 'RMSE': RMSE_list, 'R_squared': R_squared_list})
print("Flow Rate tested on Daily Data",
      "\nMean MAE:", np.round(table['MAE'].mean(),3), "±" , np.round(2*table['MAE'].std(),4),
      "\nMean RMSE:", np.round(table['RMSE'].mean(),3), "±" , np.round(2*table['RMSE'].std(),4),
      "\nMean R2:", np.round(table['R_squared'].mean(),3), "±" , np.round(2*table['R_squared'].std(),4))

In [None]:
table = run_method(weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1),weekly['Mean_Flow_Rate'],
                   n=50,method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Flow Rate with Random Forests")

### Lake Level Difference with SVR

In [None]:
table = run_method(weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1),weekly['Mean_Diff'],
                   n=50,method=SVR,kernel='poly')
print_method_results("Difference in Lake Level with SVR")

### Flow Rate Difference with SVR

In [None]:
table = run_method(weekly.drop(['Mean_Lake_Level','Mean_Diff','Mean_Flow_Rate'],axis=1),weekly['Mean_Flow_Rate'],
                   n=50,method=SVR,kernel='poly')
print_method_results("Flow Rate with Random Forests")

---

# Arno River

In this section random forests and regression trees with AdaBoost are used to predict the target value of `Hydrometry_Nave_di_Rosano`. The performance of the respective methods are as follows.

|      ML Method     |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Random Forests** | 0.375 ± 0.0212 | 0.572 ± 0.0375 | 0.28 ± 0.07 |
|    **AdaBoost**    | 0.356 ± 0.0203 | 0.547 ± 0.0370 | 0.34 ± 0.07 |

These values are the result of randomly selecting 50 train/test data combinations and fitting a random forest and AdaBoost model to each data set. This is the same algorithm for generating the mean MAE, RMSE, and $R^2$ statistics as was used with the lake.

The key data processing operations that lead to the values from the tables above come from averaging the rainfall columns into two groups. This was again motivated by the desire to lower the variance of the resulting models and from the high correlation among the two groups of regions. No time aggregation was necessary on this data, as the volatility of the target data is more apparent on the daily time-scale for a river than for a lake. In any case, it wasn't necessary to get within half meter in absolute error and doing so will only reduce the training data size and increase bias with low variance trade-off.

In [None]:
arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv')
arno

In [None]:
arno.describe().T

In [None]:
print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', arno.shape[0], '\n',
      35*'=')
arno.isna().sum()/(arno.isna().sum() + arno.notna().sum()) *100

It is obvious that there is more data for some regions than others.  
We see the percentages, let's see *how* and *where* those missing values are spread.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.heatmap(arno.isnull(), cbar=False)
plt.show()

In [None]:
corr = arno.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

The rainfall in the first 6 regions are correlated with each other, while the rainfall in the last 8 regions are correlated with themselves. Furthermore, it seems that the response variable is more responsive to those last 8 regions, as rainfall in those regions have higher correlation with the response.

This means that it might be a good idea to average the 6 and 8 regions respectively and build a model off that. This will help reduce the number of missing data, but it cannot be ignored that the amount of missing data is very large for the second group of rainfall.

The plan might be to first drop any row where there are more than 7 regions of missing rainfall data. That would look like this.

In [None]:
fig = plt.figure(figsize=(10,6))
sns.heatmap(arno.dropna(subset=['Rainfall_Le_Croci', 'Rainfall_Cavallina','Rainfall_S_Agata','Rainfall_Mangona',
                  'Rainfall_S_Piero','Rainfall_Vernio','Rainfall_Stia','Rainfall_Consuma',
                 'Rainfall_Incisa','Rainfall_Montevarchi','Rainfall_S_Savino','Rainfall_Laterina',
                 'Rainfall_Bibbiena','Rainfall_Camaldoli'],thresh=7).isnull(), cbar=False)
plt.show()

In [None]:
arno.dropna(subset=['Rainfall_Le_Croci', 'Rainfall_Cavallina','Rainfall_S_Agata','Rainfall_Mangona',
                  'Rainfall_S_Piero','Rainfall_Vernio','Rainfall_Stia','Rainfall_Consuma',
                 'Rainfall_Incisa','Rainfall_Montevarchi','Rainfall_S_Savino','Rainfall_Laterina',
                 'Rainfall_Bibbiena','Rainfall_Camaldoli'],thresh=7, inplace=True)

In [None]:
print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', arno.shape[0], '\n',
      35*'=')
arno.isna().sum()/(arno.isna().sum() + arno.notna().sum()) *100

The first 6 regions, which are all correlated, now have no missing data. This challenge is solved for that first group of regions. We managed to keep as much of the data as possible this way, preserving as much of the variability of the data as possible.

Now we have 6 out of 8 regions in the second group that are missing more than 60% of the data.

To make up for the missing information, as well as remove some of the "weight of importance" of some regions over others, we will take a daily rainfall average of the two correlated region groups.

In [None]:
arno['Mean_Rainfall_Group1'] = arno[['Rainfall_Le_Croci','Rainfall_Cavallina','Rainfall_S_Agata',
                                 'Rainfall_Mangona','Rainfall_S_Piero','Rainfall_Vernio']].mean(axis=1)

In [None]:
arno['Mean_Rainfall_Group2'] = arno[['Rainfall_Stia','Rainfall_Consuma', 'Rainfall_Incisa','Rainfall_Montevarchi','Rainfall_S_Savino',
           'Rainfall_Laterina','Rainfall_Bibbiena','Rainfall_Camaldoli']].mean(axis=1)

In [None]:
arno.head()

In [None]:
arno.drop(['Rainfall_Le_Croci','Rainfall_Cavallina','Rainfall_S_Agata','Rainfall_Mangona',
                     'Rainfall_S_Piero','Rainfall_Vernio','Rainfall_Stia','Rainfall_Consuma', 'Rainfall_Incisa',
                     'Rainfall_Montevarchi','Rainfall_S_Savino','Rainfall_Laterina','Rainfall_Bibbiena',
                     'Rainfall_Camaldoli'], axis=1,inplace=True)

In [None]:
arno = arno[['Date','Mean_Rainfall_Group1','Mean_Rainfall_Group2','Temperature_Firenze','Hydrometry_Nave_di_Rosano']]
arno.head(3)

In [None]:
corr = arno.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

In [None]:
print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', arno.shape[0], '\n',
      35*'=')
arno.isna().sum()/(arno.isna().sum() + arno.notna().sum()) *100

In [None]:
arno.dropna(inplace=True)

---

## Visualizations

One thing we will have to recognize before moving forward is that our data will be very "patchy."  
There was some attempt to maintain the continuous time-series aspect of the data, however not all of the data could be preserved while maintaining that continuous chronological order. That is to say, we will not be able to use the "hydrometry value of the previous day" to inform our model in some way, and that our data may spike more than expected.

In [None]:
fig = plt.figure(figsize=(11,4))
ax1 = fig.add_axes([0,0,1,0.5])
ax2 = fig.add_axes([0,-0.7,1,0.5])
ax3 = fig.add_axes([0,-1.4,1,0.5])
ax4 = fig.add_axes([0,-2.1,1,0.5])

ax1.set_title('Hydrometry Nave di Rosano')
ax1.set_xlabel('Time')
ax1.set_ylabel('m')

ax2.set_title('Mean Rainfall Group 1')
ax2.set_xlabel('Time')
ax2.set_ylabel('mm')

ax3.set_title('Mean Rainfall Group 2')
ax3.set_xlabel('Time')
ax3.set_ylabel('mm')

ax4.set_title('Temperature')
ax4.set_xlabel('Time')
ax4.set_ylabel('celsius')

ax1.tick_params(axis='x', bottom=False, labelbottom=False)
ax2.tick_params(axis='x', bottom=False, labelbottom=False)
ax3.tick_params(axis='x', bottom=False, labelbottom=False)
ax4.tick_params(axis='x', bottom=False, labelbottom=False)

ax1.plot(arno['Date'], arno['Hydrometry_Nave_di_Rosano'], label='Hydrometry Level', color='g')
ax2.plot(arno['Date'], arno['Mean_Rainfall_Group1'], label='Mean Rainfall Group 1', color='b')
ax3.plot(arno['Date'], arno['Mean_Rainfall_Group2'], label='Mean Rainfall Group 2', color='c')
ax4.plot(arno['Date'], arno['Temperature_Firenze'], label='Temperature', color='r')


plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Mean_Rainfall_Group1', y='Hydrometry_Nave_di_Rosano', data=arno)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(data=arno, x='Mean_Rainfall_Group1', bins=30)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Mean_Rainfall_Group2', y='Hydrometry_Nave_di_Rosano', data=arno)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(data=arno, x='Mean_Rainfall_Group2', bins=30)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.jointplot(x='Temperature_Firenze', y='Hydrometry_Nave_di_Rosano', data=arno)
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.histplot(data=arno, x='Hydrometry_Nave_di_Rosano', bins=30)
plt.show()

This data is again highly non-linear. The rainfall data values are heavily centered around 0, as with the lake. Random Forests handled that data well, so we can try that first. We can also add a model with AdaBoost to compete.

---
## Modeling
### Train/Test Split

In [None]:
X = arno.drop(['Date','Hydrometry_Nave_di_Rosano'],axis=1)
y = arno['Hydrometry_Nave_di_Rosano']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Random Forest

In [None]:
forest = RandomForestRegressor(n_estimators=500, max_features='sqrt')
forest.fit(X_train, y_train)

predictions_forest = forest.predict(X_test)

In [None]:
fig = plt.figure(figsize=(8,6))

ax = fig.add_axes([0,0,1,1])

col1 = np.where(y_test<predictions_forest,'indigo','peru')

ax.scatter(x=y_test, y=predictions_forest, c=col1)
ax.plot(y_test,y_test, color='r') # Line of accurate predictions
ax.set_xlabel('Hydrometry Nave di Rosano')
ax.set_ylabel('Predicted Hydrometry Nave di Rosano')
ax.set_title('Predicted and true values on the test set')

plt.show()

## AdaBoost

In [None]:
model_boost = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=500, learning_rate=0.01)
model_boost.fit(X_train,y_train)
predictions_boost = model_boost.predict(X_test)

In [None]:
fig = plt.figure(figsize=(8,6))

ax = fig.add_axes([0,0,1,1])

col1 = np.where(y_test<predictions_boost,'indigo','peru')

ax.scatter(x=y_test, y=predictions_boost, c=col1)
ax.plot(y_test,y_test, color='r') # Line of accurate predictions
ax.set_xlabel('Hydrometry Nave di Rosano')
ax.set_ylabel('Predicted Hydrometry Nave di Rosano')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Hydrometry Nave di Rosano")
print("FOREST MODEL",
      "\nMAE:\t", metrics.mean_absolute_error(y_test,predictions_forest),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(y_test,predictions_forest)),
      "\tR^2:\t", metrics.r2_score(y_test,predictions_forest))

print(65*"=")

print("\nBOOST MODEL",
      "\nMAE:\t", metrics.mean_absolute_error(y_test,predictions_boost),
      "\nRMSE:\t", np.sqrt(metrics.mean_squared_error(y_test,predictions_boost)),
      "\tR^2:\t", metrics.r2_score(y_test,predictions_boost))

---

## Boost vs Random Forests

In [None]:
table = run_method(arno.drop(['Date','Hydrometry_Nave_di_Rosano'],axis=1),
                   arno['Hydrometry_Nave_di_Rosano'], n=50,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Hydrometry Nave di Rosano with Random Forests")

In [None]:
table = run_method(arno.drop(['Date','Hydrometry_Nave_di_Rosano'],axis=1),
                   arno['Hydrometry_Nave_di_Rosano'], n=50,
                   method=AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=500, learning_rate=0.01))
print_method_results("Hydrometry Nave di Rosano with Boosted Regression Trees")

# Water Springs
Each water spring had varying amounts of data fill. Handling missing data was more important for some data sets than others as data sizes also varied across water springs. As with the models we've seen so far, the majority of the work was done on data preprocessing with the goal of improving either accuracy of the resulting models, or their consistency, or both.

## Amiata Overview
Water spring amiata has 4 targets it wishes to predict. Each target, one must assume, is located in a different region of the spring. Therefore, the multiple regions of rainfall is decidedly not combined into one or two feature. Instead, we allow for each model to attempt to decide their optimal features by employing machine learning methods that use only a subset of the total features.

Up until now, we've used random forests as a reliably appropriate method to start with. This is still the case for this data, except now we will compare random forests against decision trees. The following values were computed over 20 random fits of each respective ML method for each respective target as before.


#### Flow Rate Bugnano
|      ML Method     |       MAE      |       RMSE     |  $R^2$       |
|:------------------:|:--------------:|:--------------:|:------------:|
| **Decision Trees** | 0.010 ± 0.0052 | 0.038 ± 0.0209 | 0.91 ± 0.098 |
| **Random Forests** | 0.017 ± 0.0029 | 0.037 ± 0.0067 | 0.92 ± 0.028 |


#### Flow Rate Arbure
|      ML Method     |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Decision Trees** | 0.102 ± 0.0534 | 0.380 ± 0.1856 | 0.86 ± 0.12 |
| **Random Forests** | 0.162 ± 0.0217 | 0.342 ± 0.0624 | 0.90 ± 0.04 |

#### Flow Rate Ermicciolo
|      ML Method     |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Decision Trees** | 0.152 ± 0.0580 | 0.535 ± 0.2524 | 0.86 ± 0.13 |
| **Random Forests** | 0.215 ± 0.0473 | 0.459 ± 0.1251 | 0.90 ± 0.05 |

#### Flow Rate Alta 
|      ML Method     |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Decision Trees** | 0.433 ± 0.0664 | 0.795 ± 0.1239 | 0.85 ± 0.04 |
| **Random Forests** | 0.413 ± 0.0398 | 0.609 ± 0.0623 | 0.91 ± 0.02 |

## Madonna di Canneto Overview
This water spring contains only one rainfall feature, one temperature feature, and the date. The temperature carries some seasonal information regarding the flow rate, but since there are no other predictors in this dataset, a one-hot encoding of the month is added as a categorical feature.

The resulting statistics are calculated by fitting 20 random training/test set combinations of the original data to the three respective models. In the case of this spring, SVR was included to show how it arbitrarily predicts a mean value for the flow rate, regardless of the input values. Random forests perform within one standard deviation of SVR, and Decision trees performs arbitrarily worse than random forests. Modeling this spring can use further improvement.

#### Flow Rate Madonna di Canneto
|      ML Method     |    MAE     |     RMSE   |  $R^2$       |
|:------------------:|:----------:|:----------:|:------------:|
| **Decision Trees** | 16.2 ± 4.0 | 28.7 ± 5.4 | -0.50 ± 0.52 |
| **Random Forests** | 14.4 ± 2.2 | 23.6 ± 3.1 | -0.03 ± 0.27 |
|       **SVR**      | 13.6 ± 2.4 | 23.5 ± 3.9 | -0.08 ± 0.06 |


## Lupa Overview
This spring is missing any temperature data, so the month was added as a categorical feature in the same way as with the Madonna di Canneto dataset.

The resulting statistics are calculated by fitting 20 random training/test set combinations of the original data to the two respective models.

#### Flow Rate Lupa
|      ML Method     |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Decision Trees** | 0.549 ± 0.2452 | 2.878 ± 2.7529 | 0.96 ± 0.09 |
| **Random Forests** | 0.480 ± 0.1589 | 1.996 ± 1.3736 | 0.98 ± 0.03 |

In [None]:
amiata = pd.read_csv('../input/acea-water-prediction/Water_Spring_Amiata.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', amiata.shape[0], '\n',
      35*'=')
amiata.isna().sum()/(amiata.shape[0]) *100

In [None]:
lupa = pd.read_csv('../input/acea-water-prediction/Water_Spring_Lupa.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', lupa.shape[0], '\n',
      35*'=')
lupa.isna().sum()/(lupa.shape[0]) *100

In [None]:
mdc = pd.read_csv('../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', mdc.shape[0], '\n',
      35*'=')
mdc.isna().sum()/(mdc.shape[0]) *100

These bodies of water all have independent datasets and thus may require different algorithms for fitting.
It appears that water spring Lupa is perhaps ready for EDA, with no missing values. However, the lack of temperature suggest we might want to add a month feature into our model to assist the loss of some sort of seasonal information.

Water spring Madonna di Canneto is missing half of its target values, which only leaves about 1700 supervised data points, and that's assuming we don't need to further reduce the data. We may require an algorithm such as SVR to make up for the potential lack of training data.

Water spring Amiata appears to have very sparse data. Missing data might be a serious issue in this dataset, so we will tackle that problem before moving on.

---

## Amiata Missing Data

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(amiata.isnull(), cbar=False)
plt.show()

The data seems to be mostly missing from the earliest dates. We could remove most of them with a simple threshold drop. However, we're interested in prediction. Prediction requires observed data. So we will need to remove any data that is missing any of the target data.

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(amiata.dropna(subset=['Flow_Rate_Bugnano','Flow_Rate_Arbure',
                                  'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta']).isnull(), cbar=False)
plt.show()

In [None]:
amiata.dropna(subset=['Flow_Rate_Bugnano','Flow_Rate_Arbure','Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'],
              how='all', inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', amiata.shape[0], '\n',
      35*'=')
amiata.isna().sum()/(amiata.shape[0]) *100

We have near complete rainfall data. With 18% missing data and high rainfall correlation (see below) we can use an average from the other rainfall features to estimate the missing data. However, the question is whether or not to combine the rainfall data.

In this case, we are predicting 4 different flow rates. Each with a different model, but more importantly; each of which is affected differently by the rainfall of different regions. An example of this is how `Flow_Rate_Bugnano` is negatively correlated with `Rainfall_Castel_del_Piano`, but positively correlated with `Flow_Rate_Ermicciolo` and how `Rainfall_Vetta_Amiata` is positively correlated with `Rainfall_Castel_del_Piano`, but negatively correlated with `Flow_Rate_Ermicciolo`.

So we will maintain the individual rainfall data of each region separate, and use some subset selection method or a model such as the lasso that has feature selection. This way, the flow rate of each region can attempt to find the features that best suit its prediction.

The same goes for Depth to Groundwater, except the motivations for doing so are more apparent.

Ultimately though, we can use other {rainfall / depth to ground water / temperature} data to predict a missing {rainfall / depth to ground water / temperature} data point.

In [None]:
corr = amiata.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,8))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

In [None]:
a_rain_mean = amiata.loc[:,'Rainfall_Castel_del_Piano':'Rainfall_Vetta_Amiata'].mean(axis=1)

# We have to check each column independently to fill null values with this row-wise mean.
amiata['Rainfall_Castel_del_Piano'].fillna(value=a_rain_mean, inplace=True)
amiata['Rainfall_Abbadia_S_Salvatore'].fillna(value=a_rain_mean, inplace=True)
amiata['Rainfall_S_Fiora'].fillna(value=a_rain_mean, inplace=True)
amiata['Rainfall_Laghetto_Verde'].fillna(value=a_rain_mean, inplace=True)
amiata['Rainfall_Vetta_Amiata'].fillna(value=a_rain_mean, inplace=True)

fig = plt.figure(figsize=(6,3))
sns.heatmap(amiata.isnull(), cbar=False)
plt.show()

According to the correlation between these rows, we *can* use the mean of the depth to groundwater of each row to substitute missing values for any particular column. We *cannot* average them into one column, as each region might affect some models differently.

In [None]:
a_depth_mean = amiata.loc[:,
                          'Depth_to_Groundwater_S_Fiora_8':'Depth_to_Groundwater_David_Lazzaretti'
                         ].mean(axis=1)

# We have to check each column independently to fill null values with this row-wise mean.
amiata['Depth_to_Groundwater_S_Fiora_8'].fillna(value=a_depth_mean, inplace=True)
amiata['Depth_to_Groundwater_S_Fiora_11bis'].fillna(value=a_depth_mean, inplace=True)
amiata['Depth_to_Groundwater_David_Lazzaretti'].fillna(value=a_depth_mean, inplace=True)

fig = plt.figure(figsize=(6,3))
sns.heatmap(amiata.isnull(), cbar=False)
plt.show()

Lastly, we do the same for the temperature, with the same argument as the rainfall and depth to groundwater.

In [None]:
a_depth_mean = amiata.loc[:,
                          'Temperature_Abbadia_S_Salvatore':'Temperature_Laghetto_Verde'
                         ].mean(axis=1)

# We have to check each column independently to fill null values with this row-wise mean.
amiata['Temperature_Abbadia_S_Salvatore'].fillna(value=a_depth_mean, inplace=True)
amiata['Temperature_S_Fiora'].fillna(value=a_depth_mean, inplace=True)
amiata['Temperature_Laghetto_Verde'].fillna(value=a_depth_mean, inplace=True)

fig = plt.figure(figsize=(6,3))
sns.heatmap(amiata.isnull(), cbar=False)
plt.show()

In [None]:
amiata.dropna(subset=['Flow_Rate_Bugnano','Flow_Rate_Arbure','Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'],
              how='all', inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', amiata.shape[0], '\n',
      35*'=')
amiata.isna().sum()/(amiata.shape[0]) *100

---

## Madonna di Canneto Missing Data

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(mdc.isnull(), cbar=False)
plt.show()

Unfortunately the data for this spring is missing a lot of target values. Again, we have to cut those and the data will no longer be sequential. We'll keep this in mind when choosing a model.

In [None]:
mdc.dropna(inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', mdc.shape[0], '\n',
      35*'=')
mdc.isna().sum()/(mdc.shape[0]) *100

At this point, the temperature data is having to express a lot of seasonal information inherent in the flow rate. So in an effort to aid this feature, we will encode the month of each data point for our model.

In [None]:
mdc['Month'] = mdc['Date'].apply(lambda x: x.split('/')[1]) # Strip the month
months = pd.get_dummies(mdc['Month'], drop_first=True) # One-hot encoding of month

mdc = pd.concat([mdc,months],axis=1) # Add the months to the end

---

## Lupa Missing Data

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(lupa.isnull(), cbar=False)
plt.show()

The spread of missing data here would be ideal for imputation of missing feature data.  
However, in an effort to not influence the resulting model, we will simply remove those missing data points.

In [None]:
lupa.dropna(inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', lupa.shape[0], '\n',
      35*'=')
lupa.isna().sum()/(lupa.shape[0]) *100

Recall that Lupa doesn't have any temperature information. We can use the month to give some seasonal variation in the data.

In [None]:
lupa['Month'] = lupa['Date'].apply(lambda x: x.split('/')[1]) # Strip the month
months = pd.get_dummies(lupa['Month'], drop_first=True) # One-hot encoding of month

lupa = pd.concat([lupa,months],axis=1) # Add the months to the end

---

## Visualizations

### Amiata Features Vs Targets

In [None]:
sns.pairplot(data=amiata,
             x_vars=[
                 'Rainfall_Castel_del_Piano','Rainfall_Abbadia_S_Salvatore',
                 'Rainfall_S_Fiora','Rainfall_Laghetto_Verde','Rainfall_Vetta_Amiata'
             ],
            y_vars=[
                'Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
                'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'
            ])
plt.tight_layout()

In [None]:
sns.pairplot(data=amiata,
             x_vars=[
                 'Depth_to_Groundwater_S_Fiora_8','Depth_to_Groundwater_S_Fiora_11bis',
                 'Depth_to_Groundwater_David_Lazzaretti'
             ],
            y_vars=[
                'Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
                'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'
            ],
            aspect=1.5)
plt.tight_layout()

In [None]:
sns.pairplot(data=amiata,
             x_vars=[
                 'Temperature_Abbadia_S_Salvatore','Temperature_S_Fiora',
                 'Temperature_Laghetto_Verde'
             ],
            y_vars=[
                'Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
                'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'
            ],
            aspect=1.5)
plt.tight_layout()

### Madonna di Canneto Features Vs Target

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Rainfall_Settefrati', y='Flow_Rate_Madonna_di_Canneto', data=mdc)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Temperature_Settefrati', y='Flow_Rate_Madonna_di_Canneto', data=mdc)
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.boxplot(x='Month',y='Flow_Rate_Madonna_di_Canneto',data=mdc)
plt.show()

### Lupa Features Vs Target

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Rainfall_Terni', y='Flow_Rate_Lupa', data=lupa)
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.boxplot(x='Month', y='Flow_Rate_Lupa', data=lupa)
plt.show()

So it seems like these springs, or at least this spring, flows very consistently throughout the year. The spread of the data for this is generally contained within -75 and -125, with the exception of a few outliers in about 6 of the months.

Our model might have a difficult time predicting any value that is not just the training mean.

Let's model without any further speculation and see how well it does.

---

## Building the Models

### Amiata Modeling
Amiata has about 2000 data points and will require 4 resulting models. Because certain rainfall features seem to affect the flow rates differently, we will need to use a model that does some sort of feature selection. 

For this, I had in mind the elastic-net and regression tree methods. However, we should not expect elastic-net to perform well on data that is this non-linear, so we can skip to regression trees. Regression trees have some feature selection properties, which will be ideal since there is high-correlation between rainfall of different regions, we just don't know which ones might be relevant.

In [None]:
aX = amiata.drop(['Date','Flow_Rate_Bugnano','Flow_Rate_Arbure',
                  'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'],axis=1)
bugnano = amiata['Flow_Rate_Bugnano']
arbure = amiata['Flow_Rate_Arbure']
ermicciolo = amiata['Flow_Rate_Ermicciolo']
alta = amiata['Flow_Rate_Galleria_Alta']

X_bug_train, X_bug_test, bugnano_train, bugnano_test = train_test_split(aX, bugnano, test_size=0.2)
X_arb_train, X_arb_test, arbure_train, arbure_test = train_test_split(aX, arbure, test_size=0.2)
X_erm_train, X_erm_test, ermicciolo_train, ermicciolo_test = train_test_split(aX, ermicciolo, test_size=0.2)
X_alt_train, X_alt_test, alta_train, alta_test = train_test_split(aX, alta, test_size=0.2)

### Decision Trees on Amiata

In [None]:
tree_bugnano = DecisionTreeRegressor()
tree_arbure = DecisionTreeRegressor()
tree_ermicciolo = DecisionTreeRegressor()
tree_alta = DecisionTreeRegressor()

tree_bugnano.fit(X_bug_train, bugnano_train)
tree_arbure.fit(X_arb_train, arbure_train)
tree_ermicciolo.fit(X_erm_train, ermicciolo_train)
tree_alta.fit(X_alt_train, alta_train)

bug_predictions = tree_bugnano.predict(X_bug_test)
arb_predictions = tree_arbure.predict(X_arb_test)
erm_predictions = tree_ermicciolo.predict(X_erm_test)
alt_predictions = tree_alta.predict(X_alt_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(6,5))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])

col1 = np.where(bugnano_test<bug_predictions,'indigo','peru')
col2 = np.where(arbure_test<arb_predictions,'indigo','peru')

ax.scatter(x=bugnano_test, y=bug_predictions, c=col1)
ax.plot(bugnano_test,bugnano_test, color='r') # Line of accurate predictions
ax.set_xlabel('Bugnano Flow Rate')
ax.set_ylabel('Predicted Bugnano Flow Rate')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=arbure_test, y=arb_predictions, c=col2)
ax2.plot(arbure_test,arbure_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Arbure Flow Rate')
ax2.set_ylabel('Predicted Arbure Flow Rate')
ax2.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Predictions
fig = plt.figure(figsize=(6,5))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])

col1 = np.where(ermicciolo_test<erm_predictions,'indigo','peru')
col2 = np.where(alta_test<alt_predictions,'indigo','peru')

ax.scatter(x=ermicciolo_test, y=erm_predictions, c=col1)
ax.plot(ermicciolo_test,ermicciolo_test, color='r') # Line of accurate predictions
ax.set_xlabel('Ermicciolo Flow Rate')
ax.set_ylabel('Predicted Ermicciolo Flow Rate')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=alta_test, y=alt_predictions, c=col2)
ax2.plot(alta_test,alta_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Alta Flow Rate')
ax2.set_ylabel('Predicted Alta Flow Rate')
ax2.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Evaluations
print("Bugnano Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(bugnano_test,bug_predictions)),
     "\tR^2:\t", metrics.r2_score(bugnano_test,bug_predictions))

print(65*"=","\nArbure Flow Rate\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(arbure_test,arb_predictions)),
     "\tR^2:\t", metrics.r2_score(arbure_test,arb_predictions))

print(65*"=","\nErmicciolo Flow Rate\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(ermicciolo_test,erm_predictions)),
     "\tR^2:\t", metrics.r2_score(ermicciolo_test,erm_predictions))

print(65*"=","\nAlta Flow Rate\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(alta_test,alt_predictions)),
     "\tR^2:\t", metrics.r2_score(alta_test,alt_predictions))

### Random Forest on Amiata

In [None]:
forest_bugnano = RandomForestRegressor(n_estimators=500, max_features='sqrt')
forest_arbure = RandomForestRegressor(n_estimators=500, max_features='sqrt')
forest_ermicciolo = RandomForestRegressor(n_estimators=500, max_features='sqrt')
forest_alta = RandomForestRegressor(n_estimators=500, max_features='sqrt')

forest_bugnano.fit(X_bug_train, bugnano_train)
forest_arbure.fit(X_arb_train, arbure_train)
forest_ermicciolo.fit(X_erm_train, ermicciolo_train)
forest_alta.fit(X_alt_train, alta_train)

bug_predictions = forest_bugnano.predict(X_bug_test)
arb_predictions = forest_arbure.predict(X_arb_test)
erm_predictions = forest_ermicciolo.predict(X_erm_test)
alt_predictions = forest_alta.predict(X_alt_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(6,5))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])

col1 = np.where(bugnano_test<bug_predictions,'indigo','peru')
col2 = np.where(arbure_test<arb_predictions,'indigo','peru')

ax.scatter(x=bugnano_test, y=bug_predictions, c=col1)
ax.plot(bugnano_test,bugnano_test, color='r') # Line of accurate predictions
ax.set_xlabel('Bugnano Flow Rate')
ax.set_ylabel('Predicted Bugnano Flow Rate')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=arbure_test, y=arb_predictions, c=col2)
ax2.plot(arbure_test,arbure_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Arbure Flow Rate')
ax2.set_ylabel('Predicted Arbure Flow Rate')
ax2.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Predictions
fig = plt.figure(figsize=(6,5))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])

col1 = np.where(ermicciolo_test<erm_predictions,'indigo','peru')
col2 = np.where(alta_test<alt_predictions,'indigo','peru')

ax.scatter(x=ermicciolo_test, y=erm_predictions, c=col1)
ax.plot(ermicciolo_test,ermicciolo_test, color='r') # Line of accurate predictions
ax.set_xlabel('Ermicciolo Flow Rate')
ax.set_ylabel('Predicted Ermicciolo Flow Rate')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=alta_test, y=alt_predictions, c=col2)
ax2.plot(alta_test,alta_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Alta Flow Rate')
ax2.set_ylabel('Predicted Alta Flow Rate')
ax2.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Evaluations
print("Bugnano Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(bugnano_test,bug_predictions)),
     "\tR^2:\t", metrics.r2_score(bugnano_test,bug_predictions))

print(65*"=","\nArbure Flow Rate\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(arbure_test,arb_predictions)),
     "\tR^2:\t", metrics.r2_score(arbure_test,arb_predictions))

print(65*"=","\nErmicciolo Flow Rate\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(ermicciolo_test,erm_predictions)),
     "\tR^2:\t", metrics.r2_score(ermicciolo_test,erm_predictions))

print(65*"=","\nAlta Flow Rate\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(alta_test,alt_predictions)),
     "\tR^2:\t", metrics.r2_score(alta_test,alt_predictions))

## Madonna di Canneto Modeling

Madonna di Canneto has only 879 points of data and merely 3 predictors (Rainfall, Temperature, and Month).
For these reasons we might choose decision trees or SVR with a radial basis kernel, as the visualization suggests some circular symmetry in the temperature data. 

In [None]:
mdX = mdc.drop(['Date','Flow_Rate_Madonna_di_Canneto','Month'],axis=1)
mdy = mdc['Flow_Rate_Madonna_di_Canneto']

mdX_train, mdX_test, mdy_train, mdy_test = train_test_split(mdX, mdy, test_size=0.1)

In [None]:
md_model = SVR(kernel='rbf')
md_model.fit(mdX_train,mdy_train)
md_pre = md_model.predict(mdX_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(mdy_test<md_pre,'indigo','peru')

ax.scatter(x=mdy_test, y=md_pre, c=col)
ax.plot(mdy_test,mdy_test, color='r') # Line of accurate predictions
ax.set_xlabel('Madonna di Canneto Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Madonna di Canneto Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(mdy_test,md_pre)),
     "\tR^2:\t", metrics.r2_score(mdy_test,md_pre))

### Decision Trees on Madonna di Canneto

In [None]:
md_tree = DecisionTreeRegressor()
md_tree.fit(mdX_train,mdy_train)
md_tree_pre = md_tree.predict(mdX_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(mdy_test<md_tree_pre,'indigo','peru')

ax.scatter(x=mdy_test, y=md_tree_pre, c=col)
ax.plot(mdy_test,mdy_test, color='r') # Line of accurate predictions
ax.set_xlabel('Madonna di Canneto Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Madonna di Canneto Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(mdy_test,md_tree_pre)),
     "\tR^2:\t", metrics.r2_score(mdy_test,md_tree_pre))

### Random Forests on Madonna di Canneto

In [None]:
md_forest = RandomForestRegressor(n_estimators=500, max_features='sqrt',
                                  min_samples_leaf=2, min_samples_split=5)
md_forest.fit(mdX_train,mdy_train)
md_forest_pre = md_forest.predict(mdX_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(mdy_test<md_forest_pre,'indigo','peru')

ax.scatter(x=mdy_test, y=md_forest_pre, c=col)
ax.plot(mdy_test,mdy_test, color='r') # Line of accurate predictions
ax.set_xlabel('Madonna di Canneto Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Madonna di Canneto Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(mdy_test,md_forest_pre)),
     "\tR^2:\t", metrics.r2_score(mdy_test,md_forest_pre))

## Lupa Modeling

Because Lupa only has rainfall and a month feature, but still a healthy amount of data, we have the ideal situation for decision trees. So we will use that here.

In [None]:
lupX = lupa.drop(['Date','Flow_Rate_Lupa'],axis=1)
lupy = lupa['Flow_Rate_Lupa']

lupX_train, lupX_test, lupy_train, lupy_test = train_test_split(lupX, lupy, test_size=0.2)

In [None]:
lup_tree = DecisionTreeRegressor()
lup_tree.fit(lupX_train,lupy_train)
lup_pre = lup_tree.predict(lupX_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(lupy_test<lup_pre,'indigo','peru')

ax.scatter(x=lupy_test, y=lup_pre, c=col)
ax.plot(lupy_test,lupy_test, color='r') # Line of accurate predictions
ax.set_xlabel('Lupa Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Lupa Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(lupy_test,lup_pre)),
     "\tR^2:\t", metrics.r2_score(lupy_test,lup_pre))

### Random Forest on Lupa

In [None]:
lup_forest = RandomForestRegressor(n_estimators=100, max_features='sqrt')
lup_forest.fit(lupX_train,lupy_train)
lup_pre = lup_forest.predict(lupX_test)

In [None]:
# Predictions
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(lupy_test<lup_pre,'indigo','peru')

ax.scatter(x=lupy_test, y=lup_pre, c=col)
ax.plot(lupy_test,lupy_test, color='r') # Line of accurate predictions
ax.set_xlabel('Lupa Flow Rate')
ax.set_ylabel('Predicted Flow Rate')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Lupa Flow Rate\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(lupy_test,lup_pre)),
     "\tR^2:\t", metrics.r2_score(lupy_test,lup_pre))

---

## Decision Trees Vs Random Forests

In [None]:
table = run_method(aX, bugnano,method=DecisionTreeRegressor, n=20)
print_method_results("Bugnano Flow Rate with Decision trees")

table = run_method(aX, arbure,method=DecisionTreeRegressor, n=20)
print_method_results("Arbure Flow Rate with Decision trees")

table = run_method(aX, ermicciolo,method=DecisionTreeRegressor, n=20)
print_method_results("Ermicciolo Flow Rate with Decision trees")

table = run_method(aX, alta,method=DecisionTreeRegressor, n=20)
print_method_results("Alta Flow Rate with Decision trees")

In [None]:
table = run_method(aX, bugnano, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Bugnano Flow Rate with Random Forests")

table = run_method(aX, arbure, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Arbure Flow Rate with Random Forests")

table = run_method(aX, ermicciolo, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Ermicciolo Flow Rate with Random Forests")

table = run_method(aX, alta, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Alta Flow Rate with Random Forests")

In [None]:
table = run_method(mdX, mdy,n=20, method=DecisionTreeRegressor)
print_method_results("Madonna di Canneto Flow Rate with Decision trees")

table = run_method(mdX, mdy, n=20, method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Madonna di Canneto Flow Rate with Random forests")

table = run_method(mdX, mdy,n=20, method=SVR)
print_method_results("Madonna di Canneto Flow Rate with SVR")

In [None]:
table = run_method(lupX, lupy,n=20, method=DecisionTreeRegressor)
print_method_results("Lupa Flow Rate with Decision trees")

table = run_method(lupX, lupy, n=20, method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Lupa Flow Rate with Random forests")

# Aquifers

## Doganella Overview
This aquifer has 9 targets, each one modeled with random forests. Each target was fit with 20 random train/test splits of the original data to achieve the following statistics.

#### Random Forests 
|              Target              |       MAE      |       RMSE     | $R^2$       |
|:--------------------------------:|:--------------:|:--------------:|:-----------:|
| **Depth to Groundwater Pozzo 1** | 1.401 ± 0.3904 | 2.264 ± 0.7389 | 0.89 ± 0.07 |
| **Depth to Groundwater Pozzo 2** | 0.239 ± 0.0766 | 0.387 ± 0.1519 | 0.91 ± 0.06 |
| **Depth to Groundwater Pozzo 3** | 0.699 ± 0.2849 | 1.369 ± 0.8608 | 0.86 ± 0.14 |
| **Depth to Groundwater Pozzo 4** | 0.175 ± 0.0463 | 0.277 ± 0.0887 | 0.92 ± 0.06 |
| **Depth to Groundwater Pozzo 5** | 0.090 ± 0.0392 | 0.184 ± 0.0913 | 0.94 ± 0.05 |
| **Depth to Groundwater Pozzo 6** | 0.450 ± 0.1641 | 0.902 ± 0.3928 | 0.77 ± 0.17 |
| **Depth to Groundwater Pozzo 7** | 0.400 ± 0.0667 | 0.633 ± 0.1459 | 0.47 ± 0.12 |
| **Depth to Groundwater Pozzo 8** | 0.322 ± 0.0901 | 0.517 ± 0.2240 | 0.82 ± 0.12 |
| **Depth to Groundwater Pozzo 9** | 0.646 ± 0.1918 | 1.196 ± 0.5841 | 0.92 ± 0.07 |

The key data processing operations that lead to the values from the tables above come from the result of attempting to save missing data by averaging the two rainfall regions data and using that average value any time there was missing data. There was also some use of forward fill strategies for any other missing rainfall days. However, it should be noted that the two regions were kept as separate columns. This is again consistent with the strategy used on other water bodies with multiple targets. 

The missing temperature data is handled in a new way for this case. For each missing temperature data point, the historical temperature data is used to give the average temperature for that month. 

Any row missing volume data is dropped. This limited the available data to 421 data points. However, this decision was shown to be an improvement over dropping the volume columns of data.

Finally, the month of a particular data point is added to each row.

In [None]:
dog = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', dog.shape[0], '\n',
      35*'=')
dog.isna().sum()/(dog.shape[0]) *100

In [None]:
# Quick re-order of the columns to put the targets at the end
dog = dog[['Date','Rainfall_Monteporzio','Rainfall_Velletri','Temperature_Monteporzio','Temperature_Velletri',
           'Volume_Pozzo_1','Volume_Pozzo_2','Volume_Pozzo_3','Volume_Pozzo_4','Volume_Pozzo_5+6',
           'Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Depth_to_Groundwater_Pozzo_1',
           'Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
           'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_7',
           'Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9']]
dog.head(3)

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(dog.isnull(), cbar=False)
plt.show()

Missing a lot of Volume_**K** data. This information indicates the volume of water, expressed in cubic meters (**mc**), taken from the drinking water treatment plant **K**.

Let's see how the predictors are correlated so that we may strategize how to go about dealing with this missing data.

In [None]:
corr = dog.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

The two rainfall features are correlated with a coefficient of 0.82. The two temperature features are correlated with a coefficient of 0.99.

The strategy here will be to average the two rainfall regions together. This will keep as much of the rainfall information, with the cost of some variability in the different regions. It should be noted that this is possible because each target shares an approximate correlation with the two regions. This has not been the case in some of the previous datasets with multiple targets.

The strategy for the temperatures will be slightly different. We have two options.
1. We have lots of temperature data from previous years. We should be able to average a temperature for each month based on those previous years, and fill missing temperature information with this computed average for the month.
2. Remove the temperature feature entirely and add a month feature as a category.

The first option would preserve the most variability in the data.

The volume feature is mostly missing from the top of the data set, so we will have to remove the top as there isn't much to do about it. But before we do that, we need to determine the average temperature for each month.

In [None]:
dog['Month'] = dog['Date'].apply(lambda x: x.split('/')[1]) # Strip the month
temp_means = dog.groupby(['Month']).mean()[['Temperature_Monteporzio','Temperature_Velletri']] # Print monthly temp average 
temp_means

In [None]:
import math

def fillin(x,col='No value passed'):   
    # x is a series of an individual row with axis=1
    if math.isnan(x[col]):
        # check the value for temp_means
        # Give mean temp value for the respective month
        return temp_means.loc[x['Month'],col]
    else:
        return x[col]

In [None]:
dog['Temperature_Monteporzio'] = dog[['Month','Temperature_Monteporzio']].apply(fillin,axis=1,col='Temperature_Monteporzio')
dog['Temperature_Velletri'] = dog[['Month','Temperature_Velletri']].apply(fillin,axis=1,col='Temperature_Velletri')

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(dog.dropna(subset=['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2',
                               'Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                               'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                               'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8',
                               'Depth_to_Groundwater_Pozzo_9']).isnull(), cbar=False)
plt.show()

In [None]:
dog.dropna(subset=['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2',
                   'Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                   'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                   'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8',
                   'Depth_to_Groundwater_Pozzo_9'],
          inplace=True)

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(dog.fillna(method='ffill',limit=2).isnull(), cbar=False)
plt.show()

In [None]:
dog.fillna(method='ffill',limit=2, inplace=True)

The remaining missing data from one column of rainfall data can be filled by the value of the other rainfall column, assuming it is not also missing data for that day.

If we do take an average across the two columns, we find that there are only 14 days where no rainfall is recorded in either column.

In [None]:
rain_mean = dog.loc[:,'Rainfall_Monteporzio':'Rainfall_Velletri'].mean(axis=1)
rain_mean.value_counts(dropna=False).head(5)

Monteporzio has 75 missing values. With the `rain_mean` data, we can reduce that to 14.

In [None]:
dog['Rainfall_Monteporzio'].value_counts(dropna=False).head(3)

In [None]:
dog['Rainfall_Monteporzio'] = dog['Rainfall_Monteporzio'].fillna(rain_mean)
dog['Rainfall_Monteporzio'].value_counts(dropna=False).head(6)

We will also reduce the missing rainfall data to 14 for the Velletri column.

This will result in 14 days where no rainfall data exists.

In [None]:
dog['Rainfall_Velletri'] = dog['Rainfall_Velletri'].fillna(rain_mean)
dog['Rainfall_Velletri'].value_counts(dropna=False).head(6)

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(dog.fillna(rain_mean).isnull(), cbar=False)
plt.show()

In [None]:
dog.fillna(rain_mean,inplace=True)

At this point, we might consider dropping the remaining 14 rows with no rainfall data.

But while we're at it, we might want to split our data into two sets.  

One where we keep the volume data and drop the remaining missing data.  
Another where we drop all volume columns.

In [None]:
dog_no_vol = dog.drop(labels=['Volume_Pozzo_1','Volume_Pozzo_2','Volume_Pozzo_3','Volume_Pozzo_4',
                           'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9'], axis=1)
dog_no_vol.dropna(inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', dog_no_vol.shape[0], '\n',
      35*'=')
dog_no_vol.isna().sum()/(dog_no_vol.shape[0]) *100

In [None]:
dog.dropna(inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', dog.shape[0], '\n',
      35*'=')
dog.isna().sum()/(dog.shape[0]) *100

---
## Doganella Visualizations
At this point, we can't do any more feature engineering without actually looking at the data.
So let's visualize it.

In [None]:
sns.pairplot(data=dog_no_vol,
             x_vars=[
                 'Rainfall_Monteporzio','Rainfall_Velletri',
                 'Temperature_Monteporzio','Temperature_Velletri', 'Month'
             ],
            y_vars=[
                'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9'
            ])
plt.tight_layout()

In [None]:
months = pd.get_dummies(dog_no_vol['Month'], drop_first=True) # One-hot encoding of month

dog_no_vol = pd.concat([dog_no_vol,months],axis=1) # Add the months to the end

There appears to be enough of a trend in the month feature to merit a one-hot encoding of the month.

In [None]:
sns.pairplot(data=dog,
             x_vars=[
                 'Rainfall_Monteporzio','Rainfall_Velletri',
                 'Temperature_Monteporzio','Temperature_Velletri', 'Month'
                 
             ],
            y_vars=[
                'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9'
            ])
plt.tight_layout()

In [None]:
sns.pairplot(data=dog,
             x_vars=[
                 'Volume_Pozzo_1','Volume_Pozzo_2','Volume_Pozzo_3','Volume_Pozzo_4',
                 'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9'
                 
             ],
            y_vars=[
                'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9'
            ])
plt.tight_layout()

In [None]:
months = pd.get_dummies(dog['Month'], drop_first=True) # One-hot encoding of month

dog = pd.concat([dog,months],axis=1) # Add the months to the end

---

## Building the Models
A good choice of model appears again to be decision trees, as seen from similar structures in the data in other bodies of water.

We will have to build two sets of 9 models. One for each target of the two data sets.

In [None]:
nvX = dog_no_vol.drop(['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2',
                       'Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                       'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                       'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8',
                       'Depth_to_Groundwater_Pozzo_9','Date','Month'],axis=1)

nv_p1 = dog_no_vol['Depth_to_Groundwater_Pozzo_1']
nv_p2 = dog_no_vol['Depth_to_Groundwater_Pozzo_2']
nv_p3 = dog_no_vol['Depth_to_Groundwater_Pozzo_3']
nv_p4 = dog_no_vol['Depth_to_Groundwater_Pozzo_4']
nv_p5 = dog_no_vol['Depth_to_Groundwater_Pozzo_5']
nv_p6 = dog_no_vol['Depth_to_Groundwater_Pozzo_6']
nv_p7 = dog_no_vol['Depth_to_Groundwater_Pozzo_7']
nv_p8 = dog_no_vol['Depth_to_Groundwater_Pozzo_8']
nv_p9 = dog_no_vol['Depth_to_Groundwater_Pozzo_9']

nvX_p1_train, nvX_p1_test, nv_p1_train, nv_p1_test = train_test_split(nvX, nv_p1, test_size=0.2)
nvX_p2_train, nvX_p2_test, nv_p2_train, nv_p2_test = train_test_split(nvX, nv_p2, test_size=0.2)
nvX_p3_train, nvX_p3_test, nv_p3_train, nv_p3_test = train_test_split(nvX, nv_p3, test_size=0.2)
nvX_p4_train, nvX_p4_test, nv_p4_train, nv_p4_test = train_test_split(nvX, nv_p4, test_size=0.2)
nvX_p5_train, nvX_p5_test, nv_p5_train, nv_p5_test = train_test_split(nvX, nv_p5, test_size=0.2)
nvX_p6_train, nvX_p6_test, nv_p6_train, nv_p6_test = train_test_split(nvX, nv_p6, test_size=0.2)
nvX_p7_train, nvX_p7_test, nv_p7_train, nv_p7_test = train_test_split(nvX, nv_p7, test_size=0.2)
nvX_p8_train, nvX_p8_test, nv_p8_train, nv_p8_test = train_test_split(nvX, nv_p8, test_size=0.2)
nvX_p9_train, nvX_p9_test, nv_p9_train, nv_p9_test = train_test_split(nvX, nv_p9, test_size=0.2)

In [None]:
nv_p1_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p2_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p3_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p4_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p5_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p6_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p7_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p8_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
nv_p9_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)

In [None]:
nv_p1_model.fit(nvX_p1_train,nv_p1_train)
nv_p2_model.fit(nvX_p2_train,nv_p2_train)
nv_p3_model.fit(nvX_p3_train,nv_p3_train)
nv_p4_model.fit(nvX_p4_train,nv_p4_train)
nv_p5_model.fit(nvX_p5_train,nv_p5_train)
nv_p6_model.fit(nvX_p6_train,nv_p6_train)
nv_p7_model.fit(nvX_p7_train,nv_p7_train)
nv_p8_model.fit(nvX_p8_train,nv_p8_train)
nv_p9_model.fit(nvX_p9_train,nv_p9_train)

In [None]:
nv_predictions1 = nv_p1_model.predict(nvX_p1_test)
nv_predictions2 = nv_p2_model.predict(nvX_p2_test)
nv_predictions3 = nv_p3_model.predict(nvX_p3_test)
nv_predictions4 = nv_p4_model.predict(nvX_p4_test)
nv_predictions5 = nv_p5_model.predict(nvX_p5_test)
nv_predictions6 = nv_p6_model.predict(nvX_p6_test)
nv_predictions7 = nv_p7_model.predict(nvX_p7_test)
nv_predictions8 = nv_p8_model.predict(nvX_p8_test)
nv_predictions9 = nv_p9_model.predict(nvX_p9_test)

In [None]:
# Pozzo 1 - 4
fig = plt.figure(figsize=(4,3))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])
ax3 = fig.add_axes([0,-1.3,1,1])
ax4 = fig.add_axes([1.2,-1.3,1,1])

col1 = np.where(nv_p1_test<nv_predictions1,'indigo','peru')
col2 = np.where(nv_p2_test<nv_predictions2,'indigo','peru')
col3 = np.where(nv_p3_test<nv_predictions3,'indigo','peru')                      
col4 = np.where(nv_p4_test<nv_predictions4,'indigo','peru')                      

ax.scatter(x=nv_p1_test, y=nv_predictions1, c=col1)
ax.plot(nv_p1_test,nv_p1_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater Pozzo 1')
ax.set_ylabel('Predicted Depth to Groundwater Pozzo 1')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=nv_p2_test, y=nv_predictions2, c=col2)
ax2.plot(nv_p2_test,nv_p2_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Depth to Groundwater Pozzo 2')
ax2.set_ylabel('Predicted Depth to Groundwater Pozzo 2')
ax2.set_title('Predicted and true values on the test set')

ax3.scatter(x=nv_p3_test, y=nv_predictions3, c=col3)
ax3.plot(nv_p3_test,nv_p3_test, color='r') # Line of accurate predictions
ax3.set_xlabel('Depth to Groundwater Pozzo 3')
ax3.set_ylabel('Predicted Depth to Groundwater Pozzo 3')
ax3.set_title('Predicted and true values on the test set')

ax4.scatter(x=nv_p4_test, y=nv_predictions4, c=col4)
ax4.plot(nv_p4_test,nv_p4_test, color='r') # Line of accurate predictions
ax4.set_xlabel('Depth to Groundwater Pozzo 4')
ax4.set_ylabel('Predicted Depth to Groundwater Pozzo 4')
ax4.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Pozzo 5-9
fig = plt.figure(figsize=(4,3))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])
ax3 = fig.add_axes([0,-1.3,1,1])
ax4 = fig.add_axes([1.2,-1.3,1,1])
ax5 = fig.add_axes([0,-2.6,1,1])

col1 = np.where(nv_p5_test<nv_predictions5,'indigo','peru')
col2 = np.where(nv_p6_test<nv_predictions6,'indigo','peru')
col3 = np.where(nv_p7_test<nv_predictions7,'indigo','peru')
col4 = np.where(nv_p8_test<nv_predictions8,'indigo','peru')
col5 = np.where(nv_p9_test<nv_predictions9,'indigo','peru')

ax.scatter(x=nv_p5_test, y=nv_predictions5, c=col1)
ax.plot(nv_p5_test,nv_p5_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater Pozzo 5')
ax.set_ylabel('Predicted Depth to Groundwater Pozzo 5')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=nv_p6_test, y=nv_predictions6, c=col2)
ax2.plot(nv_p6_test,nv_p6_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Depth to Groundwater Pozzo 6')
ax2.set_ylabel('Predicted Depth to Groundwater Pozzo 6')
ax2.set_title('Predicted and true values on the test set')

ax3.scatter(x=nv_p7_test, y=nv_predictions7, c=col3)
ax3.plot(nv_p7_test,nv_p7_test, color='r') # Line of accurate predictions
ax3.set_xlabel('Depth to Groundwater Pozzo 7')
ax3.set_ylabel('Predicted Depth to Groundwater Pozzo 7')
ax3.set_title('Predicted and true values on the test set')

ax4.scatter(x=nv_p8_test, y=nv_predictions8, c=col4)
ax4.plot(nv_p8_test,nv_p8_test, color='r') # Line of accurate predictions
ax4.set_xlabel('Depth to Groundwater Pozzo 8')
ax4.set_ylabel('Predicted Depth to Groundwater Pozzo 8')
ax4.set_title('Predicted and true values on the test set')

ax5.scatter(x=nv_p9_test, y=nv_predictions9, c=col5)
ax5.plot(nv_p9_test,nv_p9_test, color='r') # Line of accurate predictions
ax5.set_xlabel('Depth to Groundwater Pozzo 9')
ax5.set_ylabel('Predicted Depth to Groundwater Pozzo 9')
ax5.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Evaluation
print("Depth to Groundwater Pozzo 1\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p1_test,nv_predictions1)),
     "\tR^2:\t", metrics.r2_score(nv_p1_test,nv_predictions1))

print(65*"=","\nDepth to Groundwater Pozzo 2\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p2_test,nv_predictions2)),
     "\tR^2:\t", metrics.r2_score(nv_p2_test,nv_predictions2))

print(65*"=","\nDepth to Groundwater Pozzo 3\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p3_test,nv_predictions3)),
     "\tR^2:\t", metrics.r2_score(nv_p3_test,nv_predictions3))

print(65*"=","\nDepth to Groundwater Pozzo 4\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p4_test,nv_predictions4)),
     "\tR^2:\t", metrics.r2_score(nv_p4_test,nv_predictions4))

print(65*"=","\nDepth to Groundwater Pozzo 5\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p5_test,nv_predictions5)),
     "\tR^2:\t", metrics.r2_score(nv_p5_test,nv_predictions5))

print(65*"=","\nDepth to Groundwater Pozzo 6\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p6_test,nv_predictions6)),
     "\tR^2:\t", metrics.r2_score(nv_p6_test,nv_predictions6))

print(65*"=","\nDepth to Groundwater Pozzo 7\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p7_test,nv_predictions7)),
     "\tR^2:\t", metrics.r2_score(nv_p7_test,nv_predictions7))

print(65*"=","\nDepth to Groundwater Pozzo 8\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p8_test,nv_predictions8)),
     "\tR^2:\t", metrics.r2_score(nv_p8_test,nv_predictions8))

print(65*"=","\nDepth to Groundwater Pozzo 9\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(nv_p9_test,nv_predictions9)),
     "\tR^2:\t", metrics.r2_score(nv_p9_test,nv_predictions9))

## The Volume Data

In [None]:
X = dog.drop(['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
              'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
              'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9',
              'Date','Month'],axis=1)

p1 = dog['Depth_to_Groundwater_Pozzo_1']
p2 = dog['Depth_to_Groundwater_Pozzo_2']
p3 = dog['Depth_to_Groundwater_Pozzo_3']
p4 = dog['Depth_to_Groundwater_Pozzo_4']
p5 = dog['Depth_to_Groundwater_Pozzo_5']
p6 = dog['Depth_to_Groundwater_Pozzo_6']
p7 = dog['Depth_to_Groundwater_Pozzo_7']
p8 = dog['Depth_to_Groundwater_Pozzo_8']
p9 = dog['Depth_to_Groundwater_Pozzo_9']

X_p1_train, X_p1_test, p1_train, p1_test = train_test_split(X, p1, test_size=0.1)
X_p2_train, X_p2_test, p2_train, p2_test = train_test_split(X, p2, test_size=0.1)
X_p3_train, X_p3_test, p3_train, p3_test = train_test_split(X, p3, test_size=0.1)
X_p4_train, X_p4_test, p4_train, p4_test = train_test_split(X, p4, test_size=0.1)
X_p5_train, X_p5_test, p5_train, p5_test = train_test_split(X, p5, test_size=0.1)
X_p6_train, X_p6_test, p6_train, p6_test = train_test_split(X, p6, test_size=0.1)
X_p7_train, X_p7_test, p7_train, p7_test = train_test_split(X, p7, test_size=0.1)
X_p8_train, X_p8_test, p8_train, p8_test = train_test_split(X, p8, test_size=0.1)
X_p9_train, X_p9_test, p9_train, p9_test = train_test_split(X, p9, test_size=0.1)

In [None]:
p1_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p2_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p3_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p4_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p5_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p6_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p7_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p8_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')
p9_model = RandomForestRegressor(n_estimators=200, max_features='sqrt')

In [None]:
p1_model.fit(X_p1_train,p1_train)
p2_model.fit(X_p2_train,p2_train)
p3_model.fit(X_p3_train,p3_train)
p4_model.fit(X_p4_train,p4_train)
p5_model.fit(X_p5_train,p5_train)
p6_model.fit(X_p6_train,p6_train)
p7_model.fit(X_p7_train,p7_train)
p8_model.fit(X_p8_train,p8_train)
p9_model.fit(X_p9_train,p9_train)

In [None]:
predictions1 = p1_model.predict(X_p1_test)
predictions2 = p2_model.predict(X_p2_test)
predictions3 = p3_model.predict(X_p3_test)
predictions4 = p4_model.predict(X_p4_test)
predictions5 = p5_model.predict(X_p5_test)
predictions6 = p6_model.predict(X_p6_test)
predictions7 = p7_model.predict(X_p7_test)
predictions8 = p8_model.predict(X_p8_test)
predictions9 = p9_model.predict(X_p9_test)

In [None]:
# Pozzo 1 - 4
fig = plt.figure(figsize=(4,3))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])
ax3 = fig.add_axes([0,-1.3,1,1])
ax4 = fig.add_axes([1.2,-1.3,1,1])

col1 = np.where(p1_test<predictions1,'indigo','peru')
col2 = np.where(p2_test<predictions2,'indigo','peru')
col3 = np.where(p3_test<predictions3,'indigo','peru')                      
col4 = np.where(p4_test<predictions4,'indigo','peru')                      

ax.scatter(x=p1_test, y=predictions1, c=col1)
ax.plot(p1_test,p1_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater Pozzo 1')
ax.set_ylabel('Predicted Depth to Groundwater Pozzo 1')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=p2_test, y=predictions2, c=col2)
ax2.plot(p2_test,p2_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Depth to Groundwater Pozzo 2')
ax2.set_ylabel('Predicted Depth to Groundwater Pozzo 2')
ax2.set_title('Predicted and true values on the test set')

ax3.scatter(x=p3_test, y=predictions3, c=col3)
ax3.plot(p3_test,p3_test, color='r') # Line of accurate predictions
ax3.set_xlabel('Depth to Groundwater Pozzo 3')
ax3.set_ylabel('Predicted Depth to Groundwater Pozzo 3')
ax3.set_title('Predicted and true values on the test set')

ax4.scatter(x=p4_test, y=predictions4, c=col4)
ax4.plot(p4_test,p4_test, color='r') # Line of accurate predictions
ax4.set_xlabel('Depth to Groundwater Pozzo 4')
ax4.set_ylabel('Predicted Depth to Groundwater Pozzo 4')
ax4.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Pozzo 5-9
fig = plt.figure(figsize=(4,3))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])
ax3 = fig.add_axes([0,-1.3,1,1])
ax4 = fig.add_axes([1.2,-1.3,1,1])
ax5 = fig.add_axes([0,-2.6,1,1])

col1 = np.where(p5_test<predictions5,'indigo','peru')
col2 = np.where(p6_test<predictions6,'indigo','peru')
col3 = np.where(p7_test<predictions7,'indigo','peru')
col4 = np.where(p8_test<predictions8,'indigo','peru')
col5 = np.where(p9_test<predictions9,'indigo','peru')

ax.scatter(x=p5_test, y=predictions5, c=col1)
ax.plot(p5_test,p5_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater Pozzo 5')
ax.set_ylabel('Predicted Depth to Groundwater Pozzo 5')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=p6_test, y=predictions6, c=col2)
ax2.plot(p6_test,p6_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Depth to Groundwater Pozzo 6')
ax2.set_ylabel('Predicted Depth to Groundwater Pozzo 6')
ax2.set_title('Predicted and true values on the test set')

ax3.scatter(x=p7_test, y=predictions7, c=col3)
ax3.plot(p7_test,p7_test, color='r') # Line of accurate predictions
ax3.set_xlabel('Depth to Groundwater Pozzo 7')
ax3.set_ylabel('Predicted Depth to Groundwater Pozzo 7')
ax3.set_title('Predicted and true values on the test set')

ax4.scatter(x=p8_test, y=predictions8, c=col4)
ax4.plot(p8_test,p8_test, color='r') # Line of accurate predictions
ax4.set_xlabel('Depth to Groundwater Pozzo 8')
ax4.set_ylabel('Predicted Depth to Groundwater Pozzo 8')
ax4.set_title('Predicted and true values on the test set')

ax5.scatter(x=p9_test, y=predictions9, c=col5)
ax5.plot(p9_test,p9_test, color='r') # Line of accurate predictions
ax5.set_xlabel('Depth to Groundwater Pozzo 9')
ax5.set_ylabel('Predicted Depth to Groundwater Pozzo 9')
ax5.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Evaluation
print("Depth to Groundwater Pozzo 1\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p1_test,predictions1)),
     "\tR^2:\t", metrics.r2_score(p1_test,predictions1))

print(65*"=","\nDepth to Groundwater Pozzo 2\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p2_test,predictions2)),
     "\tR^2:\t", metrics.r2_score(p2_test,predictions2))

print(65*"=","\nDepth to Groundwater Pozzo 3\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p3_test,predictions3)),
     "\tR^2:\t", metrics.r2_score(p3_test,predictions3))

print(65*"=","\nDepth to Groundwater Pozzo 4\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p4_test,predictions4)),
     "\tR^2:\t", metrics.r2_score(p4_test,predictions4))

print(65*"=","\nDepth to Groundwater Pozzo 5\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p5_test,predictions5)),
     "\tR^2:\t", metrics.r2_score(p5_test,predictions5))

print(65*"=","\nDepth to Groundwater Pozzo 6\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p6_test,predictions6)),
     "\tR^2:\t", metrics.r2_score(p6_test,predictions6))

print(65*"=","\nDepth to Groundwater Pozzo 7\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p7_test,predictions7)),
     "\tR^2:\t", metrics.r2_score(p7_test,predictions7))

print(65*"=","\nDepth to Groundwater Pozzo 8\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p8_test,predictions8)),
     "\tR^2:\t", metrics.r2_score(p8_test,predictions8))

print(65*"=","\nDepth to Groundwater Pozzo 9\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p9_test,predictions9)),
     "\tR^2:\t", metrics.r2_score(p9_test,predictions9))

Clearly, the volume data are very strong predictors of the depth to groundwater of each region. This is because, despite the loss in hundreds of data points to include only those days where volume data was recorded, the volume data results in better performance.

---

## MAE, RMSE, and $R^2$

In [None]:
table = run_method(X, p1, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 1")

table = run_method(X, p2, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 2")

table = run_method(X, p3, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 3")

table = run_method(X, p4, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 4")

table = run_method(X, p5, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 5")

table = run_method(X, p6, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 6")

table = run_method(X, p7, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 7")

table = run_method(X, p8, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 8")

table = run_method(X, p9, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Pozzo 9")

## Luco Overview
This aquifer has one target, which has been modeled with random forests. Each target was fit with 20 random train/test splits of the original data to achieve the following statistics.

#### Depth to Groundwater Podere Casetta 
|     ML Method      |       MAE      |       RMSE     | $R^2$       |
|:------------------:|:--------------:|:--------------:|:-----------:|
| **Random Forests** | 0.031 ± 0.0077 | 0.046 ± 0.0109 | 0.98 ± 0.01 |


The rainfall and temperature features in this dataset are averaged into one column to reduce variance of the resulting model. Any missing data after these two operations are dropped row-wise for the targets and the features.

In [None]:
luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', luco.shape[0], '\n',
      35*'=')
luco.isna().sum()/(luco.shape[0]) *100

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(luco.isnull(), cbar=False)
plt.show()

Because we are trying to predict only one output, we can average rainfall data to make up for missing information.

There is no missing temperature data, but that can be averaged into a single column for simplification, since local temperatures are likely to all be correlated and are also not likely to affect an aquifer through evaporation in the same way a lake would be affected.

There is a lot of missing information missing from the Pozzo regions. In the Doganella aquifer data, it was worth the loss of years of data to include the volume data in the model, so we will attempt to keep that data, but the target data is very spotty around those later dates.

In [None]:
corr = luco.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

In [None]:
luco['Mean_Rainfall'] = luco.loc[:,'Rainfall_Simignano':'Rainfall_Monteroni_Arbia_Biena'].mean(axis=1)

luco.drop(labels=['Rainfall_Simignano','Rainfall_Siena_Poggio_al_Vento','Rainfall_Mensano',
                  'Rainfall_Montalcinello','Rainfall_Monticiano_la_Pineta','Rainfall_Sovicille',
                  'Rainfall_Ponte_Orgia','Rainfall_Scorgiano','Rainfall_Pentolina',
                  'Rainfall_Monteroni_Arbia_Biena'], axis=1, inplace=True)

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(luco.isnull(), cbar=False)
plt.show()

Now we will remove any missing rows from the target data.

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(luco.dropna(subset=['Depth_to_Groundwater_Podere_Casetta']).isnull(), cbar=False)
plt.show()

In [None]:
luco.dropna(subset=['Depth_to_Groundwater_Podere_Casetta'], inplace=True)

In [None]:
corr = luco.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

There are only a few hundred data points for the other depth variables, which we can use as predictors in this challenge. Pozzo 1 and 3 have a high negative correlation with the response variable, so it is probably justified in keeping those data points. Let's see exactly how many non-null values there are for those columns.

Fortunately, those non-null depth to groundwater features overlap with the non-null volume data, so we don't have to concern ourselves as much with the volume data.

In [None]:
luco.dropna(subset=[
    'Depth_to_Groundwater_Pozzo_1',
    'Depth_to_Groundwater_Pozzo_3',
    'Depth_to_Groundwater_Pozzo_4'], inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', luco.shape[0], '\n',
      35*'=')
luco.isna().sum()/(luco.shape[0]) *100

445 data points should be enough to work with, considering the benefit from keeping the depth to groundwater columns.

We can now simplify the temperature data and begin visualization.

In [None]:
luco['Mean_Temperature'] = luco.loc[:,'Temperature_Siena_Poggio_al_Vento':'Temperature_Monteroni_Arbia_Biena'].mean(axis=1)

luco.drop(labels=['Temperature_Siena_Poggio_al_Vento','Temperature_Mensano','Temperature_Pentolina',
                  'Temperature_Monteroni_Arbia_Biena'], axis=1, inplace=True)

In [None]:
cols = [
    'Date','Mean_Rainfall','Mean_Temperature',
    'Volume_Pozzo_1','Volume_Pozzo_3','Volume_Pozzo_4',
    'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
    'Depth_to_Groundwater_Podere_Casetta'
       ]

luco = luco[cols]
luco.info()

In [None]:
corr = luco.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

An interesting observation; the individual temperatures each had low correlation $( |\text{ corr }| < 0.2 )$ with our target. But with the averaging of each region, we now have a stronger correlation between temperature and our target.

---

##  Luco Visualization

In [None]:
fig = plt.figure(figsize=(11,4))
ax1 = fig.add_axes([0,0,1,0.5])
ax2 = fig.add_axes([0,-0.7,1,0.5])
ax3 = fig.add_axes([0,-1.4,1,0.5])
ax4 = fig.add_axes([0,-2.1,1,0.5])

ax1.set_title('Depth to Groundwater Podere Casetta')
ax1.set_xlabel('Time')
ax1.set_ylabel('meters')

ax2.set_title('Mean Volume')
ax2.set_xlabel('Time')
ax2.set_ylabel('mc')

ax3.set_title('Temperature')
ax3.set_xlabel('Time')
ax3.set_ylabel('celsius')

ax4.set_title('Mean Region Rainfall')
ax4.set_xlabel('Time')
ax4.set_ylabel('mm')

ax1.tick_params(axis='x', bottom=False, labelbottom=False)
ax2.tick_params(axis='x', bottom=False, labelbottom=False)
ax3.tick_params(axis='x', bottom=False, labelbottom=False)
ax4.tick_params(axis='x', bottom=False, labelbottom=False)

ax1.plot(luco['Date'], luco['Depth_to_Groundwater_Podere_Casetta'], label='Depth to Groundwater', color='g')
ax2.plot(luco['Date'], luco[['Volume_Pozzo_1','Volume_Pozzo_3','Volume_Pozzo_4']].mean(axis=1), label='Volume', color='y')
ax3.plot(luco['Date'], luco['Mean_Temperature'], label='Temperature', color='r')
ax4.plot(luco['Date'], luco['Mean_Rainfall'], label='Rainfall', color='b')

plt.show()

In [None]:
fig = plt.figure(figsize=(11,4))
ax1 = fig.add_axes([0,0,1,0.5])
ax2 = fig.add_axes([0,-0.7,1,0.5])
ax3 = fig.add_axes([0,-1.4,1,0.5])
ax4 = fig.add_axes([0,-2.1,1,0.5])

ax1.set_title('Depth to Groundwater Podere Casetta')
ax1.set_xlabel('Time')
ax1.set_ylabel('meters')

ax2.set_title('Depth to Groundwater Pozzo 1')
ax2.set_xlabel('Time')
ax2.set_ylabel('meters')

ax3.set_title('Depth to Groundwater Pozzo 3')
ax3.set_xlabel('Time')
ax3.set_ylabel('meters')

ax4.set_title('Depth to Groundwater Pozzo 4')
ax4.set_xlabel('Time')
ax4.set_ylabel('meters')

ax1.tick_params(axis='x', bottom=False, labelbottom=False)
ax2.tick_params(axis='x', bottom=False, labelbottom=False)
ax3.tick_params(axis='x', bottom=False, labelbottom=False)
ax4.tick_params(axis='x', bottom=False, labelbottom=False)

ax1.plot(luco['Date'], luco['Depth_to_Groundwater_Podere_Casetta'], label='Depth to Groundwater', color='g')
ax2.plot(luco['Date'], luco['Depth_to_Groundwater_Pozzo_1'], label='Depth Pozzo 1', color='y')
ax3.plot(luco['Date'], luco['Depth_to_Groundwater_Pozzo_3'], label='Depth Pozzo 3', color='r')
ax4.plot(luco['Date'], luco['Depth_to_Groundwater_Pozzo_4'], label='Depth Pozzo 4', color='b')

plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Mean_Rainfall'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Mean_Temperature'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Volume_Pozzo_1'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Volume_Pozzo_3'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Volume_Pozzo_4'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Depth_to_Groundwater_Pozzo_1'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Depth_to_Groundwater_Pozzo_3'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

In [None]:
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x=luco['Depth_to_Groundwater_Pozzo_4'], y=luco['Depth_to_Groundwater_Podere_Casetta'])
plt.show()

The data looks like it would be fit well with decision trees yet again. So we can jump to using random forest to begin.

---

## Modeling Luco

In [None]:
X = luco.drop(['Date','Depth_to_Groundwater_Podere_Casetta'], axis=1)
y = luco['Depth_to_Groundwater_Podere_Casetta']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=5)
model.fit(X_train,y_train)

In [None]:
predictions = model.predict(X_test)

In [None]:
fig = plt.figure(figsize=(10,6))

ax = fig.add_axes([0,0,1,1])

col = np.where(y_test<predictions,'indigo','peru')

ax.scatter(x=y_test, y=predictions, c=col)
ax.plot(y_test,y_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater Podere Casetta')
ax.set_ylabel('Predicted Depth to Groundwater')
ax.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
print("Depth to Groundwater Podere Casetta \n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(y_test,predictions)),
     "\tR^2:\t", metrics.r2_score(y_test,predictions))

---


## MAE, RMSE, and $R^2$

In [None]:
table = run_method(X, y, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater Podere Casetta")

##  Auser Overview

This aquifer has five targets, each modeled with random forests. Each target was fit with 20 random train/test splits of the original data to achieve the following statistics.

#### Random Forests 
|             Target           |       MAE      |       RMSE     | $R^2$       |
|:----------------------------:|:--------------:|:--------------:|:-----------:|
| **Depth to Groundwater LT2** | 0.020 ± 0.0121 | 0.160 ± 0.2050 | 0.92 ± 0.13 |
| **Depth to Groundwater SAL** | 0.017 ± 0.0089 | 0.060 ± 0.0702 | 0.98 ± 0.02 |
| **Depth to Groundwater PAG** | 0.016 ± 0.0025 | 0.032 ± 0.0107 | 0.99 ± 0.01 |
| **Depth to Groundwater CoS** | 0.029 ± 0.0119 | 0.090 ± 0.1274 | 0.99 ± 0.01 |
| **Depth to Groundwater DIEC** | 0.016 ± 0.0045  | 0.041 ± 0.0243 | 0.99 ± 0.01 |

#### Procedure
The rainfall and temperature features in this dataset are averaged into one column to reduce variance of the resulting model. After dropping rows with missing target data, there were only about 13 missing rows remaining. Linear interpolation was used to fill those gaps.

In [None]:
auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', auser.shape[0], '\n',
      35*'=')
auser.isna().sum()/(auser.shape[0]) *100

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(auser.isnull(), cbar=False)
plt.show()

---

## Auser Missing Data

In [None]:
corr = auser.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

Rainfall and Temperature are highly correlated and can be averaged as before. This would just be for simplification purposes and to lower the variance of our resulting model as there isn't much missing data, except from the targets. Of course, we typically just remove rows with missing target data.

We can see that if we remove those data points, we have only a few days remaining with missing data.

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(auser.dropna(subset=['Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL',
                                 'Depth_to_Groundwater_PAG','Depth_to_Groundwater_CoS',
                                 'Depth_to_Groundwater_DIEC']).isnull(), cbar=False)
plt.show()

In [None]:
auser.dropna(subset=['Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL','Depth_to_Groundwater_PAG',
                     'Depth_to_Groundwater_CoS','Depth_to_Groundwater_DIEC'], inplace=True)

In [None]:
print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', auser.shape[0], '\n',
      35*'=')
auser.isna().sum()/(auser.shape[0]) *100

Let's look at those rows where there are some missing data for the two hydrometry columns.

In [None]:
auser[auser['Hydrometry_Monte_S_Quirico'].isnull()]

In [None]:
auser[auser['Hydrometry_Piaggione'].isnull()]

The two options for filling this data is a linear interpolation and the other option is a simple back fill.

In [None]:
auser.interpolate(inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', auser.shape[0], '\n',
      35*'=')
auser.isna().sum()/(auser.shape[0]) *100

In [None]:
auser['Mean_Rainfall'] = auser.loc[:,'Rainfall_Gallicano':'Rainfall_Fabbriche_di_Vallico'].mean(axis=1)
auser['Mean_Temperature'] = auser.loc[:,'Temperature_Orentano':'Temperature_Lucca_Orto_Botanico'].mean(axis=1)

cols = ['Mean_Rainfall','Mean_Temperature', 'Volume_POL','Volume_CC1','Volume_CC2','Volume_CSA','Volume_CSAL',
        'Hydrometry_Monte_S_Quirico','Hydrometry_Piaggione','Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL',
        'Depth_to_Groundwater_PAG','Depth_to_Groundwater_CoS','Depth_to_Groundwater_DIEC']

auser = auser[cols]

In [None]:
corr = auser.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

---

## Auser Visualization

In [None]:
sns.pairplot(data=auser,
             x_vars=[
                 'Mean_Rainfall','Mean_Temperature', 'Volume_POL','Volume_CC1','Volume_CC2',
                 'Volume_CSA','Volume_CSAL','Hydrometry_Monte_S_Quirico','Hydrometry_Piaggione'
             ],
            y_vars=[
                'Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL','Depth_to_Groundwater_PAG',
                'Depth_to_Groundwater_CoS','Depth_to_Groundwater_DIEC'
            ])
plt.tight_layout()

---

## Auser Modeling
### Random Forest

In [None]:
X = auser.drop(['Mean_Rainfall','Mean_Temperature', 'Volume_POL','Volume_CC1','Volume_CC2',
                 'Volume_CSA','Volume_CSAL','Hydrometry_Monte_S_Quirico','Hydrometry_Piaggione'],axis=1)

LT2 = auser['Depth_to_Groundwater_LT2']
SAL = auser['Depth_to_Groundwater_SAL']
PAG = auser['Depth_to_Groundwater_PAG']
CoS = auser['Depth_to_Groundwater_CoS']
DIEC = auser['Depth_to_Groundwater_DIEC']

X_LT2_train, X_LT2_test, LT2_train, LT2_test = train_test_split(X, LT2, test_size=0.2)
X_SAL_train, X_SAL_test, SAL_train, SAL_test = train_test_split(X, SAL, test_size=0.2)
X_PAG_train, X_PAG_test, PAG_train, PAG_test = train_test_split(X, PAG, test_size=0.2)
X_CoS_train, X_CoS_test, CoS_train, CoS_test = train_test_split(X, CoS, test_size=0.2)
X_DIEC_train, X_DIEC_test, DIEC_train, DIEC_test = train_test_split(X, DIEC, test_size=0.2)

In [None]:
LT2_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
LT2_model.fit(X_LT2_train,LT2_train)

SAL_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
SAL_model.fit(X_SAL_train,SAL_train)

PAG_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
PAG_model.fit(X_PAG_train,PAG_train)

CoS_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
CoS_model.fit(X_CoS_train,CoS_train)

DIEC_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=10)
DIEC_model.fit(X_DIEC_train,DIEC_train)

In [None]:
LT2_predictions = LT2_model.predict(X_LT2_test)
SAL_predictions = SAL_model.predict(X_SAL_test)
PAG_predictions = PAG_model.predict(X_PAG_test)
CoS_predictions = CoS_model.predict(X_CoS_test)
DIEC_predictions = DIEC_model.predict(X_DIEC_test)

In [None]:
# Auser Results
fig = plt.figure(figsize=(4,3))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])
ax3 = fig.add_axes([0,-1.3,1,1])
ax4 = fig.add_axes([1.2,-1.3,1,1])
ax5 = fig.add_axes([0,-2.6,1,1])

col1 = np.where(LT2_test<LT2_predictions,'indigo','peru')
col2 = np.where(SAL_test<SAL_predictions,'indigo','peru')
col3 = np.where(PAG_test<PAG_predictions,'indigo','peru')
col4 = np.where(CoS_test<CoS_predictions,'indigo','peru')
col5 = np.where(DIEC_test<DIEC_predictions,'indigo','peru')

ax.scatter(x=LT2_test, y=LT2_predictions, c=col1)
ax.plot(LT2_test,LT2_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater LT2')
ax.set_ylabel('Predicted Depth to Groundwater LT2')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=SAL_test, y=SAL_predictions, c=col2)
ax2.plot(SAL_test,SAL_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Depth to Groundwater SAL')
ax2.set_ylabel('Predicted Depth to Groundwater SAL')
ax2.set_title('Predicted and true values on the test set')

ax3.scatter(x=PAG_test, y=PAG_predictions, c=col3)
ax3.plot(PAG_test,PAG_test, color='r') # Line of accurate predictions
ax3.set_xlabel('Depth to Groundwater PAG')
ax3.set_ylabel('Predicted Depth to Groundwater PAG')
ax3.set_title('Predicted and true values on the test set')

ax4.scatter(x=CoS_test, y=CoS_predictions, c=col4)
ax4.plot(CoS_test,CoS_test, color='r') # Line of accurate predictions
ax4.set_xlabel('Depth to Groundwater CoS')
ax4.set_ylabel('Predicted Depth to Groundwater CoS')
ax4.set_title('Predicted and true values on the test set')

ax5.scatter(x=DIEC_test, y=DIEC_predictions, c=col5)
ax5.plot(DIEC_test,DIEC_test, color='r') # Line of accurate predictions
ax5.set_xlabel('Depth to Groundwater DIEC')
ax5.set_ylabel('Predicted Depth to Groundwater DIEC')
ax5.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Evaluation
print("Depth to Groundwater LT2\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(LT2_test,LT2_predictions)),
     "\tR^2:\t", metrics.r2_score(LT2_test,LT2_predictions))

print(65*"=","\nDepth to Groundwater SAL\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(SAL_test,SAL_predictions)),
     "\tR^2:\t", metrics.r2_score(SAL_test,SAL_predictions))

print(65*"=","\nDepth to Groundwater PAG\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(PAG_test,PAG_predictions)),
     "\tR^2:\t", metrics.r2_score(PAG_test,PAG_predictions))

print(65*"=","\nDepth to Groundwater CoS\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(CoS_test,CoS_predictions)),
     "\tR^2:\t", metrics.r2_score(CoS_test,CoS_predictions))

print(65*"=","\nDepth to Groundwater DIEC\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(DIEC_test,DIEC_predictions)),
     "\tR^2:\t", metrics.r2_score(DIEC_test,DIEC_predictions))

---

## MAE, RMSE, and $R^2$

In [None]:
table = run_method(X, LT2, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater LT2")

table = run_method(X, SAL, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater SAL")

table = run_method(X, PAG, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater PAG")

table = run_method(X, CoS, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater CoS")

table = run_method(X, DIEC, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater DIEC")

## Petrignano Overview
This aquifer has two targets, each modeled with random forests. Each model was fit with 20 random train/test splits of the original data to achieve the following statistics.

#### Random Forests 
|             Target           |       MAE      |       RMSE     | $R^2$       |
|:----------------------------:|:--------------:|:--------------:|:-----------:|
| **Depth to Groundwater P24** | 1.761 ± 0.0954 | 2.357 ± 0.1257 | 0.40 ± 0.07 |
| **Depth to Groundwater P25** | 1.738 ± 0.0730 | 2.345 ± 0.0777 | 0.37 ± 0.07 |

#### Procedure
This procedure simply involved removing all rows with missing data.

In [None]:
pet = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv')

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', pet.shape[0], '\n',
      35*'=')
pet.isna().sum()/(pet.shape[0]) *100

In [None]:
fig = plt.figure(figsize=(8,5))
sns.heatmap(pet.isnull(), cbar=False)
plt.show()

In [None]:
corr = pet.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

There's plenty of data here that we can simply remove any of the few rows with missing data.

In [None]:
pet['Mean_Temperature'] = pet.loc[:,'Temperature_Bastia_Umbra':'Temperature_Petrignano'].mean(axis=1)

cols = ['Date','Rainfall_Bastia_Umbra','Mean_Temperature','Volume_C10_Petrignano',
        'Hydrometry_Fiume_Chiascio_Petrignano','Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']

pet = pet[cols]

pet.dropna(inplace=True)

print('\t% Null Values by Column','\n',
      '\tDataFrame Size:', pet.shape[0], '\n',
      35*'=')
pet.isna().sum()/(pet.shape[0]) *100

In [None]:
corr = pet.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    fig,ax = plt.subplots(figsize=(12,10))
    ax = sns.heatmap(corr, mask=mask, annot=True, vmin=-1, vmax=1, cmap='viridis')

---

## Petrignano Visualization

In [None]:
sns.pairplot(data=pet,
             x_vars=[
                 'Rainfall_Bastia_Umbra','Mean_Temperature','Volume_C10_Petrignano',
                 'Hydrometry_Fiume_Chiascio_Petrignano'
             ],
            y_vars=[
                'Depth_to_Groundwater_P24','Depth_to_Groundwater_P25'
            ])
plt.tight_layout()

---


## Modeling Petrignano
### Random Forest

In [None]:
X = pet.drop(['Date','Depth_to_Groundwater_P24','Depth_to_Groundwater_P25'],axis=1)
p24 = pet['Depth_to_Groundwater_P24']
p25 = pet['Depth_to_Groundwater_P25']

X_p24_train, X_p24_test, p24_train, p24_test = train_test_split(X, p24, test_size=0.2)
X_p25_train, X_p25_test, p25_train, p25_test = train_test_split(X, p25, test_size=0.2)

In [None]:
p24_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=15)
p25_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', min_samples_split=15)

In [None]:
p24_model.fit(X_p24_train,p24_train)
p25_model.fit(X_p25_train,p25_train)

In [None]:
predictions1 = p24_model.predict(X_p24_test)
predictions2 = p25_model.predict(X_p25_test)

In [None]:
# Results
fig = plt.figure(figsize=(4,3))

ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([1.2,0,1,1])

col1 = np.where(p24_test<predictions1,'indigo','peru')
col2 = np.where(p25_test<predictions2,'indigo','peru')

ax.scatter(x=p24_test, y=predictions1, c=col1)
ax.plot(p24_test,p24_test, color='r') # Line of accurate predictions
ax.set_xlabel('Depth to Groundwater P24')
ax.set_ylabel('Predicted Depth to Groundwater P24')
ax.set_title('Predicted and true values on the test set')

ax2.scatter(x=p25_test, y=predictions2, c=col2)
ax2.plot(p25_test,p25_test, color='r') # Line of accurate predictions
ax2.set_xlabel('Depth to Groundwater P25')
ax2.set_ylabel('Predicted Depth to Groundwater P25')
ax2.set_title('Predicted and true values on the test set')

plt.show()

In [None]:
# Evaluation
print("Depth to Groundwater P24\n")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p24_test,predictions1)),
     "\tR^2:\t", metrics.r2_score(p24_test,predictions1))

print(65*"=","\nDepth to Groundwater P25\n ")
print("RMSE:\t", np.sqrt(metrics.mean_squared_error(p25_test,predictions2)),
     "\tR^2:\t", metrics.r2_score(p25_test,predictions2))

---

## MAE, RMSE, and $R^2$

In [None]:
table = run_method(X, p24, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater P24")

table = run_method(X, p25, n=20,
                   method=RandomForestRegressor,n_estimators=500, max_features='sqrt')
print_method_results("Depth to Groundwater P25")