# **Challenge**

The challenge is to determine how various factors influence the water availability of each presented waterbody and ensuring water availability for each time interval of the year.

In this notebook, I will one mathematical model, for prediction of Aquifer_Petrignano's Depth_to_Groundwater_P24. Similar approach can be reused for the prediction of groundwater depths of the other water bodies.

Also, I'll try to keep this notebook short and informative.

This famous quote aligns with the objective of this challenge.

![](https://i.pinimg.com/originals/cd/49/f4/cd49f44a536a4c04c9eb225091df43b9.jpg)

# Data Viz and Preprocessing

In [None]:
import numpy as np 
import pandas as pd 

import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
df = pd.read_csv("../input/acea-water-prediction/Aquifer_Petrignano.csv")

In [None]:
df

**Features** ->
1. **Rainfall_Bastia_Umbra** -> quantity of rain falling 
1. **Temperature columns** -> temperature  
1. **Volume_C10_Petrignano** -> volume of water taken from the drinking water treatment plant 
1. **Hydrometry_Fiume_Chiascio_Petrignano** -> groundwater level 

Targets -> **Depth_to_Groundwater_P24** and **Depth_to_Groundwater_P25**


The Date column is provided in string format. Let's convert it to the `datetime64[ns]` data type as our first step.

In [None]:
from datetime import datetime, date 
df['Date'] = pd.to_datetime(df.Date, format = '%d/%m/%Y')

Let's visualize the data.

In [None]:
f, ax = plt.subplots(nrows=7, ncols=1, figsize=(20, 45))

sns.lineplot(x=df.Date, y=df.Rainfall_Bastia_Umbra.fillna(np.inf), ax=ax[0])
ax[0].set_title('Rainfall', fontsize=16)
ax[0].set_ylabel(ylabel='Rainfall', fontsize=16)


sns.lineplot(x=df.Date, y=df.Temperature_Bastia_Umbra.fillna(np.inf), ax=ax[1])
ax[1].set_title('Temperature_Bastia_Umbra', fontsize=16)
ax[1].set_ylabel(ylabel='Temperature_Bastia_Umbra', fontsize=16)


sns.lineplot(x=df.Date, y=df.Temperature_Petrignano.fillna(np.inf), ax=ax[2])
ax[2].set_title('Temperature_Petrignano', fontsize=16)
ax[2].set_ylabel(ylabel='Temperature_Petrignano', fontsize=16)


sns.lineplot(x=df.Date, y=df.Volume_C10_Petrignano.fillna(np.inf), ax=ax[3])
ax[3].set_title('Volume', fontsize=16)
ax[3].set_ylabel(ylabel='Volume', fontsize=16)


sns.lineplot(x=df.Date, y=df.Hydrometry_Fiume_Chiascio_Petrignano.fillna(np.inf), ax=ax[4])
ax[4].set_title('Hydrometry', fontsize=16)
ax[4].set_ylabel(ylabel='Hydrometry', fontsize=16)


sns.lineplot(x=df.Date, y=df.Depth_to_Groundwater_P24.fillna(np.inf), ax=ax[5])
ax[5].set_title('Depth_to_Groundwater_P24', fontsize=16)
ax[5].set_ylabel(ylabel='Depth_to_Groundwater_P24', fontsize=16)

sns.lineplot(x=df.Date, y=df.Depth_to_Groundwater_P25.fillna(np.inf), ax=ax[6])
ax[6].set_title('Depth_to_Groundwater_P25', fontsize=16)
ax[6].set_ylabel(ylabel='Depth_to_Groundwater_P25', fontsize=16)


plt.show()

## Handling Missing Values

* After careful exploration, we find that there are many missing values before 2009, this old data would not be very beneficial, so drop them.
* All the columns except date have missing values as shown below, so we'll have to figure out what to do with them.
* For imputation purpose, filling missing values with mean would be one option but I find linear interpolation to be a better option as it will carry some of the characteristics of neighbouring values using the .interpolate function of python.

In [None]:
df = df[df.Rainfall_Bastia_Umbra.notna()].reset_index(drop=True)

In [None]:
df.isnull().sum()

In [None]:
df = df.interpolate(method ='linear', limit_direction ='forward')

## Resampling the data

We will downsample the data from days to weeks thus reducing some noise.

In [None]:
df = df[['Date', 'Rainfall_Bastia_Umbra', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25', 'Temperature_Bastia_Umbra', 'Temperature_Petrignano', 'Volume_C10_Petrignano', 'Hydrometry_Fiume_Chiascio_Petrignano']].resample('7D', on='Date').mean().reset_index(drop=False)

## Time Features
Gathering some time-dependent features which are beneficial for prediction.

In [None]:
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day
df['day_of_year'] = df['Date'].dt.dayofyear
df['week_of_year'] = df['Date'].dt.weekofyear
df['quarter'] = df['Date'].dt.quarter
df['season'] = df.month%12 // 3 + 1

df[['Date', 'year', 'month', 'day', 'day_of_year', 'week_of_year', 'quarter', 'season']].head()

In [None]:
df.head(2)

## Differencing Features
It can be used to remove the series dependence on time, so-called temporal dependence. Differencing can help stabilize the mean of the time series by removing changes in the level of a time series.

In [None]:
df['Rainfall_Bastia_Umbra_diff_1'] = df['Rainfall_Bastia_Umbra'].diff(periods = 1)
df['Temperature_Bastia_Umbra_diff_1'] = df['Temperature_Bastia_Umbra'].diff(periods = 1)
df['Temperature_Petrignano_diff_1'] = df['Temperature_Petrignano'].diff(periods = 1)
df['Volume_C10_Petrignano_diff_1'] = df['Volume_C10_Petrignano'].diff(periods = 1)
df['Hydrometry_Fiume_Chiascio_Petrignano_diff_1'] = df['Hydrometry_Fiume_Chiascio_Petrignano'].diff(periods = 1)
    
df['Rainfall_Bastia_Umbra_diff_2'] = df['Rainfall_Bastia_Umbra'].diff(periods = 2)
df['Temperature_Bastia_Umbra_diff_2'] = df['Temperature_Bastia_Umbra'].diff(periods = 2)
df['Temperature_Petrignano_diff_2'] = df['Temperature_Petrignano'].diff(periods = 2)
df['Volume_C10_Petrignano_diff_2'] = df['Volume_C10_Petrignano'].diff(periods = 2)
df['Hydrometry_Fiume_Chiascio_Petrignano_diff_2'] = df['Hydrometry_Fiume_Chiascio_Petrignano'].diff(periods = 2)

# Prediction using XGBoost

In [None]:
X = df.drop(['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25','Date'], axis=1)
y1 = df.Depth_to_Groundwater_P24
y2 = df.Depth_to_Groundwater_P25

I've used xgb with hyperparameter optimisation using GridSearchCV. TimeSeriesSplit is used instead of the usual train_test_split as time series data has to be divided on the basis of date and not completely randomly.

The hyperparameters chosen are
* 'max_depth':range(1,10,2)
* 'min_child_weight':range(1,10,2)
* 'n_estimators' : [100,1000,10000]
* 'learning_rate' : [0.01,0.1]

In [None]:
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import xgboost as xgb

model = xgb.XGBRegressor()
param_search = {'max_depth':range(1,10,2),'min_child_weight':range(1,10,2), 'n_estimators' : [100,1000,10000], 'learning_rate' : [0.01,0.1]}

tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
                        param_grid=param_search)
gsearch.fit(X, y1)

We can easily find the best parameters for our xgboost model.

In [None]:
gsearch.best_params_

In [None]:
xgb1 = xgb.XGBRegressor(learning_rate =0.01, n_estimators=10000, max_depth=3, eval_metric='mae', seed=27)

In [None]:
xgb1.fit(X,y1)

I've taken the last 80 datapoints for metrics evaluation and visualisation so as to know how well the model is performing.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math

y_val = y1[-80:]
X_val = X[-80:]

y_pred = xgb1.predict(X_val)

In [None]:
print(mean_absolute_error(y_val, y_pred))
print(math.sqrt(mean_squared_error(y_val, y_pred)))

The metric results seem quite satisfactory !

Let's see the plot of these 80 observations Actual vs Predicted

In [None]:
plt.figure(figsize=(16, 6))
sns.lineplot(x=df.Date[-80:], y=y_val, legend='brief', label= 'Actual')
sns.lineplot(x=df.Date[-80:], y=y_pred, legend='brief', label= 'Predicted')

The plot aligns with the mae and rmse scores we achieved above and shows that the prediction values is close to the actual values.

Similar prediction can be done for predicting Depth_to_Groundwater_P25 values.

### Please UPVOTE if you find this notebook useful [source of motivation for me :)] and share your suggestions on further improvement.