In [None]:
pip install linearmodels

In [None]:
### At first we need to import necessary packages.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
from IPython.core.display import Image, display
import missingno as msno
import plotly.express as px
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.dates as mdates
import statsmodels.api as sm
from linearmodels import PanelOLS
from linearmodels.panel import RandomEffects
from linearmodels.panel import PooledOLS
from linearmodels.panel import FamaMacBeth
from linearmodels.panel import compare

import statsmodels.api as sm

%matplotlib inline
plt.style.use('ggplot')
sns.set_style(style="darkgrid")

## 1. Data description

The full information about the data can be found in overview section in Kaggle.
* <a href="https://www.kaggle.com/c/acea-water-prediction/data">Acea Smart Water Analytics.
    
We have nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. The main task of this challenge is to conduct broad analysis of given dataset and construct appropriate models that best explains underlying patterns of the data and can make a good prediction on new datasets belong to another waterbody. <br>
There are three types of waterbodies: 
1. water spring (for which three datasets are provided),
2. lake (for which a dataset is provided), 
3. river (for which a dataset is provided), 
4. aquifers (for which four datasets are provided). 

### 1.1 Aquifer

The following information is received from Wikipedia:
<em>An aquifer is an underground layer of water-bearing permeable rock, rock fractures or unconsolidated materials (gravel, sand, or silt).Groundwater can be extracted using a water well.Aquifers occur from near-surface to deeper than 9,000 metres. Those closer to the surface are not only more likely to be used for water supply and irrigation, but are also more likely to be replenished by local rainfall.Overexploitation can lead to the exceeding of the practical sustained yield; i.e., more water is taken out than can be replenished.

![Aquifer](Data/Aquifer1.png)

Many factors affect water levels in aquifiers. Some changes are due to natural phenomena, and others are caused by human activities. According to the information from [The Missouri Department of Natural Resources](https://dnr.mo.gov/geology/wrc/docs/WhyWaterLevelsChange.pdf) water levels mostly change for the following reasons:
1. <strong><em>The quantity of rain falling.</em></strong> The most significant water-level changes due to recharge generally occur during springtime of the year when precipitation is generally greatest and evaporation and plant usage rates are low.
2. <strong><em>Temperature.</em></strong> Most cities and towns use considerably more water during the summer months than other times of the year, when average temperature is high.
3. <strong><em>Atmospheric Pressure Changes.</em></strong> Changes in atmospheric pressure can also cause groundwater levels to fluctuate. Changes in barometric pressure will cause water levels in some wells penetrating confined or semi-confined aquifers to change. The relationship is inverse. An increase in air pressure will cause water level in the well to fall, and a decrease in air pressure will cause water-level in the well to rise. 
4. <strong><em>Aquifer Deformation.</em></strong> Water-level changes due to aquifer deformation are commonly due to either Earth tides, or earthquakes. Other external stresses caused by heavy trucks and trains can also cause groundwater fluctuations in some aquifers.

In our data we have no features such as <strong><em>Atmospheric Pressure Changes.</em></strong> and <strong><em>Aquifer Deformation.</em></strong>.

### 1.2 Water spring

The following information is received from Wikipedia:
<em>A spring is a point at which water flows from an aquifer to the Earth's surface. It is a component of the hydrosphere.
Spring discharge, or resurgence, is determined by the spring's recharge basin. Factors that affect the recharge include the size of the area in which groundwater is captured, the amount of precipitation, the size of capture points, and the size of the spring outlet. Water may leak into the underground system from many sources including permeable earth, sinkholes, and losing streams. A spring is the result of an aquifer being filled to the point that the water overflows onto the land surface.

![Spring](Data/Spring.jpg)

### 1.3 River

The following information is received from Wikipedia:
<em>A river is a natural flowing watercourse, usually freshwater, flowing towards an ocean, sea, lake or another river. In some cases a river flows into the ground and becomes dry at the end of its course without reaching another body of water. 

There are many factors, both natural and human-induced, that cause rivers to continuously change (https://www.usgs.gov):<br>
Natural mechanisms
1. Sedimentation of lakes and wetlands
2. Runoff from rainfall and snowmelt
3. Evaporation from soil and surface-water bodies
4. Transpiration by vegetation
5. Ground-water discharge from aquifers
6. Ground-water recharge from surface-water bodies

Human-induced mechanisms
1. Surface-water withdrawals and transbasin diversions
2. River-flow regulation for hydropower and navigation
3. Construction, removal, and sedimentation of reservoirs and stormwater detention ponds
4. Stream channelization and levee construction
5. Drainage or restoration of wetlands
6. Land-use changes such as urbanization that alter rates of erosion, infiltration, overland flow, or evapotranspiration

## 2. Aquifer "Auser" 

### 2.1 Features

Description: This waterbody consists of two subsystems, called NORTH and SOUTH, where the former partly influences the behavior of the latter. Indeed, the north subsystem is a water table (or unconfined) aquifer while the south subsystem is an artesian (or confined) groundwater.

The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well.

An unconfined aquifer is an aquifer whose upper water surface (water table) is at atmospheric pressure, and thus is able to rise and fall. <strong><em>Such aquifers are usually closer to the Earth's surface than confined aquifers are, and as such are impacted by drought conditions sooner than confined aquifers.</em></strong>



So according to the above information <strong> Aquifer "Auser" </strong> should have a pronounced seasonal cycle and factors such as rainfall and temperature should have direct impact on water levels.

In [None]:
import os
print(os.listdir("../input"))

In [None]:
Aquifer_Auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv', sep=',')
Aquifer_Auser.info()

In [None]:
Feature_matrix = Aquifer_Auser[Aquifer_Auser.columns[~Aquifer_Auser.columns.isin(['Depth_to_Groundwater_LT2', "Depth_to_Groundwater_SAL", "Depth_to_Groundwater_PAG", "Depth_to_Groundwater_CoS", "Depth_to_Groundwater_DIEC"])]]
Output_matrix = Aquifer_Auser[['Depth_to_Groundwater_LT2', "Depth_to_Groundwater_SAL", "Depth_to_Groundwater_PAG", "Depth_to_Groundwater_CoS", "Depth_to_Groundwater_DIEC"]]

In [None]:
missing_values = []
for col in Aquifer_Auser.columns:
    missing_values.append(round(Aquifer_Auser[col].isnull().sum()))

    total = [len(Aquifer_Auser)]*len(missing_values)
proportion_missing_values = np.rint(np.true_divide(missing_values, total) * 100).astype(int)
ind = [x for x, _ in enumerate(Aquifer_Auser.columns)]

plt.figure(figsize=(17,10))
plt.bar(Aquifer_Auser.columns, proportion_missing_values, width=0.8, label='values', color='blue', edgecolor='blue')
plt.xticks(ind, Aquifer_Auser.columns)
plt.ylabel("Features")
plt.xlabel("% of missing values")
plt.title("Missing values")
plt.ylim=1.0

for index,data in enumerate(proportion_missing_values):
    plt.text(x=index , y =data+1 , s=f"{data}%" , fontdict=dict(fontsize=12))
    
# rotate axis labels
plt.setp(plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')

plt.show()

In [None]:
msno.matrix(Aquifer_Auser.replace("nan", np.nan))

From the above bar chart it seems that the measurement of depth levels of wells were not conducted for early years of observation.

In [None]:
# Function that groups only numerical variables
def group_numeric(pdf):
    list(pdf.select_dtypes([np.int64,np.float64]).columns)
    display(pdf[list(pdf)].describe())
    
group_numeric(Aquifer_Auser)

In [None]:
# Plot distributions of feature variables
def plot_dist(pdf):
    for i, col in enumerate(pdf.select_dtypes([np.int64,np.float64]).columns):
        plt.figure(i, figsize=(12,8))
        sns.distplot(pdf[col])
        
plot_dist(Feature_matrix) 

In [None]:
# Drop missing values 
Aquifer_Auser_dop_missing = Aquifer_Auser.dropna()
Aquifer_Auser_dop_missing['Date']=pd.to_datetime(Aquifer_Auser_dop_missing['Date'])
Aquifer_Auser_dop_missing.index = Aquifer_Auser_dop_missing["Date"]
Aquifer_Auser_dop_missing = Aquifer_Auser_dop_missing.drop(["Date"], axis=1)
Feature_matrix = Aquifer_Auser_dop_missing[Aquifer_Auser_dop_missing.columns[~Aquifer_Auser_dop_missing.columns.isin(['Depth_to_Groundwater_LT2', "Depth_to_Groundwater_SAL", "Depth_to_Groundwater_PAG", "Depth_to_Groundwater_CoS", "Depth_to_Groundwater_DIEC"])]]
Output_matrix = Aquifer_Auser_dop_missing[['Depth_to_Groundwater_LT2', "Depth_to_Groundwater_SAL", "Depth_to_Groundwater_PAG", "Depth_to_Groundwater_CoS", "Depth_to_Groundwater_DIEC"]]

In [None]:
# Plot lines of feature variables
def plot_line(pdf):
    cols_plot = list(pdf.columns)
    axes = pdf[cols_plot].plot(linewidth=0.5, figsize=(14, 12), subplots=True)
    for ax in axes:
        ax.set_ylabel('')

In [None]:
for i in ['Rainfall', 'Temperature', "Volume", 'Hydrometry']:
    plot_line(Feature_matrix.filter(regex=i)) 

In [None]:
def plot_bar(pdf):
    fig, axes = plt.subplots(len(pdf.columns), 1, figsize=(17, 25), sharex=True)
    for name, ax in zip(list(pdf.columns), axes):
        sns.boxplot(data=pdf, x=pdf.index.year, y=name, ax=ax)
        ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

In [None]:
for i in ['Rainfall', 'Temperature', "Volume", 'Hydrometry']:
    plot_bar(Feature_matrix.filter(regex=i)) 

In [None]:
corrMatrix = Feature_matrix.corr()
fig, ax = plt.subplots(figsize=(25,15)) 
sns.heatmap(corrMatrix, annot=True, square=True, center=0)
plt.show()

### 2.2 Output variables

In [None]:
# Plot distributions of output variables
def plot_dist(pdf):
    for i, col in enumerate(pdf.select_dtypes([np.int64,np.float64]).columns):
        plt.figure(i, figsize=(12,8))
        sns.distplot(pdf[col])
        
plot_dist(Output_matrix) 

In [None]:
plot_line(Output_matrix) 

In [None]:
plot_bar(Output_matrix) 

In [None]:
fig, axes = plt.subplots(len(Output_matrix.columns), 1, figsize=(17, 25), sharex=True)
for name, ax in zip(list(Output_matrix.columns), axes):
        sns.boxplot(data=Output_matrix, x=Output_matrix.index.month, y=name, ax=ax)
        ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
if ax != axes[-1]:
        ax.set_xlabel('')

In [None]:
corrMatrix = Output_matrix.corr()
fig, ax = plt.subplots(figsize=(15,12)) 
sns.heatmap(corrMatrix, annot=True, square=True, center=0)
plt.show()

### 2.1 Linear modeling

In [None]:
# function for extracting year, month and day
def extract_time(pdf):
    pdf["Year"] = pdf.index.year
    pdf["Month"] = pdf.index.month
    pdf["Day"] = pdf.index.day
extract_time(Feature_matrix)
extract_time(Output_matrix)

In [None]:
Feature_matrix.set_index(['Year', 'Month'], inplace=True)
Output_matrix.set_index(['Year', 'Month'], inplace=True)

#### 2.1.1 Basic linear regression (OLS estimation)

The model is given by
$$ y_{i,t} = \beta_{0} +  \sum_{j=1}^{p} X_{i,j,t} \beta_j + \epsilon_{i,t}$$


In [None]:
dependent_vars = list(Feature_matrix.columns)
dependent_vars = sm.add_constant(Feature_matrix[dependent_vars])
output_vars = list(Output_matrix.columns)
pooled_res_output = []
for i in output_vars[:-1]:
    model = PooledOLS(Output_matrix[i], dependent_vars)
    pooled_res = model.fit()
    pooled_res_output.append(pooled_res)
    print(pooled_res)
    

#### 2.1.2 The random effects model

The model is given by
$$ y_{i,t} = \beta_{0} + \alpha_i + \sum_{j=1}^{p} X_{i,j,t} \beta_j + \epsilon_{i,t}$$

In [None]:
dependent_vars = list(Feature_matrix.columns)
dependent_vars = sm.add_constant(Feature_matrix[dependent_vars])
output_vars = list(Output_matrix.columns)
RE_res_output = []
for i in output_vars[:-1]:
    model = RandomEffects(Output_matrix[i], dependent_vars)
    RE_res = model.fit()
    RE_res_output.append(RE_res)
    print(RE_res)

#### 2.1.3 The Fama-MacBeth estimator

The model is given by
$$ y_{i,t} = \sum_{j=1}^{p} X_{i,j,t} \beta_j + \epsilon_{i,t}$$
The Fama-MacBeth estimator is computed by performing T regressions, one for each time period using all available entity observations. Denote the estimate of the model parameters as $\hat{\beta_{j,t}}$. The reported estimator is then
$$ \hat{\beta_j} = \sum_{i=1}^{T} \hat{\beta_{j,t}} $$

In [None]:
dependent_vars = list(Feature_matrix.columns)
dependent_vars = sm.add_constant(Feature_matrix[dependent_vars])
output_vars = list(Output_matrix.columns)
FM_res_output = []
for i in output_vars[:-1]:
    model = FamaMacBeth(Output_matrix[i], dependent_vars)
    FM_res = model.fit(cov_type='kernel', kernel='bartlett')
    FM_res_output.append(FM_res)
    print(FM_res)

#### 2.1.2 Comparison of Random Effects, OLS and Fama-MacBeth models

In [None]:
for i in range(len(FM_res_output)):
        print(compare({'FM':FM_res_output[i],'RE':RE_res_output[i],'Pooled':pooled_res_output[i]}))