<hr style="border:0.2px solid black"> </hr>

<figure>
  <IMG SRC="img/ntnu_logo.png" WIDTH=250 ALIGN="right">
</figure>

**<ins>Course:</ins>** TVM4174 - Hydroinformatics for Smart Water Systems

# <ins>Example:</ins> Pipe Rehabilitation dataset
    
*Developed by David B. Steffelbauer*

Version 1.0
    
<hr style="border:0.2px solid black"> </hr>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style('darkgrid')

### Load the dataset

In [None]:
data = pd.read_csv('data/pipes.csv', index_col=0)
data

### Correlation Analysis

In [None]:
corr = data.corr()
corr

You can also plot the correlation matrix to skim faster through the data by visual inspection. Here is a link to how to do that [($\rightarrow$ link)](https://seaborn.pydata.org/examples/many_pairwise_correlations.html)

In [None]:
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, vmin=-1.0, vmax=1.0, center=0, mask=mask, cmap=cmap,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

In [None]:
sns.pairplot(data);

## Linear Regression

### Build a Regression model

Our task is to find out, what are the most influential factors causing pipes to break. Hence we use the pipe-bursts as our output variable, and the other features of the dataset as out inputs.

In [None]:
X = data.drop('bursts', axis=1)
y = data['bursts']
X

Additionally, we want to make forecasts with our model, so use 95 % of the data for training our model, and the remaining 5 % for testing, if our model is capable of making forecasts.

We will use scikit-learn to build our regression model and follow the typical scikit-learn procedure as explained in the lecture.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=123)

In [None]:
X_test

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()  # make an instance of the LinearRegression class
results = model.fit(X_train, y_train)

Let's first see the prediction accuracy on the training dataset:

In [None]:
y_predict = model.predict(X_train)

In [None]:
fig, ax = plt.subplots()
plt.plot(y_train, y_predict, 'ko', alpha=0.3)
plt.xlabel(r'bursts', fontsize=14)
plt.ylabel(r'predicted bursts', fontsize=14)

In [None]:
def MSE(a, b):
    
    return np.mean((a-b) ** 2)

In [None]:
MSE(y_predict, y_train)

We can also use the model to forecast the unknown test data:

In [None]:
y_forecast = model.predict(X_test)

In [None]:
MSE(y_forecast, y_test)

In [None]:
fig, ax = plt.subplots()
plt.plot(y_test, y_forecast, 'ko', alpha=0.7)
plt.xlabel(r'bursts', fontsize=14)
plt.ylabel(r'predicted bursts', fontsize=14)

So, what are the most influential factors? We can see that in the regression coefficients.

In [None]:
results.coef_

In [None]:
coefs = pd.Series(results.coef_, index=X.columns)
coefs

But be careful, the original data has different magnitudes! We can circumvent this by scaling the data prior to fitting. The standard scaler standardizes the data to zero mean and a standard deviation of 1.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.05, random_state=123)

X_train

In [None]:
model = LinearRegression()
results = model.fit(X_train, y_train)

In [None]:
results.coef_
coefs = pd.Series(results.coef_, index=X.columns)
coefs

### Same in Statsmodels

In [None]:
import statsmodels.api as sm

X_train = sm.add_constant(X_train)  # Don't forget to add the constant when using statsmodels!!! sklearn does it automatically.

model = sm.OLS(y_train, X_train)
results = model.fit()
print(results.summary())

### Lasso regression

In [None]:
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.05)
lasso_reg.fit(X, y)

In [None]:
lasso_reg.coef_
l_coefs = pd.Series(lasso_reg.coef_, index=X.columns)
l_coefs

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=3)
tree_reg.fit(X, y)

In [None]:
from sklearn.tree import export_graphviz

export_graphviz(tree_reg,
                out_file="pipes_tree.dot", feature_names=X.columns,
                class_names=y.name, rounded=True,
                filled=True
                )

In [None]:
!dot -Tpng pipes_tree.dot -o pipes_tree.png

<img src="pipes_tree.png">