# Injection induced seismic events at The Geysers geothermal field: 2003-2016

The Geysers is the world's largest geothermal field, containing a complex of 18 geothermal power plants, drawing steam from more than 350 wells, located in the Mayacamas Mountains approximately 72 miles north of San Francisco, California. Geysers produced about 20% of California's renewable energy in 2019.

The first commercial geothermal power plant in The Geysers in California was put into operation in September 1960  to tap natural steam. But, in the late 1980s, it was found that the flow of steam across the geothermal field had reduced and the reservoir was not recharging quickly enough to meet the required steam supply. As a result, inefficient power plants were shut down.

The geothermal reservoir is now recharged by injecting recycled wastewater from the city of Santa Rosa and the Lake County sewage treatment plants. 18 million gallons of treated wastewater is supplied each day. 

The injection of cold water into this hot geothermal reservoir induced seismic events. A dense seismic network was installed to monitor the induced seismicity from 2003 to 2016 (data available here: http://ncedc.org/egs/catalog-search.html).

Here, I have collected the injection data from 73 injection wells present in the Northwest of the geothermal field (data available here:https://www.conservation.ca.gov/calgem/Pages/WellFinder.aspx) to try to investigate the relation between induced seismic events and water injection. 


It is an ongoing project...

## ABSTRACT:
**Objectives:**

Predict the **monthly amount of seismic energy released** and the **monthly b-value**.

**Data:**  
A seismic catalogue and the injection data for 73 injection wells, that include:
- monthly amount of injected water,
- monthly average injection rate,
- number of day in a month where water ws injected.

**Model evaluation**

Several models are used (Lasso, linear, Ridge, Elasticnet, Random forest and xgboost). These model are scored using the explained variance (r2), and the model with the best score is selected for the final prediction.


**Results for prediction monthly amount of seismic energy released**


In [None]:
from IPython.display import Image
print("Model's best score to predict monthly energy released")
Image("../input/results/Models scores Energy.png")

In [None]:
from IPython.display import Image
print ("Original data vs predicted")
Image("../input/results/Energy -  original vs predicted.png")

**Results for prediction monthly b-value**

In [None]:
from IPython.display import Image
print("Model's best score to predict monthly energy released")
Image("../input/results/Models scores bvalues.png")

In [None]:
Image("../input/results/b values -  original vs predicted.png")

[1: SEISMIC DATA](#SEISMIC_DATA)

- [1.1: Load the data](#1.1)
- [1.2: Simple Statistical seismology](#1.2)
    - [1.2.1: Plot the ECDF of the Earthquake magnitudes](#1.2)
    - [1.2.2: Computing the b-value](#1.2.2) 
- [1.3: Seicmic activity evolution from 2003 to 2016](#1.3)
    - [1.3.1: b-value evolution](#1.3)
    - [1.3.2: monthly seismic energy released](#1.3.2)
    - [1.3.3: Density maps of induced-earthquakes](#1.3.3)
    
[2: INJECTION DATA](#INJECTION_DATA)
- [2.1: Load and prepare the data](#2.1)

[3: INJECTION DATA VS INDUCED SEISMICITY](#INJ_VS_SEISM)
- [3.1: injection vs seismic energy](#3.1)
- [3.2: Injection vs b_value](#3.2)
- [3.3: lag periods](#3.3)

[4: DATA PREPARATION](#Data_Preparation)
- [4.1: Dependent variables:](#4.1)
    - [4.1.1: Check the asymmetry of the probability distribution](#4.1.1)
    - [4.1.2: Log transform skewed targets:](#4.1.2)
   
- [4.2: Independent variables](#4.2) 
    - [4.2.1: define functions used for data preparation](#4.2)
    - [4.2.2: Features extraction](#4.2.2)
    - [4.2.3: Injection data: scaling, lag version, and feature reduction](#4.2.3)
    
    
- [4.3: correlations between features and target variables](#4.3)
    - [4.3.1: function used to plot correlations](#4.3.1)
    - [4.3.2: prepare dataframe to calculate correlation coefficient](#4.3.2)
    - [4.3.3:  correlation between seismic energy and:](#4.3.3)
    - [4.3.4:  correlation between b value and:](#4.3.4)
    
 - [4.5: Outliers detection:](#4.5)
    
[5: MACHINE LEARNING: prediction monthy seismic energy](#ML)
- [5.1: Features selection and data split for linear models](#5.1)
- [5.2: functions used for machine learning](#5.2)
- [5.3: Linear model](#5.3)
    - [5.3.1: Feature selection with Lasso](#5.3)
    - [5.3.2: Linear regression](#5.3.2)
    - [5.3.3: Ridge model](#5.3.3)
    - [5.3.4: ElasticNet regression](#5.3.4)
    
    
- [5.4: Decision tree methods](#5.4)
    - [5.4.1: Random Forest Regressor](#5.4.1)
    - [5.4.2: xgboost regression](#5.4.2)
    
- [5.5: SUMMARY PREDICTION ENERGY](#5.5)

[6: MACHINE LEARNING: prediction monthy b-value](#ML2)
- [6.1: Features selection and data split for linear models](#6.1)
- [6.2: Linear model](#6.2)
    - [6.2.1: Feature selection with Lasso](#6.2)
    - [6.2.2: Linear regression](#6.2.2)
    - [6.2.3: Ridge model](#6.2.3)
    - [6.2.4: ElasticNet regression](#6.2.4)
    
- [6.3: Decision tree methods](#6.3)
    - [6.3.1: Random Forest Regressor](#6.3)
    - [6.3.2: xgboost](#6.3.2)
    
- [6.4: Summary -- predition monthly b-value --](#6.4)

**Import librairies**

In [None]:
import numpy as np
import pandas as pd

# map creation
import cartopy.crs as ccrs
import cartopy
import cartopy.feature as cfeature
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

# data visualization 
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns

# stat on data
from scipy import stats
from scipy.stats import norm, skew

# feature reduction
from sklearn.decomposition import PCA

#---- Machine learning
# data preparation
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# hyperparameter tunnig
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score

# # Import necessary modules for neutral network
# import keras
# from keras.layers import Dense, BatchNormalization
# from keras.models import Sequential
# from keras.callbacks import EarlyStopping, ModelCheckpoint, History

# Model evaluation
import math
from sklearn import metrics
from statsmodels.graphics.api import abline_plot

**Study area location**

In [None]:
plt.figure(figsize=(7,7))

# SHOW LOCATION OF THE GEYSER GEOTHERMAL FIELD
ax1 = plt.axes(projection=ccrs.PlateCarree())
ax1.set_extent([-123, -121, 37,39], crs=ccrs.PlateCarree())

# add color
ax1.add_feature(cfeature.OCEAN.with_scale('10m'))
ax1.add_feature(cfeature.LAND)
ax1.add_feature(cfeature.RIVERS)
ax1.coastlines()

# add grid
gl = ax1.gridlines(crs=ccrs.PlateCarree(), draw_labels=True, linewidth=2, color='gray', alpha=0.5, linestyle='--')
gl.xlabels_top = False
gl.ylabels_right = False
gl.xlocator = mticker.FixedLocator([-122.5, -121.5])
gl.ylocator = mticker.FixedLocator([38.5, 37.5,37])
gl.xformatter = LONGITUDE_FORMATTER
gl.yformatter = LATITUDE_FORMATTER
gl.xlabel_style = {'size': 13, 'color': 'gray', 'weight': 'bold'}
gl.ylabel_style = {'size': 13, 'color': 'gray', 'weight': 'bold'}

# San Francisco/Coordinates
ax1.scatter(x =-122.45, y=37.7, s=2000,c='black')
ax1.text(-122.3, 37.6, 'San Francisco', size=16)

# San Francisco/Coordinates
ax1.scatter(x =-122.801046, y=38.821042, s=2000,c='green')
ax1.text(-122.62, 38.8, 'The Geysers', size=16)
ax1.text(-122.68, 38.7, 'geothermal field', size=16)

# set title
ax1.set_title('Induced seismic events location',size=15)

plt.show()

## <a id="SEISMIC_DATA"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>1: SEISMIC DATA</center></h3>

<a id="1.1"></a>
## 1.1: Load the data

In [None]:
# step1: Load the seismic catalog 
catalogue = pd.read_csv(r'../input/water-injection-induced-seismic-events/seismic catalogue 2003 2016.csv')
print('They are {} induced-seismic events from 2003 to 2016 in our study area.'.format(catalogue.shape[0]))
catalogue.head(2)

In [None]:
# set date as index
catalogue['date'] = pd.to_datetime(catalogue['date'])
catalogue = catalogue.set_index('date')
catalogue.tail(2)

See that the last month has only one day!

<a id="1.2"></a>
## 1.2: Simple Statistical seismology
### 1.2.1: Plot the ECDF of the Earthquake magnitudes

In [None]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)
    # x-data for the ECDF: x 
    x = np.sort(data)
    # y-data for the ECDF: y  The y data of the ECDF go from 1/n to 1 in equally spaced increments. 
    y = np.arange(1,n+1) / n
    
    return x, y

In [None]:
# define figure size
fig = plt.figure(figsize=(5,5))

# figure title
fig.suptitle('Empirical Cumulative Distribution Function', fontsize=18)

mags = catalogue['Ml']
ax1 = plt.plot(*ecdf(mags),marker='.',linestyle = 'none')
ax1 = plt.xlabel('magnitude')
ax1 = plt.ylabel('ECDF')
ax1 = plt.text(1.5, 0.2, 'max magnitude {}'.format(mags.max()),fontsize=14)
ax1 = plt.text(1.7, 0.3, 'Nb. events {}'.format(len(mags)),fontsize=14)

<a id="1.2.2"></a>
###  1.2.2: Computing the b-value 

In [None]:
# define fonction to compute b-value with confident interval
def b_value(mags, mt):
    """Compute the b-value."""
    # Extract magnitudes above completeness threshold: m
    m = mags[mags >= mt]
    # Compute b-value: b
    b = (np.mean(m)-mt)*np.log(10)
    return b

In [None]:
# Because there are plenty of earthquakes above magnitude 1, 
# we can use mt = 1 as our completeness threshold.
mt = 1

# Compute b-value and confidence interval
b = b_value(mags, mt)

print('The b-value is {0:.2f}'.format(b))

The seismicity at the Geysers follows the Gutenberg-Richter law quite well, with a b-around 1.

<a id="1.3"></a>
## 1.3: Seicmic activity evolution from 2003 to 2016 
### 1.3.1: b-value evolution

In [None]:
# create new column for month and year
catalogue['years_months'] = catalogue.index.to_period('M')

#  calculate b_value each month
years_months = catalogue['years_months'].unique()
list_b= []
for year_month in years_months:
    df = catalogue[catalogue['years_months']==year_month]
    mags = df['Ml']
    b = b_value(mags, mt)
    list_b.append(b)

df_b_value = pd.DataFrame({'years_months':years_months,'b_value':list_b})    
df_b_value.tail(2)

In [None]:
# drop last month (march 2016)
df_b_value = df_b_value[:-1]
# convert period to date
df_b_value['years_months'] = df_b_value['years_months'].values.astype('datetime64[M]')
# set 'years_months' as index
df_b_value = df_b_value.set_index('years_months')
df_b_value.head(3)

In [None]:
# define figure size
fig = plt.figure(figsize=(10,5))

# figure 
fig.suptitle('Evolution b-value', fontsize=18)
ax1 = plt.plot(df_b_value.index,df_b_value['b_value'],marker = 'o',color='red')
ax1 = plt.xlabel('year',size=14)
ax1 = plt.ylabel('b-value',size=14)
plt.show()

<a id="1.3.2"></a>
### 1.3.2: monthly seismic energy released

In [None]:
# calculate the amount of energy released  by an earthquake
catalogue['Energy'] = 10**(11.8 + 1.5*catalogue['Ml'])
catalogue.head(2)

# calculate the monthly amount of seismic energy releases
monthly_energy_released = catalogue.groupby(['years_months'])['Energy'].sum()
monthly_energy_released = pd.DataFrame(monthly_energy_released)

# convert period to date
monthly_energy_released = monthly_energy_released.reset_index()
monthly_energy_released['Date'] = monthly_energy_released['years_months'].values.astype('datetime64[M]')

# set date as index
monthly_energy_released = monthly_energy_released.set_index('Date')

# drop unwanted column
monthly_energy_released = monthly_energy_released.drop(columns=['years_months'])

# drop 3 last month (March 2016 only one days and outliers)
monthly_energy_released = monthly_energy_released[:-1]

monthly_energy_released.tail(2)

In [None]:
# define figure size
fig = plt.figure(figsize=(10,5))

# figure 
fig.suptitle('Monthly seismic energy released', fontsize=18)
ax1 = plt.plot(monthly_energy_released.index,monthly_energy_released['Energy'])
ax1 = plt.xlabel('year',size=14)
ax1 = plt.ylabel('Total energy per month',size=14)
plt.show()

<a id="1.3.3"></a>
### 1.3.3: Density maps of induced-earthquakes

In [None]:
# create nodes' coordinates for interpolation
x= np.linspace(-122.86, -122.74, 50)
# select the first two and two number of the list
x_min = np.mean(x[0:2])
x_max = np.mean(x[-2:])

# calculate the nodes coordinate between these extremes
y= np.linspace(38.76, 38.88, 50)
y_min = np.mean(y[0:2])
y_max = np.mean(y[-2:])

# calulate the center coordinate of each bin
x_mean = np.linspace(x_min, x_max, 49)
y_mean = np.linspace(y_min, y_max, 49)
# create list with coordinates
list_mean_coord = []
for longitude in x_mean:
    for latitude in y_mean:
        list_mean_coord.append([longitude,latitude])
        
df_mean_coord = pd.DataFrame(list_mean_coord,columns=['mean_long','mean_lat'])

In [None]:
import matplotlib.tri as tri

def plot_snapchot(df,year):
    # -----------------------
    # Interpolation on a grid
    # -----------------------
    # A contour plot of irregularly spaced data coordinates
    # via interpolation on a grid.
    xmin = df['mean_long'].min()
    xmax = df['mean_long'].max()
    ymin = df['mean_lat'].min()
    ymax = df['mean_lat'].max()

    npts = df.shape[0]
    ngridx = 50
    ngridy = 50

    # Create grid values first.
    X_int = np.linspace(xmin, xmax, ngridx)
    Y_int = np.linspace(ymin, ymax, ngridy)

    # Perform linear interpolation of the data (x,y)
    # on a grid defined by (xi,yi)
    triang = tri.Triangulation(df['mean_long'], df['mean_lat'])
    interpolator = tri.LinearTriInterpolator(triang, df['count'])
    Xi, Yi = np.meshgrid(X_int, Y_int)
    Z_int  = interpolator(Xi, Yi)

    # define figure size
#     fig, ax = plt.subplots(figsize=(6,4))
    
    ax = plt.contourf(X_int, Y_int, Z_int, levels=50, cmap="RdBu_r")
    plt.title('{}'.format(year), fontsize=14)
    x = [-122.84, -122.8,-122.76]
    plt.xticks(x)
    
    # fig.colorbar()
    plt.colorbar(ax)

    return ax

In [None]:
def count_meq(year):
    data = catalogue[catalogue['year']==year]
    # df with only coordinate
    data = data[['Longitude','Latitude']]
    data = data.reset_index()
    data = data.drop('date',axis=1)
    # bin the data into equally spaced groups
    x_cut = pd.cut(data.Longitude, np.linspace(-122.86, -122.74, 50), right=False)
    y_cut = pd.cut(data.Latitude, np.linspace(38.76, 38.88, 50), right=False)

    # group and count
    result = data.groupby([x_cut,y_cut]).count()

    # rename columns and flatten df
    result.columns = ['countx','count']
    result = result.reset_index()
    # select only count
    count = result[['count']]
    # fill NaN value with zero
    count = count.fillna(0)
    # append count
    df_result = pd.concat([df_mean_coord,count],axis=1)
    return df_result

In [None]:
# extract year from index
catalogue['year'] = pd.DatetimeIndex(catalogue.index).year

# use function count number of seismic events per bins per year
df_result_2003 = count_meq(2003)
df_result_2004 = count_meq(2004)
df_result_2005 = count_meq(2005)
df_result_2006 = count_meq(2006)
df_result_2007 = count_meq(2007)
df_result_2008 = count_meq(2008)
df_result_2009 = count_meq(2009)
df_result_2010 = count_meq(2010)
df_result_2011 = count_meq(2011)
df_result_2012 = count_meq(2012)
df_result_2013 = count_meq(2013)
df_result_2014 = count_meq(2014)
df_result_2015 = count_meq(2015)
df_result_2016 = count_meq(2016)


In [None]:
fig = plt.figure(figsize=(15,15))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

plt.suptitle('Density maps (number of eartquake per bin) for the year: ',y=0.92,size=16)

# color based on trend
ax1 = fig.add_subplot(5,3,1)
ax1 = plot_snapchot(df_result_2003,'2003')

ax2 = fig.add_subplot(5,3,2)
ax2 = plot_snapchot(df_result_2004,'2004')

ax1 = fig.add_subplot(5,3,3)
ax1 = plot_snapchot(df_result_2005,'2005')

ax1 = fig.add_subplot(5,3,4)
ax1 = plot_snapchot(df_result_2006,'2006')

ax1 = fig.add_subplot(5,3,5)
ax1 = plot_snapchot(df_result_2007,'2007')

ax2 = fig.add_subplot(5,3,6)
ax2 = plot_snapchot(df_result_2008,'2008')

ax1 = fig.add_subplot(5,3,7)
ax1 = plot_snapchot(df_result_2009,'2009')

ax1 = fig.add_subplot(5,3,8)
ax1 = plot_snapchot(df_result_2010,'2010')

ax1 = fig.add_subplot(5,3,9)
ax1 = plot_snapchot(df_result_2011,'2011')

ax1 = fig.add_subplot(5,3,10)
ax1 = plot_snapchot(df_result_2012,'2012')

ax1 = fig.add_subplot(5,3,11)
ax1 = plot_snapchot(df_result_2013,'2013')

ax1 = fig.add_subplot(5,3,12)
ax1 = plot_snapchot(df_result_2014,'2014')

ax1 = fig.add_subplot(5,3,13)
ax1 = plot_snapchot(df_result_2015,'2015')

ax1 = fig.add_subplot(5,3,14)
ax1 = plot_snapchot(df_result_2016,'2016')

## <a id="INJECTION_DATA"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>2: INJECTION DATA</center></h3>

<a id="2.1"></a>
## 2.1: Load and prepare the data

Initially, I have downloaded 73 files for the 73 injection wells. In each file I have selected four columns:
- date (YY/MM),
- "Gross Injected (1000kg)", which represent the volume of water injected during the month,
- "Water Injection Rate (1000 kg/hr)", 
- "Days", which is the number of days during the month were water was injected.

Then, three files were generated:
- "Total Gross Injected Water", with the "Gross Injected (1000kg)" for the 73 wells from 1969-05-01	to 2021-02-01
- "Total Water Injection Rate", with the "Water Injection Rate (1000 kg/hr)" for the 73 wells from 1969-05-01	to 2021-02-01
- "Total days_inj", with the "days" for the 73 wells from 1969-05-01 to 2021-02-01

In these three files, the columns names are made with the API well numbers. An API is an "unique, permanent, numeric identifier" assigned to each well drilled for oil and gas in the United States.


In [None]:
###-------- LOAD THE INJECTION DATA
# Load monthly amount of injected water: "Gross Injected (1000kg)"
df_GIW_per_well = pd.read_csv(r'../input/water-injection-induced-seismic-events/Total Gross Injected Water.csv',sep=';')
# Load monthly injection rate: "Water Injection Rate (1000 kg/hr)"
df_RIW_per_well = pd.read_csv(r'../input/water-injection-induced-seismic-events/Total Water Injection Rate.csv',sep=';')
# Load number of days per month where 
df_days_per_well = pd.read_csv(r'../input/water-injection-induced-seismic-events/Total days_inj.csv',sep=';')

# set date as index
df_GIW_per_well['Date'] = pd.to_datetime(df_GIW_per_well['Date'])
df_GIW_per_well = df_GIW_per_well.set_index('Date')

df_RIW_per_well['Date'] = pd.to_datetime(df_RIW_per_well['Date'])
df_RIW_per_well = df_RIW_per_well.set_index('Date')

df_days_per_well['Date'] = pd.to_datetime(df_days_per_well['Date'])
df_days_per_well = df_days_per_well.set_index('Date')

# select the period as during which the induced seismic events are monitored 
df_GIW_per_well = df_GIW_per_well.loc['2003-05-01':'2016-02-01']
df_RIW_per_well = df_RIW_per_well.loc['2003-05-01':'2016-02-01']
df_days_per_well = df_days_per_well.loc['2003-05-01':'2016-02-01']

df_GIW_per_well.tail(2)

In [None]:
# plot the data
df_GIW_per_well.plot(subplots=True, figsize=(15,40))
plt.show()

In [None]:
df_RIW_per_well.plot(subplots=True, figsize=(15,40))
plt.show()

In [None]:
df_days_per_well.plot(subplots=True, figsize=(15,40))
plt.show()

We can see that many injection wells were not used during this period. they were either not drilled yet or abandonned, so we can remove them.

In [None]:
# first we drop all the columns where 'Gross Injected (1000kg)' is 0 everywhere
print('Number of columns before dropping well not used for injection during the selected period: {}'.format(df_GIW_per_well.shape[1]))
df_GIW_per_well = df_GIW_per_well.loc[:, df_GIW_per_well.any()]
print('final number of wells used: {}'.format(df_GIW_per_well.shape[1]))
print('')

In [None]:
# then we make sure to keep the same columns in the two other df
columns_to_keep = df_GIW_per_well.columns
df_RIW_per_well = df_RIW_per_well.drop(columns=[col for col in df_RIW_per_well if col not in columns_to_keep])
df_days_per_well = df_days_per_well.drop(columns=[col for col in df_days_per_well if col not in columns_to_keep])

# add prefix
df_GIW_per_well = df_GIW_per_well.add_prefix('GIW_')
df_RIW_per_well = df_RIW_per_well.add_prefix('RIW_')
df_days_per_well = df_days_per_well.add_prefix('days_')

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(15,8),sharex=True)
# fig.subplots_adjust(hspace=0.4,wspace=0.3)

df_GIW_per_well.plot(ax=axes[0],legend=None)
axes[0].set_ylabel('GIW',size=14)
df_RIW_per_well.plot(ax=axes[1],legend=None)
axes[1].set_ylabel('RIW',size=14)
df_days_per_well.plot(ax=axes[2],legend=None)
axes[2].set_ylabel('days',size=14)
axes[2] = plt.xlabel('year',size=14)

plt.show()

We can observe that there is a negative value for the water injection rate (RIW). It is not possible, so we will replace it by '0'. 

In [None]:
for col in df_RIW_per_well.columns:
    df_RIW_per_well[col][df_RIW_per_well[col] < 0] = 0
    
df_RIW_per_well.plot(legend=None)
plt.ylabel('injection rate (x1000kg/hr)')
plt.show()

print('Now, on the initial 73 injection wells only {} were injected during this period'\
      .format(df_RIW_per_well.shape[1]))

## <a id="INJ_VS_SEISM"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>3: INJECTION DATA VS INDUCED SEISMICITY</center></h3>

In [None]:
# create empty df:
df_features_vs_sismicity = pd.DataFrame()

# add column with total amount of water injected per month in the area
total_vol_inj = df_GIW_per_well.sum(axis=1)
# add the monthly seismic energy released in this area
df_features_vs_sismicity = pd.concat([monthly_energy_released,df_b_value,total_vol_inj],axis=1)
df_features_vs_sismicity.columns = ['Energy','b_value','GIW_sum']

df_features_vs_sismicity.head(2)

<a id="3.1"></a>
### 3.1: injection vs seismic energy

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

plt.title('Seasonal evolutions of the amount of injected water and the seismic activity ',size=18, y=1.06) 
 

ax.plot(df_features_vs_sismicity.index, df_features_vs_sismicity["GIW_sum"],color='darkblue')
ax.set_xlabel('year',fontsize=15)
ax.set_ylabel('Total injected water (1000kg)', color='darkblue',fontsize=15)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.tick_params('y', colors='darkblue')

ax2 = ax.twinx()  #specify that the two lines share the same x-axis
ax2.plot(df_features_vs_sismicity.index,df_features_vs_sismicity['Energy'],color='green')
ax2.set_ylabel('Energy seismic', color='green',fontsize=15)
ax2.tick_params('y', colors='green')
ax2.tick_params(axis='both', which='major', labelsize=14)

plt.show()


<a id="3.2"></a>
### 3.2: Injection vs b_value

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

plt.title('Seasonal evolutions of the amount of injected water and b_value ',size=18, y=1.06) 
 

ax.plot(df_features_vs_sismicity.index, df_features_vs_sismicity["GIW_sum"],color='darkblue')
ax.set_xlabel('year',fontsize=15)
ax.set_ylabel('Total injected water (1000kg)', color='darkblue',fontsize=15)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.tick_params('y', colors='darkblue')

ax2 = ax.twinx()  #specify that the two lines share the same x-axis
ax2.plot(df_features_vs_sismicity.index,df_features_vs_sismicity['b_value'],color='red')
ax2.set_ylabel('b_value', color='red',fontsize=15)
ax2.tick_params('y', colors='red')
ax2.tick_params(axis='both', which='major', labelsize=14)

plt.show()


We can clearly observe a saisonanlity in the amount of injected water (more injected water during the winter months and less during the summer months), in the amount of seismic energy released every month, and in the b_value. However, it seems there is a lag period between injection pick, seismicity pick, and b_value pick. 

<a id="3.3"></a>
### 3.3: lag periods

In [None]:
# create dictionnary to write shift version on injection data
dic_GIW_shift = {}
# create dictionnary to write shorter version on seismic catalogue
dic_shorter_catalogue={}
dic_shorter_b_value={}
dic_GIW_SEISMIC={}

list_coef_GIW_Energy = []
list_coef_GIW_b_value = []

for i in range (0,6):
    # Creating a time-shifted dictionnary with the injection data and remove nan values
    dic_GIW_shift['GIW_shift{}'.format(i)] = df_features_vs_sismicity["GIW_sum"].shift(i).dropna()
    # create a shorter version of seismic data to fit the lenght of the lagged versions
    if i == 0:
        dic_shorter_catalogue['shorter_catalogue{}'.format(i)] = df_features_vs_sismicity['Energy']
        dic_shorter_b_value['shorter_b_value{}'.format(i)] = df_features_vs_sismicity['b_value']
    else:
        dic_shorter_catalogue['shorter_catalogue{}'.format(i)] = df_features_vs_sismicity['Energy'].iloc[:-i]
        dic_shorter_b_value['shorter_b_value{}'.format(i)] = df_features_vs_sismicity['b_value'].iloc[:-i]
    # store time-shifted series with the associated seismic moment  
    dic_GIW_SEISMIC['{}'.format(i)] = pd.concat([dic_GIW_shift['GIW_shift{}'.format(i)],
                                                 dic_shorter_catalogue['shorter_catalogue{}'.format(i)],
                                                 dic_shorter_b_value['shorter_b_value{}'.format(i)]],
                                                 axis=1,join='inner')
   # calculate the correlation coefficient 
    coef_GIW_Energy = dic_GIW_shift['GIW_shift{}'.format(i)].corr(dic_shorter_catalogue['shorter_catalogue{}'.format(i)] )
    list_coef_GIW_Energy.append(coef_GIW_Energy)
    coef_GIW_b_value = dic_GIW_shift['GIW_shift{}'.format(i)].corr(dic_shorter_b_value['shorter_b_value{}'.format(i)] )
    list_coef_GIW_b_value.append(coef_GIW_b_value)

In [None]:
list_month = list(range (0,6))

fig = plt.figure(2,figsize=(16,10))

ax1 = fig.add_subplot(2,2,1)
# plot coef for the lagged version
ax1.plot(list_month,list_coef_GIW_Energy,marker='o',linestyle = '-')

# Label axes and show plot
ax1 = plt.xlabel('lagged version (month)',fontsize=15)
ax1 = plt.xticks(fontsize=14)
ax1 = plt.ylabel('correlation coefficient',fontsize=15)
ax1 = plt.yticks(fontsize=14)
ax1 = plt.title('correlation coefficient between seismic activity and \nseveral lagged versions of the injection data',fontsize=16,y=1.05)

ax2 = fig.add_subplot(2,2,2)
# plot coef for the lagged version and a linear-regression-line
ax2.scatter(dic_GIW_shift['GIW_shift2'],dic_shorter_catalogue['shorter_catalogue2'],marker='o')
m, b = np.polyfit(dic_GIW_shift['GIW_shift2'],dic_shorter_catalogue['shorter_catalogue2'], 1)
ax2 = plt.plot(dic_GIW_shift['GIW_shift2'],m*dic_GIW_shift['GIW_shift2']+b,color='k')
# add coef correlation at location x and y
ax2 = plt.text(2500000, 2e18, 'R=0.34',fontsize=15)
# # Label axes and show plot
ax2 = plt.xlabel('2 Months lagged version of injection data',fontsize=15)
ax2 = plt.xticks(fontsize=14)
ax2 = plt.ylabel('monthly seismic energy released',fontsize=15)
ax2 = plt.yticks(fontsize=14)
ax2 = plt.title('Injection data (2 months shifted) versus \n seismic activity',fontsize=16,y=1.05)


plt.show()    

In [None]:
list_month = list(range (0,6))

fig = plt.figure(2,figsize=(16,10))

ax1 = fig.add_subplot(2,2,1)
# plot coef for the lagged version
ax1.plot(list_month,list_coef_GIW_b_value,marker='o',linestyle = '-', color= 'red')

# Label axes and show plot
ax1 = plt.xlabel('lagged version (month)',fontsize=15)
ax1 = plt.xticks(fontsize=14)
ax1 = plt.ylabel('correlation coefficient',fontsize=15)
ax1 = plt.yticks(fontsize=14)
ax1 = plt.title('correlation coefficient between b_value and \nseveral lagged versions of the injection data',fontsize=16,y=1.05)

ax2 = fig.add_subplot(2,2,2)
# plot coef for the lagged version and a linear-regression-line
ax2.scatter(dic_GIW_shift['GIW_shift2'],dic_shorter_b_value['shorter_b_value2'],marker='o')
m, b = np.polyfit(dic_GIW_shift['GIW_shift2'],dic_shorter_b_value['shorter_b_value2'], 1)
ax2 = plt.plot(dic_GIW_shift['GIW_shift2'],m*dic_GIW_shift['GIW_shift2']+b,color='k')
# add coef correlation at location x and y
# ax2 = plt.text(2500000, 52, 'R=0.34',fontsize=15)
# # Label axes and show plot
ax2 = plt.xlabel('2 Months lagged version of injection data',fontsize=15)
ax2 = plt.xticks(fontsize=14)
ax2 = plt.ylabel('monthly b_value',fontsize=15)
ax2 = plt.yticks(fontsize=14)
ax2 = plt.title('Injection data (2 months shifted) versus \n b_value',fontsize=16,y=1.05)


plt.show()    

The highest coefficient correlation was found after 2 months, meaning that seismicity peak occurs 2 months after peak injection.  Same with b-values.

## <a id="Data_Preparation"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>4: DATA PREPARATION</center></h3>

<a id="4.1"></a>
## 4.1: Dependent variables:
### 4.1.1: Check the asymmetry of the probability distribution 
#### b-value

In [None]:
def plot_distribution(df,variable):
    ax = sns.displot(data=df, x=variable, kde=True)

    # Get the fitted parameters used by the function
    (mu, sigma) = norm.fit(df[variable])
    print( '\n mu = {:.2E} and sigma = {:.2E}\n'.format(mu, sigma))

    #Now plot the distribution
    plt.legend(['Normal dist. ($\mu=$ {:.2E} and $\sigma=$ {:.2E} )'.format(mu, sigma)],
                loc='best')
    plt.ylabel('Frequency')
    plt.title('moment distribution')

    #Get also the QQ-plot
    fig = plt.figure()
    res = stats.probplot(df[variable], plot=plt)
    plt.show()
    
    #skewness and kurtosis
    print("Skewness: %f" % df[variable].skew())
    print("Kurtosis: %f" % df[variable].kurt())

#### distribution b_value

In [None]:
plot_distribution(df_b_value,'b_value')

#### Seismic energy released with Mw>4

In [None]:
plot_distribution(monthly_energy_released,'Energy')

Both target variables have:
- a distribution close from the normal distribution.
- a low skewness. (means that the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.)

- low Kurtosis. (means that the data tends to have heavy tails, or outliers)

<a id="4.1.2"></a>
### 4.1.2: Log transform skewed targets:
**Ideally, we want our skewness value to be around 0 and kurtosis less than 3.** 

In [None]:
# calculate log b_value
df_b_value['log_b_value'] = np.log1p(df_b_value['b_value'])
# calculate log Energy
monthly_energy_released['log_Energy'] = np.log1p(monthly_energy_released['Energy'])

#skewness and kurtosis
print("Skewness log_b_value: %f" % df_b_value['log_b_value'].skew())
print("Kurtosis log_b_value: %f" % df_b_value['log_b_value'].kurt())
print('')
print("Skewness log_Energy: %f" % monthly_energy_released['log_Energy'].skew())
print("Kurtosis log_Energy: %f" % monthly_energy_released['log_Energy'].kurt())


In [None]:
# calcul correlation between the 2 target variables
monthly_energy_released['log_Energy'].corr(df_b_value['b_value'])

<a id="4.2"></a>
## 4.2: Independent variables  
### 4.2.1: define functions used for data preparation

#### Function for features extraction

In [None]:
def extract_features(df):
    df_features = pd.DataFrame()
    df_features["sum"] = df.sum(axis=1)
    df_features["max"] = df.max(axis=1)
    df_features["mean"] = df.mean(axis=1)
    df_features["std"] = df.std(axis=1)
    df_features["skew"] = df.skew(axis=1)
    df_features["kurtosis"] = df.kurtosis(axis=1)
    df_features["5pct"] = df.quantile(.05,axis=1)
    df_features["10pct"] = df.quantile(.1,axis=1)
    df_features["15pct"] = df.quantile(.15,axis=1)
    df_features["20pct"] = df.quantile(.2,axis=1)
    df_features["25pct"] = df.quantile(.25,axis=1)
    df_features["30pct"] = df.quantile(.3,axis=1)
    df_features["35pct"] = df.quantile(.35,axis=1)
    df_features["40pct"] = df.quantile(.4,axis=1)
    df_features["45pct"] = df.quantile(.45,axis=1)
    df_features["50pct"] = df.quantile(.50,axis=1)
    df_features["55pct"] = df.quantile(.55,axis=1)
    df_features["60pct"] = df.quantile(.60,axis=1)
    df_features["65pct"] = df.quantile(.65,axis=1)
    df_features["70pct"] = df.quantile(.70,axis=1)
    df_features["75pct"] = df.quantile(.75,axis=1)
    df_features["80pct"] = df.quantile(.80,axis=1)
    df_features["85pct"] = df.quantile(.85,axis=1)
    df_features["90pct"] = df.quantile(.90,axis=1)
    df_features["95pct"] = df.quantile(.95,axis=1)
    return df_features

#### Function for adding lag versions

In [None]:
def create_lag(df):
    df_shift1 = pd.DataFrame()
    df_shift1 = df.shift(1)
    df_shift1 = df.add_prefix('S1_')
    df_shift2 = pd.DataFrame()
    df_shift2 = df.shift(2)
    df_shift2 = df.add_prefix('S2_')
    df_shift3 = pd.DataFrame()
    df_shift3 = df.shift(3)
    df_shift3 = df_shift3.add_prefix('S3_')
    df_shift4 = pd.DataFrame()
    df_shift4 = df.shift(4)
    df_shift4 = df_shift4.add_prefix('S4_')

    df_lags = pd.concat([df,df_shift1,df_shift2,df_shift3, df_shift4], axis=1)
    df_lags = df_lags.dropna() 
    return df_lags

<a id="4.2.2"></a>
## 4.2.2: Features extraction

In [None]:
# create new dataframe with new features
df_fearures_GIW = extract_features(df_GIW_per_well)
df_fearures_RIW = extract_features(df_RIW_per_well)
df_fearures_days = extract_features(df_days_per_well)
# add prefix
df_fearures_GIW = df_fearures_GIW.add_prefix('GIW_')
df_fearures_RIW = df_fearures_RIW.add_prefix('RIW_')
df_fearures_days =df_fearures_days.add_prefix('days_')
# concat df
df_features = pd.concat([df_fearures_GIW,df_fearures_RIW,df_fearures_days],axis=1)
df_features.shape

In [None]:
# add lag versions :
df_features_lag = create_lag(df_features)
df_features_lag.shape

In [None]:
# feature reduction
pca = PCA(n_components=0.99)
df_features_lag_reduced =  pd.DataFrame(pca.fit_transform(df_features_lag))
df_features_lag_reduced = df_features_lag_reduced.add_prefix('PCA_FEATURE_AXE_')
df_features_lag_reduced.shape

<a id="4.2.3"></a>
## 4.2.3: Injection data: scaling, lag version, and feature reduction

In [None]:
df_GIW_per_well_lag = create_lag(df_GIW_per_well)
df_RIW_per_well_lag = create_lag(df_RIW_per_well)
df_days_per_well_lag = create_lag(df_days_per_well)
df_GIW_per_well_lag.shape

In [None]:
# scale the 3 dataframes with: GIW, RIW, days 
scaler = StandardScaler()
df_GIW_per_well_lag_scaled = pd.DataFrame(scaler.fit_transform(df_GIW_per_well_lag))
df_GIW_per_well_lag_scaled = df_GIW_per_well_lag_scaled.add_prefix('GIW_')
df_RIW_per_well_lag_scaled = pd.DataFrame(scaler.fit_transform(df_RIW_per_well_lag))
df_RIW_per_well_lag_scaled = df_RIW_per_well_lag_scaled.add_prefix('RIW_')
df_days_per_well_lag_scaled = pd.DataFrame(scaler.fit_transform(df_days_per_well_lag))
df_days_per_well_lag_scaled = df_days_per_well_lag_scaled.add_prefix('days_')

# concat the 3 df
df_injection_data_scaled = pd.concat([df_GIW_per_well_lag_scaled,
                                      df_RIW_per_well_lag_scaled,
                                      df_days_per_well_lag_scaled],axis=1)
df_injection_data_scaled.head(2)


In [None]:
# feature reduction
pca = PCA(n_components=0.99)
df_injection_data_scaled_reduced = pd.DataFrame(pca.fit_transform(df_injection_data_scaled))
df_injection_data_scaled_reduced = df_injection_data_scaled_reduced.add_prefix('PCA_inj_AXE_')
df_injection_data_scaled_reduced.shape


In [None]:
print(pca.explained_variance_ratio_)

<a id="4.3"></a>
## 4.3: correlations between features and target variables

In [None]:
# create list to select features that correlates the best wiht seismic energy
list_variable_with_high_corr_Ener=[]
# create list to select features that correlates the best wiht b-value
list_variable_with_high_corr_bval=[]


<a id="4.3.1"></a>
### 4.3.1: function used to plot correlations

In [None]:
def plot_correlation(df,var1,var2,color):
    """
    plot data and a linear regression model fit. 
    Parameters
    ----------
    df : dataframe
    var1: 'column_name' of the variable 1 in df
    var2: 'column_name' of the variable 2 in df
    color : color scatter points
    Returns
    -------
    Figure  
    """
    # transform var1 and var2 into numpy array:
    xm = np.array(df[var1])
    ym = np.array(df[var2])
    # get regression line properties:
    slope, intercept, r_value, p_value, std_err = stats.linregress(xm, ym)
    # Plot linear regression with 95% confidence interval and the regression coefficient
    sns.regplot(x=var1,y=var2,data=df,fit_reg=True,color = color,
                line_kws={'label':"R={:.2f}".format(r_value),"color": "black"}) 
    # axes and title properties
    plt.xlabel(var1,fontsize=15)
    plt.ylabel(var2,fontsize=15)
    # plot legend
    plt.legend(prop={'size': 15})

<a id="4.3.2"></a>
### 4.3.2: prepare dataframe to calculate correlation coefficient

In [None]:
# resize targets (remove last 4 raws) to match size explanatory variables (4 rows missing due to drop NaN value after shift)
target_Ener = monthly_energy_released['log_Energy'][:-4].reset_index()
target_Ener = target_Ener.drop(columns = ['Date'],axis=1)

target_bval = df_b_value['log_b_value'][:-4].reset_index()
target_bval = target_bval.drop(columns = ['years_months'],axis=1)

# injection data vs Energy: 
df_injection_data_scaled_Ener = pd.concat([df_injection_data_scaled,target_Ener], axis=1)
df_pca_injection_data_scaled_Ener = pd.concat([df_injection_data_scaled_reduced,target_Ener], axis=1)

# injection data vs b-value:
df_injection_data_scaled_bval = pd.concat([df_injection_data_scaled,target_bval], axis=1)
df_pca_injection_data_scaled_bval = pd.concat([df_injection_data_scaled_reduced,target_bval], axis=1)

# features vs Energy:
df_features_lag2 = df_features_lag.reset_index().drop(columns = ['Date'],axis=1)
df_features_Ener = pd.concat([df_features_lag2,target_Ener],axis=1)
df_pca_features_Ener = pd.concat([df_features_lag_reduced,target_Ener], axis=1)

# features vs b-value:
df_features_bval = pd.concat([df_features_lag2,target_bval],axis=1)
df_pca_features_bval = pd.concat([df_features_lag_reduced,target_bval], axis=1)

# finally create one df with all the data
df_all_data = pd.concat([df_injection_data_scaled,df_injection_data_scaled_reduced,
                         df_features_lag2,df_features_lag_reduced,                         
                         target_Ener,target_bval], axis=1)
df_all_data = df_all_data.set_index(df_features_lag.index)

<a id="4.3.3"></a>
### 4.3.3:  correlation between seismic energy and:
#### Raw injection data :

In [None]:
def calcul_sort_save_corr(df,variable,threshold=0.35):
    # calcul, sort and save in df the correlation coefficient
    corr = pd.DataFrame(df[df.columns[0:]].corr()[variable][:-1])
    corr = pd.DataFrame(abs(corr[variable]).sort_values(ascending=False))

    # select coef corr > 0.3 and append corresponding variables to a list
    highest_corr = corr[corr[variable]>threshold]
    print(corr.head(10))
    return highest_corr

In [None]:
corr_injection_Ener = calcul_sort_save_corr(df_injection_data_scaled_Ener,'log_Energy')

In [None]:
# define figure size
fig = plt.figure(figsize=(18,6))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between raw injection data and seismic energy', fontsize=18,y=1)

# subplot
ax1 = fig.add_subplot(2,2,1)
ax1 = plot_correlation(df_injection_data_scaled_Ener,'days_205','log_Energy','blue')
ax2 = fig.add_subplot(2,2,2)
ax2 = plot_correlation(df_injection_data_scaled_Ener,'RIW_222','log_Energy','green')
ax3 = fig.add_subplot(2,2,3)
ax3 = plot_correlation(df_injection_data_scaled_Ener,'GIW_58','log_Energy','red')
ax4 = fig.add_subplot(2,2,4)
ax4 = plot_correlation(df_injection_data_scaled_Ener,'GIW_10','log_Energy','black')

plt.show()

In [None]:
corr_pca_injection_Ener = calcul_sort_save_corr(df_pca_injection_data_scaled_Ener,'log_Energy',threshold=0.26)

In [None]:
# define figure size
fig = plt.figure(figsize=(18,6))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between PCA axes from raw injection data and seismic energy', fontsize=18,y=1)

# subplot
ax1 = fig.add_subplot(2,2,1)
ax1 = plot_correlation(df_pca_injection_data_scaled_Ener,'PCA_inj_AXE_4','log_Energy','blue')
ax2 = fig.add_subplot(2,2,2)
ax2 = plot_correlation(df_pca_injection_data_scaled_Ener,'PCA_inj_AXE_1','log_Energy','green')
ax3 = fig.add_subplot(2,2,3)
ax3 = plot_correlation(df_pca_injection_data_scaled_Ener,'PCA_inj_AXE_6','log_Energy','red')
ax4 = fig.add_subplot(2,2,4)
ax4 = plot_correlation(df_pca_injection_data_scaled_Ener,'PCA_inj_AXE_21','log_Energy','black')

plt.show()

#### extracted features:

In [None]:
corr_features_Ener = calcul_sort_save_corr(df_features_Ener,'log_Energy')

In [None]:
# define figure size
fig = plt.figure(figsize=(18,9))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between extracted features and seismic energy', fontsize=18,y=0.95)

# subplot
ax1 = fig.add_subplot(3,2,1)
ax1 = plot_correlation(df_features_Ener,'S4_days_90pct','log_Energy','blue')
ax2 = fig.add_subplot(3,2,2)
ax2 = plot_correlation(df_features_Ener,'S4_days_std','log_Energy','green')
ax3 = fig.add_subplot(3,2,3)
ax3 = plot_correlation(df_features_Ener,'S4_days_kurtosis','log_Energy','red')
ax4 = fig.add_subplot(3,2,4)
ax4 = plot_correlation(df_features_Ener,'S4_days_skew','log_Energy','black')
ax5 = fig.add_subplot(3,2,5)
ax5 = plot_correlation(df_features_Ener,'S2_GIW_95pct','log_Energy','gold')
ax6 = fig.add_subplot(3,2,6)
ax6 = plot_correlation(df_features_Ener,'S1_GIW_95pct','log_Energy','darkblue')

plt.show()

#### features obtained with PCA

In [None]:
corr_pca_features_Ener = calcul_sort_save_corr(df_pca_features_Ener,'log_Energy')

In [None]:
# define figure size
fig = plt.figure(figsize=(18,3))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between PCA axes from extracted features and seismic energy', fontsize=18,y=1)

# subplot
ax1 = fig.add_subplot(1,2,1)
ax1 = plot_correlation(df_pca_features_Ener,'PCA_FEATURE_AXE_0','log_Energy','blue')
ax2 = fig.add_subplot(1,2,2)
ax2 = plot_correlation(df_pca_features_Ener,'PCA_FEATURE_AXE_2','log_Energy','green')

plt.show()

<a id="4.3.4"></a>
### 4.3.4:  correlation between b value and:
#### raw injection data

In [None]:
corr_injection_bval = calcul_sort_save_corr(df_injection_data_scaled_bval,'log_b_value',threshold = 0.75)

In [None]:
# define figure size
fig = plt.figure(figsize=(18,9))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between raw injection data and b-value', fontsize=18,y=1)

# subplot
ax1 = fig.add_subplot(3,2,1)
ax1 = plot_correlation(df_injection_data_scaled_bval,'days_162','log_b_value','blue')
ax2 = fig.add_subplot(3,2,2)
ax2 = plot_correlation(df_injection_data_scaled_bval,'days_210','log_b_value','green')
ax3 = fig.add_subplot(3,2,3)
ax3 = plot_correlation(df_injection_data_scaled_bval,'days_234','log_b_value','red')
ax4 = fig.add_subplot(3,2,4)
ax4 = plot_correlation(df_injection_data_scaled_bval,'days_186','log_b_value','black')
ax5 = fig.add_subplot(3,2,5)
ax5 = plot_correlation(df_injection_data_scaled_bval,'GIW_234','log_b_value','gold')
ax6 = fig.add_subplot(3,2,6)
ax6 = plot_correlation(df_injection_data_scaled_bval,'GIW_186','log_b_value','darkblue')

plt.show()

#### PCA axes from injection data :

In [None]:
corr_pca_injection_bval = calcul_sort_save_corr(df_pca_injection_data_scaled_bval,'log_b_value',threshold = 0.70)

In [None]:
# define figure size
fig = plt.figure(figsize=(18,6))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between PCA axes from raw injection data and b-value', fontsize=18,y=1)

# subplot
ax1 = fig.add_subplot(2,2,1)
ax1 = plot_correlation(df_pca_injection_data_scaled_bval,'PCA_inj_AXE_0','log_b_value','blue')
ax2 = fig.add_subplot(2,2,2)
ax2 = plot_correlation(df_pca_injection_data_scaled_bval,'PCA_inj_AXE_2','log_b_value','green')
ax3 = fig.add_subplot(2,2,3)
ax3 = plot_correlation(df_pca_injection_data_scaled_bval,'PCA_inj_AXE_3','log_b_value','red')

plt.show()

#### Extracted features

In [None]:
corr_features_bval = calcul_sort_save_corr(df_features_bval,'log_b_value',threshold = 0.74)

In [None]:
# define figure size
fig = plt.figure(figsize=(18,9))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between extracted features and b-value', fontsize=18,y=.95)

# subplot
ax1 = fig.add_subplot(3,2,1)
ax1 = plot_correlation(df_features_bval,'S4_days_skew','log_b_value','blue')
ax2 = fig.add_subplot(3,2,2)
ax2 = plot_correlation(df_features_bval,'S3_days_skew','log_b_value','green')
ax3 = fig.add_subplot(3,2,3)
ax3 = plot_correlation(df_features_bval,'S4_days_kurtosis','log_b_value','red')
ax4 = fig.add_subplot(3,2,4)
ax4 = plot_correlation(df_features_bval,'S4_days_sum','log_b_value','black')
ax5 = fig.add_subplot(3,2,5)
ax5 = plot_correlation(df_features_bval,'S4_days_mean','log_b_value','gold')
ax6 = fig.add_subplot(3,2,6)
ax6 = plot_correlation(df_features_bval,'S3_days_sum','log_b_value','darkblue')

plt.show()

#### features obtained with PCA

In [None]:
corr_pca_features_bval = calcul_sort_save_corr(df_pca_features_bval,'log_b_value',threshold = 0.70)

In [None]:
# define figure size
fig = plt.figure(figsize=(18,3))
fig.subplots_adjust(hspace=0.4,wspace=0.3)

# figure title
fig.suptitle('Highest correlations found between PCA axes from extracted features and b-value', fontsize=18,y=1)

# subplot
ax1 = fig.add_subplot(1,2,1)
ax1 = plot_correlation(df_pca_features_bval,'PCA_FEATURE_AXE_3','log_b_value','blue')

<a id="4.5"></a>
## 4.5: Outliers detection:

In [None]:
print(df_features_Ener['log_Energy'].sort_values(ascending=True).head(3))

In [None]:
positon_outliers = [16]

# remove outlier in df with Energy
df_injection_data_scaled_Ener = df_injection_data_scaled_Ener.drop(df_injection_data_scaled_Ener.index[positon_outliers])
df_pca_injection_data_scaled_Ener = df_pca_injection_data_scaled_Ener.drop(df_pca_injection_data_scaled_Ener.index[positon_outliers])
df_features_Ener = df_features_Ener.drop(df_features_Ener.index[positon_outliers])
df_pca_features_Ener = df_pca_features_Ener.drop(df_pca_features_Ener.index[positon_outliers])

# remove outlier in df with b_value
df_injection_data_scaled_bval = df_injection_data_scaled_bval.drop(df_injection_data_scaled_bval.index[positon_outliers])
df_pca_injection_data_scaled_bval = df_pca_injection_data_scaled_bval.drop(df_pca_injection_data_scaled_bval.index[positon_outliers])
df_features_bval = df_features_bval.drop(df_features_bval.index[positon_outliers])
df_pca_features_bval = df_pca_features_bval.drop(df_pca_features_bval.index[positon_outliers])

# remove outlier in df with all the data
df_all_data = df_all_data.drop(df_all_data.index[positon_outliers])


In [None]:
# re-calculate coefficient of correlations:
corr_injection_Ener = calcul_sort_save_corr(df_injection_data_scaled_Ener,'log_Energy',threshold=0.30)

In [None]:
corr_pca_injection_Ener = calcul_sort_save_corr(df_pca_injection_data_scaled_Ener,'log_Energy',threshold=0.28)

In [None]:
corr_features_Ener = calcul_sort_save_corr(df_features_Ener,'log_Energy',threshold=0.30)

In [None]:
corr_pca_features_Ener = calcul_sort_save_corr(df_pca_features_Ener,'log_Energy',threshold=0.26)

In [None]:
corr_injection_bval = calcul_sort_save_corr(df_injection_data_scaled_bval,'log_b_value',threshold=0.65)

In [None]:
corr_pca_injection_bval = calcul_sort_save_corr(df_pca_injection_data_scaled_bval,'log_b_value',threshold=0.65)

In [None]:
corr_features_bval = calcul_sort_save_corr(df_features_bval,'log_b_value',threshold=0.65)

In [None]:
corr_pca_features_bval = calcul_sort_save_corr(df_pca_features_bval,'log_b_value',threshold=0.65)

In [None]:
# concat df with meaningful features
meaningful_features_Energy = pd.concat([corr_injection_Ener,corr_pca_injection_Ener,
                                        corr_features_Ener,corr_pca_features_Ener],axis=0)
meaningful_features_Energy = pd.DataFrame(abs(meaningful_features_Energy['log_Energy']).sort_values(ascending=False)).dropna()
print(meaningful_features_Energy.shape)
meaningful_features_Energy.head(5)

In [None]:
# concat df with meaningful features
meaningful_features_bval = pd.concat([corr_injection_bval,corr_pca_injection_bval,
                                        corr_features_bval,corr_pca_features_bval],axis=0)
meaningful_features_bval = pd.DataFrame(abs(meaningful_features_bval['log_b_value']).sort_values(ascending=False)).dropna()
print(meaningful_features_bval.shape)
meaningful_features_bval.head(5)

## <a id="ML"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>5: MACHINE LEARNING: prediction monthy seismic energy</center></h3>

<a id="5.1"></a>
## 5.1: Features selection and data split for linear models

In [None]:
# select only variables highly correlated with targets
X_Ener = df_all_data[[c for c in df_all_data.columns if c in meaningful_features_Energy.index]]
X_Ener.head(2)

In [None]:
# select only variables highly correlated with targets
X_bval = df_all_data[[c for c in df_all_data.columns if c in meaningful_features_bval.index]]
X_bval.head(3)

In [None]:
target_Ener = df_all_data['log_Energy'].values
target_bval = df_all_data['log_b_value'].values

In [None]:
X_train_Ener, X_test_Ener, y_train_Ener, y_test_Ener = train_test_split(X_Ener, target_Ener,test_size = .3, random_state=0)
X_train_bval, X_test_bval, y_train_bval, y_test_bval = train_test_split(X_bval, target_bval,test_size = .3, random_state=0)

<a id="5.2"></a>
## 5.2: functions used for machine learning
### function to evaluate model prediction

In [None]:
def get_best_score(grid):
    best_score = grid.best_score_
    print(best_score)    
    print(grid.best_params_)
    print(grid.best_estimator_)
    return best_score

def model_evaluation(y_test,prediction):
    r2 = round(metrics.r2_score(y_test, prediction), 2)
    abs_perc_error = np.mean(np.abs((y_test-prediction)/prediction))
    mean_abs_err = metrics.mean_absolute_error(y_test, prediction)
    rmse = np.sqrt(metrics.mean_squared_error(y_test, prediction))
    print("R2 (explained variance):",r2 )
    print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", abs_perc_error)
    print("Mean Absolute Error (Σ|y-pred|/n):", "{:,f}".format(mean_abs_err))
    print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,f}".format(rmse))
    ## residuals
#     prediction = prediction.reshape(len(prediction),1)
    residuals = y_test - prediction
    if abs(max(residuals)) > abs(min(residuals)):
        max_error = max(residuals)  
    else:
        max_error = min(residuals) 
    max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
    # max_true = y_test[max_idx]
    max_pred = prediction[max_idx]
    print("Max Error:", "{}".format(max_error))
    
    ## Plot predicted vs true
    fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(10,5))
    ax[0].scatter(prediction, y_test, color="black")
    abline_plot(intercept=0, slope=1, color="red", ax=ax[0])
    # ax[0].vlines(x=max_pred, ymin=max_true, ymax=max_true-max_error, color='red', linestyle='--', alpha=0.7, label="max error")
    ax[0].grid(True)
    ax[0].set(xlabel="Predicted", ylabel="True", title="Predicted vs True")
    ax[0].legend()

    ## Plot predicted vs residuals
    ax[1].scatter(prediction, residuals, color="red")
    ax[1].vlines(x=max_pred, ymin=0, ymax=max_error, color='black', linestyle='--', alpha=0.7, label="max error")
    ax[1].grid(True)
    ax[1].set(xlabel="Predicted", ylabel="Residuals", title="Predicted vs Residuals")
    ax[1].hlines(y=0, xmin=np.min(prediction), xmax=np.max(prediction))
    ax[1].legend()
    plt.show()

    print('The model explains {}% of the variance of the target variable.'.format(r2*100))
    print('On average, predictions have an error of {:,.2f}, or they’re wrong by {:,.2f}%.'.format(mean_abs_err,(abs_perc_error)*100)) 
#     print('the average difference between the predicted value and the actual value is {:,.2f}%: '.format(abs_perc_error*100))
    print('The biggest error on the test set was over {:,.2f}.'.format(max_error))


### Functions for plotting features' importance

In [None]:
def plot_nb_feature_vs_score(df):
    # PLOT RESULT:
    df.plot('number_feat', 'best_score')
    # Returns index of minimun best_score
    index = df[['best_score']].idxmax() 
    # get the number of features used to have the best score 
    print(df['number_feat'][index])
    
def plot_feature_importance(importance,n,names,model_type):

    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names[:n],'feature_importance':feature_importance[:n]}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    #Define size of bar plot
    plt.figure(figsize=(15,10))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

### Function for features selection

In [None]:
def select_model(model_ini,thres, X_train,X_test):   
    selection = SelectFromModel(model_ini, threshold=thres, prefit=True)
    n_features = selection.transform(X_train).shape[1]
    selected_vars = list(X_train.columns[selection.get_support()])
    X_train_selected = X_train[selected_vars]
    X_test_selected = X_test[selected_vars]
    return X_train_selected,X_test_selected

### Function to perform GridsearchCV

In [None]:
def perform_grid_search(model,random_grid,X_train, y_train,thres):
    random = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 300, cv = 4, 
                               verbose=2, random_state=12,scoring ='explained_variance')
    # Fit the random search model
    random.fit(X_train, y_train)
    # print output
    print('-'*60)
    print('Results from Grid Search with threshold = {}'.format(thres))
    print("Best score:",random.best_score_)
    print("with the following parameters :\n",random.best_params_)

<a id="5.3"></a>
## 5.3: Linear model
### 5.3.1: Feature selection with Lasso

In [None]:
lasso_score = []
number_feature = []
for i in range (1,X_Ener.shape[1]):
    # select diferent 
    columns = meaningful_features_Energy.index[:i]
    X = X_Ener[columns].values
    y = target_Ener
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
    
    lasso = Lasso(normalize=True)
    parameters = {'alpha': [1e-8,1e-5,1e-4, 1e-3,1e-2,0.5,1,2],
              'tol':[1e-6,1e-5,1e-4,1e-3,1e-2]}
    grid_lasso = GridSearchCV(lasso, parameters, cv=12, verbose=0, scoring = 'explained_variance')
    grid_lasso.fit(X_train, y_train)
    
    sc_lasso = get_best_score(grid_lasso)
    lasso_score.append(sc_lasso)
    number_feature.append(i)

result_lasso =  pd.DataFrame(zip(number_feature,lasso_score),columns = ['number_feat', 'best_score'])

In [None]:
plot_nb_feature_vs_score(result_lasso)

#### Run gridsearch with right number of features

In [None]:
# select columns that give best results
columns = meaningful_features_Energy.index[:22]
X = X_Ener[columns].values
y = target_Ener
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# Lasso regression model and gridsearch
grid_lasso = GridSearchCV(lasso, parameters, cv=12, verbose=1, scoring = 'explained_variance')
grid_lasso.fit(X_train, y_train)

sc_lasso = get_best_score(grid_lasso)

#### Run best fit Lasso

In [None]:
## best fit: 
lasso = Lasso(alpha= 0.01, normalize=True,tol = 1e-05)
## fit the model. 
lasso.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
prediction = lasso.predict(X_test)

#add score to list
r2_Lasso = metrics.r2_score(y_test, prediction)
mse_Lasso = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="5.3.2"></a>
### 5.3.2: Linear regression
#### GridSearchCV() with number features = 22

In [None]:
# select columns that give best results
columns = meaningful_features_Energy.index[:22]
X = X_Ener[columns].values
y = target_Ener
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# linear regression model and gridsearch
linreg = LinearRegression()
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
grid_linear = GridSearchCV(linreg, parameters, cv=12, verbose=1 , scoring ='explained_variance')
grid_linear.fit(X_train, y_train)

sc_linear = get_best_score(grid_linear)

#### Best result with linear model

In [None]:
model_linreg = LinearRegression(copy_X= True, fit_intercept= True, normalize= False)
model_linreg.fit(X_train,y_train)
prediction = model_linreg.predict(X_test)

#add score to list
r2_linreg = metrics.r2_score(y_test, prediction)
mse_linreg = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="5.3.3"></a>
### 5.3.3: Ridge model
#### GridSearchCV() with number features = 22

In [None]:
# select columns that give best results
columns = meaningful_features_Energy.index[:22]
X = X_Ener[columns].values
y = target_Ener
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# linear regression model and gridsearch
ridge = Ridge(normalize=True)
parameters = {'alpha':[1e-6,1e-5,1e-4,1e-3,1e-2,0.1,0.5,1], 
              'tol':[1e-9,1e-7,1e-6,1e-5,1e-4]}
grid_ridge = GridSearchCV(ridge, parameters, cv=12, verbose=1, scoring = 'explained_variance')
grid_ridge.fit(X_train, y_train)

sc_ridge = get_best_score(grid_ridge)

#### Best result with Ridge

In [None]:
## best fit: 
ridge_bf = Ridge(alpha= 1, normalize=True,tol = 1e-09)
## fit the model. 
ridge_bf.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
prediction = ridge_bf.predict(X_test)

#add score to list
r2_ridge = metrics.r2_score(y_test, prediction)
mse_ridge = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="5.3.4"></a>
### 5.3.4: ElasticNet regression
#### GridSearchCV() with number feature = 22

In [None]:
# select columns that give best results
columns = meaningful_features_Energy.index[:22]
X = X_Ener[columns].values
y = target_Ener
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# Lasso regression model and gridsearch
elastic_reg = ElasticNet()
parameters = {
    'alpha': [1e-6,1e-5, 1e-4, 5e-3,1e-3,5e-2, 1e-2, 5e-1, 1e-1,1.0],
    'l1_ratio': [1e-6,1e-5, 1e-4,5e-3, 1e-3,5e-2, 1e-2, 5e-1,1e-1,1.0],
    'tol': [1e-6,1e-5, 1e-4,1e-3]}
grid_elas = GridSearchCV(elastic_reg, parameters, cv=12, verbose=1, scoring = 'explained_variance')
grid_elas.fit(X_train, y_train)

sc_elas = get_best_score(grid_elas)

#### Best result with Elastic

In [None]:
## best fit: 
elastic = ElasticNet(alpha= 1.0,l1_ratio= 0.005,tol=1e-05)
## fit the model. 
elastic.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
prediction = elastic.predict(X_test)

#add score to list
r2_elas = metrics.r2_score(y_test, prediction)
mse_elas = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="5.4"></a>
## 5.4: Decision tree methods 

In [None]:
# data selection for decision tree methods
df_all_data2 = df_all_data.drop(columns=['log_Energy','log_b_value'], axis =1)
X_train_Ener, X_test_Ener, y_train_Ener, y_test_Ener = train_test_split(df_all_data2, target_Ener,test_size = .3, random_state=0)
X_train_bval, X_test_bval, y_train_bval, y_test_bval = train_test_split(df_all_data2, target_Ener,test_size = .3, random_state=0)


<a id="5.4.1"></a>
### 5.4.1: Random Forest Regressor
- First, we evaulate the importance of features.
- Then, we create several models trained on different input features and we perform a RandomizedSearchCV on each model. Each model is made of 100 trees to speed the calculation.
- Finally, we select the number of features that gave the highest score, we initiate a new model and perform a gridsearch 

#### feature importance

In [None]:
# initial random forest regressor model, fit it 
model_rf_ini = RandomForestRegressor(n_estimators= 100,random_state = 0)
model_rf_ini.fit(X_train_Ener, y_train_Ener)
# extract and plot important features
plot_feature_importance(model_rf_ini.feature_importances_,60,X_train_Ener.columns,'Random Forest Regressor: ')

#### feature selection and gridsearch

Here, we define the grid for random searh that going to be used for each model.

In [None]:
# Create the random grid
random_grid_rf = {
                # measure quality of the split
                'criterion': ['mse', 'mae'],
                # Number of features to consider at every split 
               'max_features': ['auto', 'sqrt','log2'],
                # Maximum number of levels in tree
               'max_depth': [10,50,100,200,None],
                # Minimum number of samples required to split a node
               'min_samples_split': range(2,10,1),
                # Minimum number of samples required at each leaf node
               'min_samples_leaf': range(1,10,1),
                # Method of selecting samples for training each tree
               'bootstrap': [True, False],
                'n_estimators' : [100],
                'random_state': [0]}

In [None]:
# # do first random search with threshold = 0.0002
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0002, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0002)

Results from Grid Search with threshold = 0.0002

Best score: **0.21704545681030946**

with the following parameters :

{'random_state': 0, 'n_estimators': 100, 'min_samples_split': 8, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 200, 'criterion': 'mse', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0004
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0004, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0004)

Results from Grid Search with threshold = 0.0004

Best score: **0.2181101677673559**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 200, 'criterion': 'mae', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0006
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0006, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0006)

Results from Grid Search with threshold = 0.0006

Best score: **0.2249491184242004**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 100, 'criterion': 'mae', 'bootstrap': True}


In [None]:
# # do first random search with threshold = 0.0008
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0008, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0008)

Results from Grid Search with threshold = 0.0008

Best score: **0.2327851268229801**

with the following parameters :
 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'mse', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0009
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0009, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0009)

Results from Grid Search with threshold = 0.0009

Best score: **0.23715507366773927**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'mse', 'bootstrap': True}

In [None]:
# # # do first random search with threshold = 0.001
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.001, X_train_Ener, X_test_Ener)
# # # initiat model
# model_rf = RandomForestRegressor()
# # # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.001)

Results from Grid Search with threshold = 0.001

Best score: **0.2386073321504737**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 100, 'criterion': 'mae', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0011
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0011, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0011)

Results from Grid Search with threshold = 0.0011

Best score: **0.25370632639103463**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'mse', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0012
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0012, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0012)

Results from Grid Search with threshold = 0.0012

Best score: **0.254835044417125**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 50, 'criterion': 'mae', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0013
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0013, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0013)

Results from Grid Search with threshold = 0.0013

Best score: **0.26921830705419325**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 7, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 50, 'criterion': 'mae', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.0014
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0014, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0014)

Results from Grid Search with threshold = 0.0014

Best score: **0.22909710132256744**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 8, 'max_features': 'log2', 'max_depth': 10, 'criterion': 'mae', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.0015
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0015, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.0015)

Results from Grid Search with threshold = 0.0015

Best score: **0.25165823131968784**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 3, 'max_features': 'sqrt', 'max_depth': 200, 'criterion': 'mae', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.00175
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.00175, X_train_Ener, X_test_Ener)
# # initiat model
# model_rf = RandomForestRegressor()
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_Ener,0.00175)

Results from Grid Search with threshold = 0.00175

Best score: **0.24612827184727618**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 200, 'criterion': 'mse', 'bootstrap': True}

#### GridSearchCV with threshold = 0.0013 (best score)

In [None]:
# # do first random search with threshold = 0.0013
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0013, X_train_Ener, X_test_Ener)

# model_rf = RandomForestRegressor()

# param_grid = {
#     'bootstrap': [True],
#     'criterion': ['mae'],
#     'max_depth': range(45,55,1),
#     'max_features': ['log2'],
#     'min_samples_leaf': [1,2,3,4,5],
#     'min_samples_split': [5,6,7,8,9],
#     'n_estimators': [100],
#     'random_state': [0]
# }

# grid_search = GridSearchCV(estimator = model_rf, param_grid = param_grid, cv = 12, verbose = 2,scoring ='explained_variance')

# # Fit the grid search to the data
# grid_search.fit(X_train_selected, y_train_Ener)
# grid_search.best_params_

**Result gridsearch:**
{'bootstrap': True,
 'criterion': 'mae',
 'max_depth': 45,
 'max_features': 'log2',
 'min_samples_leaf': 2,
 'min_samples_split': 6,
 'n_estimators': 100,
 'random_state': 0}

In [None]:
# do first random search with threshold = 0.02
X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0013, X_train_Ener, X_test_Ener)

model_rf = RandomForestRegressor(bootstrap ='True',criterion='mae',max_depth = 45,max_features = 'log2',
                                min_samples_leaf = 2, min_samples_split = 6, n_estimators = 100,random_state =  0)
model_rf.fit(X_train_selected,y_train_Ener)
prediction = model_rf.predict(X_test_selected)

#add score to list
r2_rf = metrics.r2_score(y_test_Ener, prediction)
mse_rf = metrics.mean_squared_error(y_test_Ener, prediction)

# evaluation model
model_evaluation(y_test_Ener,prediction)

The model overfit the data. To reduce over fitting we can try to tune these parameters:

- `n_estimators`: The more trees, the less likely the algorithm is to overfit. 
- `max_features`: This defines how many features each tree is randomly assigned, it could be lowered 
- `max_depth`: lowering this parameter reduce the complexity of the learned models, and so the over fitting risk. 
- `min_samples_leaf`: increase to obtain similar effect as the max_depth parameter.


In [None]:
# do first random search with threshold = 0.02
X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0013, X_train_Ener, X_test_Ener)

model_rf = RandomForestRegressor(bootstrap ='True',criterion='mae',max_depth = 45,max_features ='log2',
                                min_samples_leaf = 2, min_samples_split = 6, n_estimators = 3000,random_state =  0)
model_rf.fit(X_train_selected,y_train_Ener)
prediction = model_rf.predict(X_test_selected)

#add score to list
r2_rf = metrics.r2_score(y_test_Ener, prediction)
mse_rf = metrics.mean_squared_error(y_test_Ener, prediction)

print(r2_rf)

<a id="5.4.2"></a>
### 5.4.2: xgboost regression
As before we:
- First, evaulate the importance of features.
- Then, create several models trained on different input features and we perform a RandomizedSearchCV on each model.
- Finally, select the number of features that gave the highest score, we initiate a new model and perform a gridsearch 

#### feature importance

In [None]:
# Create the DMatrix: 
train_dmatrix = xgb.DMatrix(data=X_train_Ener, label=y_train_Ener)

# Instantiate the initial regressor model: model_gbm_ini
model_gbm_ini = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,eval_metric='rmse',seed=42)

# Fit andomized_mse to the data
model_gbm_ini.fit(X_train_Ener,y_train_Ener)

In [None]:
plot_feature_importance(model_gbm_ini.feature_importances_,60,X_train_Ener.columns,'xgboost regression: ')

In [None]:
# initial the parameters for random search
random_grid_gbm = {
    'colsample_bytree': [0.6,0.7,0.8,0.9,1],
    'max_depth': range(1,11,1),
    'eta' : [0.01,0.05,0.1, 0.15, 0.2],
    'alpha': [0,0.3,0.6,0.9,1],
    'min_child_weight': [1],
    'scale_pos_weight' : [1],
    'n_estimators' :  [100],
     'seed' : [42]
    
}

In [None]:
def perform_grid_search_xgb(model,random_grid,X_train, y_train,thres):
    # Create the DMatrix: 
    train_dmatrix = xgb.DMatrix(data=X_train, label=y_train)
    random = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 300, cv = 4, 
                               verbose=2, random_state=12,scoring ='explained_variance')
    # Fit the random search model
    random.fit(X_train, y_train)
    # print output
    print('-'*60)
    print('Results from Grid Search with threshold = {}'.format(thres))
    print("Best score:",random.best_score_)
    print("with the following parameters :\n",random.best_params_)

In [None]:
# # do first random search with threshold = 0.001
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.001, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.001)

Results from Grid Search with threshold = 0.001

Best score: **0.23901575756703272**

with the following parameters :
 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.05, 'colsample_bytree': 0.6, 'alpha': 0.6}

In [None]:
# # do first random search with threshold = 0.002
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.002, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.002)

Results from Grid Search with threshold = 0.002

Best score:**0.3211045305680925**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 6, 'eta': 0.05, 'colsample_bytree': 0.7, 'alpha': 0.6}

In [None]:
# # do first random search with threshold = 0.003
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.003, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.003)

Results from Grid Search with threshold = 0.003

Best score: **0.33907105956175704**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.15, 'colsample_bytree': 0.8, 'alpha': 0.6}

In [None]:
# # do first random search with threshold = 0.004
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.004, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.004)

Results from Grid Search with threshold = 0.004

Best score: **0.3554615103664501**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 8, 'eta': 0.05, 'colsample_bytree': 0.9, 'alpha': 0.3}

In [None]:
# # do first random search with threshold = 0.005
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.005, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.005)

Results from Grid Search with threshold = 0.005

Best score: **0.3678609550293852**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.2, 'colsample_bytree': 0.6, 'alpha': 0.3}

In [None]:
# # do first random search with threshold = 0.006
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.006, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.006)

Results from Grid Search with threshold = 0.006

Best score: **0.3678609550293852**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.2, 'colsample_bytree': 0.6, 'alpha': 0.3}

In [None]:
# # do first random search with threshold = 0.007
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.007, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.007)

Results from Grid Search with threshold = 0.007

Best score: **0.3717616957479021**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.15, 'colsample_bytree': 0.7, 'alpha': 0.6}

In [None]:
# # do first random search with threshold = 0.008
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.008, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.008)

Results from Grid Search with threshold = 0.008

Best score: **0.3577037794408484**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.1, 'colsample_bytree': 0.7, 'alpha': 0}

In [None]:
# # do first random search with threshold = 0.009
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.009, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_Ener,0.009)

Results from Grid Search with threshold = 0.009

Best score: **0.3547285700797372**

with the following parameters :

 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.05, 'colsample_bytree': 0.8, 'alpha': 0}

#### Run gridsearchCV with threshold = 0.007

In [None]:
# #### do first random search with threshold = 0.009
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.007, X_train_Ener, X_test_Ener)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# gbm_param_grid = {
#     'colsample_bytree': [0.65,0.67,0.69, 0.7,0.72,0.74],
#     'max_depth': range(1,4,1),   
#     'eta' : [0.13,0.14,0.15, 0.16, 0.17],
#     'alpha': [0.4,0.5,0.6,0.7,0.8],
#     'min_child_weight': [1,1.5],
#     'scale_pos_weight' : [1,1.5],
#     'n_estimators' :  [100],
#      'seed' : [42],
    
# }

# grid_search = GridSearchCV(estimator = model_gbm, param_grid = gbm_param_grid, cv = 4, verbose = 2,scoring ='explained_variance')

# # Fit the grid search to the data
# grid_search.fit(X_train_selected, y_train_Ener)
# grid_search.best_params_
# grid_search.best_score_

best score : **0.40611976681211237**

{'alpha': 0.4,
 'colsample_bytree': 0.67,
 'eta': 0.14,
 'max_depth': 2,
 'min_child_weight': 1.5,
 'n_estimators': 100,
 'scale_pos_weight': 1,
 'seed': 42}

### run best gbm

In [None]:
import xgboost as xgb

# do first random search with threshold = 0.035
X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.007, X_train_Ener, X_test_Ener)

# Create the DMatrix: 
train_dmatrix = xgb.DMatrix(data=X_train_selected, label=y_train_Ener)

xgb  = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,min_child_weight= 1.5,
                        max_depth = 2, eta = 0.14,colsample_bytree = 0.67,
                        alpha = 0.4,seed=42,scale_pos_weight=1)
                        
# Fit the regressor to the training set
xgb.fit(X_train_selected,y_train_Ener)

## Predicting the target value
prediction = xgb.predict(X_test_selected)

# evaluation model
model_evaluation(y_test_Ener,prediction)

The model has a better performance on the training set than on the testing set.
so we will try to lower:

- `colsample_bytree` (the ratio of features used),
- `subsample` (the ratio of the training instances used),
- `eta` (the learning rate of our GBM (i.e. how much we update our prediction with each successive tree).
    
and to increase:

- `gamma` (the minimum loss reduction required to make a further split),
- `min_child_weight` (the minimum sum of instance weight needed in a leaf)

In [None]:
import xgboost as xgb
# lower 'colsample_bytree': 0.67 to 0.07
xgb  = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,min_child_weight= 1.5,
                        max_depth = 2, eta = 0.14,colsample_bytree = 0.07,
                        alpha = 0.4,seed=42,scale_pos_weight=1)
                        
# Fit the regressor to the training set
xgb.fit(X_train_selected,y_train_Ener)

## Predicting the target value
prediction = xgb.predict(X_test_selected)

# evaluation model
r2 = round(metrics.r2_score(y_test_Ener, prediction), 2)
print(r2)

R2 evolves from 0.8 to 0.18

In [None]:
# decrease subsample from 1 to 0.5
import xgboost as xgb
# lower 'colsample_bytree': 0.67 to 0.07
xgb  = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,min_child_weight= 1.5,
                        max_depth = 2, eta = 0.14,colsample_bytree = 0.07,subsample=0.5,
                        alpha = 0.4,seed=42,scale_pos_weight=1)
                        
# Fit the regressor to the training set
xgb.fit(X_train_selected,y_train_Ener)

## Predicting the target value
prediction = xgb.predict(X_test_selected)

# evaluation model
r2 = round(metrics.r2_score(y_test_Ener, prediction), 2)
print(r2)

# evaluation model
model_evaluation(y_test_Ener,prediction)

#add score to list
r2_xgb = metrics.r2_score(y_test, prediction)
mse_xgb = metrics.mean_squared_error(y_test, prediction)

Other variables didn't help

<a id="5.5"></a>
## 5.5: Summary -- predition monthly seismic energy released --

In [None]:
# create list with the model score
list_score_r2 = [r2_linreg,r2_ridge,r2_Lasso,r2_elas,r2_rf,r2_xgb]
list_score_mse = [mse_linreg,mse_ridge,mse_Lasso,mse_elas,mse_rf,mse_xgb]
# create list with model name
list_regressors = ['linear','Ridge','Lasso','ElaNet','RF','xgboost']

# create dictionnary and dataframe
dic_score = {'model': list_regressors,
            'score_R2':list_score_r2,
            'score_mse':list_score_mse,}

dic_score = pd.DataFrame(dic_score)
dic_score

In [None]:
# Plot the predictions for each model
fig, axes = plt.subplots(2,figsize=(15,5))
ax = plt.subplot(1,2,1)
ax = sns.pointplot(x = "model", y = "score_R2", data = dic_score) 
ax.set_ylabel('Score (R2)', size=20, labelpad=12.5)
ax.set_xlabel('Model', size=20, labelpad=12.5)
ax.tick_params(labelsize=14)

# add annotations one by one with a loop
for ind in dic_score.index: 
     ax.text(ind,dic_score['score_R2'][ind]+0.002,'{:.5f}'.format(dic_score['score_R2'][ind]),
             horizontalalignment='left', size='medium', color='black', weight='semibold',fontsize=12)
        
ax2 = plt.subplot(1,2,2)
ax2 = sns.pointplot(x = "model", y = "score_mse", data = dic_score) 
ax2.set_ylabel('Score (MSE)', size=20, labelpad=12.5)
ax2.set_xlabel('Model', size=20, labelpad=12.5)
ax2.tick_params(labelsize=14)

# add annotations one by one with a loop
for ind in dic_score.index: 
     ax2.text(ind,dic_score['score_mse'][ind]+0.002,'{:.5f}'.format(dic_score['score_mse'][ind]),
             horizontalalignment='left', size='medium', color='black', weight='semibold',fontsize=12)
        
plt.title("Models' scores", size=20)
plt.savefig('Models scores Energy2.png')


#### visualize original data vs prediction with RIDGE

In [None]:
# select columns that give best results
columns = meaningful_features_Energy.index[:22]
X = X_Ener[columns].values
y = target_Ener
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

## best fit: 
ridge_bf = Ridge(alpha= 1, normalize=True,tol = 1e-09)
## fit the model. 
ridge_bf.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
log_prediction = ridge_bf.predict(X_test)
prediction =  np.exp(log_prediction)
original_data = np.exp(y_test)



In [None]:
# Plot the predictions for each model
fig, axes = plt.subplots(1,figsize=(10,5))
      
plt.plot(original_data,'-ob',label='original data')
plt.plot(prediction,'-dr',label='prediction')
plt.ylabel('monthly seimic energy',size=16)
plt.legend()

plt.title("Original data vs predicted", size=20)

plt.savefig('Energy -  original vs predicted.png')

## <a id="ML2"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>6: MACHINE LEARNING: prediction monthy b-value</center></h3>

<a id="6.1"></a>
## 6.1: Features selection and data split for linear models

In [None]:
# select only variables highly correlated with targets
X_bval = df_all_data[[c for c in df_all_data.columns if c in meaningful_features_bval.index]]
# select target: log_b_values
target_bval = df_all_data['log_b_value'].values
# split the data
X_train_bval, X_test_bval, y_train_bval, y_test_bval = train_test_split(X_bval, target_bval,test_size = .3, random_state=0)

<a id="6.2"></a>
## 6.2: Linear model
### 6.2.1: Feature selection with Lasso

In [None]:
lasso_score = []
number_feature = []
for i in range (1,X_bval.shape[1]):
    # select diferent 
    columns = meaningful_features_bval.index[:i]
    X = X_bval[columns].values
    y = target_bval
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
    
    lasso = Lasso(normalize=True)
    parameters = {'alpha': [1e-8,1e-5,1e-4, 1e-3,1e-2,0.5,1,2],
              'tol':[1e-6,1e-5,1e-4,1e-3,1e-2]}
    grid_lasso = GridSearchCV(lasso, parameters, cv=12, verbose=0, scoring = 'explained_variance')
    grid_lasso.fit(X_train, y_train)
    
    sc_lasso = get_best_score(grid_lasso)
    lasso_score.append(sc_lasso)
    number_feature.append(i)

result_lasso =  pd.DataFrame(zip(number_feature,lasso_score),columns = ['number_feat', 'best_score'])

In [None]:
plot_nb_feature_vs_score(result_lasso)

#### Run gridsearch with right number of features

In [None]:
# # select columns that give best results
# columns = meaningful_features_bval.index[:62]
# X = X_bval[columns].values
# y = target_bval
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# # Lasso regression model and gridsearch
# grid_lasso = GridSearchCV(lasso, parameters, cv=12, verbose=1, scoring = 'explained_variance')
# grid_lasso.fit(X_train, y_train)

# sc_lasso = get_best_score(grid_lasso)

#### Run best fit Lasso

In [None]:
## best fit: 
lasso = Lasso(alpha= 0.0001, normalize=True,tol = 0.01)
## fit the model. 
lasso.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
prediction = lasso.predict(X_test)

#add score to list
r2_Lasso = metrics.r2_score(y_test, prediction)
mse_Lasso = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="6.2.2"></a>
### 6.2.2: Linear regression
#### GridSearchCV() with number features = 22

In [None]:
# select columns that give best results
columns = meaningful_features_bval.index[:62]
X = X_bval[columns].values
y = target_bval
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# linear regression model and gridsearch
linreg = LinearRegression()
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
grid_linear = GridSearchCV(linreg, parameters, cv=12, verbose=1 , scoring ='explained_variance')
grid_linear.fit(X_train, y_train)

sc_linear = get_best_score(grid_linear)

#### Best result with linear model

In [None]:
model_linreg = LinearRegression(copy_X= True, fit_intercept= True, normalize= False)
model_linreg.fit(X_train,y_train)
prediction = model_linreg.predict(X_test)

#add score to list
r2_linreg = metrics.r2_score(y_test, prediction)
mse_linreg = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="6.2.3"></a>
### 6.2.3: Ridge model
#### GridSearchCV() with number features = 62

In [None]:
# select columns that give best results
columns = meaningful_features_bval.index[:62]
X = X_bval[columns].values
y = target_bval
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# linear regression model and gridsearch
ridge = Ridge(normalize=True)
parameters = {'alpha':[1e-6,1e-5,1e-4,1e-3,1e-2,0.1,0.5,1], 
              'tol':[1e-9,1e-7,1e-6,1e-5,1e-4]}
grid_ridge = GridSearchCV(ridge, parameters, cv=12, verbose=1, scoring = 'explained_variance')
grid_ridge.fit(X_train, y_train)

sc_ridge = get_best_score(grid_ridge)

#### Best result with Ridge

In [None]:
## best fit: 
ridge_bf = Ridge(alpha= 0.1, normalize=True,tol = 1e-09)
## fit the model. 
ridge_bf.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
prediction = ridge_bf.predict(X_test)

#add score to list
r2_ridge = metrics.r2_score(y_test, prediction)
mse_ridge = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="6.2.4"></a>
### 6.2.4: ElasticNet regression
#### GridSearchCV() with number feature = 62

In [None]:
# # select columns that give best results
# columns = meaningful_features_bval.index[:62]
# X = X_bval[columns].values
# y = target_bval
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# # Lasso regression model and gridsearch
# elastic_reg = ElasticNet()
# parameters = {
#     'alpha': [1e-6,1e-5, 1e-4, 5e-3,1e-3,5e-2, 1e-2, 5e-1, 1e-1,1.0],
#     'l1_ratio': [1e-6,1e-5, 1e-4,5e-3, 1e-3,5e-2, 1e-2, 5e-1,1e-1,1.0],
#     'tol': [1e-6,1e-5, 1e-4,1e-3]}
# grid_elas = GridSearchCV(elastic_reg, parameters, cv=12, verbose=1, scoring = 'explained_variance')
# grid_elas.fit(X_train, y_train)

# sc_elas = get_best_score(grid_elas)

#### Best result with Elastic

In [None]:
## best fit: 
elastic = ElasticNet(alpha= 0.05,l1_ratio= 1e-6,tol=1e-06)
## fit the model. 
elastic.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
prediction = elastic.predict(X_test)

#add score to list
r2_elas = metrics.r2_score(y_test, prediction)
mse_elas = metrics.mean_squared_error(y_test, prediction)

# evaluation model
model_evaluation(y_test,prediction)

<a id="6.3"></a>
## 6.3: Decision tree methods 
### 6.3.1: Random Forest Regressor
- First, we evaulate the importance of features.
- Then, we create several models trained on different input features and we perform a RandomizedSearchCV on each model.
- Finally, we select the number of features that gave the highest score, we initiate a new model and perform a gridsearch 

#### feature importance

In [None]:
# initial random forest regressor model, fit it 
model_rf_ini = RandomForestRegressor(n_estimators= 100,random_state = 0)
model_rf_ini.fit(X_train_bval, y_train_bval)
# extract and plot important features
plot_feature_importance(model_rf_ini.feature_importances_,60,X_train_bval.columns,'Random Forest Regressor: ')

#### feature selection and gridsearch

In [None]:
# Create the random grid
random_grid_rf = {
                # measure quality of the split
                'criterion': ['mse', 'mae'],
                # Number of features to consider at every split  
               'max_features': ['auto', 'sqrt','log2'],
                # Maximum number of levels in tree
               'max_depth': [10,50,100,200,None],
                # Minimum number of samples required to split a node
               'min_samples_split': range(2,10,1),
                # Minimum number of samples required at each leaf node
               'min_samples_leaf': range(1,10,1),
                # Method of selecting samples for training each tree
               'bootstrap': [True, False],
                'n_estimators' : [100],
                'random_state': [0]}

In [None]:
# # do first random search with threshold = 0.002
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.002, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.002)

Results from Grid Search with threshold = 0.002

Best score: **0.869887463593535**

with the following parameters :
 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 200, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.004
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.004, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.004)

Results from Grid Search with threshold = 0.004

Best score: **0.8717680156094305**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'mse', 'bootstrap': False}

Results from Grid Search with threshold = 0.004
Best score: **0.8717680156094305**
with the following parameters :
 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'mse', 'bootstrap': False

In [None]:
# # do first random search with threshold = 0.006
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.006, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.006)

Results from Grid Search with threshold = 0.006

Best score: **0.8723766584804427**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 200, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.008
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.008, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.008)

Results from Grid Search with threshold = 0.008

Best score: **0.8817541154750542**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 100, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.009
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.009, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.009)

Results from Grid Search with threshold = 0.009

Best score: **0.8611314245563841**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 100, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.01
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.01, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.01)

Results from Grid Search with threshold = 0.01

Best score: **0.8629705092133909**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 200, 'criterion': 'mse', 'bootstrap': True}

In [None]:
# # do first random search with threshold = 0.012
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.012, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.012)

Results from Grid Search with threshold = 0.012

Best score: **0.8681300519859287**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.014
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.014, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.014)

Results from Grid Search with threshold = 0.014

Best score: **0.857859455823619**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.016
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.016, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.016)

Results from Grid Search with threshold = 0.016

Best score: **0.8472248203819742**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'mse', 'bootstrap': False}

In [None]:
# # do first random search with threshold = 0.018
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.018, X_train_bval, X_test_bval)
# # initiat model
# model_rf = RandomForestRegressor(random_state = 0)
# # perform gridsearchCH
# perform_grid_search(model_rf,random_grid_rf,X_train_selected, y_train_bval,0.018)

Results from Grid Search with threshold = 0.018

Best score: **0.8446134637581928**

with the following parameters :

 {'random_state': 0, 'n_estimators': 100, 'min_samples_split': 7, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 100, 'criterion': 'mse', 'bootstrap': True}

### gridsearchCH with threshold = 0.008 (best score)

In [None]:
# measure quality of the split
                'criterion': ['mse', 'mae'],
                # Number of features to consider at every split  
               'max_features': ['auto', 'sqrt','log2'],
                # Maximum number of levels in tree
               'max_depth': [10,50,100,200,None],
                # Minimum number of samples required to split a node
               'min_samples_split': range(2,10,1),
                # Minimum number of samples required at each leaf node
               'min_samples_leaf': range(1,10,1),
                # Method of selecting samples for training each tree
               'bootstrap': [True, False],
                'n_estimators' : [100],
                'random_state': [0]}

In [None]:
# # do first random search with threshold = 0.0013
# X_train_selected, X_test_selected =  select_model(model_rf_ini,0.0013, X_train_bval, X_test_bval)

# model_rf = RandomForestRegressor()



# param_grid = {
#     'bootstrap': [True],
#     'criterion': ['mse','mae'],
#     'max_depth': range(75,155,5),
#     'max_features': ['log2'],
#     'min_samples_leaf': [1,2,3,4],
#     'min_samples_split': [2,3,4],
#     'n_estimators': [100],
#     'random_state': [0]
# }

# grid_search = GridSearchCV(estimator = model_rf, param_grid = param_grid, cv = 6
#                            , verbose = 2,scoring ='explained_variance')

# # Fit the grid search to the data
# grid_search.fit(X_train_selected, y_train_bval)
# grid_search.best_params_

result gridsearch1:
`bootstrap`: True,`criterion`: 'mae', `max_depth`: 75, `max_features`: 'log2', `min_samples_leaf`: 1,
`min_samples_split`: 2,`n_estimators`: 100,`random_state`: 0

In [None]:
# do first random search with threshold = 0.08
X_train_selected, X_test_selected =  select_model(model_rf_ini,0.008, X_train_bval, X_test_bval)

model_rf = RandomForestRegressor(bootstrap ='True',criterion='mae',max_depth = 75,max_features = 'log2',
                                min_samples_leaf = 1, min_samples_split = 2, n_estimators = 100,random_state =  0)
model_rf.fit(X_train_selected,y_train_bval)
prediction = model_rf.predict(X_test_selected)

#add score to list
r2_rf = metrics.r2_score(y_test_bval, prediction)
mse_rf = metrics.mean_squared_error(y_test_bval, prediction)

# evaluation model
model_evaluation(y_test_bval,prediction)

<a id="6.3.2"></a>
### 6.3.2: xgboost
#### Features importance

In [None]:
import xgboost as xgb

# Create the DMatrix: 
train_dmatrix = xgb.DMatrix(data=X_train_bval, label=y_train_bval)

# Instantiate the initial regressor model: model_gbm_ini
model_gbm_ini = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,eval_metric='rmse',seed=42)

# Fit andomized_mse to the data
model_gbm_ini.fit(X_train_bval,y_train_bval)

In [None]:
plot_feature_importance(model_gbm_ini.feature_importances_,60,X_train_bval.columns,'xgboost regression: ')

#### feature selection

In [None]:
# # do first random search with threshold = 0.0001
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.0001, X_train_bval, X_test_bval)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_bval,0.0001)

Results from Grid Search with threshold = 0.0001

Best score: **0.8864982237714087**

with the following parameters :
 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.2, 'colsample_bytree': 0.8, 'alpha': 0}

In [None]:
# do first random search with threshold = 0.0002
X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.0002, X_train_bval, X_test_bval)

# Instantiate the regressor: gbm
model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# perform gridsearchCH
perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_bval,0.0002)

Results from Grid Search with threshold = 0.0002

Best score: **0.8875462686991474**

with the following parameters :
 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.15, 'colsample_bytree': 1, 'alpha': 0}

In [None]:
# # do first random search with threshold = 0.0003
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.0003, X_train_bval, X_test_bval)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_bval,0.0003)

Results from Grid Search with threshold = 0.0003

Best score: **0.891920367880493**
Results from Grid Search with threshold = 0.0003

Best score: 0.891920367880493

with the following parameters : {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 1, 'eta': 0.2, 'colsample_bytree': 0.8, 'alpha': 0}
with the following parameters :
 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 1, 'eta': 0.2, 'colsample_bytree': 0.8, 'alpha': 0}

In [None]:
# do first random search with threshold = 0.0004
X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.0004, X_train_bval, X_test_bval)

# Instantiate the regressor: gbm
model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# perform gridsearchCH
perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_bval,0.0004)

Results from Grid Search with threshold = 0.0004

Best score: **0.8814383131990927**

with the following parameters :
 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 2, 'eta': 0.1, 'colsample_bytree': 0.7, 'alpha': 0}

In [None]:
# # do first random search with threshold = 0.0005
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.0005, X_train_bval, X_test_bval)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# # perform gridsearchCH
# perform_grid_search_xgb(model_gbm,random_grid_gbm,X_train_selected, y_train_bval,0.0005)

Results from Grid Search with threshold = 0.0005

Best score: **0.8773872353186494**

with the following parameters :
 {'seed': 42, 'scale_pos_weight': 1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 4, 'eta': 0.05, 'colsample_bytree': 0.6, 'alpha': 0}

#### Run gridsearch with threshold 0.0003 

In [None]:
# #### do first random search with threshold = 0.0003
# X_train_selected, X_test_selected =  select_model(model_gbm_ini,0.0003, X_train_bval, X_test_bval)

# # Instantiate the regressor: gbm
# model_gbm = xgb.XGBRegressor(objective='reg:squarederror')

# gbm_param_grid = {
#     'colsample_bytree': [0.75,0.77,0.79, 0.8,0.82,0.84],
#     'max_depth': range(1,3,1),   
#     'eta' : [0.16,0.18,0.2, 0.22, 0.24],
#     'alpha': [0,0.1,0.2,0.3],
#     'min_child_weight': [1],
# #     'scale_pos_weight' : [1],
#     'n_estimators' :  [100],
#      'seed' : [42],
    
# }

# grid_search = GridSearchCV(estimator = model_gbm, param_grid = gbm_param_grid, cv = 4, verbose = 2,scoring ='explained_variance')

# # Fit the grid search to the data
# grid_search.fit(X_train_selected, y_train_bval)
# grid_search.best_params_
# grid_search.best_score_

Result gridsearchCV:

BEST SCORE  = **0.891920367880493**, with:

{'alpha': 0,
 'colsample_bytree': 0.8,
 'eta': 0.2,
 'max_depth': 1,
 'min_child_weight': 1,
 'n_estimators': 100,
 'seed': 42}

In [None]:
import xgboost as xgb
# lower 'colsample_bytree': 0.67 to 0.07
xgb  = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,min_child_weight= 1,
                        max_depth = 1, eta = 0.2,colsample_bytree = 0.8,
                        alpha = 0.0,seed=42,scale_pos_weight=1)
                        
# Fit the regressor to the training set
xgb.fit(X_train_selected,y_train_bval)

## Predicting the target value
prediction = xgb.predict(X_test_selected)

# evaluation model
r2 = round(metrics.r2_score(y_test_bval, prediction), 2)
print(r2)

The model has a better performance on the training set than on the testing set.
so we will try to lower:

- `colsample_bytree` (the ratio of features used),
- `subsample` (the ratio of the training instances used),
- `eta` (the learning rate of our GBM (i.e. how much we update our prediction with each successive tree).
    
and to increase:

- `gamma` (the minimum loss reduction required to make a further split),
- `min_child_weight` (the minimum sum of instance weight needed in a leaf)

In [None]:
import xgboost as xgb
# lower 'colsample_bytree': 0.67 to 0.07
xgb  = xgb.XGBRegressor(objective='reg:squarederror',n_estimators = 100,min_child_weight= 1,
                        max_depth = 1, eta = 0.1,colsample_bytree = 0.6,
                        alpha = 0.0,seed=42,scale_pos_weight=1,subsample = 0.8)
                        
# Fit the regressor to the training set
xgb.fit(X_train_selected,y_train_bval)

## Predicting the target value
prediction = xgb.predict(X_test_selected)

#add score to list
r2_xgb = metrics.r2_score(y_test_bval, prediction)
mse_xgb = metrics.mean_squared_error(y_test_bval, prediction)

# evaluation model
model_evaluation(y_test_bval,prediction)

<a id="6.4"></a>
## 6.4: Summary -- predition monthly b-value --

In [None]:
# create list with the model score
list_score_r2 = [r2_linreg,r2_ridge,r2_Lasso,r2_elas,r2_rf,r2_xgb]
list_score_mse = [mse_linreg,mse_ridge,mse_Lasso,mse_elas,mse_rf,mse_xgb]
# create list with model name
list_regressors = ['linear','Ridge','Lasso','ElaNet','RF','xgboost']

# create dictionnary and dataframe
dic_score = {'model': list_regressors,
            'score_R2':list_score_r2,
            'score_mse':list_score_mse,}

dic_score_bval = pd.DataFrame(dic_score)
dic_score_bval

In [None]:
# Plot the predictions for each model
fig, axes = plt.subplots(2,figsize=(15,5))
ax = plt.subplot(1,2,1)
ax = sns.pointplot(x = "model", y = "score_R2", data = dic_score_bval,color='green') 
ax.set_ylabel('Score (R2)', size=20, labelpad=12.5)
ax.set_xlabel('Model', size=20, labelpad=12.5)
ax.tick_params(labelsize=14)

# add annotations one by one with a loop
for ind in dic_score_bval.index: 
     ax.text(ind,dic_score_bval['score_R2'][ind]+0.002,'{:.5f}'.format(dic_score_bval['score_R2'][ind]),
             horizontalalignment='left', size='medium', color='black', weight='semibold',fontsize=12)
        
ax2 = plt.subplot(1,2,2)
ax2 = sns.pointplot(x = "model", y = "score_mse", data = dic_score_bval,color='green') 
ax2.set_ylabel('Score (MSE)', size=20, labelpad=12.5)
ax2.set_xlabel('Model', size=20, labelpad=12.5)
ax2.tick_params(labelsize=14)

# add annotations one by one with a loop
for ind in dic_score_bval.index: 
     ax2.text(ind,dic_score_bval['score_mse'][ind]+5e-5,'{:.5f}'.format(dic_score_bval['score_mse'][ind]),
             horizontalalignment='left', size='medium', color='black', weight='semibold',fontsize=12)
        
plt.title("Models' scores ", size=20)
plt.savefig('Models scores bvalues.png')


### Visualize prediction made with Ridge

In [None]:
# select columns that give best results
columns = meaningful_features_bval.index[:62]
X = X_bval[columns].values
y = target_bval
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

## best fit: 
ridge_bf = Ridge(alpha= 0.1, normalize=True,tol = 1e-09)
## fit the model. 
ridge_bf.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
log_prediction = ridge_bf.predict(X_test)
prediction =  np.exp(log_prediction)
original_data = np.exp(y_test)

In [None]:
# Plot the predictions for each model
fig, axes = plt.subplots(1,figsize=(10,5))
      
plt.plot(original_data,'-ob',label='original data')
plt.plot(prediction,'-dr',color='green',label='prediction')
plt.ylabel('monthly b-values',size=16)
plt.legend()

plt.title("Original data vs predicted", size=20)

plt.savefig('b values -  original vs predicted.png')