# How long does an outage last?

* Title: 15 Years Of Power Outages

* Objective: Predict attrition of your valuable employees

* Kaggle link: https://www.kaggle.com/autunno/15-years-of-power-outages

* Inspired by: __[my EDA on this dataset](https://www.kaggle.com/autunno/eda-15-years-of-power-outage)__

This notebook aims to make an EDA (Exploratory Data Analysis) on 15 years of outage to find out the features' relations and to prepare the ground for a Machine Learning model to predict the cause.

# How  this notebook is organized

1. [Data pre-processing](#1.-Data-pre-processing)
2. [Feature engineering and selection](#2.-Feature-engineering-and-selection)
3. [Data analysis](#3.-Data-analysis)
4. [Data preparation](#4.-Data-preparation)
5. [First model](#5. First-model)
6. [More models!](#6. More-models!)
7. [Back to the drawing board](#7. Back-to-the-drawing-board)

# 1. Data pre-processing

We start by importing all the libraries we're going to use:

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVR
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsRegressor

We now need to import our data:

In [5]:
# Allow us to see all columns when printing
pd.set_option('display.max_columns', 1000)

# Disable chained assignment warning message
pd.options.mode.chained_assignment = None 

# Read the dataset and print the first 5 rows
dataset = pd.read_csv('../input/Grid_Disruption_00_14_standardized - Grid_Disruption_00_14_standardized.csv')
dataset.head()

In [6]:
print("Number of entries: " + str(len(dataset.index)))

A quick peek at the data shows us that empty values are defined as "Unknown", which means we should treat them as NULL. To decide what to do with each value, we must first analyze how many empty values each column has:

### Analyzing the numeric values

We can see many columns have "Unknown", which needs to be cleaned. We need to have special care with our numerical columns. Year is pretty likely pretty clean, I expect that "Number of Customers Affected" is more troublesome:

In [7]:
len(pd.to_numeric(dataset['Year'], 'coerce').dropna().astype(int))

"Year" seems to be perfectly filled, we don't need to worry about it.

In [8]:
len(pd.to_numeric(dataset['Demand Loss (MW)'], 'coerce').dropna().astype(int))

Over 700 rows are not numeric on 'Demand Loss (MW)'. It's quite a lot of missing values (almost 50%), we'll have to decide if we want to keep it or not.

In [9]:
len(pd.to_numeric(dataset['Number of Customers Affected'], 'coerce').dropna().astype(int))

As we can see above, we have too many non numerical rows for this column (only 222 are correctly filled), it may be best to simply drop it. Let's take a quick look at 'Demand Loss (MW)' first. I've already analyzed these columns on __[my EDA](https://www.kaggle.com/autunno/eda-15-years-of-power-outage)__, which proved that this data is not in good shape to be used. Besides the usual culprits ("NaN", "Unknown", "None"), we also have some strange choices, such as using "Approx. " and " - " to indicate a possible range of values

### Analyzing the date values

We should also take a look at our date columns, to check if they are all in good shape:

In [10]:
len(dataset[pd.isnull(pd.to_datetime(dataset['Date Event Began'], 'coerce'))])

In [11]:
len(dataset[pd.isnull(pd.to_datetime(dataset['Date of Restoration'], 'coerce'))])

We can see from the above snippets that 'Date of Restoration' has quite a few invalid dates (events described as ongoing, unknown or simply wrongly inputed). We'll have to clean it up alongside with the other data. We should do the same for the hour attributes:

In [12]:
len(dataset[pd.isnull(pd.to_datetime(dataset['Time Event Began'], 'coerce'))])

In [13]:
len(dataset[pd.isnull(pd.to_datetime(dataset['Time of Restoration'], 'coerce'))])

There are several wrongly inputted time variables (both on start and restoration). These wrong values should all be replaced with np.nan (not a number) so that the rows can be dropped with dropna().

### So, what now?

We have two very badly formed numeric columns: number of customers affected and demand loss. We can't possibly keep demand loss, as that would leave us with a too small dataset. However, we could keep demand loss if it makes sense for our model. It all depends on what we want from the data. We have a few options:
 * Build a regression model to predict demand loss.
 * Build a regression to predict the duration of an outage. 
 * Build a classifier to predict what part of the day (e.g. morning, afternoon, night, etc) or month of the year outages are more likely to occur.
 * Find out which respondents solve outages faster.
 * Only use data which has number of customers affected known, and find the relation between this data and the other features.
 
In this kernel, I've chosen to build a regressor to predict the duration of an outage, which means that demand loss is actually undesirable. The reason for this is that this is something that is only known after the fact, so we'll never be able to use this in order to predict the duration of a new outage. The only things we'll need to know is: Duration,  event description, NERC Region, Geographic Areas and Respondent.

### Removing undesirable attributes

Having decided what path to take, we can remove unnecessary attributes. Even though we won't use Year on our model, let's keep it for now so that we can use it on our data analysis further on:

In [14]:
dataset = dataset[dataset.columns.difference(['Demand Loss (MW)', 'Number of Customers Affected', 'Event Description'])]

### Removing empty and noisy values

With that in mind, we can continue with our data pre-processing and replace 'Unknown' with None on all other columns, so that we can have a better idea of how many empty values we have:

In [15]:
dataset.replace('Unknown', np.nan, inplace=True)
dataset.replace('Ongoing', np.nan, inplace=True)

In [16]:
dataset.isnull().any()

Many columns have empty values, lets now check how bad it is:

In [17]:
print("Total number of rows: " + str(len(dataset.index)))
print("Number of empty values:")
for column in dataset.columns:
    print(" * " + column + ": " + str(dataset[column].isnull().sum()))

We now have very few columns left with 'None' values, we can just remove these rows.

In [18]:
dataset = dataset.dropna()

We can now check if our data is properly cleaned:

In [19]:
print("Total number of rows: " + str(len(dataset.index)))
print("Number of empty values:")
for column in dataset.columns:
    print(" * " + column + ": " + str(dataset[column].isnull().sum()))

It apparently is. However, there may still be some cases of wrong dates which were not dropped, but are not necessarily well-formed. Let's take a last look at it:

In [20]:
wrong_date_began = dataset[pd.isnull(pd.to_datetime(dataset['Date Event Began'], 'coerce'))].index
print(wrong_date_began)

In [21]:
wrong_date_restoration = dataset[pd.isnull(pd.to_datetime(dataset['Date of Restoration'], 'coerce'))].index
print(wrong_date_restoration)

In [22]:
wrong_time_began = dataset[pd.isnull(pd.to_datetime(dataset['Time Event Began'], 'coerce'))].index
print(wrong_time_began)

In [23]:
wrong_time_restoration = dataset[pd.isnull(pd.to_datetime(dataset['Time of Restoration'], 'coerce'))].index
print(wrong_time_restoration)

There's still a few wrong dates and we can easily deal with them by directly removing the indexes:

In [24]:
# append all wrong dates into a single array, turn into a set to remove duplicated indexes and drop them from the feature map
wrong_dates = set(wrong_date_began.append(wrong_date_restoration).append(wrong_time_began).append(wrong_time_restoration))
dataset = dataset.drop(wrong_dates)
print(len(dataset))

The final size of our dataset is 1593; we didn't lose much data in our cleaning process, which is nice. Let's take a last look at our data before proceeding:

In [25]:
dataset.reset_index()
dataset.head()

# 2. Feature engineering and selection

We have two main issues to tackle: 
 * Create a duration attribute, from Date Event Began, Date of Restoration, Time Event Began and Time of Restoration.
 * Create a one hot encoded set of attributes to represent the event description, since many columns have more than a single value.
 * Either deal with the location (which varies between state, city, county, region), or drop it.

### Getting the duration (in minutes) of an outage

As mentioned on the previous section, I've chosen to build a regressor to predict the duration of an outage, so we'll convert our time-stamps into a single attribute containing the duration in minutes. We'll start by getting our dates and time together:

In [26]:
date_start = pd.to_datetime(dataset['Date Event Began'] + ' ' + dataset['Time Event Began'])
date_end = pd.to_datetime(dataset['Date of Restoration'] + ' ' + dataset['Time of Restoration'])

In [27]:
date_end.head()

With that done, we can now calculate the difference in minutes, and then add this new attribute to our dataset, removing the old time columns (let's keep the dates for now, we may want to use it on data analysis):

In [28]:
# Create the attribute'Duration in Minutes', which is the difference between the end and start date of the event.
dataset['Duration in Minutes'] = (date_end - date_start).dt.total_seconds() / 60
dataset = dataset[dataset.columns.difference(['Time Event Began', 'Time of Restoration'])]

dataset.head()

Before proceeding, we should remove any negative durations, since that most likely means input error:

In [29]:
dataset = dataset[(dataset['Duration in Minutes'] > 0)]

A few more entries were removed due to bad duration. Technically, this operation belongs in the data pre-processing section, but it couldn't be done until we had merged the columns. We now also need to take a look at our max, to make sure we don't have any crazy values:

In [30]:
dataset.loc[dataset['Duration in Minutes'].idxmax()]

The maximum duration we had is of 188978 minutes, which means 131 days! That is obviously a mistake (otherwise, poor souls). Before proceeding, it would be wise to calculate how many long outages we have; let's define 4320 minutes as the limit, which rounds up to 3 days:

In [31]:
print('Number of outages that lasted over 3 days: ' + str(len(dataset[(dataset['Duration in Minutes'] > 4320)])))

There are 276 occurrences, which is a fairly good hit to our dataset. However, it's for the best, as it's a tall task to deal with these outliers while scoring a good prediction for lower duration outages.

In [32]:
long_outages = dataset[(dataset['Duration in Minutes'] >= 4320)]
dataset = dataset[(dataset['Duration in Minutes'] < 4320)]

We also kept two additional dataframes called long_outages, which contains really lasting outages (above 3 days). We most likely won't use this data on a model, but it may provide us with interesting analysis.

### Splitting the events into several attributes

The main issue with the 'Tags' feature is that it may contain several causes into a single entry (separated by commas), as seen below:

In [33]:
dataset.iloc[:, 6].head()

There are a few ways to deal with this:
 1. Accept this attribute as is, considering that the combination of these events is an unique categorical value.
 2. Rank the most important events, and replace the full string with it if they are present
 3. Build a set list with all event types, and create a binary column for each to map all reasons for an outage
 
On this kernel, we're going to explore the mysteries behind door number 3, as it's not a good idea at this point to assume which events are more important (2) and the first option is just lazy.

If we always had a fixed number of tags, this could be simply be addressed by the following code:

In [34]:
test_split = dataset['Tags'].str.split(',', expand=True)
test_split.head()

Since it's not the case, we're better off creating a unique set with all occurences and manually creating our one-hot-encoded attributes:

In [35]:
tags = dataset.Tags.str.split(',').tolist()
unique_events = set(x.lstrip() for lst in tags for x in lst)
print(unique_events)

Note that .lspstrip() was used in the conversion to trim leading whitespace. One interesting thing to note here is that we can remove unknown not because it's a bad feature to have; on the contrary, it may actually be relevant as we don't always know the reason for an outage when it happens; we can remove it because of the dummy variable trap:

In [36]:
tags = dataset['Tags']
unique_events.remove('unknown')
labelencoder = LabelEncoder()

# Create the new features and encode them
for event in unique_events:
    dataset[event] = tags.str.contains(event)
    dataset[event] = labelencoder.fit_transform(dataset[event])
    
dataset = dataset[dataset.columns.difference(['Tags'])]   
dataset.head()

### Geographic Areas

Before deciding how to deal with this attribute, we must first take a look at all the unique values it has:

In [37]:
areas = dataset['Geographic Areas'].unique()
print('Ammount of unique areas: ', len(areas))

In [38]:
print('Unique areas: ' + str(len(areas)))

That is too much variety to handle for such a small dataset. Taking a look at it, we can see that the data has no pattern or rule. A few disparity examples:
  * North Part of the Island (What Island??)
  * Entergy System (Not a city/state)
  * Georgia and Alabama (Two states)
  * Southern, Central and Northern New Jersey (Portions of a single state)
  * City of Los Angeles (A single city)
  * Primary Dade County Florida (A single county)
  
That leaves us with two options:
  1. Replace these values with State only
  2. Ignore this column altogether, and use the NERC region to 
  
Going with option (1) is a tough road to take, mainly because not all entries have the state information and a lot of data would be lost for no good reason, since the NERC region should be enough information for us to build an interesting model. That being said, we're going to remove this column:

In [39]:
dataset = dataset[dataset.columns.difference(['Geographic Areas'])]

### Respondent & NERC Region

These columns are pretty simple and can be simply dealt with by applying OneHotEncoder to it:

In [40]:
nerc_regions = dataset['NERC Region'].unique()
print('Unique NERC Regions (' + str(len(nerc_regions)) + '):')
print(nerc_regions)

In [41]:
respondent = dataset['Respondent'].unique()
print('Unique Respondents: ' + str(len(respondent)) )

The respondent column seems to be a bit messy. There doesn't seem to be a pattern, and the same value is represented differently in many places. It may be too hard to properly clean it up through code, it's better to simply disregard it. 

NERC Region on the other hand seems fine and can be used *almost* as is. There are a few cases of two values clustered in a single value separated by comma (e.g. RFC, SERC), which we want to deal with after we create our one hot encoded feature map:

In [42]:
# Apply one hot encoding to NERC Region
nerc_region = pd.get_dummies(dataset['NERC Region'], drop_first=True)
print(nerc_region.columns)

# Append the NERC Region OneHotEncoded attributes to the feature map
dataset = pd.concat([dataset, nerc_region], axis=1)

# Remove the original attributes
dataset = dataset[dataset.columns.difference(['Respondent'])]

With our feature map created, we can now deal with the comma separated values:

In [43]:
# Deal with 'RFC, SERC' attribute
dataset['RFC'] = np.where(dataset['RFC, SERC'] + dataset['RFC'] > 0, 1, 0)
dataset['SERC'] = np.where(dataset['RFC, SERC'] + dataset['SERC'] > 0, 1, 0)

# Deal with 'NPCC, RFC' attribute
dataset['NPCC'] = np.where(dataset['NPCC, RFC'] + dataset['NPCC'] > 0, 1, 0)
dataset['RFC'] = np.where(dataset['NPCC, RFC'] + dataset['RFC'] > 0, 1, 0)

# Deal with 'FRCC, SERC' attribute
dataset['FRCC'] = np.where(dataset['FRCC, SERC'] + dataset['FRCC'] > 0, 1, 0)
dataset['SERC'] = np.where(dataset['FRCC, SERC'] + dataset['SERC'] > 0, 1, 0)

# Remove the old (and now unneeded) attributes
dataset = dataset[dataset.columns.difference(['RFC, SERC', 'NPCC, RFC', 'FRCC, SERC'])]
dataset.head()

And that's it, we're done! For reference, the two following images shows the new and old NERC regions respectively, which we need to keep in mind since we're dealing with a dataset that has both (it changed after 2010):

New NERC Regions            |  Old NERC Regions
:-------------------------:|:-------------------------:
![](https://www.eia.gov/electricity/data/eia411/images/nerc_2010.jpg)  |  ![](https://www.eia.gov/electricity/data/eia411/images/nerc_new.jpg)

Ideally, we would either separate it in two datsets (one before 2010, and other one after), but we could end up with too little data that way for the newer outages.

The data is now ready for analysis and modeling! We ended up only keeping the OneHotEncoded NERC Regions and the cause of the events. That should be fine for what we're trying to achieve, since the other data would not be promptly available to help us determine how long of an outage we're looking at.

# 3. Data analysis

Data analysis is usually done before feature engineering, since it helps you identify what you want to use, what needs to change and give you a better taste of the problem. However, sine my previous __[kernel](https://www.kaggle.com/autunno/eda-15-years-of-power-outage)__ already covered some basic EDA, we already jumped into this problem with a good idea of what we can do with this dataset and what we want to achieve. Also, our data was a bit too untame to extract useful information from it, and doing some feature engineering helps with that.

With our dataset properly cleaned, we can now take a look and see how it's distributed (and how the columns relate to each other). A few interesting plots comes to mind:

### NERC Regions

In [44]:
dim = (15, 10)
fig, ax = plt.subplots(figsize=dim)
tag_plot = sns.countplot(x="NERC Region", ax=ax, data=dataset)

for item in tag_plot.get_xticklabels():
    item.set_rotation(45)

### Outage duration per year

In [45]:
dim = (15, 10)
fig, ax = plt.subplots(figsize=dim)
demand_plot = sns.boxplot(x="Year", y="Duration in Minutes", ax=ax, data=dataset)

for item in demand_plot.get_xticklabels():
    item.set_rotation(45)

### Outage duration per NERC region

In [46]:
dim = (15, 10)
fig, ax = plt.subplots(figsize=dim)
demand_plot = sns.boxplot(x="NERC Region", y="Duration in Minutes", ax=ax, data=dataset)

for item in demand_plot.get_xticklabels():
    item.set_rotation(45)

### Events per year

In [47]:
dim = (15, 10)
fig, ax = plt.subplots(figsize=dim)
sns.countplot(x="Year", ax=ax, data=dataset)

### Long lasting outages

On the previous section, we separated our dataset in two: one with outages that last longer than 15 days, another with less. We only had 20 such outages, some of them might even just be typing errors, but it's worth taking a look at what this data brings:

In [48]:
dim = (15, 10)
fig, ax = plt.subplots(figsize=dim)
demand_plot = sns.boxplot(x="Year", y="Duration in Minutes", ax=ax, data=long_outages)

for item in demand_plot.get_xticklabels():
    item.set_rotation(45)

In [49]:
dim = (20, 10)
fig, ax = plt.subplots(figsize=dim)
tag_plot = sns.countplot(x="NERC Region", ax=ax, data=long_outages)

for item in tag_plot.get_xticklabels():
    item.set_rotation(45)

In [50]:
dim = (20, 10)
fig, ax = plt.subplots(figsize=dim)
tag_plot = sns.countplot(x="Tags", ax=ax, data=long_outages)

for item in tag_plot.get_xticklabels():
    item.set_rotation(45)

### Feature correlation heatmap

Before implementing the correlation heatmap, we better remove the columns we won't be using anymore:

In [51]:
#correlation matrix
corrmat = dataset.corr()
f, ax = plt.subplots(figsize=(15, 13))
sns.heatmap(corrmat, vmax=.8, square=True);

### Analysis

Short duration outages (less than 3 days):
  * NERC Regions: We can see that the most of the outage occurrences are on the RFC, WECC and SERC regions, which are the part of the old NERC model. This is understandable, since they are big regions which are now described in many smaller pieces (WECC is now other 9 regions). Unfortunately, the only way to transform the old regions into the new ones would be to use the geographic location, which is in terrible shape and won't be of much help. Half of the NERC regions are barely affected by outages
  * Duration per year: There are some outliers in 2009, 2013 and 2014, however, they are not that absurd and can be kept. This plot show us a trend that, in general, things have been improving after they got worst in 2002 (2000 and 2001 were very good years, apparently). It's interesting to note that there's not always a strong relation between number of outages and duration, since 2011 had a low total duration but the most cases in the last 15 years.
  * The feature heatmap gave us a few interesting nuggets of wisdom:
    * PR NERC region is often affected by voltage reduction
    * Physical events are strongly related to vandalism (who would have thought!)
    * Severe weather are usually related to storms
    * Earthquakes are faily common in HECO NERC region 

As for the long lasting outages:
  * 2013 is the biggest outlier in the whole dataset. According to the tags column, it's due to cyber vandalism, which should not take 131 days to recover from (otherwise, this would be highly troubling)
  * Even though 2014 was a good year in general, it was pretty bad from a long lasting outages perspective.
  * RFC is the most troublesome regions in this regard, not much different from our analysis on the main dataset.
  * The biggest reasons for long lasting outages are severe weather, fuel supply and vandalism.

# 4. Data preparation

With all that said and done, we now need to get our feature map and output array:

In [52]:
output = dataset.iloc[:, 2]
features = dataset[dataset.columns.difference(['Duration in Minutes','Date Event Began', 'Date of Restoration', 'Year', 'NERC Region'])]

Since we identified that storm and severe weather are too alike, we can combine it:

In [53]:
features['severe weather'] = np.where(features['storm'] + features['severe weather'] > 0, 1, 0)
features = features[features.columns.difference(['storm'])]

The same goes for vandalism and physical events:

In [54]:
features['physical'] = np.where(features['vandalism'] + features['physical'] > 0, 1, 0)
features = features[features.columns.difference(['physical'])]

Let's take a quick look to make sure we did everything correctly:

In [55]:
features.head()

In [56]:
output.head()

In [57]:
print('Features size: ' + str(len(features)) + ' - Output size: ' + str(len(output)))

Finally, let's split the dataset into train and test:

In [58]:
features_train, features_test, duration_train, duration_test = train_test_split(features, output, test_size = 0.3, random_state = 0)

And that's it, we're done. Let's build some models!

# 5. First model

We came, we saw and now we need to conquer. Let's start by running a grid search on a Kernel SVR regressor, this will hopefully tell us two things: 
 1. Is the problem linearly separable?
 2. Is there any hope of finding a good relationship between NERC Region and event type with outage duration? In other words, should we even bother with building models for this dataset?

Normally we would need to normalize the data in order to properly run Kernel SVM, but since all our features are *onehotencoded*, we can skip this step. We start by defining a function to build our grid search models:

In [59]:
# Run grid search, get the prediction array and print the accuracy and best combination
def fit_and_pred_grid_classifier(regressor, param_grid, X_train, X_test, y_train, y_test, folds=5):
    # Apply grid search with F1 Score to help balance the results (avoid bias on "no attrition")
    grid_search = GridSearchCV(estimator = regressor, param_grid = param_grid, cv = folds, n_jobs = -1, verbose = 0)
    grid_search.fit(X_train, y_train)
    best_accuracy = grid_search.best_score_
    best_parameters = grid_search.best_params_

    # Get the prediction array
    grid_search_pred = grid_search.predict(X_test)

    # Print the MSE, R2 score and best parameter combination
    print("MSE: " + str(mean_squared_error(y_test, grid_search_pred))) 
    print("R2: " + str(r2_score(y_test, grid_search_pred))) 
    print("Best parameter combination: " + str(best_parameters)) 
    
    return grid_search_pred

And finally, we can build our first regressor:

In [60]:
regressor = SVR(kernel='rbf', C=10, gamma=0.1)
param_grid = [
    {
        'C': [400, 450, 500, 550, 600], 
        'kernel': ['linear'],
        'epsilon': [550, 600, 650, 700, 750],
    }, 
    {
        'C': [12, 13, 14, 15, 20], 
        'kernel': ['rbf', 'sigmoid'], 
        'gamma': [0.3, 0.4, 0.5, 1],
        'epsilon': [10, 100, 500, 750, 1000],
    },
]
# Build and fit the grid search SVR model
pred = fit_and_pred_grid_classifier(regressor, param_grid, features_train, features_test, duration_train, duration_test)

YIKES! Pretty bad MSE and R2 numbers after doing so much feature engineering and analysis is worrisome. Let's take a look at what we're getting:

In [None]:
X_grid = np.arange(1, 367)

dim = (20, 10)
fig, ax = plt.subplots(figsize=dim)
ax.set_xticks([])
plot_1 = sns.swarmplot(x=X_grid, y=duration_test, color='green')
plot_2 = sns.swarmplot(x=X_grid, y=pred, color='red')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

for item in plot_1.get_xticklabels():
    item.set_visible(False)
    
for item in plot_2.get_xticklabels():
    item.set_visible(False)

We can see that our model is often not predicting well, there were very few cases where it got it *almost* right. So, what now? Now we make a few more sad attempts of modeling this dataset.

# 6. More models!

As we've seen from the past section, things are rather dire. As it stands, it seems that our data is not fit to predict the duration of an outage only based on NERC region and event type. It could be that, after all, outages are unpredictable. However, we'll not give up without a fight, and so let's try a few more models before hitting the shower. Will we succeed? Will we fail? We will probably fail, but have some fun while doing so:

### Random forest

In [None]:
regressor = DecisionTreeRegressor()
param_grid = {
    'max_depth': [ 50, 75, 85, 90, 95, 100, 105, 110],
    'max_features': [10, 13, 14, 15, 16, 17, 18, 19, 20],
    'min_samples_leaf': [10, 11, 12, 13, 14, 15],
    'min_samples_split': [5, 6, 7, 8, 9, 10]
}
pred = fit_and_pred_grid_classifier(regressor, param_grid, features_train, features_test, duration_train, duration_test)

### Bayesian ridge

In [None]:
regressor = BayesianRidge()
param_grid = [
    {
        'alpha_1': [125, 150, 175, 200, 225, 250],
        'alpha_2': [0.00000001, 0.0000001, 0.000001],
        'lambda_1': [0.00000001, 0.0000001, 0.000001],
        'lambda_2': [125, 150, 175, 200, 225, 250]
    }, 
]
pred = fit_and_pred_grid_classifier(regressor, param_grid, features_train, features_test, duration_train, duration_test)

### KNN

In [None]:
regressor = KNeighborsRegressor()
param_grid = {
    'n_neighbors': [1,2,4,5,10,20],
    'weights': ['distance', 'uniform'],
    'algorithm': ['ball_tree', 'kd_tree', 'brute'],
    'metric': ['minkowski','euclidean','manhattan'], 
    'p': [1, 2]
}
pred = fit_and_pred_grid_classifier(regressor, param_grid, features_train, features_test, duration_train, duration_test)

# 7. Back to the drawing board

Alas, it was not meant to be. We could probably build some complex ensembles and try advanced techniques, but it would not help much. When a simple model can't even achieve above 30% R2 it's a data issue, not a modeling one.

That being said, our exploratory data analysis was quite successful as we were able to extract good information out of the data. It could be improved further, however, by splitting the outage duration in more windows (e.g. from 3 days to a week, from one week to two weeks, from two weeks to a month, etc), separating the data by year (or, at least, by NERC epochs), etc.

The last question remaining is: can we get any meaningful models out of this dataset? Probably. there are a few things that could be done in order to help create a meaningful model:
 1. Fix the geographic location so that it becomes usable (much more useful than a NERC region), but it should take quite a lot of effort.
 2. Simplify the feature map by cutting down the event columns, keeping only the most common ones.
 3. Split the dataset by year (as we've seen on EDA, every year is very different from another, there's no consistency)
 4. Build a model for outliers only, since long lasting outages are more dangerous and thus more interesting to prepare for.