# Power Outages
This project uses major power outage data in the continental U.S. from January 2000 to July 2016. Here, a major power  outage is defined as a power outage that impacted at least 50,000 customers or caused an unplanned firm load loss of atleast 300MW. Refer to the Github repository for more [info](https://github.com/srpatel2000/Power-Outage-EDA). 

The data is downloadable [here](https://engineering.purdue.edu/LASCI/research-data/outages/outagerisks) from Purdue University's website.

A data dictionary is available at this [article](https://www.sciencedirect.com/science/article/pii/S2352340918307182) under *Table 1. Variable descriptions*.

![image.png](attachment:image.png)

# Summary of Findings

### Project Goals
*First half of project*
1. Assess the quality of these datasets via exploratory data analysis
2. Assess the mechanism of missingness for some relevant portion of the dataset
3. Ask/answer a question about the dataset using a hypothesis test

*Second half of project*
4. Clearly state and frame a prediction problem (classification or regression); choose and justify an objective (e.g. accuracy vs f1-score)
5. Train a "baseline" model with generic set of features created for different kinds of data (e.g. ordinal encoding, one-hot encoding).
6. Engineer at least two new features from the data that improve the baseline model
7. Create an sklearn ML-pipeline; do a search for the best model and parameters using the pipeline.
8. Do an inference analysis on the results (i.e. does my model perform better on attribute X vs Y?)

### Cleaning and EDA
First, we had to clean our given dataset. When we first imported the set, the columns were not the correct names and the rows did not correctly represent the data generating process (DGP). In order to fix this, we obtained the relevant header names (which were located on row four), and made the column names equal to them. We also dropped the irrelevant 'variables' column that originally showed up because it did not matter in our analysis. Lastly, we set the observations equal to the index in order to more quickly find data. 

After reformatting the table to more correctly reflect the DGP, we combined the start date and time in order to create a summarized datetime column. We did the same for the end date and time in order to be uniform. We added these two columns to the dataframe in order to encompass a more precise dataset.

We also reformatted the dtypes of the columns that did not contain null values. This was in order to ensure that the DGP was accurate and so that future analysis would run smoother.

We then began our exploratory data analysis (EDA). In the EDA we focused on the above questions by creating graphs. More specifically, we performed univariate, bivariate, and aggregation analysis.

When performing univariate analysis, we mainly focused on finding where and when major power outages occur. We found that the main cause of power outages between 2000 and 2016 is severe weather followed by intentional attacks. We also found that most outages occur in states with higher populations. This may be due to the fact that states with a higher population tend to have more power plants, and therefore would be prone to having more outages. Going off of this fact, we found that the region that takes the most impact is the northeast region of the U.S. This would be fitting considering that many of these states (e.g. New York) have a high population density.

After visualizing this data, we knew we wanted to bring in the causes of the outages into our analysis. In our bivariate analysis we focused on how different outage causes affected rural versus urban populations. We created a spaghetti plot to see how the counts of different causes fluctuated by year. We noticed that severe weather was the main cause of outages up until 2011. Not only were there 269 overall outages this year, but intentional attacks took over as the biggest cause of outages. Upon doing research we can attribute the large amount of power outages to the 2011 Southwest Blackout Event. The outage was the result of 23 distinct events that occurred on 5 separate power grids in a span of 11 minutes. More information can be found here: https://www.nerc.com/pa/rrm/ea/Pages/September-2011-Southwest-Blackout-Event.aspx. This led us to wonder: how differently are urban versus rural areas affected by outages caused by intential attacks? We plotted a boxplot that conveyed how differently urban versus rural areas were affected by different causes. Although we found that intential attacks affected them at the same rate, severe weather did not.
So this led us to our hypothesis testing question: Are rural areas more prone to severe weather outages than urban areas? More information is in the Hypothesis Test Question of this summary. 

We also did some aggregation analysis be checking how many overall customers were affected by a specific cause. We did this in order to try to see why certain causes may affect rural areas more often. 

We split up the graphs that relate to our newly found hypothesis test and other fun visualizations we did while investigating our questions. We won't go into much detail about the findings in those, however we plan on utilizing those graphs for further analysis in the future so you can go ahead and take a look at them yourself. 

## First half of project (Hypothesis Testing)

### Assessment of Missingness

Hypothesis for MAR Permutation Tests:
- Null hypothesis: The missingness of "OUTAGE.DURATION" is not dependent on the compared column data.
- Alt hypothesis: The missingness of "OUTAGE.DURATION" is dependent on the the compated column data.

For our analysis of the missingness of our dataset, we believe that our data is not NMAR (not missing at random). NMAR classification is given to data if the missingness of the missingness of a column can be credited to that is not given within the dataset. In order to determine whether or not our dataset contain NMAR data, we conducted permutation tests on all columns within the dataset that contained missing data. Every permutation tests on all non trivial missingness columns seems to have at leave one simulation that returned a p value less than 0.05 which allowed us to reject the null hypothesis. 

The column with nontrivial missingness data that we chose to analyze was 'OUTAGE.DURATION'. This column has 1476 non-null values out of the 1534 possible data entries. In order to determine 'OUTAGE.DURATION''s dependency of missingness, we conducted a KS-Statistic permutation test. The KS test is used to identify whether the two distributions are from the same continuous distribution. Using the KS statistic test, we were able to create two samples of data (one which contains the distribution of a column where OUTAGE.DURATION is null and the other which contains the distribution of a column where OUTAGE.DURATION is not null). After generating multiple sample through the 1000 simulations, we compared to our observed statistic which was the ks_2samp of our initial two independent samples. The results of the permutation test, along with a significance level of 0.05, showed that 'TOTAL.SALES' was able to reject null hypothesis with a p value of 0.0 and 'CUSTOMERS.AFFECTED' failed to reject the null hypothesis with a p value of 0.28. Rejecting the null hypothesis indicates that the distribution are not similar. Failing to reject the null hypothesis indicates that the distribution is similar. This result does make sense as there are many factors other than outage duration that will affect the missingness of customers affected. However, outage duration can directly affects the total sales which indicates total electricity consumption in the U.S. state.

### Hypothesis Test
As stated above our question was: Are rural areas more prone to severe weather outages than urban areas? We came up with this questions because we observed that rural areas had a higher rate or outages related to weather compared to urban areas. This may be due to reasons such as rural areas having less facilities to protect their power plants from large weather disasters. 

When testing this we maintained these hypotheses:
- Null: There is no difference in the amount that severe weather affects rural vs urban populations.
- Alternative: There is a difference in the amount that severe weather affects rural vs urban populations. (However, we don't know why though)

We used the difference in medians between the two populations. 

We performed 10,000 trials and with a p-value of 0.0 and a significance level 0.05, we came to the conclusion that we can reject the null. In our data set we can conclude that there is a statistically significant difference in the amount that severe weather affects rural vs urban populations.

## Second half of project (Modeling)

### Baseline Model
For the baseline model, we choose ['YEAR', 'POSTAL.CODE', 'CUSTOMERS.AFFECTED', 'POPPCT_URBAN'] as our initial column as we deemed them most revelant to predict the cause of outages. For example, certain years may be correlated with certain specific causes. We also choose to use POPPCT_URBAN as another column since we discovered from our previous project, where we conducted a hypothesis test regarding the impact of severe weather on rural versus urban, that there is a strong relationship or correlation between the causes and whether it is rural or urban. 

In the Baseline Pipeline, we preprocessed the data by first using the Simple Imputer. The Simple Imputer imputed an null string for all categorical features and 0 for all the numerical features. We then pocessed to One Hot encode our categorical features. Our choosen classifier was Decision Tree Classifier.

### Final Model

We decided to engineer six more features to better improve our model. First, we used the start and restoration date and time to generate columns that extracted the day of the week that both the start and restoration of the outage occured. We then used the same columns to extract the hour an outage started and restored as well as the duration that an outage last. Finally, using POPDEN, we determined whether a state is more rural or urban dominanted. 

After testing Decision Tree, Random Forest, and KNN classifiers, we found Decision Tree performed the best about optimizing the parameters. In addition, we add PCA to our pipeline to handle our highly correlated columns.

### Fairness Evaluation
For the fairness evaluation, we chose to evaluate whether the model performed better when a larger group of customers was affected. We chose to do an accuracy parity evaluation since we had a multiclass classification model and found that accuracy was one of the easier ways to evaluate model fairness.

We split our data based on whether a large group of people were affected (a large number of people were affected if 50,000 or more people were affected, otherwise a small group of people were affected). We chose our cut off at 50,000 because we wanted our accuracy parity evaluation to determine whether or not our model performed equally well across all groups. Splitting up the values at 50,000 helped us split our dataset essentially in half, therefore our model should be classifying equally well across these two groups.

In order to perform this evaluation we performed a permutation test with these hypotheses:

Null hypothesis: The classifications of major power outages are "the same" when a small and large number of people are affected.

Alternative hypothesis: The classifications of major power outages are NOT "the same" when a small and large number of people are affected.

We set a significance level of 0.05. Essentially, we'd reject the null hypothesis for a p value less than .05 and we'd fail to reject the null hypothesis if it's greater than .05.

Since we got a p-value of 1.0, we can strongly fail to reject the null hypothesis. Therefore, the classifications of major power outages are "the same" when both small and large number of people are affected.

# Code

### Import Statements

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import folium
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import Binarizer
from sklearn.decomposition import PCA
#from util import tree_to_code
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score

### Cleaning and EDA

#### _Cleaning_

Initial Reading of Excel Dataset:

In [5]:
outages = pd.read_excel('outage.xlsx') #convert the file to dataframe

new_columns = outages.iloc[4] #get the relevant header names
filtered_outages = outages[6:]
filtered_outages.columns = new_columns #convert the column names in the original dataframe to the relevant ones

filtered_outages = filtered_outages.drop('variables', axis = 1) #drop the irrelevant 'variables' column

filtered_outages = filtered_outages.set_index('OBS', drop = True) #set the observations to the index

filtered_outages.columns.name = None #drop the name on the index

filtered_outages.head(2) #final cleaned dataframe

FileNotFoundError: [Errno 2] No such file or directory: 'outage.xlsx'