# Hypothesis testing
**Name(s)**: Jiaying Chen, Minh Hoang

**Website Link**: (your website link)


**Note**: Run this command to install wordcloud:
`!pip install wordcloud`

In [1]:
!pip install wordcloud



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from wordcloud import WordCloud


import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

# Step 1: Introduction

### Loading data

In [2]:
# Drop first 5 rows of metadata
raw = pd.read_csv('outage.csv', header=None).iloc[5:]

# Extract and clean column names from the first actual row (index 5 in the original file)
cols = raw.iloc[0, 1:].tolist()
# Drop the first column (contains "variables") and the header row itself
raw = raw.iloc[1:, 1:].copy()

# Assign cleaned column names
raw.columns = cols

# Reset index so we are working with the correct row numbers.
raw.reset_index(drop=True, inplace=True)

# Finally, drop variable column 
raw = raw.iloc[1:, :]
raw.head(2)

Unnamed: 0,OBS,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,...,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
1,1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,"Friday, July 1, 2011",...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5926658691451,8.40733413085488,5.47874298334407
2,2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,"Sunday, May 11, 2014",...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5926658691451,8.40733413085488,5.47874298334407


### Renaming columns

In [3]:
new_cols = [
    'obs', 'year', 'month', 'state', 'postal_code', 'nerc_region',
    'climate_region', 'anomaly_level', 'climate_cat',
    'start_date', 'start_time', 'restore_date',
    'restore_time', 'cause_cat', 'cause_detail',
    'hurricane_names', 'duration', 'demand_loss_mw',
    'customers_affected', 'res_price', 'com_price', 'ind_price',
    'total_price', 'res_sales', 'com_sales', 'ind_sales', 'total_sales',
    'res_pct', 'com_pct', 'ind_pct', 'res_customers',
    'com_customers', 'ind_customers', 'total_customers', 'res_cust_pct',
    'com_cust_pct', 'ind_cust_pct', 'pc_realgsp_state', 'pc_realgsp_usa',
    'pc_realgsp_rel', 'pc_realgsp_change', 'util_realgsp', 'total_realgsp',
    'util_contri', 'pi_util_of_usa', 'population', 'pop_pct_urban',
    'pop_pct_uc', 'popden_urban', 'popden_uc', 'popden_rural',
    'area_pct_urban', 'area_pct_uc', 'pct_land', 'pct_water_tot',
    'pct_water_inland'
]

raw.columns = new_cols
raw.head(2)


Unnamed: 0,obs,year,month,state,postal_code,nerc_region,climate_region,anomaly_level,climate_cat,start_date,...,pop_pct_urban,pop_pct_uc,popden_urban,popden_uc,popden_rural,area_pct_urban,area_pct_uc,pct_land,pct_water_tot,pct_water_inland
1,1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,"Friday, July 1, 2011",...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5926658691451,8.40733413085488,5.47874298334407
2,2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,"Sunday, May 11, 2014",...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5926658691451,8.40733413085488,5.47874298334407


In [4]:
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1534 entries, 1 to 1534
Data columns (total 56 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   obs                 1534 non-null   object
 1   year                1534 non-null   object
 2   month               1525 non-null   object
 3   state               1534 non-null   object
 4   postal_code         1534 non-null   object
 5   nerc_region         1534 non-null   object
 6   climate_region      1528 non-null   object
 7   anomaly_level       1525 non-null   object
 8   climate_cat         1525 non-null   object
 9   start_date          1525 non-null   object
 10  start_time          1525 non-null   object
 11  restore_date        1476 non-null   object
 12  restore_time        1476 non-null   object
 13  cause_cat           1534 non-null   object
 14  cause_detail        1063 non-null   object
 15  hurricane_names     72 non-null     object
 16  duration            1476

## Introduction and Question Identification
(⚠️ means still need more data, ✅ means workable with existing data )

**Currently, we are considering three problem statement to explore:**

1. ⚠️ Assessment of Infrastructure Resilience (Based on whether or not `cause_detail` has enough information.)
Analyze how different regions' infrastructure characteristics (e.g., overhead vs. underground lines, maintenance investments) correlate with outage frequency and duration. This can inform infrastructure improvement strategies.
2. ✅ Temporal Trends and Climate Change Correlation (doable with existing data, but kinda boring) 
Examine how the frequency and causes of outages have evolved over time and assess potential correlations with climate change indicators. This can provide insights into how changing climate patterns impact power reliability.
3. ⚠️ Policy and Emergency Response Evaluation (We don’t have explicit timestamps or markers indicating policy changes or emergency response dates.)
Evaluate the effectiveness of policies and emergency responses by analyzing outage data before and after the implementation of specific measures. This can guide future policy development and emergency planning.

**Ideas that are just "ok":**

⭕️ 1. Predictive Modeling of Outage Risks <u> ***(but it’s kinda boring, i bet people have done it already; not creative)*** </u>

Utilize machine learning techniques to predict the likelihood of major power outages based on factors such as weather patterns, infrastructure characteristics, and economic indicators. This can aid in proactive maintenance and resource allocation.

⭕️ 2. Socioeconomic Impact Analysis <u> ***(has been done numerous times already !!!!!!! Check out examples, two of them already exist. I know you like it, Im sry!)*** </u>

Investigate the relationship between socioeconomic factors (e.g., income levels, urbanization) and the frequency or duration of power outages. This can highlight areas where outages disproportionately affect vulnerable populations.



### Exploring idea 1
Check out whether `cause_detail` gives us any good information about the infrustructure. 

In [5]:
raw['cause_detail'].unique()

array([nan, 'vandalism', 'heavy wind', 'thunderstorm', 'winter storm',
       'tornadoes', 'sabotage', 'hailstorm', 'uncontrolled loss',
       'winter', 'wind storm', 'computer hardware', 'public appeal',
       'storm', ' Coal', ' Natural Gas', 'hurricanes', 'wind/rain',
       'snow/ice storm', 'snow/ice ', 'transmission interruption',
       'flooding', 'transformer outage', 'generator trip',
       'relaying malfunction', 'transmission trip', 'lightning',
       'switching', 'shed load', 'line fault', 'breaker trip', 'wildfire',
       ' Hydro', 'majorsystem interruption', 'voltage reduction',
       'transmission', 'Coal', 'substation', 'heatwave',
       'distribution interruption', 'wind', 'suspicious activity',
       'feeder shutdown', '100 MW loadshed', 'plant trip', 'fog', 'Hydro',
       'earthquake', 'HVSubstation interruption', 'cables', 'Petroleum',
       'thunderstorm; islanding', 'failure'], dtype=object)

- `cause_detail` does contain some infrastructure-related failure types, like:
**transformer outage, generator trip, relaying malfunction, breaker trip, line fault, substation, transmission interruption, distribution interruption, cables, HVSubstation interruption, plant trip**, etc.
- They will allow us to indirectly infer infrastructure issues, but there are no explicit infrastructure metadata: We don’t have direct info on overhead vs underground lines, age of equipment, maintenance budgets, or investments.

- External datasets to consider: 
    - **(I FW THIS ONE HEAVY) [EIA Reports on Utility Investments](https://www.eia.gov/todayinenergy/detail.php?id=48136)** 
        - Provides financial and operational data related to maintenance and upgrades.
        - Can help explain or correlate investment levels with outage frequency/duration.
        - Good source for explaining patterns seen in outage data.
    - [Mapping the Depths: Underground Power Distribution (arXiv study, paper only)](https://arxiv.org/abs/2402.06668)
        - Unique dataset that quantifies underground vs overhead lines by utility.
        - Can provide a strong predictor variable about infrastructure type (underground = more resilient).

### Exploring EIA reports on utility investment 
-> <u> **[Annual Electric Power Industry Report, Form EIA-861 detailed data files](https://www.eia.gov/electricity/data/eia861/)**</u> <br>
-> <u> **[A Guide to EIA Electric Power Data](https://www.eia.gov/electricity/data/guide/pdf/guide.pdf)** page 9/18  </u> 
> **Retail Sales by Electric Utilities and Power Marketers (Form EIA-861, Annual Electric Power Industry Report)**
>
> Data Collected by Form EIA-861  
> The Form EIA-861, Annual Electric Power Industry Report collects annual data from a census of all utilities that sell electricity to end-use customers in the 50 states, the District of Columbia, Puerto Rico, American Samoa, the American Virgin Islands, Guam, and the Northern Mariana Islands. These surveys collect information on sales to ultimate customers by utilities and power marketers, energy efficiency programs, distributed generating capacity, and related data elements.  
>
> The data collected include several items:  
> - **Service territory by state and county**  
> - **Sales revenue to ultimate customers**  
> - **Revenue and customer count**  
> - **Source and disposition of electricity**  
> - **Advanced metering**  
> - **Demand response and energy efficiency programs**  
> - **Dynamic pricing**  
> - **Capacity and other information related to net metering**  
> - **Non-net metered distributed generating units**  
> - **Distribution system characteristics and reliability**


Out of the above, the good columns to look for are ` Distribution System Characteristics and Reliability`, ` Service Territory by State and County`, `Advanced Metering Infrastructure (AMI)`, `Revenue and Customer Count by Utility` 
-  `Distribution System Characteristics and Reliability`: Directly speaks to infrastructure resilience. 
-  `Service Territory by State and County`: Needed for merging 
- `Advanced Metering Infrastructure (AMI)`: AMI often correlates with modernization efforts and may reflect better outage response times. Can assess: Compare outage duration/frequency in regions with vs. without AMI.
- `Demand Response and Energy Efficiency Programs`: May suggest proactive infrastructure investment or mitigation strategies. Can assess: "Do regions with stronger demand response programs show fewer or shorter outages?"
- `Revenue and Customer Count by Utility`: Can assess possible correlations between revenue and investment in resilience.

### Final decision on problem statement 

<u> **Assessment of Infrastructure Resilience.   
Analyze how different regions' infrastructure maintenance investments correlate with outage frequency and duration. This can inform infrastructure improvement strategies.** </u>



# Step 2: Data Cleaning and Exploratory Data Analysis

In [6]:
raw.head(2)

Unnamed: 0,obs,year,month,state,postal_code,nerc_region,climate_region,anomaly_level,climate_cat,start_date,...,pop_pct_urban,pop_pct_uc,popden_urban,popden_uc,popden_rural,area_pct_urban,area_pct_uc,pct_land,pct_water_tot,pct_water_inland
1,1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,"Friday, July 1, 2011",...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5926658691451,8.40733413085488,5.47874298334407
2,2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,"Sunday, May 11, 2014",...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5926658691451,8.40733413085488,5.47874298334407


## Data cleaning

### Dropping unused features

Since we will be focusing on assessing infrastructure resilience by state based on investment and other economic factors, we have dropped many climate or sales related columns. 

In [7]:
dropped = [
    'obs', 'start_date', 'start_time', 'restore_date', 'restore_time',
    'res_price', 'com_price', 'ind_price',
    'res_sales', 'com_sales', 'ind_sales',
    'res_pct', 'com_pct', 'ind_pct',
    'res_customers', 'com_customers', 'ind_customers',
    'res_cust_pct', 'com_cust_pct', 'ind_cust_pct',
    'pct_land', 'pct_water_tot', 'pct_water_inland',
    'hurricane_names', 'state', 'pc_realgsp_usa', 
    'pc_realgsp_rel', 'pc_realgsp_change', 'util_realgsp',
    'total_realgsp', 'util_contri', 'nerc_region',
    'demand_loss_mw', 'customers_affected',
    'total_price', 'total_sales', 'total_customers',   
]
raw_dropped = raw.drop(columns = dropped)
raw_dropped = raw_dropped.rename(columns = {'postal_code': 'state'})

# Dropping individual NaNs: first convert all string NaN into np, then drop accordingly. 
raw_dropped = raw_dropped.replace("NaN", np.nan)
raw_dropped = raw_dropped.dropna(subset=['year', 'month', 'state']) 
raw_dropped.head(2)

Unnamed: 0,year,month,state,climate_region,anomaly_level,climate_cat,cause_cat,cause_detail,duration,pc_realgsp_state,pi_util_of_usa,population,pop_pct_urban,pop_pct_uc,popden_urban,popden_uc,popden_rural,area_pct_urban,area_pct_uc
1,2011,7,MN,East North Central,-0.3,normal,severe weather,,3060,51268,2.2,5348119,73.27,15.28,2279,1700.5,18.2,2.14,0.6
2,2014,5,MN,East North Central,-0.1,normal,intentional attack,vandalism,1,53499,2.2,5457125,73.27,15.28,2279,1700.5,18.2,2.14,0.6


### Feature enrgineering

In [8]:
# Make duration numeric, standardize cause_detail text 
raw_dropped['duration'] = pd.to_numeric(raw_dropped['duration'], errors='coerce')
raw_dropped['cause_detail'] = raw_dropped['cause_detail'].dropna().astype(str).str.lower().str.strip().str.replace(r'[^\w\s]', '', regex=True).str.replace(r'\s+', ' ', regex=True)


"""
New columns for: 
Total outages that month in that state
Average outage duration that month in that state
"""
outages_count = (raw_dropped.groupby(['year', 'month', 'state']).size().reset_index(name='monthly_outage_count_bystate'))
avg_duration = (raw_dropped.groupby(['year', 'month', 'state'])['duration'].mean().reset_index(name='monthy_avg_duration_bystate'))
by_month_year = pd.merge(outages_count, avg_duration, on = ['year', 'month', 'state'], how = 'outer')
outage = pd.merge(raw_dropped, by_month_year, on = ['year', 'month', 'state'], how = 'left')

# Create new column for each year each month 
outage['year_month'] = outage['year'].astype(str) + '_' + outage['month'].astype(str)

outage.head(2)

Unnamed: 0,year,month,state,climate_region,anomaly_level,climate_cat,cause_cat,cause_detail,duration,pc_realgsp_state,...,pop_pct_urban,pop_pct_uc,popden_urban,popden_uc,popden_rural,area_pct_urban,area_pct_uc,monthly_outage_count_bystate,monthy_avg_duration_bystate,year_month
0,2011,7,MN,East North Central,-0.3,normal,severe weather,,3060.0,51268,...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,2,1530.0,2011_7
1,2014,5,MN,East North Central,-0.1,normal,intentional attack,vandalism,1.0,53499,...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,1,1.0,2014_5


# Process external data source 
1. Using the link on this website, download all zip files from 2000 - 2016 (Do NOT use the reformatted files!)
https://www.eia.gov/electricity/data/eia861/
2. Get file 2. Convert the file into CSV, Get the three total columns, keep columns of : Year, State, Thousands dollars, megawatthours, count for Total. Group each year's dataframe by state, sum for each column. 

Total revenue is very high, might want to regularize it?

In [None]:
external = pd.DataFrame(columns = ['state', 'total_revenue', 'megawatthours', 'count', 'year'])

for i in range(17):
    if i < 10:
        num = '0' + str(i)
    else:
        num = str(i)
    state_col = 'Unnamed: 3' if i == 0 else 'Unnamed: 6'
    df = pd.read_csv(f'20{num}.csv')
    counts = df.columns[-1]
    megawatts = df.columns[-2]
    revenue = df.columns[-3]
    df = df.iloc[2:][[state_col, revenue, megawatts, counts]].rename(columns = {state_col: 'state', revenue: 'total_revenue', 
                                                                                            megawatts: 'megawatthours', counts: 'count'})
    df['total_revenue'] = df['total_revenue'].str.replace(',', '', regex = True).replace(['.', ''], np.nan).astype(float)*1000
    df['megawatthours'] = df['megawatthours'].str.replace(',', '', regex = True).replace(['.', ''], np.nan).astype(float)
    df['count'] = df['count'].str.replace(',', '', regex = True).replace(['.', ''], np.nan).astype(float)
    df = df.groupby('state').sum().reset_index()
    df['year'] = '20' + num
    external = pd.concat([external, df], axis = 0)

external['year'] = external['year'].astype(int) 

In [None]:
external['year_state'] = external['year'].astype(str) + external['state'].astype(str)
external.head()

In [None]:
state_duration = raw.groupby(['year', 'postal_code'])['duration'].sum().reset_index().rename(columns = {'postal_code': 'state'})
state_duration['year_state'] = state_duration['year'].astype(str) + state_duration['state'].astype(str)
final_df = external.merge(state_duration, on = 'year_state', how = 'left')[['total_revenue', 'megawatthours', 'count', 'year_state', 'duration']].fillna(0)
final_df['year'] = final_df['year_state'].str[:4]
final_df['state'] = final_df['year_state'].str[4:]
final_df = final_df.drop(columns = 'year_state')
final_df.to_csv('raw_external_combined.csv')

### Resulting external dataframe
We combined dataframes from all years 2000 - 2016, and kepted the total columns. 
 

In [10]:
util_invest = pd.read_csv('external.csv')
util_invest.drop(columns=['Unnamed: 0'], inplace=True)
util_invest.rename(columns={'count':'customer_count'}, inplace=True)
util_invest['year'] = util_invest['year'].astype(str)
util_invest.head(2)

Unnamed: 0,state,total_revenue,megawatthours,customer_count,year
0,AK,535246000.0,5309970.0,273530.0,2000
1,AL,4687257000.0,83524220.0,2262753.0,2000


In [11]:
outage= outage.merge(util_invest, how='left', on=['year', 'state'])
outage.head(2)

Unnamed: 0,year,month,state,climate_region,anomaly_level,climate_cat,cause_cat,cause_detail,duration,pc_realgsp_state,...,popden_uc,popden_rural,area_pct_urban,area_pct_uc,monthly_outage_count_bystate,monthy_avg_duration_bystate,year_month,total_revenue,megawatthours,customer_count
0,2011,7,MN,East North Central,-0.3,normal,severe weather,,3060.0,51268,...,1700.5,18.2,2.14,0.6,2,1530.0,2011_7,5928515000.0,68532708.0,2595696.0
1,2014,5,MN,East North Central,-0.1,normal,intentional attack,vandalism,1.0,53499,...,1700.5,18.2,2.14,0.6,1,1.0,2014_5,6540932000.0,68719367.0,2640737.0


## Univariate Analysis



### Countplot of climate categories

In [12]:
climate_region_count = outage['climate_region'].value_counts().reset_index()
climate_region_count.columns = ['climate_region', 'count']
climate_region_cntplot = px.bar(climate_region_count, x='count', y='climate_region',
            title='Outage Cause Category Count',
            labels={'climate_region': 'Climate Region', 'count': 'Number of Outages'},
            color='count',  
            #color_discrete_sequence=px.colors.qualitative.Pastel  
             )

climate_region_cntplot.show()


Log-scaled duration distribution

In [13]:
outage['duration']

0       3060.0
1          1.0
2       3000.0
3       2550.0
4       1740.0
         ...  
1520       NaN
1521     220.0
1522     720.0
1523      59.0
1524     181.0
Name: duration, Length: 1525, dtype: float64

In [14]:
duration = outage['duration']
duration = pd.to_numeric(duration, errors='coerce').dropna()
duration = duration[duration > 0]  # log can't handle zero or negative

log_duration = np.log10(duration)
df = pd.DataFrame({'log_duration': log_duration})

# Create bins manually
bin_edges = np.linspace(log_duration.min(), log_duration.max(), 30)
df['bin'] = pd.cut(df['log_duration'], bins=bin_edges, include_lowest=True)
# Calculate counts per bin
bin_counts = df['bin'].value_counts().sort_index()
bin_centers = [interval.mid for interval in bin_counts.index]


hist_df = pd.DataFrame({
    'bin_center': bin_centers,
    'count': bin_counts.values
})
total_outages = df['log_duration'].notna().sum()
hist_df['percent'] = (hist_df['count'] / total_outages) * 100


# Make the minutes human readable 
def human_readable(mins):
    if mins >= (7 * 1440): 
        return f"{round(mins / (7 * 1440))} weeks"
    elif mins >= 1440:
        return f"{round(mins / 1440)}d"
    elif mins >= 60:
        return f"{round(mins / 60)}h"
    else:
        return f"{int(round(mins))} min"

hist_df['bin_label'] = [
    f"{human_readable(10**interval.left)} –{human_readable(10**interval.right)}"
    for interval in bin_counts.index
]

tick_vals = hist_df['bin_center']
tick_text = [f"{int(np.expm1(x))}" for x in tick_vals]

fig = px.bar(
    hist_df,
    x='bin_label',
    y='percent',
    color='bin_center',
    color_continuous_scale=px.colors.sequential.Blackbody,
    labels={'bin_label': 'Duration Range', 'percent': 'Percent of Outages'},
    title='Log Duration of Outages (Percentage)'
)

fig.update_layout(
    plot_bgcolor="#d0d5e6",
    xaxis_title='Duration (minutes, approx)',
    xaxis_tickangle=-45
)

fig.update_traces(
    hovertemplate=
        'Duration Range: %{x}<br>' +
        'Percent of Outages: %{y:.2f}%<extra></extra>'
)

fig.show()



#### Cause detail wordcloud 

In [15]:
raw.cause_detail
text = ' '.join(
    raw['cause_detail']
    .dropna()
    .astype(str)
    .str.lower()
    .str.strip()
    .str.replace(r'[^\w\s]', '', regex=True)  # Remove punctuation
    .str.replace(r'\s+', ' ', regex=True)
)
wordcloud = WordCloud(
    # colormap='gnuplot2',
    font_path='Impact',
    colormap='cool',
    width=800,
    height=400,
    background_color='white',
    collocations=False,  # Prevent phrases from being split
    random_state=42 
).generate(text)
wc_array = np.array(wordcloud.to_image())

fig = px.imshow(wc_array)
fig.update_layout(
    title="Word Cloud of Causes of Outages",
    xaxis=dict(showticklabels=False),
    yaxis=dict(showticklabels=False),
    margin=dict(l=10, r=10, t=40, b=10)
)
fig.show()

## Bivariate Analysis

In [16]:
fig = px.box(
    outage,
    x='state',
    y='anomaly_level',
    color='state',  # Optional: colors each box by state
    title='Anomaly Levels by State',
)

fig.update_layout(
    xaxis_title='State',
    yaxis_title='Anomaly Level',
    title_font=dict(size=30),
    xaxis_tickangle=90,
    font=dict(size=12),
    height=400,
    width=800,
    showlegend=False  # You can turn this on if coloring by state is meaningful
)

fig.show()


In [17]:

# Aggregate counts by climate_region and cause_cat
agg_df = outage.groupby(['climate_region', 'cause_cat']).size().reset_index(name='outage_count')
agg_df = agg_df.sort_values(['climate_region', 'cause_cat'])

ordered_regions = ['Northeast', 'West North Central', 'Southwest', 'Northwest', 'East North Central', 'Southeast', 'Central', 'West', 'South']

regions = agg_df['climate_region'].unique()
colors = ['#33a8c7ff', '#52e3e1ff', '#a0e426ff', '#fdf148ff', '#ffab00ff',  '#f77976ff', '#f050aeff', '#d883ffff', '#9336fdff' ]
colors2 = ["#ef476f","#f78c6b","#ffd166","#83d483","#06d6a0","#0cb0a9","#118ab2","#0c637f","#073b4c"]
colors3 = ["#5d9cec","#4fc1e9","#48cfad","#a0d468","#ffce54","#fc6e51","#ed5565","#ac92ec","#ec87c0"]
colors4= ["#ff7073","#ea9e8d","#dbb3b1","#ffe085","#fed35d","#96e6b3","#73d3c9","#8cd9f8","#a0b7cf"]
color_map = {region: colors[i % len(colors4)] for i, region in enumerate(ordered_regions)}

fig = px.sunburst(
    agg_df,
    path=['climate_region', 'cause_cat'],  # hierarchy: inner ring is climate_region, outer is cause_cat
    values='outage_count',
    color='climate_region',
    color_discrete_map=color_map,
    #color='outage_count',  
    #color_continuous_scale=px.colors.sequential.Plasma,
    title='Outage Counts by Climate Region and Cause Category'
)
fig.update_layout(width=600, height=600)
fig.update_traces(insidetextorientation='radial')  # or 'tangential' or 'auto'
fig.show()

In [18]:
fig1 = px.density_heatmap(
    outage,
    x='month',
    y='state',
    z='monthly_outage_count_bystate',
    histfunc='avg',  # average monthly outage count by state-month
    color_continuous_scale='Plasma',
    labels={'month': 'Month', 'state': 'State', 'monthly_outage_count_bystate': 'Avg Outage Count'},
    title='Average Monthly Outage Count by State and Month'
)

fig1.update_layout(
    yaxis={'categoryorder':'total ascending'},  # Sort states by total outages ascending
    xaxis=dict(dtick=1),  # show all month ticks
    height=800,
    plot_bgcolor='#f0f0f0'
)

fig1.show()


In [19]:

df_month = (
    outage.dropna(subset=['duration'])
    .assign(year_month=lambda df: pd.to_datetime(df['year'].astype(str) + '-' + df['month'].astype(str).str.zfill(2)))
    .groupby('year_month', as_index=False)['duration']
    .mean()
    .rename(columns={'year_month': 'time'})
)
df_month['type'] = 'Year-Month'

# Prepare year data
df_year = (
    outage.dropna(subset=['duration'])
    .groupby('year', as_index=False)['duration']
    .mean()
    .rename(columns={'year': 'time'})
)
df_year['type'] = 'Year'

# Combine
df_combined = pd.concat([df_month, df_year], ignore_index=True)

scatter = px.scatter(
    outage.dropna(subset=['duration']),
    x='year_month',
    y='duration',
    opacity=0.3,
    labels={'year_month': 'Year-Month', 'duration': 'Duration (min)'},
    title='Outage Durations: Individual Points + Monthly Average'
)

# Plot
fig = px.line(
    df_combined,
    x='time',
    y='duration',
    color='type',
    labels={'time': 'Time', 'duration': 'Avg Outage Duration (minutes)', 'type': 'Aggregation'},
    title='Average Outage Duration: Year-Month vs Year',
)

fig.update_layout(height=600)
fig.show()

In [20]:
outage.columns

Index(['year', 'month', 'state', 'climate_region', 'anomaly_level',
       'climate_cat', 'cause_cat', 'cause_detail', 'duration',
       'pc_realgsp_state', 'pi_util_of_usa', 'population', 'pop_pct_urban',
       'pop_pct_uc', 'popden_urban', 'popden_uc', 'popden_rural',
       'area_pct_urban', 'area_pct_uc', 'monthly_outage_count_bystate',
       'monthy_avg_duration_bystate', 'year_month', 'total_revenue',
       'megawatthours', 'customer_count'],
      dtype='object')

# Step 3: Assessment of Missingness
1. Inspect how much missing data there is in each column (percentage) (rename the dataframe to outage, so that there's destintion between raw dataset (used for missingness later) and real dataset we will be using)
2. Decide on how to fill in missingness (drop? mean imputation? probability imputation?)
3. Feature engineer however you think is fit. 

In [21]:
(outage.isna().sum(axis = 0) / outage.shape[0]).sort_values(ascending= False)

cause_detail                    0.308197
duration                        0.032131
monthy_avg_duration_bystate     0.014426
popden_rural                    0.006557
popden_uc                       0.006557
climate_region                  0.003279
year                            0.000000
popden_urban                    0.000000
megawatthours                   0.000000
total_revenue                   0.000000
year_month                      0.000000
monthly_outage_count_bystate    0.000000
area_pct_uc                     0.000000
area_pct_urban                  0.000000
pop_pct_urban                   0.000000
pop_pct_uc                      0.000000
month                           0.000000
population                      0.000000
pi_util_of_usa                  0.000000
pc_realgsp_state                0.000000
cause_cat                       0.000000
climate_cat                     0.000000
anomaly_level                   0.000000
state                           0.000000
customer_count  

### Columns with notably large amounts of missing data:  
demand_loss_mw  
customers_affected  
cause_detail  
hurricane_names  


#### Assessing missingness for duration 

In [22]:
duration_nmar = outage.copy()['duration'].value_counts().reset_index()
duration_nmar['duration'] = duration_nmar['duration'].astype(int)
duration_nmar.sort_values(by = 'duration', ascending = False)

Unnamed: 0,duration,count
620,108653,1
679,78377,1
216,60480,1
368,49427,1
174,49320,2
...,...,...
125,4,2
61,3,3
16,2,6
0,1,97



## NMAR Assessments
### Addressing Missingness of Cause Detail
The missingness of the **cause_detailed** col is most likely NMAR since the details might give sensitive information about an individual. Usually when an outage occurs, we can expect to see a recording or failsafe that indicates what was broken. That said, if the outage occurs in under-resourced areas then there is a much higher chance that it is not recorded at all due to lack of infrastructure.

### Addressing Missingness of Duration
On the other hand, we have reason believe that **duration** not NMAR since even though we can make a case that if a duration is too low or too high it might not be recorded. Why we believe that reasoning is flawed is because the **duration** column has 78 entries that only have 0 duration. Since, duration is non negative and the max duration is rather high, we can say that this missingness cannot be determined by the value within the column itself. If anything, if an outage lasts very long then there is more reason that it should be recorded for record keeping. So it is not NMAR.

<!-- We believe it may be MAR because the duration of an outage can vary based on how the population is concentrated, the cause detail (typhoon or natural disasters more likely to cause much more damage to the infrastructure making it less likely that the equipment to collect the data may be tampered with), month is there is a pattern of months that specifically has many outages that l -->

### Addressing Missingness of Climate Category and Region
It doesn't make much sense to label this as NMAR because all geographical regions in the country are labeled with their respective category and region, so it is unlikely that an outtage would happen in an area that cannot be classified as one of them. Thus, it cannot be that the missingness is dependent on the value of the entry within the coclumn itself. 

### Addressing Missingness of popden_uc, popden_rural
There is no reason to believe that these columns are NMAR since the population level, urban or not, is highly dependent on the state and its population, both of which are features in our data. 

### Addressing Missingness anomaly_level
Gonna ask Tauhidur

### Addressing Missingness in Month
Gonna ask Tauhidur

# MAR Testing
It is intriguing to analyze whether the cause_detail 


# Step 4: Hypothesis testing
Recall our problem statement:   
**Assessment of Infrastructure Resilience (Based on whether or not cause_detail has enough information.) Analyze how different regions' infrastructure <u>*maintenance investments*</u> correlate with outage frequency and duration. This can inform infrastructure improvement strategies, based on yearly spent on utility for each state**

### Requirements: 
1. Clearly state a pair of hypotheses and perform a hypothesis test or permutation test that is not related to missingness. 
2. Clearly state your null and alternative hypotheses, your choice of test statistic and significance level, the resulting 
p -value, and your conclusion. Justify why these choices are good choices for answering the question you are trying to answer.

Optional: Embed a visualization related to your hypothesis test in your website.

Tip: When making writing your conclusions to the statistical tests in this project, never use language that implies an absolute conclusion; since we are performing statistical tests and not randomized controlled trials, we cannot prove that either hypothesis is 100% true or false.

In [23]:
util_invest.head(2)

Unnamed: 0,state,total_revenue,megawatthours,customer_count,year
0,AK,535246000.0,5309970.0,273530.0,2000
1,AL,4687257000.0,83524220.0,2262753.0,2000


In [24]:
outage[['state','year','duration', 'total_revenue', 'megawatthours','customer_count']].head(2)

Unnamed: 0,state,year,duration,total_revenue,megawatthours,customer_count
0,MN,2011,3060.0,5928515000.0,68532708.0,2595696.0
1,MN,2014,1.0,6540932000.0,68719367.0,2640737.0


### 1. Creating a Single “Spending” Feature for Each State-Year
We will create a heuristic feature that combines normalized revenue, megawatthours, and customer count.
1. Normalized Revenue per Customer: how much money each customer generates — a proxy for investment per capita.
2. Normalized Energy Usage per Customer: how much energy each customer consumes — a proxy for infrastructure demand.

Scaling matters: Without normalization, one metric could dominate due to its magnitude.

Per-customer makes it equitable: Larger states won't automatically look better just because they're bigger.

Directly interpretable: States with higher infra scores are funding and supplying more power per user.



In [25]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
outage_ht = outage.copy()[['state','year','duration', 'total_revenue', 'megawatthours','customer_count']]

outage_ht['rev_per_cust'] = outage_ht['total_revenue'] / outage_ht['customer_count']
outage_ht['mwh_per_cust'] = outage_ht['megawatthours'] / outage_ht['customer_count']

outage_ht[['rev_cust_norm', 'mwh_cust_norm']] = scaler.fit_transform(
    outage_ht[['rev_per_cust', 'mwh_per_cust']]
)

outage_ht['infra_score'] = (outage_ht['rev_cust_norm'] + outage_ht['mwh_cust_norm']) / 2
outage_ht.head(2)

Unnamed: 0,state,year,duration,total_revenue,megawatthours,customer_count,rev_per_cust,mwh_per_cust,rev_cust_norm,mwh_cust_norm,infra_score
0,MN,2011,3060.0,5928515000.0,68532708.0,2595696.0,2283.978825,26.40244,0.248254,0.204431,0.226342
1,MN,2014,1.0,6540932000.0,68719367.0,2640737.0,2476.934318,26.022799,0.28202,0.198277,0.240149


Null (H₀): High- and low-spending states have equal or longer outage durations.   
Alternative (H₁): High-spending states have shorter outage durations.

In [26]:
threshold = outage_ht['infra_score'].median()
outage_ht['spending_group'] = outage_ht['infra_score'].apply(lambda x: 'high' if x >= threshold else 'low')

df = outage_ht.dropna(subset=['duration'])
high_mean = df[df['spending_group'] == 'high']['duration'].astype(int).mean()
low_mean = df[df['spending_group'] == 'low']['duration'].astype(int).mean()
observed_diff = low_mean - high_mean  # Expect positive if high spenders are better

In [27]:
fig = px.histogram(
    outage_ht,
    x='infra_score',
    color='spending_group',
    barmode='overlay',  # stack the bars by category
    nbins=100,  # or tune this as needed
    title='Stacked Distribution of col2 by col1',
    #labels={'col2': 'Your Value', 'col1': 'Category'}
    color_discrete_map={
        'high': "#9168e3",  # blue
        'low': "#2dc9d2"   # orange
    },
)

fig.update_layout(
    plot_bgcolor="#f4f4f4",
    bargap=0.05
)

fig.show()

In [28]:

n_permutations = 5000
perm_diffs = []

for _ in range(n_permutations):
    shuffled = np.random.permutation(df['spending_group'].values)  # shuffle group labels
    df_shuffled = df.copy()
    df_shuffled['shuffled_group'] = shuffled  # assign shuffled group

    high = df_shuffled[df_shuffled['shuffled_group'] == 'high']['duration'].astype(float)
    low = df_shuffled[df_shuffled['shuffled_group'] == 'low']['duration'].astype(float)

    diff = np.mean(low) - np.mean(high)
    perm_diffs.append(diff)

p_value = np.mean(np.array(perm_diffs) >= observed_diff)


In [29]:
p_value

np.float64(0.0084)