# Severity of storms 

### 1.1 Business problem


Our project will focus on analyzing and predicting the severity of tornadoes across various regions in America in terms of property damage. To analyze this, we will take the last 10 years of tornado data to represent an approximate measure of it. To calculate this, we will establish a threshold for measuring the severity using the amount of property damage caused by tornadoes. We will predict the property damage of tornadoes and use a scale of low, medium, and high to represent the severity.  
In addition, we will also compare attitudes across each of the 4 seasons - summer, spring, winter, fall, and see if there are any trends present across the 10 year span that shows when tornadoes are most common and damaging. The trends will help us be more accurate in predicting the property damage caused by the tornadoes  per region during different seasons/times of the year.


### 1.2 Business understanding

### 1.3 Datasets


Our dataset details instances of severe weather across a 10 year period. We will be looking at Hurricanes across this time period. We are given the locations that the hurricanes are in, the category of hurricane, fatalities, property damage, and the length of the storm


### 1.4 Proposed analytics solution

 How we get to the target variable -- severity index
The severity index will be calculated by… 


In [60]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Read all data into one single dataframe
df_all_data = pd.read_csv('./dataset/storm_event_details_2010.csv')

for i in range(2011,2021):
    df_temp = pd.read_csv(f'./dataset/storm_event_details_{i}.csv')
    df_all_data = df_all_data.append(df_temp, ignore_index=True)


  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)
  df_all_data = df_all_data.append(df_temp, ignore_index=True)


In [61]:
## Remove unused columns and format continuous columns

df_hur = df_all_data[df_all_data['EVENT_TYPE']=='Tornado']
df_hur = df_hur.drop(columns=['TOR_OTHER_WFO', 'END_YEARMONTH', 'EVENT_TYPE', 'END_DATE_TIME',
                                           'TOR_OTHER_CZ_STATE','TOR_OTHER_CZ_FIPS','TOR_OTHER_CZ_NAME','DATA_SOURCE','EPISODE_NARRATIVE',
                                            'EVENT_NARRATIVE','WFO','SOURCE','CZ_TIMEZONE','BEGIN_AZIMUTH','END_AZIMUTH','BEGIN_LAT',
                                            'END_LAT','BEGIN_LON','END_LON','STATE_FIPS','BEGIN_RANGE','END_RANGE','DAMAGE_CROPS',
                                            'BEGIN_TIME','END_TIME','BEGIN_LOCATION','END_LOCATION','FLOOD_CAUSE','MAGNITUDE_TYPE',
                                            'MAGNITUDE','CZ_FIPS','CZ_TYPE','CZ_NAME','CATEGORY'])
cols = ['INJURIES_INDIRECT', 'INJURIES_DIRECT', 'DEATHS_INDIRECT', 'DEATHS_DIRECT']
df_hur = df_hur.assign(HARM_TOTAL=df_hur[cols].sum(1)).drop(cols,1)
df_hur['TOR_AREA'] = df_hur['TOR_LENGTH']*df_hur['TOR_WIDTH']
df_hur = df_hur.drop(columns=['TOR_LENGTH', 'TOR_WIDTH'])
df_hur = df_hur.dropna()


  df_hur = df_hur.assign(HARM_TOTAL=df_hur[cols].sum(1)).drop(cols,1)


In [62]:
## Format DAMAGE_PROPERTY column to be float instead of object

dmg = pd.DataFrame(df_hur['DAMAGE_PROPERTY'])

print(dmg)
for index, row in dmg.iterrows():
    val = row['DAMAGE_PROPERTY']
    if val[-1:] == 'B':
        row['DAMAGE_PROPERTY'] = float(val[:-1])*1000000
        # print(float(row['DAMAGE_PROPERTY'][:-1])*1000000)
    elif val[-1:] == 'M':
        row['DAMAGE_PROPERTY'] = float(val[:-1])*1000
        # print(float(row['DAMAGE_PROPERTY'][:-1])*1000)
    elif val[-1:] == 'K':
        row['DAMAGE_PROPERTY'] = float(val[:-1])*1
        # print(float(row['DAMAGE_PROPERTY'][:-1]))

df_hur['DAMAGE_PROPERTY'] = dmg

       DAMAGE_PROPERTY
75               5.00K
304              1.50M
617             10.00K
731            750.00K
732             10.00K
...                ...
688471           0.00K
688472         250.00K
688495          60.00K
688497           0.00K
688560           0.00K

[12945 rows x 1 columns]


In [63]:
## Sort by priority variable and find data split percentages

df_hur = df_hur.sort_values('DAMAGE_PROPERTY', ascending=False)
df_hur.info()
df_hur = df_hur.loc[df_hur['TOR_F_SCALE']!='EFU']
print(df_hur['TOR_F_SCALE'].value_counts()/len(df_hur) * 100)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14988 entries, 100619 to 688592
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   BEGIN_YEARMONTH  14988 non-null  int64  
 1   BEGIN_DAY        14988 non-null  int64  
 2   END_DAY          14988 non-null  int64  
 3   EPISODE_ID       14988 non-null  int64  
 4   EVENT_ID         14988 non-null  int64  
 5   STATE            14988 non-null  object 
 6   YEAR             14988 non-null  int64  
 7   MONTH_NAME       14988 non-null  object 
 8   BEGIN_DATE_TIME  14988 non-null  object 
 9   DAMAGE_PROPERTY  12945 non-null  object 
 10  TOR_F_SCALE      14988 non-null  object 
 11  HARM_TOTAL       14988 non-null  int64  
 12  TOR_AREA         14988 non-null  float64
Int64Index: 12945 entries, 100619 to 688560
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   BEGIN_YEARMONTH  1294

In [64]:
df_hur['MONTH_NAME'].value_counts()/len(df_hur) * 100

April        21.677932
May          20.548708
June         12.413519
July          7.093439
March         6.727634
August        6.067594
October       5.137177
November      4.580517
January       4.413519
February      4.159046
September     3.618290
December      3.562624
Name: MONTH_NAME, dtype: float64

In [65]:
## Split data into strata and sample proportinally (stratified sampling)

sampled_df = df_hur.groupby('TOR_F_SCALE', group_keys=False).apply(lambda x: x.sample(frac=0.1))
sampled_df
# sampled_df['MONTH_NAME'].value_counts()/len(df_hur) * 100

Unnamed: 0,BEGIN_YEARMONTH,BEGIN_DAY,END_DAY,EPISODE_ID,EVENT_ID,STATE,YEAR,MONTH_NAME,BEGIN_DATE_TIME,DAMAGE_PROPERTY,TOR_F_SCALE,HARM_TOTAL,TOR_AREA,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
672194,202008,19,19,149961,904133,NEW JERSEY,2020,August,19-AUG-20 08:57:00,,EF0,0,84.00,,,,,,,,,,,,,,
93457,201104,25,25,51239,303923,TEXAS,2011,April,25-APR-11 23:30:00,1.00K,EF0,0,2.00,,,,,,,,,,,,,,
305008,201404,2,2,84224,508482,KANSAS,2014,April,02-APR-14 19:23:00,,EF0,0,0.20,,,,,,,,,,,,,,
241133,201301,30,30,70457,428685,TENNESSEE,2013,January,30-JAN-13 03:10:00,70.00K,EF0,0,174.75,,,,,,,,,,,,,,
589534,201904,14,14,134696,811342,OHIO,2019,April,14-APR-19 16:17:00,10.00K,EF0,0,30.00,54412.0,201009,15.0,15,43020,250744.0,KANSAS,2010.0,September,15-SEP-10 17:28:00,0.0,EF0,0.0,30.0
93458,201104,28,28,51045,305053,NORTH CAROLINA,2011,April,28-APR-11 15:10:00,33.0,EF0,0,314.50,,,,,,,,,,,,,,
59662,201006,16,16,39661,247446,SOUTH DAKOTA,2010,June,16-JUN-10 17:53:00,0.0,EF0,0,1.50,,,,,,,,,,,,,,
617967,201912,16,16,144955,870503,MISSISSIPPI,2019,December,16-DEC-19 15:45:00,10.0,EF0,0,156.60,,,,,,,,,,,,,,
332528,201506,14,14,95142,571626,NEBRASKA,2015,June,14-JUN-15 11:25:00,0.0,EF0,0,26.00,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,,,,,,,,,,,,,,
