# Severity of storms 

### 1.1 Business problem


Our project will focus on analyzing and predicting the severity of tornadoes across various regions in America in terms of property damage. To analyze this, we will take the last 10 years of tornado data to represent an approximate measure of it. To calculate this, we will establish a threshold for measuring the severity using the amount of property damage caused by tornadoes. We will predict the property damage of tornadoes and use a scale of low, medium, and high to represent the severity.  
In addition, we will also compare attitudes across each of the 4 seasons - summer, spring, winter, fall, and see if there are any trends present across the 10 year span that shows when tornadoes are most common and damaging. The trends will help us be more accurate in predicting the property damage caused by the tornadoes per region during different seasons/times of the year. 


### 1.2 Business understanding

### 1.3 Datasets


Our dataset details instances of severe weather across a 10 year period. The dataset shows us the date/year, the state the tornado occured in, deaths/injuries, and the property damage. We will be using this dataset to predict the property damage caused by future tornadoes in region acros the US. Bonus: We will predict deaths/injuries if we have more time. 


### 1.4 Proposed analytics solution

 How we get to the target variable -- severity index
The severity index will be calculated by… 


In [9]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Read all data into one single dataframe
df_all_data = pd.read_csv('./dataset/storm_event_details_2010.csv')

for i in range(2011,2021):
    df_temp = pd.read_csv(f'./dataset/storm_event_details_{i}.csv')
    df_all_data = df_all_data.append(df_temp, ignore_index=True)


FileNotFoundError: [Errno 2] No such file or directory: './dataset/storm_event_details_2010.csv'

In [None]:
## Remove unused columns and format continuous columns

df_hur = df_all_data[df_all_data['EVENT_TYPE']=='Tornado']
df_hur = df_hur.drop(columns=['TOR_OTHER_WFO', 'END_YEARMONTH', 'EVENT_TYPE', 'END_DATE_TIME', 'BEGIN_YEARMONTH', 'BEGIN_DAY', 
                                            'END_DAY', 'EPISODE_ID', 'EVENT_ID',
                                           'TOR_OTHER_CZ_STATE','TOR_OTHER_CZ_FIPS','TOR_OTHER_CZ_NAME','DATA_SOURCE','EPISODE_NARRATIVE',
                                            'EVENT_NARRATIVE','WFO','SOURCE','CZ_TIMEZONE','BEGIN_AZIMUTH','END_AZIMUTH','BEGIN_LAT',
                                            'END_LAT','BEGIN_LON','END_LON','STATE_FIPS','BEGIN_RANGE','END_RANGE','DAMAGE_CROPS',
                                            'BEGIN_TIME','END_TIME','BEGIN_LOCATION','END_LOCATION','FLOOD_CAUSE','MAGNITUDE_TYPE',
                                            'MAGNITUDE','CZ_FIPS','CZ_TYPE','CZ_NAME','CATEGORY'])
cols = ['INJURIES_INDIRECT', 'INJURIES_DIRECT', 'DEATHS_INDIRECT', 'DEATHS_DIRECT']
df_hur = df_hur.assign(HARM_TOTAL=df_hur[cols].sum(1)).drop(cols,1)
df_hur['TOR_AREA'] = df_hur['TOR_LENGTH']*df_hur['TOR_WIDTH']
df_hur = df_hur.drop(columns=['TOR_LENGTH', 'TOR_WIDTH'])
df_hur = df_hur.dropna()


NameError: name 'df_all_data' is not defined

In [None]:
## Format DAMAGE_PROPERTY column to be float instead of object

dmg = pd.DataFrame(df_hur['DAMAGE_PROPERTY'])

print(dmg)
for index, row in dmg.iterrows():
    val = row['DAMAGE_PROPERTY']
    if val[-1:] == 'B':
        row['DAMAGE_PROPERTY'] = float(val[:-1])*1000000000
    elif val[-1:] == 'M':
        row['DAMAGE_PROPERTY'] = float(val[:-1])*1000000
    elif val[-1:] == 'K':
        row['DAMAGE_PROPERTY'] = float(val[:-1])*1000

df_hur['DAMAGE_PROPERTY'] = dmg

       DAMAGE_PROPERTY
75               5.00K
304              1.50M
617             10.00K
731            750.00K
732             10.00K
...                ...
688471           0.00K
688472         250.00K
688495          60.00K
688497           0.00K
688560           0.00K

[12945 rows x 1 columns]


In [None]:
## Sort by priority variable and find data split percentages

df_hur = df_hur.sort_values('DAMAGE_PROPERTY', ascending=False)
df_hur.info()
df_hur = df_hur.loc[df_hur['TOR_F_SCALE']!='EFU']
print(df_hur['TOR_F_SCALE'].value_counts()/len(df_hur) * 100)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12945 entries, 100619 to 688560
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   STATE            12945 non-null  object 
 1   YEAR             12945 non-null  int64  
 2   MONTH_NAME       12945 non-null  object 
 3   BEGIN_DATE_TIME  12945 non-null  object 
 4   DAMAGE_PROPERTY  12945 non-null  object 
 5   TOR_F_SCALE      12945 non-null  object 
 6   HARM_TOTAL       12945 non-null  int64  
 7   TOR_AREA         12945 non-null  float64
dtypes: float64(1), int64(2), object(5)
memory usage: 910.2+ KB
EF0    51.117296
EF1    35.737575
EF2     9.852883
EF3     2.576541
EF4     0.628231
EF5     0.087475
Name: TOR_F_SCALE, dtype: float64


In [None]:
df_hur['MONTH_NAME'].value_counts()/len(df_hur) * 100

April        21.677932
May          20.548708
June         12.413519
July          7.093439
March         6.727634
August        6.067594
October       5.137177
November      4.580517
January       4.413519
February      4.159046
September     3.618290
December      3.562624
Name: MONTH_NAME, dtype: float64

In [None]:
## Split data into strata and sample proportinally (stratified sampling)

sampled_df = df_hur.groupby('TOR_F_SCALE', group_keys=False).apply(lambda x: x.sample(frac=0.1))
sampled_df

NameError: name 'df_hur' is not defined

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.bar(sampled_df['TOR_F_SCALE'], sampled_df['DAMAGE_PROPERTY'])
plt.xlabel('Tornado Scale')
plt.ylabel('Damage Property')
plt.show()
plt.savefig('tornado_scale_damage_property.png')

NameError: name 'sampled_df' is not defined

<Figure size 720x720 with 0 Axes>