# (Natural Disasters Datasets Exploration)

## Preliminary Wrangling

> This document is to explore our natural disasters' datasets which contains data from x to y. So, we could explore the patterns and hidden behaviours of natural disasters (earthquakes, volcanos and tsunamis)


In [1]:
# import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import requests
import glob

%matplotlib inline

> Our motivation goal is to explore the behaviour of the disasters and to find explanation for the unexpected ones. And to find relation between properties of every disaster (e.g. the relation between the magnitude and focal depth). Also, try to predict some earthquake aspects.

### What is the structure of your dataset?

> The main dataset of this project which is earthquake's dataset with 22 features. We are interested in some of them which are (time, latitude, longitude, depth, mag, magType). All of this features are numerical value except magType which are categorical value.

### What is/are the main feature(s) of interest in your dataset?

> The most important features are the magnitude and focal depth.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> We have other features that will help us such as longitude, latitude and timestamp. Also, there will be other features we will calculate from the data we have such as the number of aftershocks.

In [None]:
for i in range(7):
    year = 15 + i
    month = 1
    while month < 12:
        start = ''
        if(month > 9):
            start = month
        else:
            start = '0{}'.format(month)
        
        end = month + 1
        if(end > 9):
            end = end
        else:
            end = '0{}'.format(end)
            
        url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-01%2000:00:00&endtime=20{y}-{e}-01%2023:59:59&orderby=time'.format(y=year, s=start, e=end)
        r = requests.get(url, allow_redirects=True)  # to get content after redirection
        pdf_url = r.url 
        name = '20{}-{}-0.csv'.format(year,month)
        with open(name, 'wb') as f:
            f.write(r.content)
        month += 1
    url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-12-01%2000:00:00&endtime=20{y2}-01-01%2023:59:59&orderby=time'.format(y=year, y2=year+1 , s=start, e=end)
    r = requests.get(url, allow_redirects=True)  # to get content after redirection
    pdf_url = r.url 
    name = '20{}-12-0.csv'.format(year)
    with open(name, 'wb') as f:
        f.write(r.content)

In [3]:
for i in range(18,21):
    if(i == 18):
        for j in range(6,8):
            url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-01%2000:00:00&endtime=20{y}-{e}-15%2023:59:59&orderby=time'.format(y=i, s=j, e=j)
            r = requests.get(url, allow_redirects=True)  # to get content after redirection
            pdf_url = r.url 
            name = '20{}-{}-0.csv'.format(i, j)
            with open(name, 'wb') as f:
                f.write(r.content)
            url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-16%2000:00:00&endtime=20{y}-{e}-31%2023:59:59&orderby=time'.format(y=i, s=j, e=j)
            r = requests.get(url, allow_redirects=True)  # to get content after redirection
            pdf_url = r.url 
            name = '20{}-{}-1.csv'.format(i, j)
            with open(name, 'wb') as f:
                f.write(r.content)
    if(i == 19):
        for j in range(7,9):
            url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-01%2000:00:00&endtime=20{y}-{e}-01%2023:59:59&orderby=time'.format(y=i, s=j, e=j)
            r = requests.get(url, allow_redirects=True)  # to get content after redirection
            pdf_url = r.url 
            name = '20{}-{}-0.csv'.format(i, j)
            with open(name, 'wb') as f:
                f.write(r.content)
            url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-16%2000:00:00&endtime=20{y}-{e}-31%2023:59:59&orderby=time'.format(y=i, s=j, e=j)
            r = requests.get(url, allow_redirects=True)  # to get content after redirection
            pdf_url = r.url 
            name = '20{}-{}-1.csv'.format(i, j)
            with open(name, 'wb') as f:
                f.write(r.content)
    if(i == 20):
        j = 5
        url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-01%2000:00:00&endtime=20{y}-{e}-01%2023:59:59&orderby=time'.format(y=i, s=j, e=j)
        r = requests.get(url, allow_redirects=True)  # to get content after redirection
        pdf_url = r.url 
        name = '20{}-{}-0.csv'.format(i, j)
        with open(name, 'wb') as f:
            f.write(r.content)
        url = 'https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=20{y}-{s}-16%2000:00:00&endtime=20{y}-{e}-31%2023:59:59&orderby=time'.format(y=i, s=j, e=j)
        r = requests.get(url, allow_redirects=True)  # to get content after redirection
        pdf_url = r.url 
        name = '20{}-{}-1.csv'.format(i, j)
        with open(name, 'wb') as f:
            f.write(r.content)

In [49]:
#Load all csv files in a dataframe
#path = r'C:\Users\pc\graduation-project-natural-disasters-main\Project Template'
path = r'E:\Coding\GP\graduation-project-natural-disasters\Project Template'
all_files = glob.glob(path + "/*.csv")
li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002439 entries, 0 to 1002438
Data columns (total 22 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   time             1002439 non-null  object 
 1   latitude         1002439 non-null  float64
 2   longitude        1002439 non-null  float64
 3   depth            1002438 non-null  float64
 4   mag              1001552 non-null  float64
 5   magType          1001550 non-null  object 
 6   nst              607473 non-null   float64
 7   gap              733466 non-null   float64
 8   dmin             698514 non-null   float64
 9   rms              1002082 non-null  float64
 10  net              1002439 non-null  object 
 11  id               1002439 non-null  object 
 12  updated          1002439 non-null  object 
 13  place            1002439 non-null  object 
 14  type             1002439 non-null  object 
 15  horizontalError  635107 non-null   float64
 16  depthError       1

In [50]:
#Delete the unwanted columns from the data
frame.drop(['magType', 'nst', 'gap', 'dmin', 'rms', 'net', 'updated', 'type', 'horizontalError', 'depthError', 'magError', 'magNst',
        'status', 'locationSource', 'magSource'], axis=1, inplace = True)

In [51]:
frame['time'] = pd.to_datetime(frame['time'])
#frame['magType'] = frame['magType'].astype(str)
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002439 entries, 0 to 1002438
Data columns (total 7 columns):
 #   Column     Non-Null Count    Dtype              
---  ------     --------------    -----              
 0   time       1002439 non-null  datetime64[ns, UTC]
 1   latitude   1002439 non-null  float64            
 2   longitude  1002439 non-null  float64            
 3   depth      1002438 non-null  float64            
 4   mag        1001552 non-null  float64            
 5   id         1002439 non-null  object             
 6   place      1002439 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(4), object(2)
memory usage: 53.5+ MB


frame['magType'] = frame['magType'].str.capitalize()

frame.magType.unique()

In [53]:
frame[['location','country']]= frame['place'].str.split(', ', n=1, expand=True)
frame.drop(['place', 'location'], axis=1, inplace = True)

In [43]:
frame['day'] = frame.time.dt.day
frame['month'] = frame.time.dt.month
frame['year'] = frame.time.dt.year
frame['timestamp'] = frame.time.values.astype(np.int64)
frame.drop(['time'], axis=1, inplace = True)

In [44]:
duplicateDFRow = frame[frame.duplicated(['id'])]
duplicateDFRow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30825 entries, 32066 to 1002438
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   latitude   30825 non-null  float64
 1   longitude  30825 non-null  float64
 2   depth      30825 non-null  float64
 3   mag        30804 non-null  float64
 4   id         30825 non-null  object 
 5   country    29800 non-null  object 
 6   day        30825 non-null  int64  
 7   month      30825 non-null  int64  
 8   year       30825 non-null  int64  
 9   timestamp  30825 non-null  int64  
dtypes: float64(4), int64(4), object(2)
memory usage: 2.6+ MB


In [55]:
frame.drop_duplicates(keep='first',inplace=True)

In [56]:
frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 971614 entries, 0 to 1002127
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype              
---  ------     --------------   -----              
 0   time       971614 non-null  datetime64[ns, UTC]
 1   latitude   971614 non-null  float64            
 2   longitude  971614 non-null  float64            
 3   depth      971613 non-null  float64            
 4   mag        970748 non-null  float64            
 5   id         971614 non-null  object             
 6   country    934338 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(4), object(2)
memory usage: 59.3+ MB


In [57]:
df1 = frame.dropna()
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 934006 entries, 0 to 1002127
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype              
---  ------     --------------   -----              
 0   time       934006 non-null  datetime64[ns, UTC]
 1   latitude   934006 non-null  float64            
 2   longitude  934006 non-null  float64            
 3   depth      934006 non-null  float64            
 4   mag        934006 non-null  float64            
 5   id         934006 non-null  object             
 6   country    934006 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(4), object(2)
memory usage: 57.0+ MB


In [21]:
frame.dtypes

latitude     float64
longitude    float64
depth        float64
mag          float64
id            object
country       object
day            int64
month          int64
year           int64
timestamp      int64
dtype: object

In [22]:
frame.sample(25)

Unnamed: 0,latitude,longitude,depth,mag,id,country,day,month,year,timestamp
479990,63.3924,-149.5899,101.7,1.1,ak0185h2lmwl,Alaska,29,4,2018,1525018853653000000
81417,39.484,-123.1135,2.339,1.43,nc71091904,California,16,5,2015,1431746527910000000
305962,38.4084,-118.7311,1.3,1.2,nn00580213,,27,2,2017,1488190660601000000
741076,60.2678,-143.1538,12.0,2.4,ak019c31hgv8,Alaska,20,9,2019,1568957292118000000
895908,33.9645,-116.678667,9.04,1.11,ci38611538,CA,27,7,2020,1595869987270000000
256922,53.7633,-167.0923,11.7,1.9,ak016burwm7p,Alaska,14,9,2016,1473831983895000000
377546,33.2585,-116.359833,12.64,0.68,ci37985904,CA,27,8,2017,1503875186360000000
680837,52.041667,-175.986333,12.85,0.12,av70827854,Alaska,8,4,2019,1554760150620000000
770719,19.422,-155.247833,0.53,2.77,hv72206366,Hawaii,29,10,2020,1603969473790000000
629271,54.0462,-165.5404,73.8,1.9,ak019epai8f3,Alaska,16,11,2019,1573878339837000000


## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!