INTRODUCTION
------------

In this notebook, we'll explore the activity of the significant earthquakes from the last 50 years. We'll use the data collected from the National Earthquake Information Center (NEIC) around the planet. We understand that an earthquake reaches the category of "significant" if its magnitude exceeds a grade of 5.4 (as commented on the Description of the dataset). For more information regarding the data we're going to use, you can visit the [US Geological Survey.][1]


  [1]: https://www.kaggle.com/usgs

1.- DATA PREPARATION
--------------------

(Some of the packages we're going to use have objects with the same name, which rises some warning messages. I'll use the supressWarnings method to avoid them)

In [None]:
suppressWarnings(suppressMessages(library(forecast)))
suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(corrplot)))
suppressWarnings(suppressMessages(library(astsa)))
suppressWarnings(suppressMessages(library(maps)))
suppressWarnings(suppressMessages(library(plyr)))
suppressWarnings(suppressMessages(library(fpp)))
suppressWarnings(suppressMessages(library(lubridate)))

In [None]:
database <- fread("../input/database.csv",stringsAsFactors = T) #Read the data
database$Date <- as.Date(database$Date, format="%d/%m/%Y") #Sets format to dates
database <- database[Type=="Earthquake"] #We're gonna use only the earthquakes
database <- database[,c("ID","Date","Time","Latitude","Longitude","Magnitude")] #Lets clear some data
database <- database[complete.cases(database[,2]),] #Drop the cases with NAs on the Dates
database$Year <- format(as.Date(database$Date, format="%d/%m/%Y"),"%Y") #We'll need the years more ahead
summary(database)

After the preprocess, the data looks clearer and more easy to work. We've dropped 14.211 records in which the Date was NaN, and therefore we couldn't use that data. We have 9.201 complete earthquake records.

2.- EXPLORATORY DATA ANALYSIS
-----------------------------

 - **NUMBER AND MAGNITUDE OF EARTHQUAKES AROUND THE WORLD**

Lets start by looking into the global geoactivity

In [None]:
map <- ggplot(database) + borders("world", colour="black", fill="gray50")  
print(map + geom_point(aes(x=database$Longitude, y=database$Latitude,color=Magnitude),shape=18) +
        scale_color_gradient(low="blue", high="red") +
        theme(legend.position = "top")+
        ggtitle("Earthquakes by Magnitude")+labs(caption="jhervas"))

The world map of the earthquake activity looks slightly simillar to the map of the [earth's tectonic plates][1]. As we might expect, the places located the closest to the limit of those plates are the zones with the highest activity.

An interesting fact to remark is that the most of the earthquakes are close to a magnitude of 6, and there are very few cases which exceed the grade of 7 along all the years. I want to take a look at the distribution of the magnitude on the tectonic activity:



  [1]: http://kidspressmagazine.com/wp-content/uploads/2014/04/dreamstimeextralarge_30353174-copy.jpg

 - **DISTRIBUTION OF THE MAGNITUDE ACROSS EARTHQUAKES**

In [None]:
ggplot(database,aes(Magnitude))+
  geom_area(aes(y = ..count..,fill="blue"), stat = "bin")+
  labs(title="Earthquakes",caption="jhervas") + 
  guides(fill=FALSE)

As we intuited, the most of the earthquakes' magnitudes are below the grade of 6, and almost anyone exceeds a grade of 8. At this point, we could wonder if this distribution has remained constant along the last 50 years. Lets find it out:

In [None]:
magnitudes_over_years <- ddply(database, .(Year), summarize,  Mean_Magnitude=mean(Magnitude))

Magnitudes <- ts(magnitudes_over_years[2],
                  start=1965, #min(database$Date, na.rm=TRUE)
                  end=2016, #max(database$Date, na.rm=TRUE),
                  frequency =1)
plot(Magnitudes)

At first look, aparently the mean intensity of the earthquakes has changed very much during this time, specially during the decade of the 70s. Nonetheless, we have to notice that the change range is less than 0,25 points during all the records. Lets take a look at the number earthquakes along the years:

 - **NUMBER OF EARTHQUAKES ALONG THE LAST 50 YEARS**

In [None]:
Earthquakes <- ts(unname(table(database$Year)),
                           start=1965, #min(database$Date, na.rm=TRUE)
                           end=2016, #max(database$Date, na.rm=TRUE),
                           frequency =1)
plot(Earthquakes)

Very interesting. The seismic activity has been very fluctuating, specially during the last 5 years. In 2011 we had a peak (with more than 350 earthquakes, the highest record on the last 50 years!) followed by a sudden fall on the next year.

In [None]:
diff_Earthquakes <- diff(Earthquakes)
diff_Magnitudes <- diff(Magnitudes)
par(mfrow=c(2,1))
plot(diff_Earthquakes)
plot(diff_Magnitudes)

In [None]:
Box.test(diff_Earthquakes, lag=20, type="Ljung-Box")
Box.test(diff_Magnitudes, lag=20, type="Ljung-Box")

In [None]:
adf.test(diff_Earthquakes, alternative ="stationary")
adf.test(diff_Magnitudes, alternative ="stationary")

kpss.test(diff_Earthquakes)
kpss.test(diff_Magnitudes)

In [None]:
acf2(diff_Earthquakes)

In [None]:
acf2(diff_Magnitudes)

In [None]:
sarima(Earthquakes, 1, 1, 1)

In [None]:
sarima(Earthquakes, 2, 2, 2)

In [None]:
sarima(Magnitudes, 1, 1, 1)

In [None]:
par(mfrow=c(2,1))
sarima.for(Earthquakes, n.ahead=5, 1, 1, 1)
sarima.for(Magnitudes,n.ahead=5, 1, 1, 1)