<a id="top"></a>
# Sometimes they don't make it down in one piece

Every aircraft made by man (using paper or metal) eventually comes down, sometimes not in on piece. In this kernel I explore and analyze plane crashes through history.

## Contents
1. [Geocoding Crash Locations](#1)
1. [Plot of all the Crash locations](#2)
1. [Distribution of crashes through years](#3)
1. [Maps](#maps)
    1. [Crash Locations (till 1940)](#4-0)
    1. [Crash Location (1940 - 1960)](#4-1940)
    1. [Crash Location (1960 - 1980)](#4-1960)
    1. [Crash Location (1980 - 2000)](#4-1980)
    1. [Crash Location (2000 - 2020)](#4-2000)
1. [Word Clouds](#word_clouds)
    1. [Route During Crash](#5)
    1. [Models that Crash a lot](#6)
    1. [Most occuring words in crash summaries](#7)
1. [Distribution of fatalities in crash](#8)

In [None]:
import numpy as np 
import pandas as pd

import os
from os import path
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
print(os.listdir("../input"))
import urllib

In [None]:
# Loading the data
data_d = pd.read_csv('../input/plane-crash/planecrashinfo_20181121001952.csv')
data_d.head()

<a id="1"></a>
## Geocoding Crash Locations
This dataset has a feature called ```location```. This denotes the location of the crash. I had to geocode these location. I used [HERE](https://developer.here.com/) to geocode the given locations. The code for which can be found [here](https://pastebin.com/WPUEvfQg)

You can also use Openstreetmap. Code for which can be found [here](https://pastebin.com/qAPhASxx)

[back to top](#top)

In [None]:
# Loading geocoded data
data_geocoded = pd.read_csv('../input/air-crash-geocoded-20/geocoded_locations_new.csv')
data_geocoded.head()

<a id="2"></a>
## Plot of all Crash Locations in the dataset
[back to top](#top)

In [None]:
py.init_notebook_mode(connected=True)

py.init_notebook_mode(connected=True)


data = [ dict(
        type = 'scattergeo',
        lon = data_geocoded['Longitude'],
        lat = data_geocoded['Latitude'],
        text = data_geocoded['Locations'],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.6,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            color = 'red',
        ))]

layout = dict(
        title = 'Plane crashes<br>(Hover for crash locations)',
        geo = dict(
            showland = True,
            landcolor = "rgb(255, 255, 255)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='historic-crashes' )

Seems like all the continents have their fair share of crashes. There is almost no crashes in Antartica and Greenland, which is understandable.

<a id="3"></a>
## Distribution of the number of crashes through the years
[back to top](#top)

In [None]:
fig = ff.create_distplot([data_geocoded['year']], ['Crashes'])
fig['layout'].update(title='Distplot of Crashes through years', xaxis=dict(title='Year'))
py.iplot(fig, filename='Basic Distplot')

We see very few crashes near 1920. This is because the commercial arlines were first intruduced in [1915](https://en.wikipedia.org/wiki/Airliner#History). Since then, they have increased in popularity. This led to more metallic birds over our head and more crashes.

From somewhere around 1990s we a drop in number of crashes. This could be due to improvement in technologies (or maybe the dataset has not recorded every crash ¯\\_(ツ)_/¯  ).

<a id="maps"></a>
# Crashes grouped by years.
[back to top](#top)

<a id='4-0'></a>
### till 1940

In [None]:
data = [ dict(
        type = 'scattergeo',
        lon = data_geocoded['Longitude'][data_geocoded['year']<1940],
        lat = data_geocoded['Latitude'][data_geocoded['year']<1940],
        text = data_geocoded['Locations'][data_geocoded['year']<1940],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.6,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            color = 'red',
        ))]

layout = dict(
        title = 'Plane crashes Before 1940<br>(Hover for crash locations)',
        geo = dict(
            showland = True,
            landcolor = "rgb(255, 255, 255)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='historic-crashes' )

<a id="4-1940"></a>
### 1940-1960

In [None]:
data = [ dict(
        type = 'scattergeo',
        lon = data_geocoded['Longitude'][(data_geocoded['year']>=1940).values & (data_geocoded['year']<1960).values],
        lat = data_geocoded['Latitude'][(data_geocoded['year']>=1940).values & (data_geocoded['year']<1960).values],
        text = data_geocoded['Locations'][(data_geocoded['year']>=1940).values & (data_geocoded['year']<1960).values],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.6,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            color = 'red',
        ))]

layout = dict(
        title = 'Plane crashes 1940-1960<br>(Hover for crash locations)',
        geo = dict(
            showland = True,
            landcolor = "rgb(255, 255, 255)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='historic-crashes' )

<a id="4-1960"></a>
### 1960-1980

In [None]:
data = [ dict(
        type = 'scattergeo',
        lon = data_geocoded['Longitude'][(data_geocoded['year']>=1960).values & (data_geocoded['year']<1980).values],
        lat = data_geocoded['Latitude'][(data_geocoded['year']>=1960).values & (data_geocoded['year']<1980).values],
        text = data_geocoded['Locations'][(data_geocoded['year']>=1960).values & (data_geocoded['year']<1980).values],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.6,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            color = 'red',
        ))]

layout = dict(
        title = 'Plane crashes 1960-1980<br>(Hover for crash locations)',
        geo = dict(
            showland = True,
            landcolor = "rgb(255, 255, 255)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='historic-crashes' )

<a id="4-1980"></a>
### 1980-2000

In [None]:
data = [ dict(
        type = 'scattergeo',
        lon = data_geocoded['Longitude'][(data_geocoded['year']>=1980).values & (data_geocoded['year']<2000).values],
        lat = data_geocoded['Latitude'][(data_geocoded['year']>=1980).values & (data_geocoded['year']<2000).values],
        text = data_geocoded['Locations'][(data_geocoded['year']>=1980).values & (data_geocoded['year']<2000).values],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.6,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            color = 'red',
        ))]

layout = dict(
        title = 'Plane crashes 1980-2000<br>(Hover for crash locations)',
        geo = dict(
            showland = True,
            landcolor = "rgb(255, 255, 255)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='historic-crashes' )

<a id="4-2000"></a>
### 2000-2020

In [None]:
data = [ dict(
        type = 'scattergeo',
        lon = data_geocoded['Longitude'][(data_geocoded['year']>=2000).values & (data_geocoded['year']<2020).values],
        lat = data_geocoded['Latitude'][(data_geocoded['year']>=2000).values & (data_geocoded['year']<2020).values],
        text = data_geocoded['Locations'][(data_geocoded['year']>=2000).values & (data_geocoded['year']<2020).values],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.6,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            color = 'red',
        ))]

layout = dict(
        title = 'Plane crashes 2000-2020<br>(Hover for crash locations)',
        geo = dict(
            showland = True,
            landcolor = "rgb(255, 255, 255)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='historic-crashes' )

We can see that the crash sites are initially concentrated in North America and Europe. Then, it starts spread to other parts of the globe.

## Word Clouds

<a id="5"></a>
### Routes that have a lot of crashes
[back to top](#top)

In [None]:
text = ' '.join(data_d['route'][data_d['route']!='?'].values)
import urllib
file = urllib.request.urlopen('https://i.pinimg.com/564x/f8/98/f2/f898f2b1d68f0218f7dbc2a459a60bb0.jpg')
img = Image.open(file)
alice_mask = np.array(img)
stopwords = set(STOPWORDS)
wc = WordCloud(background_color="white", max_words=200, mask=alice_mask,
               stopwords=stopwords, contour_width=0)

# generate word cloud
wc.generate(text.lower())
plt.figure(figsize = (10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

We see that most of the planes that crashes had New York, Alaska (AK) or Paris in their route. Planes in training also crashed a lot.

<a id="6"></a>
### Model of planes that crash a lot
[back to top](#top)

In [None]:
text = ' '.join(data_d['ac_type'][data_d['ac_type']!='?'].values)
file = urllib.request.urlopen('https://i.pinimg.com/564x/f8/98/f2/f898f2b1d68f0218f7dbc2a459a60bb0.jpg')
img = Image.open(file)
alice_mask = np.array(img)
stopwords = set(STOPWORDS)
wc = WordCloud(background_color="white", max_words=200, mask=alice_mask,
               stopwords=stopwords, contour_width=0)

# generate word cloud
wc.generate(text.lower())
plt.figure(figsize = (10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Douglas DCs and Boeings seem to be crashing a lot. 

Turns out Douglas DC is the most produced Airline/Transport aircraft. (Source: [Wikipedia](https://en.wikipedia.org/wiki/List_of_most-produced_aircraft)]

<a id="7"></a>
### Most occuring words in crash summaries
[back to top](#top)

In [None]:
text = ''
for i in data_d['summary']:
    if i != '?':
        text = text + i

file = urllib.request.urlopen('https://i.pinimg.com/564x/f8/98/f2/f898f2b1d68f0218f7dbc2a459a60bb0.jpg')
img = Image.open(file)
alice_mask = np.array(img)
stopwords = set(STOPWORDS)
stopwords.add("aircraft")
stopwords.add("crashed")
stopwords.add("airplane")
stopwords.add("plane")
stopwords.add("helicopter")
wc = WordCloud(background_color="white", max_words=200, mask=alice_mask,
               stopwords=stopwords, contour_width=0)

# generate word cloud
wc.generate(text.lower())
plt.figure(figsize = (10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

The above word cloud gives us a good understanding of the causes of crashes in most cases.

<a id="8"></a>
### Distribution of number of fatalities during crash
[back to top](#top)

In [None]:
fatalities = []
for i in data_geocoded['fatalities']:
    n = i.split()[0]
    if n != '?':
        if int(n) <=100:
            fatalities.extend([int(n)])

fig = ff.create_distplot([fatalities], ['Fatalities'])
layout = go.Layout(
    title='Distplot of Fatalities',
    xaxis=dict(
        title='No. of Fatalities',
        titlefont=dict(
            color='#000000'
        )
    )
)
fig['layout'].update(title='Distplot of Fatalities', xaxis=dict(
        title='No. of Fatalities',
        titlefont=dict(
            color='#7f7f7f'
        )
    ))
py.iplot(fig, filename='Basic Distplot')

Most of the times, only 1, 2 or 3 people were killed in the crash (That's a good thing).