#### Input dataset ?
- I will be using the cleaned dataset after the initial EDA.

#### What is the purpose of this notebook?
- The number and nature of accidents in the aviation industry do not lean to any obiously particular location or type. 
- When I started with the topic, I was interested in finding patterns in the accidents. The Univariate analysis and bivariate analysis provides very interesting insights about the past accidents. Reading through the accident causes, I realized the accidents could have another angle added to it apart from the provided categorical features. The Narrative field is a textual description of the incident. I am interested in adding another dimension i.e. categorizing based on the words in the description.

#### What is the expected output of this notebook?
- I plan to achieve the following at the end of this notebook:
    - Check frequently used words in the summary using NLP/Count Vectorizer
    - Remove stop words iteratively using NLTK
    - Categorize words I think are common causes of accidents. eg: pilot error, weather, hijack and many mroe that I hope to find during the course of this project.
    - Map the frequent words found to the common causes of accidents.
    - This would help me to classify the incidents or to find clusters based on the results of this notebook.

In [52]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
path = './dataset/01_aviation_cleaned.csv'

In [4]:
data = pd.read_csv(path)

## Count Vectorize the Text

In [44]:
# Import library 

from sklearn.feature_extraction.text import CountVectorizer

In [45]:
# Join all the Incident Narratives

data_text = [summary for summary in data['Narrative']]

In [79]:
# Use Count vectorizer to get the  most used names in titles
# Analyzer - I want to analyze by words and not charachter
# Tokenizer - None
# Preprocessor - 
# Stop words - Use basic english stop words initially
# Maximum features - 1000
# Ngram Range 2-3
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None, lowercase=True, strip_accents="unicode",
                             preprocessor = None,stop_words = 'english',max_features = 5000, ngram_range = (1,1)) 

In [80]:
# Fit the text to the vectorizer model

data_text_features = vectorizer.fit_transform(data_text)

# Get the words and the count from the model

data_text_words = pd.DataFrame(vectorizer.transform(data_text).todense(), columns=vectorizer.get_feature_names())

In [81]:
list(data_text_words.columns)

['00',
 '000',
 '01',
 '02',
 '03',
 '04',
 '04r',
 '05',
 '050',
 '05l',
 '05r',
 '06',
 '06r',
 '07',
 '08',
 '080',
 '09',
 '09r',
 '10',
 '100',
 '1000',
 '10000',
 '102',
 '103',
 '104',
 '1049',
 '105',
 '106',
 '107',
 '108',
 '109',
 '10deg',
 '10l',
 '11',
 '110',
 '1100',
 '111',
 '111f',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '118a',
 '119',
 '11f',
 '11s',
 '12',
 '120',
 '1200',
 '12000',
 '121',
 '122',
 '123',
 '124',
 '124c',
 '125',
 '128',
 '129',
 '12bk',
 '13',
 '130',
 '1300',
 '130b',
 '130e',
 '130h',
 '130j',
 '131',
 '133',
 '134',
 '135',
 '135a',
 '139',
 '13l',
 '14',
 '140',
 '1400',
 '141',
 '1420',
 '145',
 '14500',
 '146',
 '148',
 '15',
 '150',
 '1500',
 '153',
 '154',
 '157',
 '158',
 '159',
 '15deg',
 '16',
 '160',
 '1600',
 '163',
 '165',
 '166',
 '168',
 '16r',
 '17',
 '170',
 '1700',
 '172',
 '175',
 '1750',
 '17l',
 '18',
 '180',
 '1800',
 '180deg',
 '185',
 '188',
 '19',
 '190',
 '1900',
 '1900c',
 '1900d',
 '191',
 '1936',
 '19

In [58]:
from nltk.corpus import stopwords

In [104]:
# Creating stop words database for all the numbers
stop = stopwords.words('english')
stop += ['00', '000', '01', '02', '03', '04', '04r', '05', '050', '05l', '05r', '06', '06r', '07', '08', '080', '09', '09r', '10', '100', '1000', '10000', '102', '103', '104', '1049', '105', '106', '107', '108', '109', '10deg', '10l', '11', '110', '1100', '111', '111f', '112', '113', '114', '115', '116', '117', '118', '118a', '119', '11f', '11s', '12', '120', '1200', '12000', '121', '361', '376','122', '123', '124', '124c', '125', '128', '129', '12bk', '13', '130', '1300', '130b', '130e', '130h', '130j', '131', '133', '134', '135', '135a', '139', '13l', '14', '140', '1400', '141', '1420', '145', '14500', '146', '148', '15', '150', '1500', '153', '154', '157', '158', '159', '15deg', '16', '160', '1600', '163', '165', '166', '168', '16r', '17', '170', '1700', '172', '175', '1750', '17l', '18', '180', '1800', '180deg', '185', '188', '19', '190', '1900', '1900c', '1900d', '191', '1936', '1938', '1940', '1941', '1942', '1943', '1944', '1945', '1948', '1949', '1950', '1951', '1953', '1954', '1955', '1957', '1963', '1964', '1966', '1968', '1969', '1972', '1973', '1974', '1978', '1979', '1980', '1981', '1982', '1983', '1985', '1986', '1988', '1990', '1991', '1992', '1993', '1994', '1997', '1998', '1999', '19r', '1a', '1a10', '1st', '20', '200', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '200c', '201', '2011', '2013', '2015', '202', '203', '2050', '207', '208', '208b', '20deg', '21', '210', '211', '212', '215', '22', '220', '2200', '222', '225', '227', '228', '22l', '23', '230', '2300', '232', '235', '24', '240', '2400', '241', '242', '24b', '24l', '24r', '24rv', '25', '250', '2500', '25l', '26', '260', '265', '266', '26b', '26l', '27', '270', '2700', '27l', '27r', '28', '280', '2800', '285', '28l', '28r', '29', '290', '29l', '2a', '2r', '2t', '30', '300', '3000', '304', '306', '30deg', '31', '310', '31242', '31243', '314', '315', '31l', '31r', '32', '320', '3200', '320o', '322', '325', '328', '32l', '32r', '33', '330', '3300', '330o', '336', '338', '33l', '34', '340', '340b', '340o', '345', '35', '350', '3500', '350o', '35a', '35l', '36', '360', '3600', '37', '38', '39', '390', '3a', '3c', '3deg', '3m', '3mge', '3t', '40', '400', '4000', '41', '410', '410uvp', '42', '420', '4200', '43', '430', '436', '44', '440', '45', '450', '4500', '45deg', '46', '46d', '47', '470', '4700', '47a', '47b', '47d', '47s', '48', '49', '50', '500', '5000', '501', '504', '50832', '50deg', '51', '52', '525', '53', '54', '5400', '55', '550', '5500', '56', '57', '58', '580', '59', '5a', '5h', '5n', '5nm', '5y', '60', '600', '6000', '601', '61', '610', '62', '620', '63', '631', '64', '640', '65', '650', '6500', '66', '67', '68', '6a', '6e', '70', '700', '7000', '707', '71', '710', '72', '720', '725', '727', '73', '737', '74', '747', '748', '75', '750', '7500', '756', '757', '76', '767', '767s', '76md', '770', '777', '78', '78505', '7nm', '80', '800', '8000', '801', '81', '811', '812', '82', '8235', '8241', '82a', '83', '84', '85', '850', '8500', '86', '87', '88', '8807', '89', '8q', '8th', '90', '900', '9000', '90deg', '91', '910', '92', '93', '931', '94', '95', '96', '97', '98', '99', '990', '9k', '9n', '9q', 'a100', 'a300', 'a310', 'a319', 'a320', 'a321', 'a330', 'a6m2', 'aa', 'aaa', 'aab', 'aaf', 'ab', 'ababa']
stop += ['08r', '1009', '101', '1011', '104b', '144', '21st', '221', '22285', '231', '23l', '300yds', '301', '3100', '318', '319', '31de', '321', '321b', '324', '3277', '33666', '33r', '380', '3800', '395', '39a', '3mce', '400a', '402q', '40563', '40deg', '415', '42e', '42w', '4300', '451', '460', '701', '702', '712', '730', '740', '75a', '76td', '7700', '775', '780', '78270', '78462', '785', '797', '800xp', '823', '841', '8as', '8r', '901', '919', '920', '926', '980', '9j', '9l', '9m']
stop += ['675','680','6800','6b']
# stop.append("need")

In [105]:
# Use Count vectorizer to get the  most used names in titles
# Analyzer - I want to analyze by words and not charachter
# Tokenizer - None
# Preprocessor - 
# Stop words - Use basic english stop words initially
# Maximum features - 1000
# Ngram Range 2-3
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None, lowercase=True, strip_accents="unicode",
                             preprocessor = None,stop_words = stop,max_features = 5000, ngram_range = (2,2)) 

In [106]:
# Fit the text to the vectorizer model

data_text_features = vectorizer.fit_transform(data_text)

# Get the words and the count from the model

data_text_words = pd.DataFrame(vectorizer.transform(data_text).todense(), columns=vectorizer.get_feature_names())

In [111]:
data_text_words.sum().nlargest(1000)

damaged beyond               619
beyond repair                617
substantial damage           400
landing gear                 342
sustained substantial        337
crew members                 325
caught fire                  324
en route                     290
international airport        290
forced landing               256
first officer                230
aircraft crashed             205
flight crew                  195
passenger plane              193
engine failure               186
came rest                    185
transport plane              185
left wing                    184
landing accident             182
short runway                 180
approach runway              178
emergency landing            169
shortly takeoff              161
main gear                    157
end runway                   156
co pilot                     155
air force                    153
right wing                   146
training flight              132
side runway                  128
duration h

In [None]:
# So .. we'll use np.sum() to convert it directly from the sparse matrix!
# This is enormously more memory-efficient ...
#   It only requires one int per column since summing across columns is the total word count.

def get_freq_words(sparse_counts, columns):
    # X_all is a sparse matrix, so sum() returns a 'matrix' datatype ...
    #   which we then convert into a 1-D ndarray for sorting
    word_counts = np.asarray(X_all.sum(axis=0)).reshape(-1)

    # argsort() returns smallest first, so we reverse the result
    largest_count_indices = word_counts.argsort()[::-1]

    # pretty-print the results! Remember to always ask whether they make sense ...
    freq_words = pd.Series(word_counts[largest_count_indices], 
                           index=columns[largest_count_indices])

    return freq_words


freq_words = get_freq_words(X_all, columns)
freq_words[:20]


# From the words its clear these are the common words in english and we need to avoid those. Implementing the
# stop word "english"

In [None]:
cvt = CountVectorizer(stop_words=stop, lowercase=True, strip_accents="unicode", ngram_range=(1,2))
X_all = cvt.fit_transform(data_scientist_df['jd'])
columns  =  np.array(cvt.get_feature_names())

freq_words = get_freq_words(X_all, columns)
freq_words[:20]