# Task: Analysis of Disengagement Descriptions
The goal of this notebook will be to explore 2019 autonomous vehicle disengagement reports and find trends and causes. The notebook features n-grams and aggrigation.

Thanks to:
* Art124 https://www.kaggle.com/art12400 for both the dataset and the task

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The two datasets are from December 1, 2018 to November 30, 2019. The first time filers dataset refers to those companies that recieved their autonomous vehichle permit during this cycle, the other dataset contains info from companies whose permit predates the timeframe.

In [None]:
reports = pd.read_csv('/kaggle/input/2019-autonomous-vehicle-disengagement-reports/2019AutonomousVehicleDisengagementReports.csv')
reports_ftf = pd.read_csv('/kaggle/input/2019-autonomous-vehicle-disengagement-reports/2018-19_AutonomousVehicleDisengagementReports(firsttimefilers).csv')

In [None]:
reports.iloc[322,-1]

In [None]:
import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import re

In [None]:
# This creates one long string to perform n-gram operations on
super_string = reports['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT'].astype(str).sum()
super_string = re.sub(r'[^\w\s]','',super_string)
super_string = super_string.split(' ')
stop_words = set(nltk.corpus.stopwords.words('english'))
super_string = [word for word in super_string if word not in stop_words]
len(super_string)

# n-Grams
N-grams are a tool in nlp for finding sets of commonly co-occuring words, where n refers to the number of consecutive occuring words. for instance 'Machine', 'Learning', 'Rules' would be a 3 word n-gram or trigram. Notice how the tope seven bigrams overlap and have the same count, this is because there are multiple identical entries. When we view skip-grams (where words don't need to occur imediately next to one another) we start seeing more non-overlapping counts.

In [None]:
word_fd = nltk.FreqDist(super_string)
monogram_fd = nltk.FreqDist(nltk.ngrams(super_string,1))
monogram_fd.most_common(9)

In [None]:
bigram_fd = nltk.FreqDist(nltk.bigrams(super_string))
bigram_fd.most_common(10)

In [None]:

bigram_fd = nltk.FreqDist(nltk.ngrams(super_string, 6))
bigram_fd.most_common(10)

In [None]:
skipgram_fd = nltk.FreqDist(nltk.skipgrams(super_string, n=4, k=3))

skipgram_fd.most_common(9)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# Value counts
For a simple approach, below we have the 15 descriptions for most common causes for disengagement, followed by a graph of the distrobution of the top 30. The most popular entry; 'Safety Driver proactive disengagement' has over 1700 incedents, more than twice the next highest entry and accounting for roughly 1 fifth of all entries.

In [None]:
list(pd.value_counts(reports['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT']).iloc[:15].index)

In [None]:
figure(num=None, figsize=(24, 22))
sns.countplot(y='DESCRIPTION OF FACTS CAUSING DISENGAGEMENT', palette='plasma',data=reports,order=pd.value_counts(reports['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT']).iloc[:30].index)
sns.despine(bottom=True, left=True)
plt.xticks(rotation=90);

For our dataset on first time filers we have a very different set of answers. The most common issue is 'Software Discrepency' more than 4 times as common as the next highest entry, however this issue accounts for less than a tenth of the whole dataset.

In [None]:
figure(num=None, figsize=(20, 12))
sns.countplot(y='DESCRIPTION OF FACTS CAUSING DISENGAGEMENT', palette='plasma',data=reports_ftf,order=pd.value_counts(reports_ftf['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT']).iloc[:30].index)
sns.despine(bottom=True, left=True)
plt.xticks(rotation=90);

# Uniformity in reports
As we can see below, the leading disengagement descriptions are almost identical in every other column (with the obvious exception of date).

In [None]:
reports.loc[reports['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT'] == 'Safety Driver proactive disengagement.'].describe()

In [None]:
reports_ftf.loc[reports_ftf['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT'] == 'Software Discrepancy'].describe()


This year I've challenged myself to complete one task on Kaggle per week, in order to develop a larger Data Science portfolio. If you found this notebook useful or interesting please give it an upvote. I'm always open to constructive feedback. If you have any questions, comments, concerns, or if you would like to collaborate on a future task of the week feel free to leave a comment here or message me directly. For past TOTW check out the link to my page on github for this ongoing project https://github.com/Neil-Kloper/Weekly-Kaggle-Task/wiki