<a href="https://colab.research.google.com/github/saidileep-knv/GMC_CRASH_PREDICTIONS/blob/master/GMC_Crash_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The data consists of 2,375 complaints about specific GMC vehicles submitted to the National Highway Safety and Traffic Administration (NHTSA).

The data dictionary is as follows:

nthsa_id: A unique number for each complaint
Year: The car year - 2003 thru 2011
make: The make of the car - Chevrolet, Pontiac, Saturn
model: The car model - Cobalt, G5, HHR, ION, SKY, SOLSTICE
description: The actual complaint in text format
crashed: A binary attribute - 'N' for no and 'Y' for yes
abs: Anti-Brake System - 'N' for no and 'Y' for yes
mileage: The miles on the car at the time of the accident - 0 to 200,000


Objective: To build a model that predicts whether the car was involved in a crash using the complaint and automobile characteristics.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import os
os.chdir("./gdrive/My Drive/Colab Notebooks")

In [0]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
!pip install newspaper3k
!pip install newsapi-python
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 3.4MB/s 
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting tldextract>=2.0.1 (from newspaper3k)
[?25l  Downloading https://files.pythonhosted.org/packages/1e/90/18ac0e5340b6228c25cc8e79835c3811e7553b2b9ae87296dfeb62b7866d/tldextract-2.2.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 6.5MB/s 
Collecting jieba3k>=0.35.1 (from newspaper3k)
[?25l  Downloading https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4MB)
[K     |████████████████████████████████| 7.4MB 17.1MB/s 
Collecti

True

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
import AdvancedAnalytics
from AdvancedAnalytics import ReplaceImputeEncode
from AdvancedAnalytics import logreg
from sklearn.linear_model import LogisticRegression
from AdvancedAnalytics import DecisionTree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from pydotplus.graphviz import graph_from_dot_data
import graphviz

import pandas as pd
import numpy as np
import string
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD

Helper Function: my_analyzer(s), called by the sklearn Count and TDIDF Vectorizers

The following helper function will be used to customize the parse, pos, stop, stem process necessary for text analysis. These are done using the NLTK package, customized to remove certain words and symbols, and handle sysnonyms.

In [0]:
def my_analyzer(s):
    ##Synonym List
    syns = {'veh':'vehicle', 'car':'vehicle', 'chev':'chevrolet',
           'chevy':'chevrolet', 'air bag':'airbag', "n't":'not',
           'seat belt':'seatbelt','to30':'to 30', 'wont':'would not',
           'cant':'can not', 'cannot':'can not', 'couldnt':'could not',
           'shouldnt':'should not', 'wouldnt':'would not',
           'starightforward':'straight forward'}
    
    ##Preprocess String s
    s = s.lower()
    ##Replace special characters with spaces
    s = s.replace('_',' ')
    s = s.replace('-',' ')
    s = s.replace(',','. ')
    ##Replace not contraction with 'not'
    s = s.replace("'nt",' not')
    s = s.replace("n't",' not')
    ##Tokenize
    tokens = word_tokenize(s)
    tokens = [word for word in tokens if ('*' not in word) and
             ("''"!=word) and ("``"!=word) and (word!='description')
             and (word!='dtype') and (word!='object') and (word!="'s")]
    
    ##Map synonyms
    for i in range(len(tokens)):
        if tokens[i] in syns:
            tokens[i] = syns[tokens[i]]
    
    ##Remove Stop Words
    punctuation = list(string.punctuation)+['..','...']
    pronouns = ['i', 'he', 'she', 'it', 'him', 'they', 'we', 'us', 'them']
    others = ["'d", "co", "ed", "put", "say", "get", "can", "become",
              "los", "sta", "la", "use", "iii", "else"]
    stop = stopwords.words('english')+punctuation+pronouns+others
    filtered_terms = [word for word in tokens if (word not in stop) and
                     (len(word)>1) and (not word.replace('.','',1).isnumeric())
                     and (not word.replace("'",'',2).isnumeric())]
    
    # Lemmatization & Stemming - Stemming with WordNet POS    
    # Since lemmatization requires POS need to set POS    
    tagged_words = pos_tag(filtered_terms, lang='eng')    
    # Stemming for terms without WordNet POS    
    stemmer = SnowballStemmer("english")    
    wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}    
    wnl = WordNetLemmatizer()    
    stemmed_tokens = []    
    for tagged_token in tagged_words:        
        term = tagged_token[0]        
        pos  = tagged_token[1]        
        pos  = pos[0]        
        try:            
            pos   = wn_tags[pos]            
            stemmed_tokens.append(wnl.lemmatize(term, pos=pos))       
        except:            
            stemmed_tokens.append(stemmer.stem(term))    
    return stemmed_tokens

In [0]:
def display_topics(lda, terms, n_terms=15):
    for topic_idx, topic in enumerate(lda):
        if topic_idx>8:
            break
        message = "Topic #%d: " %(topic_idx+1)
        print(message)
        abs_topic = abs(topic)
        topic_terms_sorted = [[terms[i], topic[i]] 
                              for i in abs_topic.argsort()[:-n_terms -1:-1]]
        k = 5
        n = int(n_terms/k)
        m = n_terms-k*n
        for j in range(n):
            l = k*j
            message = ""
            for i in range(k):
                if topic_terms_sorted[i+l][1]>0:
                    word = "+"+topic_terms_sorted[i+l][0]
                else:
                    word = "-"+topic_terms_sorted[i+l][0]
                message += '{:<15s}'.format(word)
            print(message)
        if m>0:
            l = k*n
            message = ""
            for i in range(m):
                if topic_terms_sorted[i+l][1]>0:
                    word = "+"+topic_terms_sorted[i+l][0]
                else:
                    word = "-"+topic_terms_sorted[i+l][0]
                message += '{:<15s}'.format(word)
            print(message)
        print("")
    return

Read the Data File:

The maximum column width in pandas is increased to ensure the text is read without truncation

In [0]:
##Increase column width to let pandas read large text columns
pd.set_option('max_colwidth', 32000)

##Read NHTSA comments
df = pd.read_excel("GMC_Complaints.xlsx")

In [9]:
df.head(4)
df.dtypes

Unnamed: 0,nthsa_id,Year,make,model,description,crashed,abs,mileage
0,10022578,2003,SATURN,ION,WHILE TRAVELING ON THE HIGHWAY AND WITHOUT PRIOR WARNING SEAT BELT RETRACTOR FELL APART. *AK THE BOLT THAT CONNECTS THE WEBBING TO THE FLOOR WAS NOT FULLY SCREWED IN AT THE PLANT. THE BOLT BACKED OUT AND THE LOWER PORTION OF THE SEATBELT WEBBING BECAME UNATTACHED. THIS IS NOT A BUCKLE ISSUE OR A RETRACTOR ISSUE. MANUFACTURING DEFECT FROM THE PLANT BECAUSE THE BOLT WAS NOT FULLY TORQUED. DEALER FIXED BY TIGHTENING THE BOLT. CW,N,N,
1,10040419,2003,SATURN,ION,"WHILE DRIVING TRANSMISSION DOES NOT ENGAGE PROPERLY, CAUSING VEHICLE TO STALL. *AK",N,N,
2,10042851,2003,SATURN,ION,"IN A PANIC SITUATION, THE OWNER WAS UNABLE TO LOCATE THE HORN BUTTON DUE TO THE SIZE AND LOCATION. THIS CAUSED A DISTRACTION, DUE TO THE CONSUMER HAVING TO TAKE HER EYES OFF THE ROAD AND LOOK ON THE STEERING WHEEL TO LOCATE THE HORN BUTTON.*AK THE CONSUMER NOTED THE DRIVER'S HEAD REST WAS TILTED TOO FAR FORWARD, THE PROBLEM WAS EVENTUALLY CORRECTED. *JB *NM",N,N,500.0
3,10049638,2003,SATURN,ION,"THE TWO SATURN 2003 IONS I HAVE DRIVEN (INCLUDING MY CURRENT VEHICLE) HAVE A TRANSMISSION PROBLEM WHERE, WHEN ENGAGING IN THIRD GEAR, THE TRANSMISSION WILL ""FREEWHEEL"" FOR SEVERAL SECONDS BEFORE ENGAGING WITH A LURCH. THE PROBLEM CAN BE CREATED BY ACCELERATING AROUND A CORNER WHILE THE TRANSMISSION IS SHIFTING FROM SECOND TO THIRD. THE CARS WERE MANUFACTURED IN MARCH/APRIL, 2003. *AK",N,Y,10600.0


nthsa_id         int64
Year             int64
make            object
model           object
description     object
crashed         object
abs             object
mileage        float64
dtype: object

In [0]:
##Setup Program Constraints
n_comments = len(df['description']) #Number of Complaints
m_features = None                   #Number of SVD vectors
s_words = 'english'                 #Stop Words Dictionary
comments = df['description']        
n_topics = 9                        #Number of topic clusters to extract
max_iter = 10                       #Maximum number of iterations
max_df = 0.5                        #Learning offset for LDAmax proportion 
                                      #of docs/reviews allowed for a term

Tokenization, POS Tagging, Stopwords Removal and Stemming

In [11]:
##Create word frequency by Review Matrix using Custom Analyzer
cv = CountVectorizer(max_df=0.95, min_df =2, max_features=m_features,
                    analyzer=my_analyzer, ngram_range=(1,2))
tf = cv.fit_transform(comments)
terms = cv.get_feature_names()
term_sums = tf.sum(axis=0)
term_counts = []
for i in range(len(terms)):
    term_counts.append([terms[i], term_sums[0,i]])
def sortSecond(e):
    return e[1]
term_counts.sort(key=sortSecond, reverse=True)
print("\nTerms with Highest Frequency:")
for i in range(10):
    print('{:<15s}{:>5d}'.format(term_counts[i][0], term_counts[i][1]))
print("")


Terms with Highest Frequency:
vehicle         6996
steer           2924
contact         2604
power           2131
failure         1745
drive           1670
problem         1466
chevrolet       1422
turn            1256
recall          1239



Create TFIDF Matrix
    TFIDF is created by transforming the term frequency matrix tf

In [12]:
##Modify tf - term_frequencies to TF/IDF matrix from the data
print("Conducting Term/Frequency Matrix usinf TF-IDF")
tfidf_vect = TfidfTransformer(norm=None, use_idf=True)
tf = tfidf_vect.fit_transform(tf)

term_idf_sums = tf.sum(axis=0)
term_idf_scores = []
for i in range(len(terms)):
    term_idf_scores.append([terms[i], term_idf_sums[0,i]])
print("The term/Frequency Matrix has", tf.shape[0], "rows, and ", tf.shape[1], "columns.")
print("The Term list has", len(terms), " terms.")
term_idf_scores.sort(key=sortSecond, reverse=True)
print("\nTerms with highest TF-IDF Scores:")
for i in range(10):
    print('{:<15s}{:>8.2f}'.format(term_idf_scores[i][0],term_idf_scores[i][1]))

Conducting Term/Frequency Matrix usinf TF-IDF
The term/Frequency Matrix has 2734 rows, and  3276 columns.
The Term list has 3276  terms.

Terms with highest TF-IDF Scores:
vehicle         8162.39
contact         5371.28
steer           5114.56
power           4081.67
failure         3553.85
problem         3208.86
drive           2958.99
recall          2788.68
turn            2765.90
go              2757.42


Singular Value Decomposition

Use SVD to decompose the TFIDF matrix tf. This is called Latent Semantic Analysis - LSA.

In [13]:
##SVD is synonymous with LSA in sklearn
uv = TruncatedSVD(n_components=n_topics, algorithm='arpack',
                 tol=0, random_state=9999)
U = uv.fit_transform(tf)

#Display the topic selections
print("\n*******Generated Topics*******")
display_topics(uv.components_, terms, n_terms=15)


*******Generated Topics*******
Topic #1: 
+vehicle       +steer         +contact       +power         +problem       
+would         +go            +recall        +failure       +drive         
+turn          +time          +start         +chevrolet     +gm            

Topic #2: 
-contact       -failure       -mileage       -state         -own           
+problem       -manufacturer  +go            -fuel          -repair        
-current       +start         -campaign      +fix           -mph           

Topic #3: 
-steer         -power         +fuel          +ignition      +key           
+start         +pump          -drive         +switch        +leak          
-go            +saturn        -turn          -wheel         +smell         

Topic #4: 
-power         -steer         +brake         +front         +tire          
+air           +side          -fuel          +bag           +door          
+vehicle       +deploy        +driver        -recall        +hit           

Topic #5

Add Topic Scores to Dataframe
The matrix U contains the SVD calculations that can be used to assign each document to a topic group.

The code below examines the matrix U and assigns topic groups to each document, then augments the original dataframe with document topic group and U.

In [14]:
##Store topic group for each doc in topics[]
topics = [0]*n_comments
topic_counts = [0]*(n_topics+1)
for i in range(n_comments):
    max = abs(U[i][0])
    topics[i] = 0
    for j in range(n_topics):
        x = abs(U[i][j])
        if x > max:
            max = x
            topics[i] = j
    topic_counts[topics[i]] += 1

print('{:<6s}{:>8s}{:>8s}'.format("TOPIC", "COMMENTS", "PERCENT"))
for i in range(n_topics):
    print('{:>3d}{:>10d}{:>8.1%}'.format((i+1), topic_counts[i], topic_counts[i]/n_comments))
    
##Create comment_scores[] and assign the topic groups
comment_scores=[]
for i in range(n_comments):
    u = [0]*(n_topics+1)
    u[0] = topics[i]
    for j in range(n_topics):
        u[j+1] = U[i][j]
    comment_scores.append(u)

##Augment Dataframe with topic group information
cols = ['topic']
for i in range(n_topics):
    s = "T"+str(i+1)
    cols.append(s)
df_topics = pd.DataFrame.from_records(comment_scores, columns = cols)
df = df.join(df_topics)

TOPIC COMMENTS PERCENT
  1      1777   65.0%
  2       417   15.3%
  3       103    3.8%
  4       156    5.7%
  5       119    4.4%
  6        51    1.9%
  7        37    1.4%
  8        48    1.8%
  9        26    1.0%


    Logistic Regression

In [15]:
df.head(1)

Unnamed: 0,nthsa_id,Year,make,model,description,crashed,abs,mileage,topic,T1,T2,T3,T4,T5,T6,T7,T8,T9
0,10022578,2003,SATURN,ION,WHILE TRAVELING ON THE HIGHWAY AND WITHOUT PRIOR WARNING SEAT BELT RETRACTOR FELL APART. *AK THE BOLT THAT CONNECTS THE WEBBING TO THE FLOOR WAS NOT FULLY SCREWED IN AT THE PLANT. THE BOLT BACKED OUT AND THE LOWER PORTION OF THE SEATBELT WEBBING BECAME UNATTACHED. THIS IS NOT A BUCKLE ISSUE OR A RETRACTOR ISSUE. MANUFACTURING DEFECT FROM THE PLANT BECAUSE THE BOLT WAS NOT FULLY TORQUED. DEALER FIXED BY TIGHTENING THE BOLT. CW,N,N,,0,2.762473,1.051344,0.099453,1.183402,0.748171,0.751461,-0.820782,0.096164,-1.021086


Attribute Map for Preprocessing Data:

The following attribute map describes the data features.
The attribute 'crashed' is the target.
The attribute 'description' contains the text(driver's complaint).

In [0]:
attribute_map = {
    'nthsa_id':['Z',(0,1e+12),[0,0]],
    'Year':['N',(2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011),[0,0]],
    'make':['N',('CHEVROLET', 'PONTIAC', 'SATURN'),[0,0]],
    'model':['N',('COBALT', 'G5', 'HHR', 'ION', 'SKY', 'SOLSTICE'),[0,0]],
    'description':['Z',(''),[0,0]],
    'crashed':['B',('N', 'Y'),[0,0]],
    'abs':['B',('N', 'Y'),[0,0]],
    'mileage':['I',(1,200000),[0,0]],
    'topic':['N',(0,1,2,3,4,5,6,7,8),[0,0]],
    'T1':['I',(-1e+8,1e+8),[0,0]],
    'T2':['I',(-1e+8,1e+8),[0,0]],
    'T3':['I',(-1e+8,1e+8),[0,0]],
    'T4':['I',(-1e+8,1e+8),[0,0]],
    'T5':['I',(-1e+8,1e+8),[0,0]],
    'T6':['I',(-1e+8,1e+8),[0,0]],
    'T7':['I',(-1e+8,1e+8),[0,0]],
    'T8':['I',(-1e+8,1e+8),[0,0]],
    'T9':['I',(-1e+8,1e+8),[0,0]]
}

Attributes with '2' as the first number are nominal.
The topic attribute is the text topic cluster number. The attributes T1-T9 are the scores for the individual documents for the topic cluster.

In [17]:
target = 'crashed'
##Drop data with missing values for target
drops = []
for i in range(df.shape[0]):
    if pd.isnull(df['crashed'][i]):
        drops.append(i)
df = df.drop(drops)
df = df.reset_index()

encoding = 'one-hot'
scale = None ##Interval Scaling
rie = ReplaceImputeEncode(data_map=attribute_map, nominal_encoding=encoding,
                         interval_scale=scale, drop = True, display = True)
encoded_df = rie.fit_transform(df)

varlist = [target, 'T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9']
X = encoded_df.drop(varlist, axis = 1)
y = encoded_df[target]
np_y = np.ravel(y) #Convert dataframe to flat array
col = rie.col
for i in range(len(varlist)):
    col.remove(varlist[i])

lr = LogisticRegression(C=1e+16, tol=1e-16)
lr = lr.fit(X, np_y)

logreg.display_coef(lr, X.shape[1], 2, col)
logreg.display_binary_metrics(lr, X, y)



********** Data Preprocessing ***********
Features Dictionary Contains:
10 Interval, 
2 Binary, 
4 Nominal, and 
3 Excluded Attribute(s).

Data contains 2734 observations & 19 columns.


Attribute Counts
.................. Missing  Outliers
nthsa_id.....         0         0
Year.........         0         0
make.........         0         0
model........         0         0
description..         0         0
crashed......         0         0
abs..........        18         0
mileage......       419        15
topic........         0         0
T1...........         0         0
T2...........         0         0
T3...........         0         0
T4...........         0         0
T5...........         0         0
T6...........         0         0
T7...........         0         0
T8...........         0         0
T9...........         0         0

Coefficients:
Intercept..        -1.2817
mileage....        -0.0000
abs........         0.4754
Year2003...        -0.4768
Year2004...        -0.4



    Decision Tree

In [18]:
scale = None
rie = ReplaceImputeEncode(data_map=attribute_map, nominal_encoding=encoding,
                         interval_scale=scale, drop=False, display=True)
encoded_df = rie.fit_transform(df)
varlist = [target, 'T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9']
X = encoded_df.drop(varlist, axis = 1)
y = encoded_df[target]
np_y = np.ravel(y)
col = rie.col
for i in range(len(varlist)):
    col.remove(varlist[i])

dtc = DecisionTreeClassifier(max_depth=7, min_samples_split=5,
                            min_samples_leaf=5)
dtc = dtc.fit(X, np_y)
DecisionTree.display_importance(dtc, col, plot=False)
DecisionTree.display_binary_metrics(dtc, X, y)


********** Data Preprocessing ***********
Features Dictionary Contains:
10 Interval, 
2 Binary, 
4 Nominal, and 
3 Excluded Attribute(s).

Data contains 2734 observations & 19 columns.


Attribute Counts
.................. Missing  Outliers
nthsa_id.....         0         0
Year.........         0         0
make.........         0         0
model........         0         0
description..         0         0
crashed......         0         0
abs..........        18         0
mileage......       419        15
topic........         0         0
T1...........         0         0
T2...........         0         0
T3...........         0         0
T4...........         0         0
T5...........         0         0
T6...........         0         0
T7...........         0         0
T8...........         0         0
T9...........         0         0

FEATURE.... IMPORTANCE
topic3.....   0.4870
mileage....   0.1513
make2:SATU.   0.0896
model2:HHR.   0.0408
model5:SOLS   0.0367
topic1.....   0.0