# Homework 4: Modeling Text Data

### Team Member 1:
* UNI:  WL2522
* Name: Wilson Lui

### Team Member 2 [optional]:
* UNI:  
* Name:


You can find the data here: https://data.boston.gov/dataset/vision-zero-entry

# Task1 - Data Cleaning  [10 points]

Load the data, visualize the class distribution. Clean up the target labels. Some categories have been arbitrarily split and need to be consolidated. 

First, rows with no comments or duplicate rows that have the same combination of request type and comment are removed. Rows with  comments that are less than 12 characters long were also removed due to comments of that length not being too informative.


Then, embedded image tags are removed from the categories.

Having removed the embedded image tags from the categories, the following categories are consolidated due to them having the same meaning with different wordings:


1. "bike facilities don't exist or need improvement" / "there are no bike facilities or they need maintenance"

2. "sidewalks/ramps don't exist or need improvement" / "there are no sidewalks or they need maintenance"

3. "the wait for the "Walk" signal is too long" / "people have to wait too long for the "Walk" signal"

4. "the roadway surface needs improvement" / "the roadway surface needs maintenance"

5. "it’s hard to see / low visibility" / "it’s hard for people to see each other"

6. "it's too far / too many lanes to cross" / "people have to cross too many lanes / too far"

7. "people are not given enough time to cross the street" / "there's not enough time to cross the street"

The following categories are very similar to each other, but not quite the same, and thus were not combined during this preprocessing step:


>"people don't yield while going straight" / "people don't yield while turning" /  "people run red lights / stop signs"


Whenever categories are consolidated, I merge the category with fewer data points into the one with more data points.


In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy
import re

from scipy.sparse import hstack
from collections import Counter

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report, f1_score, adjusted_rand_score
from sklearn.preprocessing import Normalizer

from sklearn.pipeline import make_pipeline, make_union

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.cluster import KMeans
from sklearn.decomposition import NMF, LatentDirichletAllocation

data = pd.read_csv('Vision_Zero_Entry.csv')


In [None]:
#Visualize the distribution of the categories


print(data['REQUESTTYPE'].unique())
plt.barh(range(len(data['REQUESTTYPE'].unique())), list(Counter(data['REQUESTTYPE']).values()), align='center')
plt.yticks(range(len(data['REQUESTTYPE'].unique())), list(Counter(data['REQUESTTYPE']).keys()))


In [None]:
#Remove rows with no comments
#Remove duplicate rows that have the same category and comment
#Remove rows with comments that are less than 12 characters long
#Create a list of categories
#Consolidate the categories that have embedded images tags into corresponding categories


data = data[data['COMMENTS'].notnull()]
data = data.drop_duplicates(subset=['REQUESTTYPE', 'COMMENTS'])
data = data[data['COMMENTS'].str.len() > 12]


print(len(data['REQUESTTYPE'].unique()))
categories = data['REQUESTTYPE'].unique()
categories = categories.tolist()
categories.sort()

for a in range(len(categories)): print(categories[a])
    
data['REQUESTTYPE'].replace(to_replace=categories[0:7], value=[
        "there's not enough time to cross the street",
    'the wait for the "Walk" signal is too long',
    'people speed', 'it’s hard to see / low visibility',
        "sidewalks/ramps don't exist or need improvement", "the roadway surface needs improvement",
    "of something that is not listed here"], inplace=True)

print(data['REQUESTTYPE'].unique())
print(len(data['REQUESTTYPE'].unique()))


In [None]:
#Visualize the distribution of the categories after the first round of category consolidation


print(data['REQUESTTYPE'].unique())
plt.barh(range(len(data['REQUESTTYPE'].unique())), list(Counter(data['REQUESTTYPE']).values()), align='center')
plt.yticks(range(len(data['REQUESTTYPE'].unique())), list(Counter(data['REQUESTTYPE']).keys()))


In [None]:
#Count how many occurrences of each category occur in the dataset
#Create a list with tuples containing each category and its number of occurrences


count = Counter(data['REQUESTTYPE'])
count = count.items()
count = list(count)
count.sort()
print(count)

print(count[14][0], count[19][0])
print(count[0][0], count[18][0])
print(count[11][0], count[17][0])
print(count[15][0], count[16][0])
print(count[2][0], count[3][0])
print(count[1][0], count[10][0])
print(count[5][0], count[20][0])


In [None]:
#For each category pair, check which category appears less often
#Replace that category with the other category


#bike facilities don't exist or need improvement / there are no bike facilities or they need maintenance


if count[0][1] > count[18][1]: 
    data['REQUESTTYPE'].replace(to_replace=count[18][0], value=count[0][0], inplace=True)

elif count[18][1] > count[0][1]:
    data['REQUESTTYPE'].replace(to_replace=count[0][0], value=count[18][0], inplace=True)

    
#sidewalks/ramps don't exist or need improvement / there are no sidewalks or they need maintenance


if count[14][1] > count[19][1]:
    data['REQUESTTYPE'].replace(to_replace=count[19][0], value=count[14][0], inplace=True)
    
elif count[19][1] > count[14][1]:
    data['REQUESTTYPE'].replace(to_replace=count[14][0], value=count[19][0], inplace=True)
    
    
#people have to wait too long for the "Walk" signal / the wait for the "Walk" signal is too long


if count[11][1] > count[17][1]:
    data['REQUESTTYPE'].replace(to_replace=count[17][0], value=count[11][0], inplace=True)
    
elif count[17][1] > count[11][1]:
    data['REQUESTTYPE'].replace(to_replace=count[11][0], value=count[17][0], inplace=True)
    

#the roadway surface needs improvement / the roadway surface needs maintenance


if count[15][1] > count[16][1]:
    data['REQUESTTYPE'].replace(to_replace=count[16][0], value=count[15][0], inplace=True)
    
elif count[16][1] > count[15][1]:
    data['REQUESTTYPE'].replace(to_replace=count[15][0], value=count[16][0], inplace=True)
    
    
#it’s hard to see / low visibility / it’s hard for people to see each other


if count[2][1] > count[3][1]:
    data['REQUESTTYPE'].replace(to_replace=count[3][0], value=count[2][0], inplace=True)
    
elif count[3][1] > count[2][1]:
    data['REQUESTTYPE'].replace(to_replace=count[2][0], value=count[3][0], inplace=True)
    
    
#it's too far / too many lanes to cross / people have to cross too many lanes / too far


if count[1][1] > count[10][1]:
    data['REQUESTTYPE'].replace(to_replace=count[10][0], value=count[1][0], inplace=True)
    
elif count[10][1] > count[1][1]:
    data['REQUESTTYPE'].replace(to_replace[1][0], value=count[10][0], inplace=True)

    
#people are not given enough time to cross the street / there's not enough time to cross the street


if count[5][1] > count[20][1]:
    data['REQUESTTYPE'].replace(to_replace=count[20][0], value=count[5][0], inplace=True)
    
elif count[20][1] > count[5][1]:
    data['REQUESTTYPE'].replace(to_replace=count[5][0], value=count[20][0], inplace=True)

print(Counter(data['REQUESTTYPE']))
print(len(data['REQUESTTYPE'].unique()))


In [None]:
#Make a copy of the dataset
#Separate the features from the response variable

consolidated_data = data.copy()

target = data['REQUESTTYPE']
comments = data['COMMENTS']


In [None]:
#Visualize the distribution of the consolidated categories


print(target.unique())
plt.barh(range(len(target.unique())), list(Counter(target).values()), align='center')
plt.yticks(range(len(target.unique())), list(Counter(target).keys()))


# Task2 - Model 1 [10 points]

Run a baseline multi-class classification model using a bag-of-word approach, report macro f1-score (should be above .5) and visualize the confusion matrix. Can you interpret the mistakes made by the model? 

The original selection of 28 categories have now been consolidated into the following 14:


1. "bike facilities don't exist or need improvement"

2. "of something that is not listed here"

3. "people don't yield while going straight"

4. "it’s hard to see / low visibility"

5. "people don't yield while turning"

6. "the wait for the "Walk" signal is too long"

7. "sidewalks/ramps don't exist or need improvement"

8. "people cross away from the crosswalks"

9. "people double park their vehicles"

10. "people speed"

11. "people run red lights / stop signs"

12. "it's too far / too many lanes to cross"

13. "there's not enough time to cross the street"

14. "the roadway surface needs improvement"
 
 

In [None]:
#Split the dataset into training and test sets
#Vectorize the comments


comments_train, comments_test, target_train, target_test = train_test_split(comments,
                                                                            target, stratify=target,
                                                                           random_state=3)

vect = CountVectorizer()

X_train = vect.fit_transform(comments_train)
X_test = vect.transform(comments_test)

#X_train_scaled = reg.fit_transform(X_train)
#X_test_scaled = reg.transform(X_test)


In [None]:
#Score a baseline multinomial naive Bayes model using F1-macro score


baseline = np.mean(cross_val_score(LogisticRegression(random_state=3), X_train, target_train,
                                   cv=5, scoring='f1_macro', n_jobs=8))
print('Baseline F1 Macro score:', baseline)

assert baseline > 0.5


In [None]:
#Display the confusion matrix and classification report


lr = LogisticRegression(random_state=3)
lr.fit(X_train, target_train)

np.set_printoptions(linewidth=100)

baseline_preds = cross_val_predict(lr, X_train, target_train, cv=5, n_jobs=8)

print(confusion_matrix(target_train, baseline_preds))
print(classification_report(target_train, baseline_preds))
      

From the confusion matrix and classification report, it appears that model is having trouble with these particular aspects of the model:

1. There are a great number of data points that are incorrectly classified as "of something that is not listed here". There are also many data points belonging to that category but incorrectly classified as other categories. Browsing through the comments belonging to "of something that is not listed here", it seems that many should belong to categories that already exist. Without having looked at the interface through which these complaints were collected, I would guess that maybe "of something that is not listed here" is the default option and many people don't bother changing it before submitting their complaints

2. The similarities between comments in the group "people don't yield while going straight", "people don't yield while turning", and "people run red lights / stop signs" are confusing the model. In the confusion matrix, it appears that a portion of the data points in each category are being mistakenly classified as other categories in this group.

3. Some data points in the "people speed" and "people run red lights / stop signs" categories are being mistakenly categorized as the other category. Similar to #1, many other data points belonging to other categories are being mistakenly categorized as one of these two categories.

4. Some data points in the group "bike facilities don't exist or need improvement", "sidewalks/ramps don't exist or need improvement", and "the roadway surface needs improvement" categories are being mistakenly categorized as other categories in this group. This seems to be due to the fact that complaints about bike facilities include those about poor road conditions or faded lane marking paint in bike lanes, which would overlap with comments about sidewalks and roadway conditions.

5. To a lesser extent than in #1, many data points are being mistakenly classified as "bike facilities don't exist or need improvement", most likely due to complaints about the bike facilities mentioning poor road conditions or danger due to traffic.


# Task3 - Model 2 [30 points]

Improve the model using more complex text features, including n-grams, character n-grams and possibly domain-specific features.

I began removing stop words using CountVectorizer since I saw that it reduced the number of features while achieving the same accuracy. I then performed grid searches using the following models to see if I can get better performance:


1. Multinomial Naive Bayes
2. Logistic Regression
3. Random Forest
4. SVM


I searched over the following parameter grid for CountVectorizer():


1. 'ngram_range': [(1, 1), (1, 2), (1, 5), (1, 7), (2, 3), (2, 5), (3, 8), (5, 5)]
2. 'analyzer': ['word', 'char', 'char_wb'],
3. 'min_df': [1, 2, 3],
4. 'normalizer': [None, Normalizer()]

Based on the results, I decided to continue with Naive Bayes and Logistic Regression since they performed the best, with similar cross-validation scores using the optimal parameters.


The optimal CountVectorizer() parameters to use with a Naive Bayes model were:


>{'normalizer': None, 'min_df': 3, 'analyzer': 'char_wb', 'ngram_range': (2, 5)}


The optimal CountVectorizer() parameters to use with a Logistic Regression model were:


>{'normalizer': None, 'ngram_range': (3, 8), 'min_df': 2, 'analyzer': 'char'}


Next, I used tf-idf rescaling on these two models and found that they greatly reduced the accuracy of the model compared with CountVectorizer(). 


I also tried adding a feature indicating the length of the original comment string and found that it either improved or worsened the model by a negligible amount. Therefore I decided not to include that feature in my model.


Finally, I incorporated a lemmatization function from the spaCy package to use as a custom tokenizer for CountVectorizer(). This change slightly increased my F1 macro score.



In [None]:
#Remove stop words when vectorizing the dataset


stop = CountVectorizer(stop_words='english')

X_train = stop.fit_transform(comments_train)
X_test = stop.transform(comments_test)

lr.fit(X_train, target_train)
stop_score = cross_val_score(lr, X_train, target_train, cv=5, scoring='f1_macro', n_jobs=8)

print('Stop Word Removal Scores:', stop_score)
print('Stop Word Removal Mean Score:', np.mean(stop_score))


Best score achieved through GridSearchCV with a Multinomial Naive Bayes model:
    

>0.525103133804


Best parameters:


>{'normalizer': None, 'min_df': 3, 'analyzer': 'char_wb', 'ngram_range': (2, 5)}



In [None]:
#Grid search using a naive Bayes model to find the best model paramaters


params = {'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 5), (1, 7),
                                (2, 3), (2, 5), (3, 8), (5, 5)],
             'countvectorizer__analyzer': ['word', 'char', 'char_wb'],
           'countvectorizer__min_df': [1, 2, 3],
           'normalizer': [None, Normalizer()]
}

nb_grid = GridSearchCV(make_pipeline(CountVectorizer(stop_words='english'),
                                  Normalizer(), MultinomialNB()) ,
                    param_grid=params, cv=5, scoring='f1_macro', n_jobs=8)

nb_grid.fit(comments_train, target_train)
print(nb_grid.best_score_)
print(nb_grid.best_params_)


This section has been commented out to prevent Travis-CI from timing out
------------------------------------------------------------------------

Best score achieved through GridSearchCV with a logistic regression model:


>0.544554548341


Best parameters:


>{'normalizer': None, 'ngram_range': (3, 8), 'min_df': 2, 'analyzer': 'char', 'C': 0.1}




In [None]:
#Grid search using a logistic regression model to find the best model paramaters


#lr_params = {'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 5), (1, 7),
#                                (2, 3), (2, 5), (3, 8), (5, 5)],
#             'countvectorizer__analyzer': ['word', 'char', 'char_wb'],
#           'countvectorizer__min_df': [1, 2, 3],
#           'normalizer': [None, Normalizer()],
#          'logisticregression__C': [100, 10, 1, 0.1, 0.01]
#        }

#lr_grid = GridSearchCV(make_pipeline(CountVectorizer(stop_words='english'),
#                                    Normalizer(), LogisticRegression()), 
#                       param_grid=lr_params, cv=5, scoring='f1_macro', n_jobs=8, verbose=2)

#lr_grid.fit(comments_train, target_train)
#print(lr_grid.best_score_)
#print(lr_grid.best_params_)


This section has been commented out to prevent Travis-CI from timing out
------------------------------------------------------------------------

Best score achieved through GridSearchCV with a random forest model:


>0.505797024798


Best parameters:


>{'normalizer': None, 'n_estimators': 200, 'ngram_range': (5, 5), 'analyzer': 'char', 'min_df': 3}



In [None]:
#Grid search using a random forest model to find the best model parameters


#rf_params = {'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 5), (1, 7),
#                                (2, 3), (2, 5), (3, 8), (5, 5)],
#             'countvectorizer__analyzer': ['word', 'char', 'char_wb'],
#           'countvectorizer__min_df': [1, 2, 3],
#           'normalizer': [None, Normalizer()],
#             'randomforestclassifier__n_estimators': [50, 100, 150, 200],
#        }

#rf_grid = GridSearchCV(make_pipeline(CountVectorizer(stop_words='english'),
#                                    Normalizer(), RandomForestClassifier()),
#                      param_grid=rf_params, cv=5, scoring='f1_macro', n_jobs=8, verbose=3)

#rf_grid.fit(comments_train, target_train)
#print(rf_grid.best_score_)
#print(rf_grid.best_params_)


In [None]:
#Grid search using an SVM model to find the best model parameters


#sv_params = {'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 5), (1, 7),
#                                (2, 3), (2, 5), (3, 8), (5, 5)],
#             'countvectorizer__analyzer': ['word', 'char', 'char_wb'],
#           'countvectorizer__min_df': [1, 2, 3],
#           'normalizer': [None, Normalizer()],
#          'svc__C': [100, 10, 1, 0.1, 0.01]
#        }

#sv_grid = GridSearchCV(make_pipeline(CountVectorizer(stop_words='english'),
#                                     Normalizer(), SVC()),
#                                    param_grid=sv_params, cv=5, scoring='f1_macro', verbose=2)

#sv_grid.fit(comments_train, target_train)
#print(sv_grid.best_score_)
#print(sv_grid.best_params_)


In [None]:
#Compare tf-idf rescaling with CountVectorizer using the best Logistic Regression model parameters
#{'normalizer': None, 'ngram_range': (3, 8), 'min_df': 2, 'analyzer': 'char', 'C': 0.1}

tf_lr_pipe = make_pipeline(TfidfVectorizer(stop_words='english', ngram_range=(3, 8),
                                           min_df=2, analyzer='char'),LogisticRegression(C=0.1))

cv_lr_pipe = make_pipeline(CountVectorizer(stop_words='english', ngram_range=(3, 8), 
                                           min_df=2, analyzer='char'), LogisticRegression(C=0.1))

tf_lr_pipe.fit(comments_train, target_train)
cv_lr_pipe.fit(comments_train, target_train)

tf_score = cross_val_score(tf_lr_pipe, comments_train, target_train, cv=5, scoring='f1_macro', n_jobs=8)
cv_score = cross_val_score(cv_lr_pipe, comments_train, target_train, cv=5, scoring='f1_macro', n_jobs=8)

print('tfidf scores:', tf_score)
print('tfidf mean score', np.mean(tf_score))
print('CountVectorizer scores', cv_score)
print('CountVectorizer mean score', np.mean(cv_score))


In [None]:
#Compare tf-idf rescaling with CountVectorizer using the best Naive Bayes model parameters
#{'normalizer': None, 'min_df': 3, 'analyzer': 'char_wb', 'ngram_range': (2, 5)}


tf_nb_pipe = make_pipeline(TfidfVectorizer(stop_words='english', ngram_range=(2, 5), 
                                           min_df=3, analyzer='char_wb'), MultinomialNB())

cv_nb_pipe = make_pipeline(CountVectorizer(stop_words='engish', ngram_range=(2, 5),
                                          min_df=3, analyzer='char_wb'), MultinomialNB())

print('tfidf:',
      cross_val_score(tf_nb_pipe, comments_train, target_train,
                      cv=5, scoring='f1_macro', n_jobs=8))
print('CountVectorizer',
     cross_val_score(cv_nb_pipe, comments_train, target_train,
                     cv=5, scoring='f1_macro', n_jobs=8))


In [None]:
#Add feature that indicates the length of the original comment string
#Evaluate this model using Naive Bayes


nb_cv = CountVectorizer(stop_words='english', ngram_range=(2, 5),
                       min_df=3, analyzer='char_wb')

train_len = comments_train.str.len()
train_len = np.reshape(train_len, (4833, 1))


comments_len = nb_cv.fit_transform(comments_train)
comments_len = hstack((comments_len, train_len))

nb_score = cross_val_score(MultinomialNB(), comments_len, target_train, cv=5,
                           scoring='f1_macro', n_jobs=8)

print('nb scores:', nb_score)
print('nb mean score:', np.mean(nb_score))


In [None]:
#Evaluate this model using Logistic Regression


lr_cv = CountVectorizer(stop_words='english', ngram_range=(3, 8),
                       min_df=2, analyzer='char')

comments_len = lr_cv.fit_transform(comments_train)
comments_len = hstack((comments_len, train_len))

lr_score = cross_val_score(LogisticRegression(C=0.1), comments_len, target_train, cv=5,
                           scoring='f1_macro', n_jobs=8)

print('LR scores:', lr_score)      
print('LR mean score:', np.mean(lr_score))     


In [None]:
#Implement a custom tokenizer that uses spaCy to perform lemmatization on the comments first


regexp = re.compile('(?u)\\b\\w\\w+\\b')
en_nlp = spacy.load('en_default')

old_tokenizer = en_nlp.tokenizer

en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(
regexp.findall(string))

def custom_tokenizer(document):
    doc_spacy = en_nlp(document, entity=False, parse=False)
    
    return [token.lemma_ for token in doc_spacy]


In [None]:
#Evaluate this model using Logistic Regression


lr_lemma_pipe = make_pipeline(CountVectorizer(tokenizer=custom_tokenizer, stop_words='english',
                            ngram_range=(3, 8), min_df=2, analyzer='char'), LogisticRegression(C=0.1))

lr_lemma_pipe.fit(comments_train, target_train)
lr_lemma_score = cross_val_score(lr_lemma_pipe, comments_train, target_train, cv=5,
                                scoring='f1_macro')

print('LR Lemmatization Scores:', lr_lemma_score)
print('LR Lemmatization Mean Scores:', np.mean(lr_lemma_score))


In [None]:
#Evaluate this model using Naive Bayes


nb_lemma_pipe = make_pipeline(CountVectorizer(tokenizer=custom_tokenizer, stop_words='english',
                                              ngram_range=(2, 5), min_df=3, analyzer='char_wb'),
                             MultinomialNB())

nb_lemma_pipe.fit(comments_train, target_train)
nb_lemma_score = cross_val_score(nb_lemma_pipe, comments_train, target_train, cv=5,
                                scoring='f1_macro')

print('NB Lemmatization Scores:', nb_lemma_score)
print('NB Lemmatization Mean Scores:', np.mean(nb_lemma_score))


# Task4 - Visualize Results [10 points]

Visualize results of the tuned model (classification results, confusion matrix, important features, example mistakes).

Since the Logistic Regression model consistently performed better than the Naive Bayes model, I decided to use that along with lemmatization going forward.


Though the accuracy is slightly improved, this model is still making classification mistakes that are similar in nature to the ones made by the baseline model.



In [None]:
#Score the tuned model on the training set
#Print the confusion matrix and classification report


preds = cross_val_predict(lr_lemma_pipe, comments_train, target_train, cv=5)


print(confusion_matrix(target_train, preds))
print(classification_report(target_train, preds))


In [None]:
#Plot the most important character n-grams for each category


feature_names = lr_lemma_pipe.named_steps['countvectorizer'].get_feature_names()

def plot_important_features(coef, feature_names, top_n=20, ax=None):
    if ax is None:
        ax = plt.gca()
    
    inds = np.argsort(coef)
    low = inds[:top_n]
    high = inds[-top_n:]
    important = np.hstack([low, high])
    myrange = range(len(important))

    ax.bar(myrange, coef[important])
    ax.set_xticks(myrange)
    ax.set_xticklabels(feature_names[important], rotation=90, ha='right')

n_classes = len(lr_lemma_pipe.classes_)

fig, axes = plt.subplots(n_classes, figsize=(10, 25))

for ax, coef, label in zip(axes.ravel(),
                           lr_lemma_pipe.named_steps['logisticregression'].coef_,
                          lr_lemma_pipe.classes_):
    #print(ax, coef, label)


    ax.set_title(label)
    plot_important_features(coef, np.array(feature_names), top_n=20, ax=ax)
    
plt.tight_layout()
    

In [None]:
#Filter out the misclassified comments and show some examples


misclassified = np.stack((np.array(comments_train), np.array(target_train), preds), axis=-1)
misclassified = misclassified[misclassified[:, 1] != misclassified[:, 2]]


for complaint in range(0, 2000, 400):
    print('\n', misclassified[complaint, 0], '\n',
          '\n', 'true category:', misclassified[complaint, 1],
          '\n', 'misclassified as:', misclassified[complaint, 2])


# Task5 - Clustering [10 points]

Apply LDA, NMF and K-Means to the whole dataset. Can you find clusters or topics that match well with some of the ground truth labels? Use ARI to compare the methods and visualize topics and clusters.

In [None]:
#Perform a grid search to find the most sensible number of clusters for K-Means clustering
#Calculate the ARI for each iteration
#Print the first 5 comments in each cluster



cv = CountVectorizer(min_df=2, ngram_range=(1, 4), analyzer='word', stop_words='english')
vec_comments = cv.fit_transform(comments)
feature_names = cv.get_feature_names()

km_ari = np.zeros((20,))

for n_cluster in range(20):
    km = KMeans(n_clusters=n_cluster+3, n_jobs=8, random_state=3)
    vec_comments = cv.fit_transform(comments)
    km.fit(vec_comments)
    km_preds = km.predict(vec_comments)
    
    km_ari[n_cluster] = adjusted_rand_score(target, km_preds)
    
    print('ARI:', km_ari[n_cluster])
    print(n_cluster+3, 'bin count', np.bincount(km_preds))
    km_clusters = np.stack((np.array(comments), km_preds), axis=-1)

    for cluster in range(n_cluster+3):
        for comment in range(5):
            try:
                print(cluster, '\n', km_clusters[km_clusters[:, 1] == cluster][comment, 0], '\n')
            except: continue

print(km_ari)


I first performed K-Means clustering with 14 clusters was performed on the entire dataset. I noticed that there were 5 clusters that contained only 1 data point each, so I kept reducing the number of clusters until all such clusters were gone. This was achieved when I performed K-Means clustering with only 8 clusters. I interpreted the resulting clusters in the following ways:


1. This cluster seems to be about cars not yielding or pedestrians jaywalking and how such situations could result in accidents.
2. 

In [None]:


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
cv = CountVectorizer(min_df=2, ngram_range=(1, 4), analyzer='word', stop_words='english')
vec_comments = cv.fit_transform(comments)
feature_names = cv.get_feature_names()

lda_ari = np.zeros((20,))

for topics in range(20):
    lda = LatentDirichletAllocation(n_topics=topics+3, learning_method='batch', n_jobs=8, random_state=3)
    lda.fit(vec_comments)
    lda_clusters = (np.argmax(lda.transform(vec_comments), axis=1))
    ari = adjusted_rand_score(target, lda_clusters)
    lda_ari[topics] = ari
    print('number of topics:', topics+3)
    print_top_words(lda, feature_names, 20)

print(lda_ari)


In [None]:
#6, 7, 9, 10, 12, 15 topics

# Task6 - Model 3 [30 points]

Improve the class definition for REQUESTTYPE by using the results of the clustering and results of the previous classification model. Re-assign labels using either the results of clustering or using keywords that you found during data exploration. The labels must be semantically meaningful.
The data has a large “other” category. Apply the topic modeling and clustering techniques to this subset of the data to find possible splits of this class.
Report accuracy using macro average f1 score (should be above .53) 


In [None]:
# Add your code for task 6 here. You may use multiple cells. 



# Extra Credit [Up to +20 points]

Use a word embedding representation like word2vec for step 3 and or step 6. 

In [None]:
# Add your code for extra credit here. You may use multiple cells. 

