In [1]:
import pandas as pd
import numpy as np

__Data__

In [4]:
#the data with the clusters obtained by hierarchical clustering
df_hier_clust=pd.read_csv("data/df_hier_clust.csv")

__Clusters__

Using hierarchical clustering with a dissimilarity measure for metric, I identified 18 clusters with non-homogenous sizes at a relatively low level of dissimilarity (1, the maximum being 4). 

3 of them make 74% of the cases, hence their interest.

In [21]:
groups=pd.groupby(df_hier_clust, by="clusters_max_dist_1") #clusters_max_dist_1 has the id for each of the 18 clusters

For reference, here are my clusters:

In [22]:
for cluster in df_hier_clust["clusters_max_dist_1"].value_counts().index:
    print "cluster {}".format(cluster)
    print groups.get_group(cluster)[["flee", "threat_level", "weapon_groups"]].describe(), "\n"

cluster 11
               flee threat_level weapon_groups
count           860          860           860
unique            1            1             9
top     Not fleeing       attack           gun
freq            860          860           629 

cluster 12
               flee threat_level weapon_groups
count           378          378           378
unique            1            1             9
top     Not fleeing        other         knife
freq            378          378           159 

cluster 17
       flee threat_level weapon_groups
count   270          274           274
unique    3            1             1
top     Car       attack           gun
freq    122          274           274 



- 1) The first group (labelled cluster 11 above (C11)) is comprised of cases where no chase was involved at any point and the shot people were marked as evidencing the highest level of threat "attack". __It makes at least 40% of the dataset__. Within this cluster 73% had guns, 10% had knifes.


- 2) The second group (C12) is comprised of cases where no chase was involved at any point and the shot people did not evidence the highest level of threat. __It makes at least 20% of the dataset__. Within this cluster 42% had knifes, 19% guns, 11% other non-blunt weapons, 9% were unarmed.


- 3) The third group (C17) is comprised of cases where the shot people did evidence the highest level of threat, and all had guns. __It makes at least 14% of the dataset__. Within this cluster, an equal proportion of 45% were trying to flee by car or foot, and the rest of the cases have missing values with respect to that aspect (10%). There is no case in this cluster where it has been confirmed the person did not try to flee from police at any point.

__Natural Language Processing Examination of the descriptions (Word Count)__

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [33]:
#Creating functions for replication with different data rows

#list of words to drop, like names...
one_ngram_drop_list = ["police","officers","officer","sheriff", "shot", "deputy", "deputies", "county"] #list for 1-ngram words
drop_list = []
for x in list(df_hier_clust.name.unique()):
    splitted_names = x.split(" ")
    drop_list.extend(splitted_names)
drop_list.extend(one_ngram_drop_list)

#Function to return the Top 10 words with TfIdfVectorizer
def tfidf_top_10(description_entries, 
          ngram_range_=(1,1)):
    
    tvec = TfidfVectorizer(stop_words='english', lowercase=True, ngram_range=ngram_range_, binary=True)
    
    tvec.fit(df_hier_clust.description)
    
    tvec_df = pd.DataFrame(tvec.transform(description_entries).todense(), columns=tvec.get_feature_names())
    
    mean_tvec_df = tvec_df.mean()#/!\the problem using the mean when the majority of 
                                 #values are zero is that the words occuring less in the cluster are pushed
                                 #out. As a result, the difference in the results with those
                                 #of a binary CountVectorizer is practically none. TfIdf was kept 
                                 #because of the slight improvement it provided
    
    result = mean_tvec_df.drop(drop_list, errors = 'ignore')#completely excluding words that do not give out 
                                                            #information. Needed because of the 
                                                            #'Errors' parameter is because of the list of names that are not 
                                                            #actually occuring labels
     
    return result.sort_values(ascending=False)[:10]#Top 10 words with Tf-Idf
    
     


In [34]:
print "Top 10 TfIdfVectorizer \n"

table = pd.DataFrame()

for cluster in df_hier_clust["clusters_max_dist_1"].value_counts().head(3).index:
    
    description_text = groups.get_group(cluster).description
    
    table["C{} - word".format(cluster)] = tfidf_top_10(description_text).index.values
    table["C{} - freq".format(cluster)] = tfidf_top_10(description_text).values
    
print table

Top 10 TfIdfVectorizer 

   C11 - word  C11 - freq   C12 - word  C12 - freq C17 - word  C17 - freq
0         gun    0.044486        knife    0.062814      chase    0.054117
1     pointed    0.036381         said    0.051612       fled    0.047907
2        home    0.030952         drop    0.034495        gun    0.042234
3      called    0.026533      refused    0.032105        car    0.041566
4        said    0.025789          man    0.030544        led    0.038218
5         man    0.024950       called    0.029567    vehicle    0.037920
6  responding    0.023268  disturbance    0.028110       foot    0.036378
7      report    0.022886      holding    0.028067    gunfire    0.035066
8    domestic    0.021702    responded    0.027853  exchanged    0.033604
9   responded    0.021649       report    0.027753       stop    0.032277


This basic NLP examination of the descriptions in each cluster is, for C12 and C17, clearly in accord with what transpires from the metrics above. The C11 results are not in contradiction but mainly seem to point to further circumstances.