In this notebook, we'll be performing exploratory text analysis on the notes from Durham's CFS data.  Some ideas for places to take this:

 - Word frequencies for an initial look
 - Figure out the pattern (if any) behind the dispatcher scripts, including priority, chief complaint, questions/answers, and subject/vehicle descriptions

Some observations from reading remarks:
 - "x ft from STREET NAME" indicates directed patrol at that location
 - there is a structure for transports that includes the distance traveled with a transport, but the officers don't appear to fill it out consistently
 - 10 codes are seen as "10-72" or just "72"
 - CAD-generated messages are prefixed by a bracketed descriptor -- usually "[EPD]".  Messages left by the units are preceded by (UNIT NUMBER).
 - 

To look at word frequencies, we have to combine all the words into a single corpus.

In [3]:
import nltk
import dataset
from sqlalchemy import create_engine
import pandas as pd
from IPython.display import display
from collections import defaultdict
import numpy as np

db_uri = "postgresql://jnance:@localhost:5432/cfs"

engine = create_engine(db_uri)
db = dataset.connect(db_uri)

  (attype, name))
  (attype, name))
  (attype, name))


In [2]:
notes = []

for row in db.query("SELECT body FROM note;"):
    if row['body']:
        notes += row['body'].lower().split()
    
txt = nltk.Text(notes)
fd = nltk.FreqDist(txt)

  (attype, name))
  (attype, name))


Here we can see the most frequent words in the corpus.  Many of these will probably become domain stopwords. (note: must download the nltk stopwords corpus using `nltk.download()` before this will work)

In [12]:
print("========= TOP 100 WORDS =========")
l = fd.most_common()
ndx = 0
num = 1
ignore = nltk.corpus.stopwords.words('english')
while (num <= 100 and ndx < len(l)):
    if l[ndx][0] not in ignore:
        print(str(num) + ". " + l[ndx][0])
        num += 1
    ndx += 1
print("\n")

1. caller
2. [epd]
3. -
4. questions:
5. ***
6. scene.
7. is:
8. incident
9. vehicle
10. response:
11. chief
12. complaint:
13. statement:
14. 1.
15. 2.
16. 3.
17. 4.
18. 5.
19. 6.
20. info
21. /
22. 7.
23. one
24. code:
25. person
26. .
27. description
28. known
29. involved.
30. involves
31. dispatch
32. male
33. 8.
34. suspect
35. call
36. 9.
37. weapons
38. adv
39. 1.suspect:
40. back
41. 10.
42. location
43. race:
44. priority
45. gender:
46. 11.
47. color:
48. age:
49. danger.
50. clothing:
51. alarm
52. party
53. cb
54. happened
55. blk
56. business/resident/owner
57. suspect`s
58. reported
59. known.
60. 12.
61. phone
62. number
63. traffic
64. left
65. suspect/person
66. 2
67. progress.
68. area.
69. involved
70. female
71. victim
72. calling
73. suspicious
74. black
75. pd
76. name
77. responsible
78. n/a
79. make:
80. medical
81. mentioned.
82. area
83. needs
84. door
85. 13.
86. s/he
87. body:
88. [fire]
89. 2nd
90. victim.
91. disturbance
92. see
93. ft
94. disturbance.
95

Many of these words are part of the call scripts transcribed by the CAD system, describing the caller's answers to specific questions asked by the dispatcher.  We'll probably need to add the script words to a stopword list, but we can keep the descriptive words that go along with them.

In [14]:
dpd_stopwords = set(nltk.corpus.stopwords.words('english'))

domain_stopwords = set(['caller', '[epd]', '-', 'questions:', 'scene.', '***', 'is:', 'incident', 'response:', 'chief',
                       'complaint:', 'statement:', 'info', 'description', 'known', 'involved', 'involves', 'dispatch',
                       '1.', '2.', '3.', '4.', '5.', '6.', '7.', '8.', '9.', '10.', '11.', '12.', '13.', '/', 'code:',
                       '.', 'call', 'suspect', 'adv', '1.suspect:', 'location', 'race:', 'priority', 'gender:',
                       'color:', 'age:', 'clothing:', 'cb', 'suspect`s', 'reported', 'known.', 'suspect/person',
                       'progress.', 'area.', 'pd', 'name', 'n/a', 'make:', 'area', 's/he',
                       '[fire]', 'model:'])

dpd_stopwords.update(domain_stopwords)

Trying again with domain stopwords added:

In [16]:
print("========= TOP 100 WORDS =========")
l = fd.most_common()
ndx = 0
num = 1
ignore = dpd_stopwords
while (num <= 100 and ndx < len(l)):
    if l[ndx][0] not in ignore:
        print(str(num) + ". " + l[ndx][0])
        num += 1
    ndx += 1
print("\n")

1. vehicle
2. one
3. person
4. involved.
5. male
6. weapons
7. back
8. danger.
9. alarm
10. party
11. happened
12. blk
13. business/resident/owner
14. phone
15. number
16. traffic
17. left
18. 2
19. female
20. victim
21. calling
22. suspicious
23. black
24. responsible
25. medical
26. mentioned.
27. needs
28. door
29. body:
30. 2nd
31. victim.
32. disturbance
33. see
34. ft
35. disturbance.
36. white
37. veh
38. line
39. open
40. theft
41. attention.
42. past:
43. aborted
44. leave
45. alarms
46. reportedly
47. caller.
48. 14.
49. drugs
50. monitoring
51. company.
52. two
53. activation
54. still
55. keyholder/owner
56. blocking
57. emergency
58. car
59. alarm.
60. law
61. incident.
62. contacted.
63. property
64. st
65. someone
66. front
67. going
68. advised
69. house
70. vm
71. suspect/person/vehicle
72. 1
73. 4
74. vehicles
75. blue
76. burglary/intrusion
77. tty
78. unk
79. no:
80. alcohol
81. 3
82. officer
83. vehicle.
84. people
85. 3rd
86. might
87. in.
88. wants
89. ago):
90. 

#Gang-related
##Data gathering
We did a segment for Tampa on gang-related calls.  In this case, we have the benefit of a "curated" data set, as the incidents have a flag indicating whether a given call is gang-related.  Let's see if we can classify calls as gang-related or not based on the call notes.

First, we'll get the info for each call.  Some of this may be useful in the classifier.

In [24]:
call_info = pd.read_sql("SELECT call.*, gang_related FROM call,"
                        "incident WHERE call.incident_id = incident.incident_id AND EXISTS "
                        "(SELECT 1 FROM note where call_id = call.call_id) ORDER BY call_id;",
                        engine)
call_info.shape

(23699, 40)

Check out the columns to see which appear to be the most informative -- we'll drop any ones that are likely unnecessary or won't be accessible at the start of a call.

In [25]:
print(call_info.columns)

Index(['call_id', 'month_received', 'week_received', 'dow_received', 'hour_received', 'case_id', 'call_source_id', 'primary_unit_id', 'first_dispatched_id', 'reporting_unit_id', 'street_num', 'street_name', 'city_id', 'zip', 'crossroad1', 'crossroad2', 'geox', 'geoy', 'beat', 'district', 'sector', 'business', 'nature_id', 'priority', 'report_only', 'cancelled', 'time_received', 'time_routed', 'time_finished', 'first_unit_dispatch', 'first_unit_enroute', 'first_unit_arrive', 'first_unit_transport', 'last_unit_clear', 'time_closed', 'close_code_id', 'close_comments', 'incident_id', 'year_received', 'gang_related'], dtype='object')


In [26]:
call_info['street'] = call_info['street_num'] + ' ' + call_info['street_name']

drop_cols = ['case_id', 'reporting_unit_id', 'geox', 'geoy', 'report_only', 'cancelled', 'time_received',
             'time_routed', 'time_finished', 'first_unit_dispatch', 'first_unit_enroute', 'first_unit_arrive',
             'first_unit_transport', 'last_unit_clear', 'time_closed', 'close_code_id', 'close_comments',
             'street_num', 'street_name']

for col in drop_cols:
    call_info = call_info.drop(col, axis=1)    

In [27]:
call_info.shape

(23699, 23)

In [28]:
"""separate_notes = defaultdict(list)

for row in db.query("SELECT * from note;"):
    separate_notes[row['call_id']].append(row['body'])

note_bodies = []
call_ids = []
    
for call_id in separate_notes.keys():
    call_ids.append(call_id)
    note_bodies.append('\n'.join(separate_notes[call_id]))
    
notes = pd.DataFrame({'note': note_bodies}, index=call_ids)
notes.index.name = 'call_id'
notes.head()"""

notes = pd.read_sql("SELECT note.call_id, body FROM note INNER JOIN call ON note.call_id=call.call_id "
                    "WHERE call.incident_id IS NOT NULL ORDER BY call_id;", con=engine)
notes['body'] = notes['body'].map(lambda x: '' if x is None else x)
note_grp = notes.groupby('call_id', sort=False).agg(lambda x: '\n'.join(x.tolist()))
note_grp.shape

(23699, 1)

In [30]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import wordnet as wn

stopwords = set(stopwords.words('english'))

#Get a vector of counts to use for synsetting and stemming

vec = CountVectorizer(stop_words=stopwords, min_df=0.005, max_df=0.99)
data = vec.fit_transform(note_grp['body'].tolist()).toarray()
data_names = vec.get_feature_names()

group = pd.DataFrame(data, columns=data_names)
group.shape

(23699, 1064)

In [31]:
synset_df = pd.DataFrame()


# Make DataFrame with synsets occuring >= 150 times
for column in group.columns:
    for synset in wn.synsets(column):
        try:
            synset_df[synset.name()] += group[column]
        except KeyError:
            synset_df[synset.name()] = group[column]

for column in synset_df.columns:
    if sum(synset_df[column]) < 250:
        synset_df.drop(column, axis=1, inplace=True)
        
synset_df.shape

(23699, 4212)

In [32]:
# Make DataFrame with most common stems
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

stemmer_df = pd.DataFrame()
for column in group.columns:
    try:
        stemmer_df[stemmer.stem(column)] += group[column]
    except KeyError:
        stemmer_df[stemmer.stem(column)] = group[column]
        
for column in stemmer_df.columns:
    if sum(stemmer_df[column]) < 250:
        stemmer_df.drop(column, axis=1, inplace=True)
        
for column in stemmer_df.columns:
    new_column = column + '_stem'
    stemmer_df.rename(columns={column: new_column}, inplace=True)
    
print(stemmer_df.shape)
    
synset_and_stems = pd.concat([synset_df, stemmer_df], axis=1, join='inner')

(23699, 660)


In [37]:
# Get another vector of counts to use by itself as a feature

vec = CountVectorizer(stop_words=stopwords, min_df=0.05, max_df=0.99)
data = vec.fit_transform(note_grp['body'].tolist()).toarray()
data_names = vec.get_feature_names()

group = pd.DataFrame(data, columns=data_names)
group.shape

(23699, 166)

In [38]:
final_data = pd.concat([call_info, group, synset_and_stems], axis=1, join='inner')

In [39]:
final_data.to_csv('../csv_data/gang_related.csv', encoding='utf-8', index=False)

In [40]:
final_data.shape

(23699, 5061)

##Classification

In [4]:
from sklearn import metrics
from sklearn import cross_validation
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

In [5]:
data = pd.read_csv('../csv_data/gang_related.csv', encoding='utf-8')

  data = self._reader.read(nrows)


In [6]:
data['gang_related'].value_counts(dropna=False)

False    21003
NaN       2103
True       593
dtype: int64

In [7]:
data = data[pd.notnull(data['gang_related'])]
data.shape

(21596, 5061)

In [8]:
data['gang_related'] = data['gang_related'].astype(int)

for str_col in ('street_name', 'crossroad1', 'crossroad2', 'beat', 'district', 'sector', 'business', 'priority'):
    le = LabelEncoder()
    data[str_col] = pd.Series(le.fit_transform(data[str_col].tolist()))
data.fillna(0, inplace=True)
data_features = data.drop('gang_related', 1)
data_features_values = data_features.values
feature_labels = list(data_features.columns.values)



#Subset training
data_train = data_features.loc[4250:]
data_train_features_values = data_train.values

my_normalizer=MinMaxScaler()
my_normalizer.fit(data_train_features_values)
data_train_features_values = my_normalizer.transform(data_train_features_values)

#Subset target
data_train_target = data.loc[4250:]['gang_related']
data_train_target_values = data_train_target.values.astype('float32')
data_train_features_values = data_train_features_values.astype('float32')

#Subset validation
data_valid = data_features.loc[0:4250]
data_valid_features_values = data_valid.values

data_valid_features_values = my_normalizer.transform(data_valid_features_values)

data_valid_features_values = data_valid_features_values.astype('float32')
data_valid_target = data.loc[0:4250]['gang_related']
data_valid_target_values = data_valid_target.values.astype('float32')

In [37]:
clf = LogisticRegressionCV(penalty='l1', cv=5, solver='liblinear', class_weight= {1.: 100000, 0.: 1})
clf.fit(data_train_features_values, data_train_target_values)

LogisticRegressionCV(Cs=10, class_weight={0.0: 1, 1.0: 100000}, cv=5,
           dual=False, fit_intercept=True, intercept_scaling=1.0,
           max_iter=100, multi_class='ovr', n_jobs=1, penalty='l1',
           refit=True, scoring=None, solver='liblinear', tol=0.0001,
           verbose=0)

In [38]:
testing=clf.predict(data_valid_features_values)

In [39]:
clf.coef_

array([[-1.76105582, -0.5982973 , -0.16759817, ..., -0.91387284,
        -0.49378518, -2.22101367]])

In [40]:
predicted_train = clf.predict(data_train_features_values)

predicted_valid = clf.predict(data_valid_features_values)

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print('-'*21)
print("Classification Report")
print('-'*21)
print(classification_report(data_train_target_values, predicted_train))
print('-'*16)
print("Confusion Matrix")
print('-'*16)
print(confusion_matrix(data_train_target, predicted_train))

print('-'*21)
print("Classification Report")
print('-'*21)
print(classification_report(data_valid_target, predicted_valid))
print('-'*16)
print("Confusion Matrix")
print('-'*16)
print(confusion_matrix(data_valid_target, predicted_valid))

selected_features=clf.coef_[0]
selected_features_labels={}
count=0
for x in selected_features:
    if x != 0:
        selected_features_labels[feature_labels[count]]=x
    count+=1
print("Features: " + str(len(selected_features_labels)))

---------------------
Classification Report
---------------------
             precision    recall  f1-score   support

        0.0       0.98      0.70      0.82     17248
        1.0       0.03      0.36      0.06       452

avg / total       0.95      0.69      0.80     17700

----------------
Confusion Matrix
----------------
[[12068  5180]
 [  288   164]]
---------------------
Classification Report
---------------------
             precision    recall  f1-score   support

          0       0.98      0.62      0.76      3756
          1       0.06      0.62      0.10       141

avg / total       0.94      0.62      0.73      3897

----------------
Confusion Matrix
----------------
[[2322 1434]
 [  54   87]]
Features: 5060


#Comment Sentiment

One possible area of interest would be to use the comments to identify calls where a negative interaction takes place between an officer and a citizen (henceforth "Officer-Citizen Conflict", or OCC).  With the current tension between police and citizens across the US, using the call comments to identify these kinds of calls could be useful to the DPD.  We can take two approaches to identifying these calls: using a regex to find words in close proximity (such as "OFFICER" and "RUDE" in the same sentence), or performing sentiment analysis on the comments.  We'll try both the ANEW and AFINN dictionaries.

In [2]:
import nltk.stem.porter as stemmer
import csv
import re
import math
from pprint import pprint

Set up the sentiment dictionaries

In [3]:
class ANEWScorer():
    ''' Sentiment scoring using the ANEW sentiment dictionary '''

    def __init__(self,anew_file_path,score_single_words=True):
        ''' If score_single_words is False, then the scorer will return 
            None for phrases submitted where fewer than 2 words match
            with the word lists '''
        self.anew_dict_words, self.anew_dict_stems = self.__loadANEW(anew_file_path)
        self.score_single  = score_single_words
        self.punctuation   = re.compile(r'[-.?!,“”":;#()]')
        self.porterstemmer = stemmer.PorterStemmer()

    def score(self,phrase,values="all"):
        ''' Return a tuple of scores based on the ANEW sentiment dictionary
            IF the "values" parameter is 'all', this is what you will get
            ((valence-mean,valence-SD),(arousal-mean,arousal-SD),(dominance-mean,dominance-SD))
            IF the "values" parameter specifies either 'valence' 'arousal' or 'dominanace',
            you will only get the single relevant tuple.
        '''
        # The NLTK Tokenizer might introduce problems here, depending
        # on what you are trying to tokenize. Therefore, the default
        # is to split on blankspace after removing punctuation
        tokens = [self.punctuation.sub("",word) for word in phrase.split()]

        if len(tokens) > 0 and (self.score_single is True or 
                                    (self.score_single is False and 
                                        len(tokens) > 1)):
            # This will return None if there is nothing to score
            # or if we aren't scoring single words and there is only one word
            valence   = []
            arousal   = []
            dominance = []

            for word in tokens:
                if word in self.anew_dict_words:
                    # Push in scores
                    valence.append((  self.anew_dict_words[word]['valence-mean'],  self.anew_dict_words[word]['valence-SD'] ))
                    arousal.append((  self.anew_dict_words[word]['arousal-mean'],  self.anew_dict_words[word]['arousal-SD'] ))
                    dominance.append((self.anew_dict_words[word]['dominance-mean'],self.anew_dict_words[word]['dominance-SD'] ))
                else: 
                    word_stem = self.porterstemmer.stem(word)
                    if word_stem in self.anew_dict_stems:
                        valence.append((  self.anew_dict_stems[word_stem]['valence-mean'],  self.anew_dict_stems[word_stem]['valence-SD'] ))
                        arousal.append((  self.anew_dict_stems[word_stem]['arousal-mean'],  self.anew_dict_stems[word_stem]['arousal-SD'] ))
                        dominance.append((self.anew_dict_stems[word_stem]['dominance-mean'],self.anew_dict_stems[word_stem]['dominance-SD'] ))

            # Initialize the tuples to return
            fin_valence    = (0,0)
            fin_arousal    = (0,0)
            fin_dominance  = (0,0)

            if len(valence) > 0:
                if len(valence) == 1:
                    fin_valence = (valence[0][0],valence[0][1])
                else:
                    fin_valence = self.__weightedAverage(valence)

                if len(arousal) == 1:
                    fin_arousal = (arousal[0][0],arousal[0][1])
                else:
                    fin_arousal = self.__weightedAverage(arousal)

                if len(dominance) == 1:
                    fin_dominance = (dominance[0][0],dominance[0][1])
                else:
                    fin_dominance = self.__weightedAverage(dominance)
        else:
            # If we are here it is because we either have no tokens
            # or did not meet the criteria for scoring the tokens
            return

        # Only return values if there are values to return
        # None will be "returned" if there is nothing to score
        if len(valence) > 0:
            if values == "all":
                return (fin_valence,fin_arousal,fin_dominance)
            elif values == "valence":
                return fin_valence
            elif values == "arousal":
                return fin_arousal
            elif values == "dominance":
                return fin_dominance

    def __small_p(self,stdev_i):
        ''' Helper function for calculating the maximum value on a normal curve given the standard deviation '''
        p = 1/(stdev_i * math.sqrt(2 * math.pi))
        return p

    def __weightedAverage(self,score_list):
        ''' Calculates the weighted average based on incoming pairs of averages and standard deviations '''
        mean_sd  = self.__meanSD(score_list)
        sum_p_i  = sum([self.__small_p(i[1]) for i in score_list])
        big_v_mu = sum([(self.__small_p(i[1])/sum_p_i * i[0]) for i in score_list])

        return big_v_mu, mean_sd
        
    def __meanSD(self,score_list):
        ''' Generates the average standard deviation given different standard deviations 
            Gnarly math courtesy of: http://www.csc.ncsu.edu/faculty/healey/maa-14/text/index.html
        '''
        bigM   =       (1/float(len(score_list))) * sum([i[0] for i in score_list])
        bigV   = math.sqrt(((1/float(len(score_list))) * sum([(i[1]**2 + bigM**2) for i in score_list])) - bigM**2)
        return bigV

    def __loadANEW(self,anew_file_path):
        ''' Private method for loading the ANEW dict '''
        f = open(anew_file_path, "r")
        f.readline() # Move the cursor past the header
        
        word_dict = {}
        stem_dict = {}

        for i, line in enumerate(f):
            values = line.replace("\n","").split(",")
            measures = {}
            measures['word-no']        = int(  values[2])
            measures['valence-mean']   = float(values[3])
            measures['valence-SD']     = float(values[4])
            measures['arousal-mean']   = float(values[5])
            measures['arousal-SD']     = float(values[6])
            measures['dominance-mean'] = float(values[7])
            measures['dominance-SD']   = float(values[8])
            measures['word-freq']      = float(values[9])

            word_dict[values[0]] = measures.copy()
            stem_dict[values[1]] = measures.copy()

        return word_dict, stem_dict
    
scorer = ANEWScorer("../text_analysis/anew.csv")
    
afinn = dict(map(lambda kv: (kv[0],int(kv[1])), [ line.split('\t') for line in open("../text_analysis/AFINN-111.txt")]))

Gather all the notes.  For ANEW, it doesn't do us much good to look at each line of the comments by itself, because individual words are weighted too heavily.  We'll need to look at the comments as a whole.  We also need to exclude any CAD-generated notes, which start with brackets. ex. "[EPD] Chief Complaint: ..."

In [6]:
q = "SELECT string_agg(body, '\n') AS body FROM note WHERE body NOT LIKE '[%' GROUP BY call_id;"

all_notes = [r['body'] for r in db.query(q) if r['body'] is not None]

Perform the ANEW scoring; it scores on 3 axes, so we'll have to sort on different combinations of them to see what works best.

In [45]:
anew_scored_comments = [scorer.score(n) for n in all_notes]

In [46]:
notes_and_scores = [(all_notes[i],anew_scored_comments[i]) for i in range(len(all_notes))]
pprint(notes_and_scores[:5])

[('272.09 ft from 1610 GUESS RD', None),
 ('DO NOT STOP AT CALLERS ADDRESS CK AREA\n'
  'CALLER SAID SHE WAS GOING TO LAY DOWN AND HUNG UP ON ME\n'
  'gave 10-14\n'
  'maybe from lynn  no complainant to see or contact\n'
  'NO FURTHER INFO',
  None),
 ('Linked Events 2014-000005(567) to 2014-000006(568)', None),
 ('Linked Events 2014-000005(567) to 2014-000006(568)\n'
  'another caller advised same per charles ph #919 637 4750 no complainant '
  'to see',
  None),
 ('gave 10-14', None)]


In [47]:
# Valence axis -- we want to see the lowest sentiment
pprint(sorted(notes_and_scores, key=lambda x: x[1][0][0] if x[1] is not None else 100)[:10])

[('(DUKEPD) notified\n'
  'FM IS OFF HER MEDS AND HAS NOT SLEEP  IN FOUR DAYS AND APPEARS TO BE IN '
  'A PARNOID STATE\n'
  '(B212) NOTIFY DUKE ETA 10 MIN-NON COMBATIVE W/1 FEM\n'
  'Robin Davis 11/18/60. advised voices in her head were telling her to '
  'commit suicide. She advised she has insomnia and has been drinking beer.',
  ((1.25, 0.69), (5.73, 3.14), (3.58, 3.02))),
 ('(C411) 43 A PEOKO\n'
  'Advised that her boyfriend raped her younger sister around 0300 this '
  'morning\n'
  'PHONE LOST SIGNAL FIRST TIME, ON CB SOMEONE GAVE THIS ADDRESS THEN HUNG '
  'UP\n'
  '(C411) CONCORD/PEOKO\n'
  'caller now reporting a rape that occurred last night. 919-667-1745, '
  'Rachael Roberts. Caller hung up again\n'
  'ON SECOND CB, STRAIGHT TO VM\n'
  '(C411) 1 72\n'
  'SOUNDED LIKE SOMEONE HOLLARING IN THE BACKGROUND\n'
  'Event spawned for DPD Event ID:2014034849\n'
  'Event spawned for DPD Event ID:2014035085',
  ((1.25, 0.91), (6.81, 3.1700000000000004), (2.97, 2.94))),
 ('(C411) CONC

In [48]:
# Arousal axis -- could be highest or lowest
pprint(sorted(notes_and_scores, reverse=True, key=lambda x: x[1][1][0] if x[1] is not None else -100)[:10])
print("-" * 20)
pprint(sorted(notes_and_scores, key=lambda x: x[1][1][0] if x[1] is not None else 100)[:10])

[('COMPLAINANT REFUSED FOR PD TO SEE HIM AND THAT HE WOULD GO TO SUBSTATION '
  '3 TO SEE OFFICER\n'
  'LS SEEN MAKING A TURN INTO MLK FROM SHANNON RD\n'
  'DRIVER WAS A WHITE/BLACK MALE MIX WEARING AN OLD WHITE SHIRT, SHORTS.\n'
  'ENTERED CALL IN BOLO\n'
  'ref poss road rage...\n'
  'BLACK HAUNDAI ACCENT, ON SHANNON RD APPROACHING MLK',
  ((2.41, 1.86), (8.17, 1.4), (5.68, 3.01))),
 ('reference road rage', ((2.41, 1.86), (8.17, 1.4), (5.68, 3.01))),
 ('innediate hu\n'
  'on cb\n'
  '15 501 s into durham ....road rage near target....white nissan '
  'yte31??...no compl',
  ((2.41, 1.86), (8.17, 1.4), (5.68, 3.01))),
 ('NOT A VALID NUMBER FOR CALLBACK\n'
  'open line, children on the phone, excited to see grandpa',
  ((7.5, 2.2), (7.67, 1.91), (6.18, 2.17))),
 ('EHISTORY SHOWS 2822 W MAIN ST\n'
  'TRYING TO INTERP NOW\n'
  'GOT VOICEMAIL ON CALL BACK\n'
  '{B114} ATL\n'
  '{B122} ATL\n'
  'LL INTERP TRYING TO CALL NUMBER BACK NOW\n'
  'LEFT MESSAGE\n'
  'WAITING FOR LL REP TO COME ON 

In [49]:
# Dominance axis -- could be highest or lowest
pprint(sorted(notes_and_scores, reverse=True, key=lambda x: x[1][2][0] if x[1] is not None else -100)[:10])
print("-" * 20)
pprint(sorted(notes_and_scores, key=lambda x: x[1][2][0] if x[1] is not None else 100)[:10])

[('this is in ref. to a break in that just occurred at 3923 king charles rd',
  ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('caller is on scott king rd near the entrance to the park\n'
  'Caller back on the line for an eta...advised PD would be there asap\n'
  '(A422) DSO WILL HANDLE\n'
  'holding auth A422',
  ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('see the caller...he will be @ the  buyquick on s alston by NCCU...across '
  'from the burger king',
  ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('shawn king advised no emergency',
  ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('by kings grocery store\n'
  'Linked Events 2014-121222(844) to 2014-121221(843)',
  ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('CALLER IS WAITING IN FRONT OF TOPS CHINA IN A GREEN NISSAN SENTRA\n'
  'busn  - china king',
  ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('busn is china king', ((7.26, 1.67), (5.51, 2.77), (7.38, 2.1))),
 ('in the woods behind burger king\n'
  '(C413) CLR SIG20, 2 10-72\

In [50]:
del anew_scored_comments
del notes_and_scores

Unfortunately, these results are pretty useless.  They more indicate whether a certain word (like 'rage' or 'love') is present in the notes.  We'll try the AFINN dictionary.

In [10]:
notes_and_scores = [(n, sum(map(lambda word: afinn.get(word, 0), n.lower().split()))) for n in all_notes]
pprint(sorted(notes_and_scores, key=lambda x: x[1])[:10])
print("-" * 20)
pprint(sorted(notes_and_scores, key=lambda x: x[1], reverse=True)[:10])

[('--------------------------COPIED MESSAGE--------------------  Received: '
  '11/17/2014 11:59:33 PM  From: SYSTEM (SYSTEM)  Message:   TO: DHC2    '
  '-046530 20141117 23:59:33 [ 2C029B2F54 ]   FROM: DMVISSNM        '
  '20141117 23:59:32                             N.C. DRIVER LICENSE '
  'SYSTEM  RESPONSE BASED UPON:  NAME: SCOTT,VALLI                     '
  'CITY:   COUNTY:       BIRTH DATE:            AGE:     RACE:    SEX:    '
  'PAGES:    7  ATTENTION:       IMAGE: Y                             '
  'DRIVER ISSUANCE RESPONSE  NAME: SCOTT VALLI ADETRIA                  '
  'ADDRESS:  109 BENNETT CT  CITY:   DURHAM                  STATE: NC  '
  'ZIP: 277011401  TOTAL POINTS:    0  DOB:07-14-1985 HEIGHT: 5 FT. 06 '
  'IN.   SEX: F  EYES: BRO  HAIR: BRO  RACE: B  PRIMARY LICENSE NO:      '
  '23362703  SECONDARY LICENSE NO:               NON-RESIDENT MILITARY: N  '
  'ORG. ISS.DT:          OS DL NO:                       OS '
  'STATE:           *** DRIVER LICENSE STATUS: CLS 

These results just favor the longer calls.  It looks like the sentiment analysis is going to be too general to assist us in IDing OCC calls.