In this notebook, we'll be performing exploratory text analysis on the notes from Durham's CFS data.  Some ideas for places to take this:

 - Word frequencies for an initial look
 - Figure out the pattern (if any) behind the dispatcher scripts, including priority, chief complaint, questions/answers, and subject/vehicle descriptions

Some observations from reading remarks:
 - "x ft from STREET NAME" indicates directed patrol at that location
 - there is a structure for transports that includes the distance traveled with a transport, but the officers don't appear to fill it out consistently
 - 10 codes are seen as "10-72" or just "72"
 - CAD-generated messages are prefixed by a bracketed descriptor -- usually "[EPD]".  Messages left by the units are preceded by (UNIT NUMBER).
 - 

To look at word frequencies, we have to combine all the words into a single corpus.

In [1]:
import nltk
import dataset
from sqlalchemy import create_engine
import pandas as pd
from IPython.display import display
from collections import defaultdict

db_uri = "postgresql://jnance:@localhost:5432/cfs"

engine = create_engine(db_uri)
db = dataset.connect(db_uri)

  (attype, name))
  (attype, name))
  (attype, name))


In [2]:
notes = []

for row in db.query("SELECT body FROM note;"):
    if row['body']:
        notes += row['body'].lower().split()
    
txt = nltk.Text(notes)
fd = nltk.FreqDist(txt)

  (attype, name))
  (attype, name))


Here we can see the most frequent words in the corpus.  Many of these will probably become domain stopwords. (note: must download the nltk stopwords corpus using `nltk.download()` before this will work)

In [12]:
print("========= TOP 100 WORDS =========")
l = fd.most_common()
ndx = 0
num = 1
ignore = nltk.corpus.stopwords.words('english')
while (num <= 100 and ndx < len(l)):
    if l[ndx][0] not in ignore:
        print(str(num) + ". " + l[ndx][0])
        num += 1
    ndx += 1
print("\n")

1. caller
2. [epd]
3. -
4. questions:
5. ***
6. scene.
7. is:
8. incident
9. vehicle
10. response:
11. chief
12. complaint:
13. statement:
14. 1.
15. 2.
16. 3.
17. 4.
18. 5.
19. 6.
20. info
21. /
22. 7.
23. one
24. code:
25. person
26. .
27. description
28. known
29. involved.
30. involves
31. dispatch
32. male
33. 8.
34. suspect
35. call
36. 9.
37. weapons
38. adv
39. 1.suspect:
40. back
41. 10.
42. location
43. race:
44. priority
45. gender:
46. 11.
47. color:
48. age:
49. danger.
50. clothing:
51. alarm
52. party
53. cb
54. happened
55. blk
56. business/resident/owner
57. suspect`s
58. reported
59. known.
60. 12.
61. phone
62. number
63. traffic
64. left
65. suspect/person
66. 2
67. progress.
68. area.
69. involved
70. female
71. victim
72. calling
73. suspicious
74. black
75. pd
76. name
77. responsible
78. n/a
79. make:
80. medical
81. mentioned.
82. area
83. needs
84. door
85. 13.
86. s/he
87. body:
88. [fire]
89. 2nd
90. victim.
91. disturbance
92. see
93. ft
94. disturbance.
95

Many of these words are part of the call scripts transcribed by the CAD system, describing the caller's answers to specific questions asked by the dispatcher.  We'll probably need to add the script words to a stopword list, but we can keep the descriptive words that go along with them.

In [14]:
dpd_stopwords = set(nltk.corpus.stopwords.words('english'))

domain_stopwords = set(['caller', '[epd]', '-', 'questions:', 'scene.', '***', 'is:', 'incident', 'response:', 'chief',
                       'complaint:', 'statement:', 'info', 'description', 'known', 'involved', 'involves', 'dispatch',
                       '1.', '2.', '3.', '4.', '5.', '6.', '7.', '8.', '9.', '10.', '11.', '12.', '13.', '/', 'code:',
                       '.', 'call', 'suspect', 'adv', '1.suspect:', 'location', 'race:', 'priority', 'gender:',
                       'color:', 'age:', 'clothing:', 'cb', 'suspect`s', 'reported', 'known.', 'suspect/person',
                       'progress.', 'area.', 'pd', 'name', 'n/a', 'make:', 'area', 's/he',
                       '[fire]', 'model:'])

dpd_stopwords.update(domain_stopwords)

Trying again with domain stopwords added:

In [16]:
print("========= TOP 100 WORDS =========")
l = fd.most_common()
ndx = 0
num = 1
ignore = dpd_stopwords
while (num <= 100 and ndx < len(l)):
    if l[ndx][0] not in ignore:
        print(str(num) + ". " + l[ndx][0])
        num += 1
    ndx += 1
print("\n")

1. vehicle
2. one
3. person
4. involved.
5. male
6. weapons
7. back
8. danger.
9. alarm
10. party
11. happened
12. blk
13. business/resident/owner
14. phone
15. number
16. traffic
17. left
18. 2
19. female
20. victim
21. calling
22. suspicious
23. black
24. responsible
25. medical
26. mentioned.
27. needs
28. door
29. body:
30. 2nd
31. victim.
32. disturbance
33. see
34. ft
35. disturbance.
36. white
37. veh
38. line
39. open
40. theft
41. attention.
42. past:
43. aborted
44. leave
45. alarms
46. reportedly
47. caller.
48. 14.
49. drugs
50. monitoring
51. company.
52. two
53. activation
54. still
55. keyholder/owner
56. blocking
57. emergency
58. car
59. alarm.
60. law
61. incident.
62. contacted.
63. property
64. st
65. someone
66. front
67. going
68. advised
69. house
70. vm
71. suspect/person/vehicle
72. 1
73. 4
74. vehicles
75. blue
76. burglary/intrusion
77. tty
78. unk
79. no:
80. alcohol
81. 3
82. officer
83. vehicle.
84. people
85. 3rd
86. might
87. in.
88. wants
89. ago):
90. 

#Gang-related
##Data gathering
We did a segment for Tampa on gang-related calls.  In this case, we have the benefit of a "curated" data set, as the incidents have a flag indicating whether a given call is gang-related.  Let's see if we can classify calls as gang-related or not based on the call notes.

First, we'll get the info for each call.  Some of this may be useful in the classifier.

In [16]:
call_info = pd.read_sql("SELECT call.*, gang_related FROM call, "
                        "incident WHERE call.incident_id = incident.incident_id ORDER BY call_id;",
                        engine)
call_info.shape

(25418, 39)

Check out the columns to see which appear to be the most informative -- we'll drop any ones that are likely unnecessary or won't be accessible at the start of a call.

In [17]:
print(call_info.columns)

Index(['call_id', 'month_received', 'week_received', 'dow_received', 'hour_received', 'case_id', 'call_source_id', 'primary_unit_id', 'first_dispatched_id', 'reporting_unit_id', 'street_num', 'street_name', 'city_id', 'zip', 'crossroad1', 'crossroad2', 'geox', 'geoy', 'beat', 'district', 'sector', 'business', 'nature_id', 'priority', 'report_only', 'cancelled', 'time_received', 'time_routed', 'time_finished', 'first_unit_dispatch', 'first_unit_enroute', 'first_unit_arrive', 'first_unit_transport', 'last_unit_clear', 'time_closed', 'close_code_id', 'close_comments', 'incident_id', 'gang_related'], dtype='object')


In [18]:
drop_cols = ['case_id', 'reporting_unit_id', 'geox', 'geoy', 'report_only', 'cancelled', 'time_received',
             'time_routed', 'time_finished', 'first_unit_dispatch', 'first_unit_enroute', 'first_unit_arrive',
             'first_unit_transport', 'last_unit_clear', 'time_closed', 'close_code_id', 'close_comments']
for col in drop_cols:
    call_info = call_info.drop(col, axis=1)

In [19]:
call_info.shape

(25418, 22)

In [20]:
"""separate_notes = defaultdict(list)

for row in db.query("SELECT * from note;"):
    separate_notes[row['call_id']].append(row['body'])

note_bodies = []
call_ids = []
    
for call_id in separate_notes.keys():
    call_ids.append(call_id)
    note_bodies.append('\n'.join(separate_notes[call_id]))
    
notes = pd.DataFrame({'note': note_bodies}, index=call_ids)
notes.index.name = 'call_id'
notes.head()"""

notes = pd.read_sql("SELECT note.call_id, body FROM note INNER JOIN call ON note.call_id=call.call_id "
                    "WHERE call.incident_id IS NOT NULL ORDER BY call_id;", con=engine)
notes['body'] = notes['body'].map(lambda x: '' if x is None else x)
note_grp = notes.groupby('call_id', sort=False).agg(lambda x: '\n'.join(x.tolist()))
note_grp.shape

(23699, 1)

In [21]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import wordnet as wn

stopwords = set(stopwords.words('english'))

vec = CountVectorizer(stop_words=stopwords, min_df=0.005, max_df=0.99)
data = vec.fit_transform(note_grp['body'].tolist()).toarray()
data_names = vec.get_feature_names()

group = pd.DataFrame(data, columns=data_names)
group.shape()

Unnamed: 0,00,10,100,1000,104d01,106d04,11,110b02,110d02,111b01,...,x1,x2,yard,year,yelling,yellow,yesterday,yo,yoa,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
synset_df = pd.DataFrame()


# Make DataFrame with synsets occuring >= 150 times
for column in group.columns:
    for synset in wn.synsets(column):
        try:
            synset_df[synset.name()] += group[column]
        except KeyError:
            synset_df[synset.name()] = group[column]

for column in synset_df.columns:
    if sum(synset_df[column]) < 250:
        synset_df.drop(column, axis=1, inplace=True)
        
synset_df.shape()

Unnamed: 0,ten.n.01,ten.s.01,hundred.n.01,hundred.s.01,thousand.n.01,thousand.s.01,eleven.n.01,eleven.s.01,twelve.n.01,twelve.s.01,...,young.n.04,young.n.05,young.n.06,young.n.07,young.n.08,young.n.09,young.a.01,youthful.s.01,young.s.04,unseasoned.s.03
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# Make DataFrame with most common stems
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

stemmer_df = pd.DataFrame()
for column in group.columns:
    try:
        stemmer_df[stemmer.stem(column)] += group[column]
    except KeyError:
        stemmer_df[stemmer.stem(column)] = group[column]
        
for column in stemmer_df.columns:
    if sum(stemmer_df[column]) < 250:
        stemmer_df.drop(column, axis=1, inplace=True)
        
for column in stemmer_df.columns:
    new_column = column + '_stem'
    stemmer_df.rename(columns={column: new_column}, inplace=True)
    
synset_and_stems = pd.concat([synset_df, stemmer_df], axis=1, join='inner')

In [None]:
vec = CountVectorizer(stop_words=stopwords, min_df=0.05, max_df=0.99)
data = vec.fit_transform(note_grp['body'].tolist()).toarray()
data_names = vec.get_feature_names()

group = pd.DataFrame(data, columns=data_names)
group.head()

In [24]:
final_data = pd.concat([call_info, group, synset_and_stems], axis=1, join='inner')

In [25]:
final_data.to_csv('../csv_data/gang_related.csv', encoding='utf-8', index=False)

In [26]:
final_data.head()

Unnamed: 0,call_id,month_received,week_received,dow_received,hour_received,call_source_id,primary_unit_id,first_dispatched_id,street_num,street_name,...,x1_stem,x2_stem,yard_stem,year_stem,yell_stem,yellow_stem,yesterday_stem,yo_stem,yoa_stem,young_stem
0,2014000003,1,1,2,0,8,162,162,1610,GUESS RD,...,0,0,0,0,0,0,0,0,0,0
1,2014000015,1,1,2,0,3,599,599,115,COMMERCE ST,...,0,1,0,0,0,0,0,1,0,0
2,2014000042,1,1,2,0,12,165,165,531,S ROXBORO ST,...,0,0,0,0,0,0,0,0,0,0
3,2014000083,1,1,2,0,3,9,9,4600,UNIVERSITY DR,...,0,0,0,0,0,0,0,0,0,0
4,2014000102,1,1,2,1,8,27,311,4230,UNIVERSITY DR/MARTIN LUTHER KING JR PKWY ON RAMP,...,0,0,0,0,0,0,0,0,0,0


##Classification

In [2]:
from sklearn import metrics
from sklearn import cross_validation
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder

In [13]:
data = pd.read_csv('../csv_data/gang_related.csv', encoding='utf-8')

  data = self._reader.read(nrows)


In [14]:
data['gang_related'].value_counts()

False    20877
True       609
dtype: int64

In [10]:
data = data[data['gang_related'] is not None]
data['gang_related'] = data['gang_related'].astype(int)

for str_col in ('street_name', 'crossroad1', 'crossroad2', 'beat', 'district', 'sector', 'business', 'priority'):
    le = LabelEncoder()
    #le.fit(data[str_col].tolist())
    data[str_col] = pd.Series(le.fit_transform(data[str_col].tolist()))
data_features = data.drop('gang_related', 1)
data_features_values = data_features.values
feature_labels = list(data_features.columns.values)

#Subset training
data_train = data_features.loc[4250:]
data_train_features_values = data_train.values

#Subset target
data_train_target = data.loc[4250:]['gang_related']
data_train_target_values = data_train_target.values #.astype('<U32')
data_train_features_values = data_train_features_values #.astype('<U32')

#Subset validation
data_valid = data_features.loc[0:4250]
data_valid_features_values = data_valid.values

data_valid_features_values = data_valid_features_values #.astype('float32')
data_valid_target = data.loc[0:4250]['gang_related']
data_valid_target_values = data_valid_target.values #.astype('float32')

KeyError: True

In [6]:
clf = LogisticRegressionCV(method='l1', cv=5, solver='liblinear')
clf.fit(data_train_features_values, data_train_target_values)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', refit=True,
           scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [7]:
testing=clf.predict(data_valid_features_values)

In [8]:
validation = clf.score(data_valid_features_values, data_valid_target)

In [9]:
validation

0.96683133380381092