# Allocating accidents to roads - snowkuma

This Notebook will allocate accidents to the appropriate road segment. This will allow you to find accident hotspots based on accident rates relative to traffic density.

First, some boilerplate imports.

In [1]:
# Import the required libraries

import pymongo
import datetime
import collections

import pandas as pd
import scipy.stats

In [2]:
# Open a connection to the Mongo server, open the accidents database and name the collections of accidents and labels
client = pymongo.MongoClient('mongodb://localhost:27351/')

db = client.accidents
accidents = db.accidents
labels = db.labels
roads = db.roads

Bugfix: there's an index on the `roads` collection that shouldn't be there. This will remove the index if it exists.

In [3]:
if 'Road_1_loc_2dsphere' in roads.index_information():
    roads.drop_index('Road_1_loc_2dsphere')

In [4]:
# Load the expanded names of keys and human-readable codes into memory
expanded_name = collections.defaultdict(str)
for e in labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

You'd expect the number of accidents to vary proportionally with the number of vehicles on a road, and the number of vehicle journeys to vary with population. This means that simply looking at the number of accidents in a location or region doesn't tell us much about the risk of using that road.

To start work on the 'danger' of each road section, we need to identify which accidents took place on it. 

The simple approach is to associate each accident with the nearest road census point. However, this has two problems. One is that the accident may be on a minor road that isn't associated with the traffic census point. The other is that the road on which the accident occurs may not be the one with the closest road census point (for instance, if the accident occurs near a junction).

Therefore, we need to annotate each road census document with the accidents that occur on that road section. To avoid too much duplication, we'll just annotate the `road` document with the relevant accident indexes. If we want more details about the accidents, we can look them up directly from the `accidents` collection.

### Activity 1
Convert the road information in the `accident` document into the standard name as used in the `road` document.

You're starting with information like this:

In [5]:
pd.DataFrame(list(accidents.find({}, ['1st_Road_Class', '1st_Road_Number', '2nd_Road_Class', '2nd_Road_Number'], 
                                 limit=5)),
             columns=['1st_Road_Class', '1st_Road_Number', '2nd_Road_Class', '2nd_Road_Number'])

Unnamed: 0,1st_Road_Class,1st_Road_Number,2nd_Road_Class,2nd_Road_Number
0,3,308,5,0
1,4,412,6,0
2,3,3220,6,0
3,5,0,6,0
4,4,325,6,0


... and this ...

In [6]:
sorted((code, label_of[label, code]) for label, code in label_of 
 if label == '2nd_Road_Class')

[(0, 'Not at junction or within 20 metres'),
 (1, 'Motorway'),
 (2, 'A(M)'),
 (3, 'A'),
 (4, 'B'),
 (5, 'C'),
 (6, 'Unclassified')]

and you have to convert it to the 'typical' road numbers like this:

In [7]:
pd.DataFrame(list(roads.find({}, ['Road', 'RCat'], limit=5)),
                             columns=['Road', 'RCat'])

Unnamed: 0,Road,RCat
0,A5,TR
1,A5,TR
2,A40,TR
3,A40,TR
4,A40,TR


In [8]:
sorted((code, label_of[label, code]) for label, code in label_of 
 if label == 'RCat')

[('BR', 'Rural B road'),
 ('BU', 'Urban B road'),
 ('CR', 'Rural C road'),
 ('CU', 'Urban C road'),
 ('PM', 'Principal motorway'),
 ('PR', 'Rural principal road'),
 ('PU', 'Urban principal road'),
 ('TM', 'Trunk motorway'),
 ('TR', 'Rural trunk road'),
 ('TU', 'Urban trunk road'),
 ('UR', 'Rural U road'),
 ('UU', 'Urban U road')]

Pay particular attention to A(M) roads: they're not handled consistently in the data.

The solution is in the [`15.4solutions`](15.4solutions.ipynb) Notebook.

In [12]:
# solution from 15.4

# function that takes the road class and then returns a value
def normalise_road_name(road_class, road_number):
    if road_number == 0:
        return None
    if road_class ==1:
        return 'M' + str(road_number)
    elif road_class == 2:
        if road_number == 1:
            return 'A1(M)'
        else:
            return 'A' + str(road_number) + 'M'
    elif road_class >= 3 and road_class <=5:
        return label_of[('2nd_Road_Class', road_class)] + str(road_number)
    else:
        return None

In [13]:
# test the function works ok
[(normalise_road_name(a['1st_Road_Class'], a['1st_Road_Number']),
 normalise_road_name(a['2nd_Road_Class'], a['2nd_Road_Number']))
for a in accidents.find(limit=20)]

[('A308', None),
 ('B412', None),
 ('A3220', None),
 (None, None),
 ('B325', None),
 ('A308', 'A3220'),
 ('A3216', 'A4'),
 ('B450', None),
 (None, None),
 (None, None),
 ('A315', None),
 ('A315', 'A3220'),
 ('A402', 'A4206'),
 ('B415', 'B450'),
 ('B450', 'B412'),
 ('A3217', None),
 (None, None),
 ('A3220', 'A3220'),
 ('B316', None),
 ('A4204', None)]

### Activity 2
When given an accident, find the closest road census document for the same road. Return `None` if there isn't one for this road.

Note that the relevant road could be for either the accident's first or second recorded road.

Also note that you'll need to add

```
from bson.son import SON
```

and use the direct MongoDB command `geoNear` to find the nearest road segment to a given point. 

```
road_result = db.command(SON([('geoNear', 'roads'), 
                ('near', <the given point>),
                ('spherical', True),
                ('query', <document for additional features on road segment>),
                ('limit', 1)]))
```
For example, the cell below will pick an arbitrary accident, then find the motorway segment nearest to it.

In [14]:
from bson.son import SON

a = accidents.find_one()
print(a['Accident_Index'], a['loc'])

nearest_road_result = db.command(SON([('geoNear', 'roads'), 
                ('near', a['loc']),                      
                ('spherical', True),
                ('query', {'RCat': 'TM'}),
                ('limit', 1)]))

print(nearest_road_result['results'][0]['obj']['CP'], 
      nearest_road_result['results'][0]['obj']['ONS LA Name'],
      nearest_road_result['results'][0]['obj']['Road'])

201201BS70001 {'coordinates': [-0.169101, 51.493429], 'type': 'Point'}
47892 Hounslow M4


The solution is in the [`15.4solutions`](15.4solutions.ipynb) Notebook.

In [15]:
def road_for_accident(accident):
    first_road_name = normalise_road_name(accident['1st_Road_Class'], accident['1st_Road_Number'])
    second_road_name = normalise_road_name(accident['2nd_Road_Class'], accident['2nd_Road_Number'])
    
    if first_road_name:
        first_road_result = db.command(SON([('geoNear', 'roads'),
                                           ('near', accident['loc']),
                                           ('spherical', True),
                                           ('query', {'Road': first_road_name}),
                                           ('limit', 1)]))
    else:
        first_road_result = {'results': []}
        
    if second_road_name:
        second_road_result = db.command(SON([('geoNear', 'roads'),
                                           ('near', accident['loc']),
                                           ('spherical', True),
                                           ('query', {'Road': second_road_name}),
                                           ('limit', 1)]))
    else:
        second_road_result = {'results': []}
    
    all_results = first_road_result['results'] + second_road_result['results']
    sorted_results = sorted(all_results, key=lambda r: r['dis'])
    
    if sorted_results:
        nearest_road = sorted_results[0]['obj']
        return nearest_road
    else:
        return None


In [17]:
# test it works
road_for_accident(accidents.find_one())

{'A-Junction': 'A3220',
 'AADFYear': 2012,
 'B-Junction': 'A4',
 'CP': 16854,
 'Fd2WMV': 1083.0,
 'FdAll_MV': 17673.0,
 'FdBUS': 1114.0,
 'FdCar': 12867.0,
 'FdHGV': 687.0,
 'FdHGVA3': 2.0,
 'FdHGVA5': 13.0,
 'FdHGVA6': 3.0,
 'FdHGVR2': 492.0,
 'FdHGVR3': 109.0,
 'FdHGVR4': 67.0,
 'FdLGV': 1922.0,
 'FdPC': 1232.0,
 'Latitude': 51.491641149098996,
 'LenNet': 1.8,
 'LenNet_miles': 1.11,
 'Longitude': -0.17207541184187886,
 'ONS GOR Name': 'London',
 'ONS LA Name': 'Kensington and Chelsea',
 'RCat': 'PU',
 'Road': 'A308',
 'S Ref E': 527000,
 'S Ref N': 178550,
 '_id': ObjectId('533ed2c589f6f9ee18baf4d9'),
 'loc': {'coordinates': [-0.17207541184187886, 51.491641149098996],
  'type': 'Point'}}

In [18]:
# test it works a lot
res = []
for a in accidents.find(limit=20):
    rfa = road_for_accident(a)
    if rfa:
        res.append((rfa['Road'], rfa['CP'],
                    normalise_road_name(a['1st_Road_Class'], a['1st_Road_Number']),
                    normalise_road_name(a['2nd_Road_Class'], a['2nd_Road_Number'])))
    else:
        res.append((None, None,
                    normalise_road_name(a['1st_Road_Class'], a['1st_Road_Number']),
                    normalise_road_name(a['2nd_Road_Class'], a['2nd_Road_Number'])))
res

[('A308', 16854, 'A308', None),
 (None, None, 'B412', None),
 ('A3220', 57668, 'A3220', None),
 (None, None, None, None),
 (None, None, 'B325', None),
 ('A308', 18268, 'A308', 'A3220'),
 ('A4', 46120, 'A3216', 'A4'),
 (None, None, 'B450', None),
 (None, None, None, None),
 (None, None, None, None),
 ('A315', 38590, 'A315', None),
 ('A315', 18366, 'A315', 'A3220'),
 ('A402', 16403, 'A402', 'A4206'),
 (None, None, 'B415', 'B450'),
 (None, None, 'B450', 'B412'),
 ('A3217', 37707, 'A3217', None),
 (None, None, None, None),
 ('A3220', 73640, 'A3220', 'A3220'),
 (None, None, 'B316', None),
 ('A4204', 27734, 'A4204', None)]

### Activity 3
For each accident, add its `Accident_Index` to the nearest relevant road census point. Each road census point should have a **list** of accident indexes if there are any accidents nearby.

Note that this could take **several hours** to run to completion, so you should test it works on a small set of road records first. 

Notes:
1. Use the `$push` update operator to add to the list of accident indexes
2. Make sure this process is [idempotent](https://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning): you don't want accident indexes added again if you run the procedure again (do a bulk `$unset` before you start).

The solution is in the [`15.4solutions`](15.4solutions.ipynb) Notebook.

In [19]:
# First remove all existing cached accident indexes
roads.update_many({}, {'$unset': {'nearby_accidents': True},
                      '$set': {'nearby_accident_count': 0}})

<pymongo.results.UpdateResult at 0x7fdc30fd1990>

In [22]:
# include the accident indexes in the road documents
# also maintain a count of how many accidents there are for each record.
for a in accidents.find():
    rfa = road_for_accident(a)
    if rfa and ('nearby_accidents' not in rfa or a['Accident_Index'] not in rfa['nearby_accidents']):
        roads.update_one({'_id': rfa['_id']}, {'$push': {'nearby_accidents': a['Accident_Index']},
                                   '$inc': {'nearby_accident_count': 1}})

In [23]:
list(roads.find({'CP': 16854}))

[{'A-Junction': 'A3220',
  'AADFYear': 2012,
  'B-Junction': 'A4',
  'CP': 16854,
  'Fd2WMV': 1083.0,
  'FdAll_MV': 17673.0,
  'FdBUS': 1114.0,
  'FdCar': 12867.0,
  'FdHGV': 687.0,
  'FdHGVA3': 2.0,
  'FdHGVA5': 13.0,
  'FdHGVA6': 3.0,
  'FdHGVR2': 492.0,
  'FdHGVR3': 109.0,
  'FdHGVR4': 67.0,
  'FdLGV': 1922.0,
  'FdPC': 1232.0,
  'Latitude': 51.491641149098996,
  'LenNet': 1.8,
  'LenNet_miles': 1.11,
  'Longitude': -0.17207541184187886,
  'ONS GOR Name': 'London',
  'ONS LA Name': 'Kensington and Chelsea',
  'RCat': 'PU',
  'Road': 'A308',
  'S Ref E': 527000,
  'S Ref N': 178550,
  '_id': ObjectId('533ed2c589f6f9ee18baf4d9'),
  'loc': {'coordinates': [-0.17207541184187886, 51.491641149098996],
   'type': 'Point'},
  'nearby_accident_count': 11,
  'nearby_accidents': ['201201BS70001',
   '201201BS70066',
   '201201BS70142',
   '201201BS70210',
   '201201BS70469',
   '201201BS70512',
   '201201BS70535',
   '201201BS70598',
   '201201BS70620',
   '201201BS70624',
   '201201TA00930']}

## Summary
This Notebook has worked through how to relate datasets to give different perspectives on the same phenomenon. Differing occurrences of accidents across space, time, and conditions is interesting, but can only give a partial picture if we don't have an understanding of the background rate of unremarkable journeys. However, different datasets are unlikely to ever align perfectly and will require some manipulation to allow the lessons from one set to be applied to another. 

The next Notebook looks at what new insights we can gain from the combination of these two datasets.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `15.5 Investigating accident rates`.
