# Allocating accidents to roads

This Notebook will allocate accidents to the appropriate road segment. This will allow you to find accident hotspots based on accident rates relative to traffic density.

First, some boilerplate imports.

In [None]:
# Import the required libraries

import pymongo
import datetime
import collections

import pandas as pd
import scipy.stats

In [None]:
# Open a connection to the Mongo server, open the accidents database and name the collections of accidents and labels
client = pymongo.MongoClient('mongodb://localhost:27351/')

db = client.accidents
accidents = db.accidents
labels = db.labels
roads = db.roads

In [None]:
# Load the expanded names of keys and human-readable codes into memory
expanded_name = collections.defaultdict(str)
for e in labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

You'd expect the number of accidents to vary proportionally with the number of vehicles on a road, and the number of vehicle journeys to vary with population. This means that simply looking at the number of accidents in a location or region doesn't tell us much about the risk of using that road.

To start work on the 'danger' of each road section, we need to identify which accidents took place on it. 

The simple approach is to associate each accident with the nearest road census point. However, this has two problems. One is that the accident may be on a minor road that isn't associated with the traffic census point. The other is that the road on which the accident occurs may not be the one with the closest road census point (for instance, if the accident occurs near a junction).

Therefore, we need to annotate each road census document with the accidents that occur on that road section. To avoid too much duplication, we'll just annotate the `road` document with the relevant accident indexes. If we want more details about the accidents, we can look them up directly from the `accidents` collection.

### Activity 1
Convert the road information in the `accident` document into the standard name as used in the `road` document.

You're starting with information like this:

In [None]:
pd.DataFrame(list(accidents.find({}, ['1st_Road_Class', '1st_Road_Number', '2nd_Road_Class', '2nd_Road_Number'], 
                                 limit=5)),
             columns=['1st_Road_Class', '1st_Road_Number', '2nd_Road_Class', '2nd_Road_Number'])

... and this ...

In [None]:
sorted((code, label_of[label, code]) for label, code in label_of 
 if label == '2nd_Road_Class')

and you have to convert it to the 'typical' road numbers like this:

In [None]:
pd.DataFrame(list(roads.find({}, ['Road', 'RCat'], limit=5)),
                             columns=['Road', 'RCat'])

In [None]:
sorted((code, label_of[label, code]) for label, code in label_of 
 if label == 'RCat')

Pay particular attention to A(M) roads: they're not handled consistently in the data.

The solution is in the [`15.4solutions`](15.4solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

### Activity 2
When given an accident, find the closest road census document for the same road. Return `None` if there isn't one for this road.

Note that the relevant road could be for either the accident's first or second recorded road.

Also note that you'll need to add

```
from bson.son import SON
```

and use the direct MongoDB command `geoNear` to find the nearest road segment to a given point. 

```
road_result = db.command(SON([('geoNear', 'roads'), 
                ('near', <the given point>),
                ('spherical', True),
                ('query', <document for additional features on road segment>),
                ('limit', 1)]))
```
For example, the cell below will pick an arbitrary accident, then find the motorway segment nearest to it.

In [None]:
from bson.son import SON

a = accidents.find_one()
print(a['Accident_Index'], a['loc'])

nearest_road_result = db.command(SON([('geoNear', 'roads'), 
                ('near', a['loc']),                      
                ('spherical', True),
                ('query', {'RCat': 'TM'}),
                ('limit', 1)]))

print(nearest_road_result['results'][0]['obj']['CP'], 
      nearest_road_result['results'][0]['obj']['ONS LA Name'],
      nearest_road_result['results'][0]['obj']['Road'])

The solution is in the [`15.4solutions`](15.4solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

### Activity 3
For each accident, add its `Accident_Index` to the nearest relevant road census point. Each road census point should have a **list** of accident indexes if there are any accidents nearby.

Note that this could take **several hours** to run to completion, so you should test it works on a small set of road records first. 

Notes:
1. Use the `$push` update operator to add to the list of accident indexes
2. Make sure this process is [idempotent](https://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning): you don't want accident indexes added again if you run the procedure again (do a bulk `$unset` before you start).

The solution is in the [`15.4solutions`](15.4solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## Summary
This Notebook has worked through how to relate datasets to give different perspectives on the same phenomenon. Differing occurrences of accidents across space, time, and conditions is interesting, but can only give a partial picture if we don't have an understanding of the background rate of unremarkable journeys. However, different datasets are unlikely to ever align perfectly and will require some manipulation to allow the lessons from one set to be applied to another. 

The next Notebook looks at what new insights we can gain from the combination of these two datasets.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `15.5 Investigating accident rates`.
