# Roads

This Notebook will show you the roads dataset and integrate it with the accidents data.

First, some boilerplate imports.

In [None]:
# Import the required libraries

import pymongo
import datetime
import collections

import pandas as pd
import scipy.stats

import matplotlib as mpl

mpl.rcParams['figure.figsize'] = (15, 15) # Reset the base size of figures so they're large enough to be useful.

import folium

In [None]:
# Open a connection to the Mongo server, open the accidents database and name the collections of accidents and labels
client = pymongo.MongoClient('mongodb://localhost:27351/')

db = client.accidents
accidents = db.accidents
labels = db.labels
roads = db.roads

In [None]:
# Load the expanded names of keys and human-readable codes into memory
expanded_name = collections.defaultdict(str)
for e in labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

In [None]:
def results_to_table(results, index_name, column_name, results_name, 
                     fillna=None,
                     relabel_index=False, relabel_columns=False,
                     index_label=None, column_label=None):
    
    # Move items in dicts-of-dicts to the top level.
    def flatten(d):
        new_d = {}
        for k in d:
            if isinstance(d[k], dict):
                new_d.update(flatten(d[k]))
            else:
                new_d[k] = d[k]
        return new_d

    df = pd.DataFrame([flatten(r) for r in results])
    df = df.pivot(index=index_name, columns=column_name, values=results_name)
    
    # Optionally, fiddle with names and labels to make the DataFrame pretty.
    if not fillna is None:
        df.fillna(fillna, inplace=True)
    if relabel_columns:
        df.columns = [label_of[column_name, c] for c in df.columns]
    if relabel_index:
        df.index = [label_of[index_name, r] for r in df.index]
    if column_label:
        df.columns.name = column_label
    else:
        df.columns.name = column_name
    if index_label:
        df.index.name = index_label
    else:
        df.index.name = index_name
    return df

## Looking at roads

What's in a 'road' document?

In [None]:
roads.find_one()

It's a section of road with totals of different vehicle types that passed along that section. Road sections have two ends, either junctions or region boundaries. The `Fd...` keys are the number of vehicles of a particular class that passed this point (in the forward direction, but there's no 'reverse' direction specified).

What do the codes mean?

In [None]:
expanded_name['FdAll_MV']

In [None]:
expanded_name['FdHGVA6']

What are the road categories?

In [None]:
[(c, label_of['RCat', c]) for k, c in label_of if k == 'RCat']

Note that not every road segment has a location. We'll have to bear that in mind when doing geographic analysis of the roads dataset.

In [None]:
roads.find({'loc': {'$exists': False}}).count()

In [None]:
roads.find({'loc': {'$exists': True}}).count()

## Plotting some road points

To start with, let's just plot some road segments on the map to see where they are. We'll reuse the map-making procedures fron Notebook 15.1.

In [None]:
def add_accidents_markers(the_map, query, number_of_sides=5, fill_color='#769d96', limit=0,
                     radius=5, rotation=54):
    for a in accidents.find(query, 
                            ['loc.coordinates'],
                            limit=limit):
        folium.RegularPolygonMarker(location=[a['loc']['coordinates'][1], a['loc']['coordinates'][0]], 
                     number_of_sides=number_of_sides, radius=radius, rotation=rotation,
                                   fill_color=fill_color).add_to(the_map)  

In [None]:
def add_roads_markers(the_map, query, number_of_sides=5, fill_color='#769d96', limit=0,
                     radius=5, rotation=54):
    for r in roads.find(query, 
                        ['loc.coordinates'],
                       limit=limit):
        folium.RegularPolygonMarker(location=[r['loc']['coordinates'][1], r['loc']['coordinates'][0]], 
                     number_of_sides=number_of_sides, radius=radius, rotation=rotation,
                                   fill_color=fill_color).add_to(the_map)    

In [None]:
m = folium.Map([55, -3], zoom_start=6)    

add_roads_markers(m, {'loc': {'$exists': True}}, limit=1000)
m

This clearly shows that the road data covers Britain, but nothing in Ireland.

## Milton Keynes
Let's zoom in a bit on Milton Keynes, the home of the Open University. This polygon defines the area we're interested in.

In [None]:
milton_keynes = {'type': 'Polygon',
                               'coordinates': [[[-0.869719, 52.066547], 
                                                [-0.651709, 52.066547], 
                                                [-0.651709, 51.997161], 
                                                [-0.869719, 51.997161],
                                                [-0.869719, 52.066547]
                                                ]]}

min_mk_lat = min(p[1] for p in milton_keynes['coordinates'][0])
max_mk_lat = max(p[1] for p in milton_keynes['coordinates'][0])
min_mk_lon = min(p[0] for p in milton_keynes['coordinates'][0])
max_mk_lon = max(p[0] for p in milton_keynes['coordinates'][0])

mk_centre = [min_mk_lat + (max_mk_lat - min_mk_lat) / 2, min_mk_lon + (max_mk_lon - min_mk_lon) / 2]

mk_region_query = {'loc': {'$geoWithin': {'$geometry': milton_keynes}}}

In [None]:
roads.find(mk_region_query).count()

In [None]:
[r for r in roads.find(mk_region_query, 
                       {'FdAll_MV':1, 'Road':1, 'RCat':1, 'LenNet':1, '_id':0})]

In [None]:
mk_region_query

In [None]:
m = folium.Map(mk_centre, zoom_start=12)    
add_accidents_markers(m, mk_region_query, fill_color='#ff0000', number_of_sides=6, radius=4)
add_roads_markers(m, mk_region_query, fill_color='#0000ff', number_of_sides=4, radius=10)
m

This clearly shows that not all the roads have traffic flow data. 

# Exploring the roads data
Let's have a look at some of the numbers associated with the traffic flow data. We'll load the data into a DataFrame and make some graphs.

In [None]:
mpl.rcParams['figure.figsize'] = (8, 8)

How many of each type of road section are there, and how long are they?

In [None]:
pipeline = [{'$group': {'_id': '$RCat',
                        'length': {'$avg': '$LenNet'},
                        'count': {'$sum': 1}}}]
results = list(roads.aggregate(pipeline))
results

In [None]:
road_lens_df = pd.DataFrame(results)
road_lens_df.set_index('_id', inplace=True)
road_lens_df.index.name = 'RCat'
road_lens_df['category'] = [label_of['RCat', c] for c in road_lens_df.index]
road_lens_df

In [None]:
plt.scatter(road_lens_df['count'], 
            road_lens_df['length']
            )
plt.xlabel('Count')
plt.ylabel('Total length')
for r in road_lens_df.iterrows():
    plt.annotate(r[1]['category'], xy=(r[1]['count'], r[1]['length']),
                xytext=(10, 5), textcoords = 'offset points')
plt.show()

Unsurprisingly, rural road sections are longer than urban road sections. There are more "principal" than "trunk" road sections, probably because "trunk" roads are designated major routes.

But what are the principal motorways?

In [None]:
roads.distinct('Road', {'RCat': 'PM'})

The average lengths shown so far don't tell us about the distribution of lengths of different roads.

In [None]:
road_lengths_df = pd.DataFrame(list(roads.find({}, ['RCat', 'LenNet'])))
road_lengths_df.describe()

In [None]:
road_lengths_df['LenNet'].hist()

Most road sections are very short, with a few that are longer. 

Is there are difference between rural and urban sections?

In [None]:
isUrban = road_lengths_df.apply(lambda r: r['RCat'][1] == 'U', axis=1)
isRural = road_lengths_df.apply(lambda r: r['RCat'][1] == 'R', axis=1)

In [None]:
road_lengths_df[isUrban]['LenNet'].describe()

In [None]:
road_lengths_df[isRural]['LenNet'].describe()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

road_lengths_df[isUrban]['LenNet'].hist(ax=ax, alpha=0.5, color='red')
road_lengths_df[isRural]['LenNet'].hist(ax=ax, alpha=0.5, color='green')

This shows that, while both urban and rural roads have a similar shape of road length distributions (many short sections with some longer sections), the rural roads are much longer than the urban ones.

## Looing at districts
Which districts have the most roads, and the longest roads?

We'll use a pipeline to find the data and plot it on a scatter plot.

In [None]:
pipeline = [{'$group': {'_id': '$ONS LA Name',
                        'length': {'$sum': '$LenNet'},
                        'count': {'$sum': 1}}}]
results = list(roads.aggregate(pipeline))
results

In [None]:
ons_lens_df = pd.DataFrame(results)
ons_lens_df.set_index('_id', inplace=True)
ons_lens_df.index.name = 'LA'
ons_lens_df

In [None]:
plt.scatter(ons_lens_df['count'], 
            ons_lens_df['length']
            )
plt.xlabel('Count')
plt.ylabel('Total length')
plt.show()

What are those two outliers (longest roads and most roads)?

In [None]:
# Which district has the most road sections?
ons_lens_df.loc[ons_lens_df['count'].idxmax()]

In [None]:
# Which district has the longest total of road sections?
ons_lens_df.loc[ons_lens_df['length'].idxmax()]

### Activity 1
Which districts have the most, and longest, road networks, when split between rural and urban?

Generate data that shows the number of road segments, and total length of road segments, grouped by both district and whether the road is rural or urban. Create scatter plots to show districts by rural road count vs urban road count, and rural road length vs urban road length. 

Comment on your findings.

**Hint**: You can tell is a road is rural or urban from the second character of the road category code, `R` or `U`. If you're using an aggregation pipeline to find the data, use `'class': {'$substr': ['$RCat', 1, 1]}` inside a `$project` stage to pick out the appropriate character.

The solution is in the [`15.3solutions`](15.3solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## Traffic volume distributions
What can we tell about how heavily used different roads are?

In [None]:
traffic_volume_df = pd.DataFrame(list(roads.find({}, ['Road', 'RCat', 'LenNet', 'FdAll_MV'])))
traffic_volume_df.describe()

In [None]:
traffic_volume_df['FdAll_MV'].hist(bins=20)

In [None]:
plt.scatter(traffic_volume_df['FdAll_MV'], 
            traffic_volume_df['LenNet']
            )
plt.xlabel('Volume')
plt.ylabel('Length')
plt.show()

In [None]:
isUrban = traffic_volume_df.apply(lambda r: r['RCat'][1] == 'U', axis=1)
isRural = traffic_volume_df.apply(lambda r: r['RCat'][1] == 'R', axis=1)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

traffic_volume_df[isUrban]['FdAll_MV'].hist(ax=ax, alpha=0.5, color='red', bins=20)
traffic_volume_df[isRural]['FdAll_MV'].hist(ax=ax, alpha=0.5, color='green', bins=20)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

rurals = traffic_volume_df[isRural].sample(300)
urbans = traffic_volume_df[isUrban].sample(300)

ax.scatter(rurals['FdAll_MV'], rurals['LenNet'],
           color='green', alpha=0.3
            )
ax.scatter(urbans['FdAll_MV'], urbans['LenNet'],
           color='red', alpha=0.3
            )

plt.xlabel('Volume')
plt.ylabel('Length')
plt.show()

This shows the different types of road use between rural and urban: urban roads tend to have higher volumes on shorter segments than rural areas.

In [None]:
# What road segment has the highest traffic?
traffic_volume_df.iloc[traffic_volume_df['FdAll_MV'].idxmax()]['Road']

In [None]:
# What are the busiest road sections?
traffic_volume_df.sort_values(by='FdAll_MV', ascending=False).head(10)

### Activity 2
Do different types of roads have different mixes of traffic? For each road category, find the average daily flow when averaged across all road segments of that category. Place the results in a DataFrame and plot them as a bar chart.

Investigate whether the mix of vehicle types is different on different road types. Use a suitable statistical test to determine if the differences you see are significant (you may want to refer back to Notebook 14.3).

Use just the total HGV counts, not the counts for each type of goods vehicle.

The solution is in the [`15.3solutions`](15.3solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `15.4 Allocating accidents to roads`.