# Accidents over time
Given a sharded MongoDB cluster, we can now use it to store and query the entire accidents database, from 2009–12 inclusive. 

Note that this database only just fits in the memory of the VM. Before you start the activities in this Notebook, make sure you have no other running Notebooks. 

If any of the queries takes more than a few minutes to complete, it's probably because one of the shard servers has run out of memory and failed. Rerun the first three cells and try again. 

If you get more than a couple of failures, reboot the whole VM (_not_ suspend) and try again.

Stop the single-server Mongo instance and start the shard cluster. Look at the output of this command: it could well have failures the first time you run it. If so, just run it again until it works.

In [1]:
!sudo service mongod stop
!sudo /etc/mongo-shards-down
!sudo /etc/mongo-shards-up

/etc/mongo-shards-down: line 4: /vagrant/logs/mongocluster_*pid: No such file or directory
Killing process  found in /vagrant/logs/mongocluster_*pid
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
Wait a mo to check processes are down...
...ok
Starting config server...
about to fork child process, waiting until server is ready for connections.
forked process: 4334
child process started successfully, parent exiting
Sleep for 5...
...done
Configuring config replica set
MongoDB shell version v3.4.4
connecting to: mongodb://127.0.0.1:57050/
MongoDB server version: 3.4.4
{
	"_id" : "c1",
	"members" : [
		{
			"_id" : 0,
			"host" : "localhost:57050"
		}
	]
}
{
	"info" : "try querying local.system.replset to see current configuration",
	"ok" : 0,
	"errmsg" : "already initialized",
	"code" : 23,
	"codeName" : "AlreadyInitialized"
}
bye
Sleep for 5...
...done
2018-03-12T16:36:47.819+0000 W SHARDING [main] Running a sharded cluster with fewer than 3 

In [2]:
# Import the required libraries and open the connection to Mongo

import collections
import datetime
import matplotlib as mpl

import pandas as pd
import scipy.stats

import folium
import uuid

import pymongo

In [3]:
# Open a connection to the Mongo server, open the accidents database and name the collections of accidents and labels

# Note the different port number for this cluster
client = pymongo.MongoClient('mongodb://localhost:27017/')

db = client.accidents
accidents = db.accidents
labels = db.labels
roads = db.roads

## Rerun cells above
If a map-reduce query fails, try rerunning the cells above to restart the Mongo shard cluster.

In [4]:
# Load the expanded names of keys and human-readable codes into memory

expanded_name = collections.defaultdict(str)
for e in labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

The small_accidents database only included data for 2012. The full database includes data from 2009–12. How much data?

In [5]:
accidents.find().count()

615013

In [6]:
roads.find().count()

103529

We can summarise the data with an aggregation pipeline that will show the number of accidents each month over the seven years.

In [7]:
pipeline = [
    {'$project': {'month': {'$month': '$Datetime'},
                  'year': {'$year': '$Datetime'}}},
    {'$group': {'_id': {'month': '$month', 'year': '$year'},
                'count': {'$sum': 1}}},
    {'$sort': {'_id': 1}}
]
results = list(accidents.aggregate(pipeline))
results

[{'_id': {'month': 1, 'year': 2009}, 'count': 13417},
 {'_id': {'month': 1, 'year': 2010}, 'count': 10637},
 {'_id': {'month': 1, 'year': 2011}, 'count': 11761},
 {'_id': {'month': 1, 'year': 2012}, 'count': 11836},
 {'_id': {'month': 2, 'year': 2009}, 'count': 10950},
 {'_id': {'month': 2, 'year': 2010}, 'count': 11724},
 {'_id': {'month': 2, 'year': 2011}, 'count': 11150},
 {'_id': {'month': 2, 'year': 2012}, 'count': 10863},
 {'_id': {'month': 3, 'year': 2009}, 'count': 13202},
 {'_id': {'month': 3, 'year': 2010}, 'count': 13165},
 {'_id': {'month': 3, 'year': 2011}, 'count': 12432},
 {'_id': {'month': 3, 'year': 2012}, 'count': 12171},
 {'_id': {'month': 4, 'year': 2009}, 'count': 12715},
 {'_id': {'month': 4, 'year': 2010}, 'count': 12248},
 {'_id': {'month': 4, 'year': 2011}, 'count': 12342},
 {'_id': {'month': 4, 'year': 2012}, 'count': 10820},
 {'_id': {'month': 5, 'year': 2009}, 'count': 13811},
 {'_id': {'month': 5, 'year': 2010}, 'count': 13220},
 {'_id': {'month': 5, 'year'

We can put that in a *pandas* Series and plot the number of accidents over time. Note that we're building the series from a `dict` so that *pandas* will keep the data items in date order.

In [None]:
accidents_by_month_ss = pd.Series({datetime.datetime(m['_id']['year'], m['_id']['month'], 1): 
                                m['count'] for m in results})
# A hack to change the dates to the end of the month
accidents_by_month_ss.index = accidents_by_month_ss.index.to_period('M').to_timestamp('M')
accidents_by_month_ss.plot()
accidents_by_month_ss

In [None]:
accidents_by_month_ss.plot()

That looks like a significant drop in the number of accidents, though the plot could be deceptive as the *y*-axis doesn't go to zero. Let's plot that again showing zero on the *y*-axis.

In [None]:
accidents_by_month_ss.plot(ylim=(0, accidents_by_month_ss.max() * 1.1))

Still a significant decrease in the number of accidents. Is this because people were driving less?

Let's summarise the road data, but that's only data for each year.

## Activity 1
Use an aggregation pipeline to find the total volume of traffic, grouped by year.

Were the lower accident volumes reported since 2009 because of less traffic?

The solution is in the [`16.1solutions`](16.1solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## Activity 2: Proportions of accidents at severity levels over time
Are cars getting safer? In other words, are there proportionally more slight accidents than serious or fatal, and more serious accidents than fatal?

Use an aggregation pipeline to find the number of accidents of each severity for each year. Use an appropriate statistical test to see if the proportions of accidents at each severity are significantly different over time.

The solution is in the [`16.1solutions`](16.1solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## Cleanup

In [None]:
!sudo /etc/mongo-shards-down
!sudo service mongod start

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `16.2 Python map-reduce`.