# Regression on subgroups
This Notebook will look in more detail at some factors that affect the connection between the number of casualties and vehicles in an accident. It does this by splitting the accidents dataset into two groups: those involving bus-like vehicles and those that don't, and comparing regression tests between the two groups.

In [None]:
# Import the required libraries

import pymongo
import datetime
import collections

import pandas as pd
import scipy.stats

In [None]:
# Open a connection to the Mongo server, open the accidents database and name the collections of accidents and labels
client = pymongo.MongoClient('mongodb://localhost:27351/')

db = client.accidents
accidents = db.accidents
labels = db.labels

In [None]:
# Load the expanded names of keys and human-readable codes into memory

expanded_name = collections.defaultdict(str)
for e in labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

## Pearson's *R*²
Let's look again at the whole set.

In [None]:
# Build a DataFrame, one row for each accident
cas_veh_unrolled_df = pd.DataFrame(list(accidents.find({}, ['Number_of_Casualties', 'Number_of_Vehicles'])))

# Count the number of each severity
cas_veh_df = pd.crosstab(cas_veh_unrolled_df['Number_of_Casualties'], 
                                      cas_veh_unrolled_df['Number_of_Vehicles'])
# Reshape
cas_veh_long_df = cas_veh_df.stack().reset_index()
cas_veh_long_df

In [None]:
regressionline = scipy.stats.linregress(cas_veh_unrolled_df['Number_of_Casualties'],
                                       cas_veh_unrolled_df['Number_of_Vehicles'])

# The regression line is of the form y = m x + b
m = regressionline[0]
b = regressionline[1]
(m, b)

In [None]:
plt.scatter(cas_veh_long_df['Number_of_Casualties'], 
            cas_veh_long_df['Number_of_Vehicles'],
            s=np.sqrt(cas_veh_long_df[0])*1.5,
            alpha=0.5
            )

x = np.linspace(0, 30, 20)
plt.plot(x, m*x + b)

plt.xlabel('Number of casualties')
plt.ylabel('Number of vehicles')
plt.show()

The `pearsonr` function calculates Pearson's *R*² value of correlation. Recall that values near +1 show good positive correlation, values near -1 show good negative correlation, and values near 0 show no particular correlation. The `scipy` function returns a second value, the *p* value of the result. 

In [None]:
scipy.stats.pearsonr(cas_veh_unrolled_df['Number_of_Casualties'], 
                     cas_veh_unrolled_df['Number_of_Vehicles'])

This result shows a small, positive correlation with a very small *p* value. In other words, there's not much correlation, and the result is statistically significant. This means we can reject the the null hypothesis that the number of casualties in an accident is unrelated to the number of vehicles.

Looking at the data, it seems to be a result that most accidents result in very few casualties, and the accidents with the most casualties have few vehicles.

Let's look at why the accidents with most casualties seem to only involve a few vehicles. 

### Activity 1
Investigate the types of vehicles that are involved in the largest accidents (i.e. those accidents with more than 8 casualties). What types of vehicles appear in the multiple-casualty accidents?

For each accident with more than 8 casualties, list the vehicles involved. For each vehicle, print the type of vehicle and the number of casualties in it, in a format similar to this:

```
Acc index 201201GD10531; 9 casualties, 3 vehicles
	Car: 1 casualties
	Car: 5 casualties
	Bus: 3 casualties
Acc index 201201KF60687; 11 casualties, 3 vehicles
	Taxi/Private: 2 casualties
	Taxi/Private: 4 casualties
	Taxi/Private: 5 casualties
```

**Hint**

Because you'll want to look at the vehicles involved in each accident, it's easier to keep the data in standard Python data structures, unchanged from the result of a pymongo `find()`.

The solution is in the [`14.4solutions`](14.4solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

### Activity 2
Separate out the types of accidents into those involving bus-like vehicles and those that don't, and calculate the regression scores for the subgroups.

The solution is in the [`14.4solutions`](14.4solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 14 Notebooks. It's time to move on to Part 15.