# Not stand alone
In contrast to most "solution" notebooks, this notebook isn't stand-alone. That's because running the MongoDB cluster means you shouldn't be running several notebooks at once. To see the results of these solutions, you'll have to copy to code into the [[`16.3 Accident analysis map-reduce`](16.1 Accident analysis map-reduce.ipynb) notebook.

# Activity 1

This follows very closely the example using local authorities.

In [None]:
mapper = Code("""
    function () {
        emit({police_force : this.Police_Force, year : this.Datetime.getFullYear()}, 
            {Number_of_Casualties: this.Number_of_Casualties});
    }
""")

In [None]:
reducer = Code("""
    function(key, emits) {
        var total = {Number_of_Casualties : 0}
        for (var i in emits) {
            total.Number_of_Casualties += emits[i].Number_of_Casualties;
        }
        return total;
    }
""")

In [None]:
result = accidents.map_reduce(mapper, reducer, 'myresults')
result

In [None]:
[r for r in result.find(limit=5)]

In [None]:
casualties_by_pf_year_long_df = pd.DataFrame([
    {'District name': label_of['Police_Force', r['_id']['police_force']],
     'District code': r['_id']['police_force'],
     'Year': datetime.datetime(int(r['_id']['year']), 12, 31),
     'Number_of_Casualties': r['value']['Number_of_Casualties']}
    for r in result.find()])
casualties_by_pf_year_long_df

In [None]:
casualties_by_pf_year_df = casualties_by_pf_year_long_df.pivot('Year', 'District name', 'Number_of_Casualties')
casualties_by_pf_year_df

In [None]:
casualties_by_pf_year_df['Thames Valley'].plot()

In [None]:
casualties_by_pf_year_df['Thames Valley'].plot(ylim=(0, casualties_by_pf_year_df['Thames Valley'].max() * 1.1))

# Activity 2

In [None]:
mapper = Code("""
    function () {
        emit({month : this.Datetime.getMonth(), year : this.Datetime.getFullYear()}, 
            {Number_of_Accidents: 1});
    }
""")

In [None]:
reducer = Code("""
    function(key, emits) {
        var total = {Number_of_Accidents : 0}
        for (var i in emits) {
            total.Number_of_Accidents += emits[i].Number_of_Accidents;
        }
        return total;
    }
""")

In [None]:
result = accidents.map_reduce(mapper, reducer, 'myresults')
result

In [None]:
result.find().count()

In [None]:
[r for r in result.find(limit=5)]

In [None]:
accidents_by_month_mr_ss = pd.Series({datetime.datetime(int(m['_id']['year']), 
                                      int(m['_id']['month']+1), 1): 
                    m['value']['Number_of_Accidents'] for m in result.find()})
# A hack to change the dates to the end of the month
accidents_by_month_mr_ss.index = accidents_by_month_mr_ss.index.to_period('M').to_timestamp('M')
accidents_by_month_mr_ss

In [None]:
pd.DataFrame({'aggregation': accidents_by_month_ap_ss, 'map-reduce': accidents_by_month_mr_ss}).plot()

Compare the two results and see if they're the same.

In [None]:
(accidents_by_month_ap_ss - accidents_by_month_mr_ss).sum()

In [None]:
(accidents_by_month_ap_ss - accidents_by_month_mr_ss).plot()

# Activity 3

In [None]:
mapper = Code("""
    function () {
        for (var i in this.Vehicles) {
            emit({age_band : this.Vehicles[i].Age_Band_of_Driver, vehicle_type : this.Vehicles[i].Vehicle_Type}, 
                {count: 1});
        }
    }
""")

In [None]:
reducer = Code("""
    function(key, emits) {
        var total = {count : 0}
        for (var i in emits) {
            total.count += emits[i].count;
        }
        return total;
    }
""")

In [None]:
result = accidents.map_reduce(mapper, reducer, 'myresults')
result

In [None]:
[r for r in result.find(limit=20)]

In [None]:
drivers_by_age_vtype_long_df = pd.DataFrame([
        {'Age_Band_of_Driver': r['_id']['age_band'],
         'Vehicle_Type': r['_id']['vehicle_type'],
         'Number_of_Drivers': r['value']['count']}
        for r in result.find()
    ])
drivers_by_age_vtype_long_df

In [None]:
drivers_by_age_vtype_df = drivers_by_age_vtype_long_df.pivot('Age_Band_of_Driver', 
                                                             'Vehicle_Type', 
                                                             'Number_of_Drivers')
drivers_by_age_vtype_df.index = [label_of['Age_Band_of_Driver', i] 
                                                 for i in drivers_by_age_vtype_df.index]
drivers_by_age_vtype_df.columns = [label_of['Vehicle_Type', c] 
                                     for c in drivers_by_age_vtype_df.columns]
drivers_by_age_vtype_df.fillna(0, inplace=True)
drivers_by_age_vtype_df 

In [None]:
ax = drivers_by_age_vtype_df.plot(kind='bar')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

In [None]:
ax = drivers_by_age_vtype_df.loc[['26 - 35', '36 - 45', '46 - 55']].plot(kind='bar')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

In [None]:
ax = drivers_by_age_vtype_df['Car'].plot(kind='bar')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

Don't include rows and columns with too few values in a cell: there are too few to generate meaningful results.

Start by dropping drivers aged 15 and under.

In [None]:
small_dav_df = drivers_by_age_vtype_df.iloc[4:]
small_dav_df

Then keep only the columns where every cell has at least 5 items.

In [None]:
small_dav_df = small_dav_df.loc[:, (small_dav_df > 5).all(axis=0) ]
small_dav_df

In [None]:
expected_dav_df = pd.DataFrame({c: {r: (small_dav_df[c].sum() * 
                                                small_dav_df.loc[r].sum() / 
                                                small_dav_df.sum().sum() )
                  for r in small_dav_df[c].index} 
              for c in small_dav_df}
)
expected_dav_df

In [None]:
scipy.stats.chisquare(small_dav_df, expected_dav_df, axis=None)

This is an extremely small _p_ value, so we can reject the null hypothesis that age band of driver and vehicle type are independent.

# Activity 4

In [None]:
mapper = Code("""
    function () {
        for (var i in this.Vehicles) {
            if (this.Vehicles[i].Vehicle_Type >= 2 && this.Vehicles[i].Vehicle_Type <= 5) {
                emit({age_band : this.Vehicles[i].Age_Band_of_Driver, 
                        vehicle_type : this.Vehicles[i].Vehicle_Type,
                        sex : this.Vehicles[i].Sex_of_Driver}, 
                    {count : 1});
            }
        }
    }
""")

In [None]:
reducer = Code("""
    function(key, emits) {
        var total = {count : 0}
        for (var i in emits) {
            total.count += emits[i].count;
        }
        return total;
    }
""")

In [None]:
result = accidents.map_reduce(mapper, reducer, 'myresults',
                              query = {'Datetime': {"$gte": datetime.datetime(2011, 1, 1)}})
result

In [None]:
[r for r in result.find(limit=5)]

In [None]:
riders_by_age_vtype_sex_long_df = pd.DataFrame([
        {'Age_Band': r['_id']['age_band'],
         'Vehicle_Type': r['_id']['vehicle_type'],
         'Sex': r['_id']['sex'],
         'n': r['value']['count']}
        for r in result.find()
    ])
riders_by_age_vtype_sex_long_df

In [None]:
riders_by_age_vtype_sex_df = riders_by_age_vtype_sex_long_df.pivot_table(
    columns=['Age_Band', 'Sex'],
    index='Vehicle_Type',
    values='n')
riders_by_age_vtype_sex_df

In [None]:
riders_by_age_vtype_sex_df.index = [label_of['Vehicle_Type', i] 
                                    for i in riders_by_age_vtype_sex_df.index]
riders_by_age_vtype_sex_df

In [None]:
riders_by_age_vtype_sex_df.columns.set_levels(
    [label_of['Age_Band_of_Driver', a] 
     for a in riders_by_age_vtype_sex_df.columns.levels[0]], 0, inplace=True)
riders_by_age_vtype_sex_df.columns.set_levels(
    [label_of['Sex_of_Driver', a] 
     for a in riders_by_age_vtype_sex_df.columns.levels[1]], 1, inplace=True)
riders_by_age_vtype_sex_df

In [None]:
riders_avs_df = riders_by_age_vtype_sex_df.loc[:, (slice('16 - 20','56 - 65'), ['Male','Female'])]
riders_avs_df

Plot the data for men and women separately. Note the different numbers of accidents.

In [None]:
ax = riders_avs_df.T.xs('Female', level='Sex').plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

In [None]:
ax = riders_avs_df.T.xs('Male', level='Sex').plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

It certainly looks like there's an excess of accidents involving men in their 40s on powerful motorbikes.

In [None]:
expected_riders_avs_df = pd.DataFrame({c: {r: (riders_avs_df[c].sum() * 
                                                riders_avs_df.loc[r].sum() / 
                                                riders_avs_df.sum().sum() )
                  for r in riders_avs_df[c].index} 
              for c in riders_avs_df}
)
expected_riders_avs_df.columns.names = ['Age band', 'Sex']
expected_riders_avs_df.plot()
expected_riders_avs_df

In [None]:
riders_avs_f_df = riders_avs_df.xs('Female', level='Sex', axis=1)
expected_riders_avs_f_df = pd.DataFrame({c: {r: (riders_avs_f_df[c].sum() * 
                                                riders_avs_f_df.loc[r].sum() / 
                                                riders_avs_f_df.sum().sum() )
                  for r in riders_avs_f_df[c].index} 
              for c in riders_avs_f_df}
)
expected_riders_avs_f_df.plot()
expected_riders_avs_f_df

In [None]:
riders_avs_m_df = riders_avs_df.xs('Male', level='Sex', axis=1)
expected_riders_avs_m_df = pd.DataFrame({c: {r: (riders_avs_m_df[c].sum() * 
                                                riders_avs_m_df.loc[r].sum() / 
                                                riders_avs_m_df.sum().sum() )
                  for r in riders_avs_m_df[c].index} 
              for c in riders_avs_m_df}
)
expected_riders_avs_m_df.plot()
expected_riders_avs_m_df

In [None]:
scipy.stats.chisquare(riders_avs_df, expected_riders_avs_df, axis=None)

In [None]:
scipy.stats.chisquare(riders_avs_f_df, expected_riders_avs_f_df, axis=None)

In [None]:
scipy.stats.chisquare(riders_avs_m_df, expected_riders_avs_m_df, axis=None)

In all cases, the mix of motorcycles ridden does change significantly with age of rider.