
# Exploring some data with grouping, cross tabs and pivot tables
## Solution Notebook
This is a set of worked solutions to the exercise at the end of the `04.5 Split-apply-combine with SQL and pandas` Notebook. 

Remember there will be several different ways of achieving the required results - but the results should be the same given the same salesbook dataset.


## The exercise starts with
In the `data` folder there is a `salesbook.csv` file.  

It's a fairly boring sales ledger showing for each date (watch the format!) the location of the sales team member, the sales person's name, their sales team, and what was sold (how many and at what unit price).


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
!ls data

## Your tasks 

Load the DataFrame from the `CSV` file.

You might want to clean up the data a bit. One suggestion is to split the month/day/year, or create functions to do it. (All sales are in 2015 so we could drop that info and we're not interested in the day of the month of the sale, but we are going to be interested in monthly sales - so we want to retain the month values.)

You might also want to add a sale amount column (units \* item cost) to save recalculating each time it's required, as we will be using this value later.

Given the above data, and the ability to group datasets, write code to:

- a) Show a count of the number of sales records for each District.
- b) Show a count of the number of sales records for each Team in each District, including the Team and District margin totals.
- c) Show the total sales value for each Team in each District summed over the year.
- d) Show the total sales value for each Team Member in each District over the year, showing the District and Team member margin totals. (Remember you need the team name and salesperson name to identify each person uniquely.)
- e) Show a bar chart of the number of sales each month. 
- f) Show a bar chart of the total sales each month.
- g) Show a scatter plot showing the Item Cost v. the number of Units in each record.
- h) Add a Season column to the DataFrame. For each sale record, the value for Season will be derived from the month: (11,12,1) are Winter, (2,3,4) are Spring, (5,6,7) are Summer, (8,9,10) are Autumn. From the sales in each Season calculate the number, average, maximum, minimum and total sale amount over the season (that is, from all the sales records grouped by season report the number of records, and the average, maximum, minimum and total sales amounts.


In [None]:
#  Start by reading in the CSV file.
salesbook_df = pd.read_csv('data/salesbook.csv');
salesbook_df.head()

In [None]:
# Now add the SaleAmount column.
salesbook_df['SaleAmount'] = salesbook_df['Units'] * salesbook_df['Item Cost']
salesbook_df.head()

In [None]:
# Now add the Month column.

# Note: A generator is used here, because I can't get the .month to apply to the 
# series of parsed datatimes this feels 'clumsy', surely there is a better way.

# The code loops over the salesbook_df, inferring the OrderDate format
# and converting each OrderDate to a datetime value.  The result is a Series which is assigned
# to a new 'Month' column of salesbook_df.
salesbook_df['Month'] = ([i.month for i in pd.to_datetime(salesbook_df['OrderDate'], 
                                                          infer_datetime_format=True)])
salesbook_df.head(5)


- a) Show a count of the number of sales records for each District.

In [None]:
# Counting over one column is a simple group by and count:
salesbook_df.groupby(['District']).count()

In [None]:
# To make it look a bit tidier, isolate only the District and one other column:
salesbook_df[['District','Sales']].groupby(['District']).count()

- b) Show a count of the number of sales records for each Team in each District, include the Team and District margin totals.

In [None]:
# Counting over two distinct columns is easier as a crosstab:
pd.crosstab(salesbook_df['Team'], salesbook_df['District'], margins=True)

- c) Show the total sales value for each Team in each District summed over the year.

In [None]:
# Since we're doing more than counting we need a pivot table;
# but, we only want the sum function to apply to the SaleAmount column.
# So, we strip that and the Team and District columns out of the full Salesbook DataFrame.
salesbook_df[['Team','District','SaleAmount']].pivot_table(index=['Team'], 
                                                           columns=['District'],
                                                           aggfunc=np.sum)


- d) Show the total sales value for each Team Member in each District over the year, showing the District and Team member margin totals. (Remember you need the team name and salesperson name to identify each person uniquely.)

In [None]:
# Here we need the Sales column, and we need to form a two level-index for Team and Sales.
salesbook_df[['Team','Sales','District','SaleAmount']].pivot_table(index=['Team','Sales'], 
                                                        columns=['District'],
                                                        aggfunc=np.sum,
                                                        margins=True)

- e) Show a bar chart of the number of sales each month. 

In [None]:
# First get a count of the sales records in each month.
# If we put it into a simple two-column table then plot() can work out the x,y axes itself.
MonthBySaleCount_df = salesbook_df[['Month', 'OrderDate']].groupby(['Month']).count()

# Now plot the bar chart, it looks a bit easier to read with Month as the y axis.
MonthBySaleCount_df.plot.barh()

In [None]:
# Not asked for in the question, but
# we could use a crosstab and a stacked bar chart to show the District 
# contribution to the monthly totals.
pd.crosstab(salesbook_df['Month'], salesbook_df['District']).plot.barh(stacked=True)

- f) Show a bar chart of the total sales each month.

In [None]:
# We can repeat the recipe for the count of sales example earlier, 
# adjusting for the SaleAmount and Sum functions.
salesbook_df[['Month', 'SaleAmount']].groupby(['Month']).sum().plot.barh()


In [None]:
# And a similar sleight of hand, this time with pivot tables, can give us the
# stacked bar chart to show the District contribition to the monthly sales totals.
salesbook_df[['Month', 'District', 'SaleAmount']
         ].pivot_table(index=['Month'], 
                       columns=['District'], 
                       aggfunc=np.sum
                      ).plot.barh(stacked=True)

- g) Show a scatter plot showing the Item Cost vs the number of Units in each record.

In [None]:
salesbook_df.plot.scatter(x='Item Cost', y='Units')

- h) Add a Season column to the DataFrame. For each sale record, the value for Season will be derived from the month: (11,12,1) are Winter, (2,3,4) are Spring, (5,6,7) are Summer, (8,9,10) are Autumn. From the sales in each Season calculate the number, average, maximum, minimum and total sale amount over the season (that is, from all the sales records grouped by season report the number of records, and the average, maximum, minimum and total sales amounts).


In [None]:
# First a simple function to convert months to seasons
def month_to_season(month):
    season='Winter'
    if month in [2,3,4]:
        season='Spring'
    elif month in [5,6,7]:
        season='Summer'
    elif month in [8,9,10]:
        season='Autumn'
    return season
# Now add a column to salesbook using the pandas.apply to apply the new 
# function to the Month column:
salesbook_df['Season'] = salesbook_df['Month'].apply(month_to_season)
# Check to see it had the desired affect:
salesbook_df.head()

In [None]:
# Now apply the functions to the grouped season rows -
# there's only one column of interest, SalesAmount, so let's focus on that.
salesbook_df[['Season','SaleAmount']].groupby(['Season']).agg(['count','mean','min','max','sum'])

## What next?


Why not take time to share some of your solutions with the other students on the module.  Use OpenStudio to showcase a Notebook with some of your solutions to the above exercises.  While you're there, have a look at the work posted by other students - if the techniques they've used aren't familiar to you ask them to describe how their code works.  Don't be afraid to let them know when you can see improvements to their code - good coders learn from others.

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `04.6 Introducing Regular Expressions`.