## Sales-data-analysis

ym's rendition of TalkPython's "Excel to Python Course: ch8-case-study"


### Objective
Given last year's sales data, and a commission budget of $1mil, we want to look at how we could structure the commission payout for the following year.


### Data Sources
- 'customer_master.xlsx' contains 3 sets of data, namely customers, transactions, and sales
- https://github.com/talkpython/excel-to-python-course/tree/master/code/ch8-case-study

### Changes
- 16-12-2021 : Started project

In [1]:
import pandas as pd
import xlsxwriter
from pathlib import Path
from datetime import datetime

### File Locations

In [2]:
today = datetime.today()
src_file = Path.cwd() / "data" / "raw" / "customer_master.xlsx"
output_file = Path.cwd() / "data" / "processed" / "customer_processed.xlsx"
report = Path.cwd() / "reports" / "report_v1.xlsx"

In [3]:
df_trx = pd.read_excel(src_file, sheet_name = "transactions")
df_cust = pd.read_excel(src_file, sheet_name = "customers", dtype={'zip_code':str})
df_salesagt = pd.read_excel(src_file, sheet_name = "sales")

url = "https://github.com/cphalpert/census-regions/blob/master/us%20census%20bureau%20regions%20and%20divisions.csv?raw=True"
regions = pd.read_csv(url, usecols=[1,2])

### (1) Transactions

In [4]:
df_trx

Unnamed: 0,cust_num,sku,qty,list_price,discount_rate,invoice_price,invoice_num,invoice_date_time,invoice_total
0,LA6029,SW200,4,20000,0.24,15200.0,98105,2019-12-13 14:11:43.828,60800.0
1,EB0265,PS501,4,30000,0.10,27000.0,58436,2019-06-05 23:12:47.344,108000.0
2,EE4079,SW500,1,20000,0.36,12800.0,85825,2019-09-12 03:23:24.309,12800.0
3,YR6861,ACC5144,4,400,0.12,352.0,46422,2019-10-10 15:02:54.590,1408.0
4,WL5283,SW200,1,20000,0.17,16600.0,34838,2019-08-03 11:32:29.245,16600.0
...,...,...,...,...,...,...,...,...,...
1995,XJ1430,SPB1,1,5000,0.19,4050.0,11706,2019-05-09 15:09:09.614,4050.0
1996,AI9833,SW500,3,20000,0.24,15200.0,38703,2019-11-10 03:55:57.038,45600.0
1997,WL5283,SW200,2,20000,0.40,12000.0,48217,2019-10-18 06:00:39.492,24000.0
1998,SM6748,ACC9011,18,400,0.38,248.0,66811,2019-07-24 05:07:14.352,4464.0


In [5]:
df_trx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   cust_num           2000 non-null   object        
 1   sku                2000 non-null   object        
 2   qty                2000 non-null   int64         
 3   list_price         2000 non-null   int64         
 4   discount_rate      2000 non-null   float64       
 5   invoice_price      2000 non-null   float64       
 6   invoice_num        2000 non-null   int64         
 7   invoice_date_time  2000 non-null   datetime64[ns]
 8   invoice_total      2000 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(3), object(2)
memory usage: 140.8+ KB


In [6]:
prev_yr_total_sales = df_trx['invoice_total'].sum()
print(f'Total sales for previous year: ${prev_yr_total_sales:,.0f}')

Total sales for previous year: $126,493,662


In [7]:
total_com = 1_000_000
avg_com = total_com / prev_yr_total_sales
print(f'Average commission rate based on previous year\'s sale: {avg_com:.2%}')

Average commission rate based on previous year's sale: 0.79%


In [8]:
df_trx.describe()

Unnamed: 0,qty,list_price,discount_rate,invoice_price,invoice_num,invoice_total
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,5.336,15407.7,0.202315,12279.542,50350.6705,63246.831
std,6.072524,9907.746587,0.098452,8098.502539,28755.571742,94703.387591
min,1.0,400.0,0.0,200.0,43.0,240.0
25%,2.0,5000.0,0.13,3950.0,25356.25,12400.0
50%,3.0,20000.0,0.2,14800.0,50824.0,32000.0
75%,4.0,20000.0,0.27,17600.0,75442.0,66450.0
max,24.0,30000.0,0.56,30000.0,99990.0,705600.0


In [9]:
df_trx.describe(include=object)

Unnamed: 0,cust_num,sku
count,2000,2000
unique,50,12
top,VK4512,SW200
freq,53,338


In [10]:
agg_cols = {'invoice_num' : 'count', 'discount_rate':'mean','qty':'sum', 'invoice_total':'sum'}

In [11]:
df_trx.groupby('sku').agg(agg_cols).reset_index()

Unnamed: 0,sku,invoice_num,discount_rate,qty,invoice_total
0,ACC0001,73,0.207671,377,118828.0
1,ACC5144,71,0.185493,469,151912.0
2,ACC8222,73,0.197123,443,139876.0
3,ACC9011,84,0.209167,427,133996.0
4,PS403,145,0.199172,662,16130700.0
5,PS501,149,0.201812,784,18609900.0
6,SPA1,137,0.195693,809,3274350.0
7,SPB1,137,0.193796,802,3238100.0
8,SPBC2,141,0.209787,708,2793600.0
9,SW121,320,0.203,1631,25856400.0


##### Based on the 'transactions' data, we made the below observations:
- Total sales for the previous year was \$126,493,662.
- Average commission rate based on the commission budget of $1mil is 0.79%.
- There were a total of 2000 transactions, between 50 unique customers, and across 12 different products. 
- Some products were more popular than the rest.

### (2) Customers

In [12]:
df_cust

Unnamed: 0,company_name,channel,zip_code,city,state,account_num,total_sales
0,Universal Technology Vision,retail,22910,Charlottesville,VA,AH5590,1257912
1,East Design Hill,retail,66546,Wakarusa,KS,OL0453,1158564
2,Studio Pacific Galaxy,retail,79698,Abilene,TX,YR6861,1663488
3,Galaxy Building,retail,85275,Mesa,AZ,AS3124,1193560
4,Resource Innovation Future,retail,97013,Canby,OR,DK1362,958040
5,Internet Hill Systems,retail,74360,Picher,OK,KK6153,970886
6,Pacific Hill Application,retail,49862,Munising,MI,MS1866,1271136
7,Net Electronic,retail,42631,Marshes Siding,KY,WA1826,1101414
8,Software Bell Technology,retail,45342,Miamisburg,OH,XJ1430,942044
9,Innovation Net,retail,20390,Washington,DC,NS1312,1010872


In [13]:
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  50 non-null     object
 1   channel       50 non-null     object
 2   zip_code      50 non-null     object
 3   city          50 non-null     object
 4   state         50 non-null     object
 5   account_num   50 non-null     object
 6   total_sales   50 non-null     int64 
dtypes: int64(1), object(6)
memory usage: 2.9+ KB


In [14]:
df_cust.describe().style.format('{:,.0f}')

Unnamed: 0,total_sales
count,50
mean,2529873
std,2482702
min,746216
25%,1115702
50%,1328859
75%,1705738
max,9121596


In [15]:
df_cust.describe(include=object)

Unnamed: 0,company_name,channel,zip_code,city,state,account_num
count,50,50,50,50,50,50
unique,50,3,50,48,31,50
top,Universal Technology Vision,retail,22910,Dawson,VA,AH5590
freq,1,38,1,2,4,1


Here we see that there are a total of 50 companies in the dataset, all bearing different zipcodes and located across 48 cites/31 states. There are 3 sales channels, with the most common being retail, which is being used by 38 companies.

In [16]:
df_cust.groupby('state').agg({'company_name' : 'count', 'total_sales':'sum'}).reset_index().sort_values(by = ['total_sales'],ascending = False).style.format({'total_sales':'{:,.0f}'})

Unnamed: 0,state,company_name,total_sales
18,MO,3,10716216
21,NE,2,10360328
10,KS,3,9118012
20,NC,2,8177040
8,ID,1,7853376
9,IL,1,6958500
17,MN,1,6833484
12,LA,1,6557928
7,IA,1,6548220
26,PA,1,6222564


In [17]:
avg_sales_per_state = df_cust['total_sales'].sum()/df_cust['state'].nunique()
print(f'The average sales per state is ${avg_sales_per_state:,.2f}')

The average sales per state is $4,080,440.71


The states MO and NE brought in significantly more sales than the rest - more than double the average sales per state.

In [18]:
df_cust.sort_values(by='total_sales', ascending=False).head()

Unnamed: 0,company_name,channel,zip_code,city,state,account_num,total_sales
34,Signal Hill Bell,reseller,68631,Creston,NE,HC3828,9121596
31,West Max Hardware,reseller,64738,Collins,MO,MH1146,8807964
28,South Speed East,reseller,83856,Priest River,ID,KI8637,7853376
32,Telecom North Resource,reseller,28170,Wadesboro,NC,RJ3363,7008576
21,Solutions North,reseller,62520,Dawson,IL,QZ1799,6958500


This is probably due to the fact that the top 2 customers are located in these 2 states. Notice also that the top 5 customers are all using the reseller channel.

In [19]:
channels = df_cust.groupby('channel').agg({'company_name' : 'count', 'total_sales':'sum'}).reset_index().sort_values(by = ['total_sales'],ascending = False)
channels

Unnamed: 0,channel,company_name,total_sales
1,reseller,10,72708276
2,retail,38,46782774
0,partner,2,7002612


In [20]:
channels['company_name'] = channels['company_name'] / channels['company_name'].sum()
channels['total_sales'] = channels['total_sales'] / channels['total_sales'].sum()
channels

Unnamed: 0,channel,company_name,total_sales
1,reseller,0.2,0.574798
2,retail,0.76,0.369843
0,partner,0.04,0.055359


Even though most companies (76%) uses the retail channel, the reseller channel brought in 57% of total sales. 

Let's see if the sales channel has any correlation with the states the companies are in.

In [21]:
pd.pivot_table(df_cust, 
               index = "state", 
               columns = "channel", 
               values = 'total_sales', 
               aggfunc = 'count', 
               margins = True, 
               fill_value =0)

channel,partner,reseller,retail,All
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AL,0,0,1,1
AZ,0,0,1,1
CA,0,0,2,2
CT,0,0,2,2
DC,1,0,1,2
FL,1,0,1,2
GA,0,0,1,1
IA,0,1,0,1
ID,0,1,0,1
IL,0,1,0,1


In [22]:
pd.crosstab(df_cust['state'],
            df_cust['channel'],
            values = df_cust['total_sales'], 
            aggfunc = 'sum',
            normalize = 'index')

channel,partner,reseller,retail
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL,0.0,0.0,1.0
AZ,0.0,0.0,1.0
CA,0.0,0.0,1.0
CT,0.0,0.0,1.0
DC,0.790636,0.0,0.209364
FL,0.753398,0.0,0.246602
GA,0.0,0.0,1.0
IA,0.0,1.0,0.0
ID,0.0,1.0,0.0
IL,0.0,1.0,0.0


The 10 companies that uses the reseller channel are spread out across 10 different states. There does not seem to be any conclusive correlation between the state and the choice of sales channel.

##### The observations made from the 'customers' data can be summarised as below:
- There are 50 companies, located across 48 cities or 31 states.
- There are 3 sales channels, namely retail, reseller, and partner.
- While 76% of companies go through retail, the reseller channel brought in 57% of total sales volume.
- All top 5 companies (in terms of total sales) uses the reseller channel.
- The top 2 companies are located in MO and NE respectively, which are coincidentially also the top 2 states in terms of total sales volume.


### (3) Sales agents

In [23]:
df_salesagt

Unnamed: 0,first_name,last_name,region,tenure
0,Shannon,Muniz,NorthEast,5.6
1,Leonard,Malcolm,West,3.8
2,Mona,Sutton,Midwest,5.4
3,Mickey,Tyner,South,0.7


In [24]:
df_salesagt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   first_name  4 non-null      object 
 1   last_name   4 non-null      object 
 2   region      4 non-null      object 
 3   tenure      4 non-null      float64
dtypes: float64(1), object(3)
memory usage: 256.0+ bytes


There are a total of 4 sales agents, each covering a different region. We would like to look into the amount of sales each of them brought in.

As this table does not tell us which states belong to which regions, we need the help of an external table to tell us which region each state is located. This data can be found here: https://github.com/cphalpert/census-regions/blob/master/us%20census%20bureau%20regions%20and%20divisions.csv

We will add this input to the top of the file, together with the rest of the inputs.

In [25]:
regions

Unnamed: 0,State Code,Region
0,AK,West
1,AL,South
2,AR,South
3,AZ,West
4,CA,West
5,CO,West
6,CT,Northeast
7,DC,South
8,DE,South
9,FL,South


Let's rename the columns in the region table and also change the spelling of Northeast to be consistent with the dataset we are working with.

In [26]:
regions = regions.rename(columns={'State Code':'state', 'Region':'region'})
regions['region'] = regions['region'].str.replace("Northeast","NorthEast")
regions

Unnamed: 0,state,region
0,AK,West
1,AL,South
2,AR,South
3,AZ,West
4,CA,West
5,CO,West
6,CT,NorthEast
7,DC,South
8,DE,South
9,FL,South


Now we can link each customer to their sales agent.

In [27]:
temp = pd.merge(df_cust,regions)
temp

Unnamed: 0,company_name,channel,zip_code,city,state,account_num,total_sales,region
0,Universal Technology Vision,retail,22910,Charlottesville,VA,AH5590,1257912,South
1,Contract Electronics Industries,retail,24153,Salem,VA,GG0303,1035050,South
2,Star Interactive,retail,22153,Springfield,VA,UM2244,1541486,South
3,Vision People Solutions,retail,24557,Gretna,VA,WL5283,1299450,South
4,East Design Hill,retail,66546,Wakarusa,KS,OL0453,1158564,Midwest
5,Hardware Adventure Universal,retail,67118,Norwich,KS,GA3939,1163380,Midwest
6,Solutions Universal,reseller,66212,Shawnee Mission,KS,SA4443,6796068,Midwest
7,Studio Pacific Galaxy,retail,79698,Abilene,TX,YR6861,1663488,South
8,Virtual Vision Data,retail,77501,Pasadena,TX,YA6348,1440886,South
9,Galaxy Building,retail,85275,Mesa,AZ,AS3124,1193560,West


In [28]:
cust_salesagt = pd.merge(temp, df_salesagt, how='left')

In [29]:
cust_salesagt

Unnamed: 0,company_name,channel,zip_code,city,state,account_num,total_sales,region,first_name,last_name,tenure
0,Universal Technology Vision,retail,22910,Charlottesville,VA,AH5590,1257912,South,Mickey,Tyner,0.7
1,Contract Electronics Industries,retail,24153,Salem,VA,GG0303,1035050,South,Mickey,Tyner,0.7
2,Star Interactive,retail,22153,Springfield,VA,UM2244,1541486,South,Mickey,Tyner,0.7
3,Vision People Solutions,retail,24557,Gretna,VA,WL5283,1299450,South,Mickey,Tyner,0.7
4,East Design Hill,retail,66546,Wakarusa,KS,OL0453,1158564,Midwest,Mona,Sutton,5.4
5,Hardware Adventure Universal,retail,67118,Norwich,KS,GA3939,1163380,Midwest,Mona,Sutton,5.4
6,Solutions Universal,reseller,66212,Shawnee Mission,KS,SA4443,6796068,Midwest,Mona,Sutton,5.4
7,Studio Pacific Galaxy,retail,79698,Abilene,TX,YR6861,1663488,South,Mickey,Tyner,0.7
8,Virtual Vision Data,retail,77501,Pasadena,TX,YA6348,1440886,South,Mickey,Tyner,0.7
9,Galaxy Building,retail,85275,Mesa,AZ,AS3124,1193560,West,Leonard,Malcolm,3.8


Let's bring in the transaction data as well.

In [30]:
detailed_trx = pd.merge(df_trx, cust_salesagt, left_on = "cust_num", right_on = "account_num", how="left")
detailed_trx

Unnamed: 0,cust_num,sku,qty,list_price,discount_rate,invoice_price,invoice_num,invoice_date_time,invoice_total,company_name,channel,zip_code,city,state,account_num,total_sales,region,first_name,last_name,tenure
0,LA6029,SW200,4,20000,0.24,15200.0,98105,2019-12-13 14:11:43.828,60800.0,Bell Frontier Resource,retail,95172,San Jose,CA,LA6029,1719822,West,Leonard,Malcolm,3.8
1,EB0265,PS501,4,30000,0.10,27000.0,58436,2019-06-05 23:12:47.344,108000.0,Speed Resource Vision,retail,64074,Napoleon,MO,EB0265,746216,Midwest,Mona,Sutton,5.4
2,EE4079,SW500,1,20000,0.36,12800.0,85825,2019-09-12 03:23:24.309,12800.0,Venture Construction,retail,06016,Broad Brook,CT,EE4079,1559544,NorthEast,Shannon,Muniz,5.6
3,YR6861,ACC5144,4,400,0.12,352.0,46422,2019-10-10 15:02:54.590,1408.0,Studio Pacific Galaxy,retail,79698,Abilene,TX,YR6861,1663488,South,Mickey,Tyner,0.7
4,WL5283,SW200,1,20000,0.17,16600.0,34838,2019-08-03 11:32:29.245,16600.0,Vision People Solutions,retail,24557,Gretna,VA,WL5283,1299450,South,Mickey,Tyner,0.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,XJ1430,SPB1,1,5000,0.19,4050.0,11706,2019-05-09 15:09:09.614,4050.0,Software Bell Technology,retail,45342,Miamisburg,OH,XJ1430,942044,Midwest,Mona,Sutton,5.4
1996,AI9833,SW500,3,20000,0.24,15200.0,38703,2019-11-10 03:55:57.038,45600.0,Resource Adventure Internet,retail,49752,Kinross,MI,AI9833,1580248,Midwest,Mona,Sutton,5.4
1997,WL5283,SW200,2,20000,0.40,12000.0,48217,2019-10-18 06:00:39.492,24000.0,Vision People Solutions,retail,24557,Gretna,VA,WL5283,1299450,South,Mickey,Tyner,0.7
1998,SM6748,ACC9011,18,400,0.38,248.0,66811,2019-07-24 05:07:14.352,4464.0,Advanced Alpha Federated,reseller,56023,Delavan,MN,SM6748,6833484,Midwest,Mona,Sutton,5.4


Let's clean up this table by removing the unncessary columns.

In [31]:
final_data = detailed_trx.iloc[:,[1,2,8,9,10,13,16,17,18,19]].copy()
final_data

Unnamed: 0,sku,qty,invoice_total,company_name,channel,state,region,first_name,last_name,tenure
0,SW200,4,60800.0,Bell Frontier Resource,retail,CA,West,Leonard,Malcolm,3.8
1,PS501,4,108000.0,Speed Resource Vision,retail,MO,Midwest,Mona,Sutton,5.4
2,SW500,1,12800.0,Venture Construction,retail,CT,NorthEast,Shannon,Muniz,5.6
3,ACC5144,4,1408.0,Studio Pacific Galaxy,retail,TX,South,Mickey,Tyner,0.7
4,SW200,1,16600.0,Vision People Solutions,retail,VA,South,Mickey,Tyner,0.7
...,...,...,...,...,...,...,...,...,...,...
1995,SPB1,1,4050.0,Software Bell Technology,retail,OH,Midwest,Mona,Sutton,5.4
1996,SW500,3,45600.0,Resource Adventure Internet,retail,MI,Midwest,Mona,Sutton,5.4
1997,SW200,2,24000.0,Vision People Solutions,retail,VA,South,Mickey,Tyner,0.7
1998,ACC9011,18,4464.0,Advanced Alpha Federated,reseller,MN,Midwest,Mona,Sutton,5.4


In [32]:
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sku            2000 non-null   object 
 1   qty            2000 non-null   int64  
 2   invoice_total  2000 non-null   float64
 3   company_name   2000 non-null   object 
 4   channel        2000 non-null   object 
 5   state          2000 non-null   object 
 6   region         2000 non-null   object 
 7   first_name     2000 non-null   object 
 8   last_name      2000 non-null   object 
 9   tenure         2000 non-null   float64
dtypes: float64(2), int64(1), object(7)
memory usage: 171.9+ KB


Let's do a brief check whether the datasets were merged correctly, by counter checking that the total sales tallies with the initial amount we had.

In [33]:
final_data['invoice_total'].sum() == prev_yr_total_sales

True

Earlier, we did not find any meaningful relationship between sales channel and the states the companies are located. Let's see if there is any pattern in sales channel vs region, or sales agent.

In [34]:
agent_vs_channel_1 = pd.pivot_table(final_data, 
                                  index = ['last_name', 'region'],
                                  columns = 'channel',
                                  values = 'company_name',
                                  aggfunc = 'nunique',
                                  margins = True,
                                  fill_value = 0)
agent_vs_channel_1

Unnamed: 0_level_0,channel,partner,reseller,retail,All
last_name,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Malcolm,West,0,1,4,5
Muniz,NorthEast,0,1,5,6
Sutton,Midwest,0,6,9,15
Tyner,South,2,2,20,24
All,,2,10,38,50


Malcolm and Muniz appear to have notably less companies within their access, compared to Muniz and Sutton, across all channels.

Tyner, who's in charge of the Southern region, appears to have significantly more companies using the retail channel compared to the other sales agents. He is also the only agent who has sales coming in via the partner channel.

Sutton, on the other hand, has significantly more customers in the reseller channel compared to the other sales agents.

There are 2 possibilities:
- Agent expertise: Tyner has expertise in dealing with partner and retail customers, while Sutton's forte is targetting resellers
- Geographical preference: The companies in the South favour the retail channel, or the companies in the Midwest favour the reseller channel

With the limitations of the data we currently have, we are unable to identify which possibility is more likely. We could however, try to reorganise the sales agents' area of coverage and observe the changes in the following year's sales data.

Let's also look at the amount of sales generated via each channel by each sales agent.

In [35]:
total_sales_by_agent_and_ch1 = pd.pivot_table(final_data, 
                                           index = 'last_name', 
                                           columns = 'channel',
                                           values = 'invoice_total',
                                           aggfunc = 'sum',
                                           margins = True,
                                           fill_value = 0).style.format('{:,.0f}')
total_sales_by_agent_and_ch1

channel,partner,reseller,retail,All
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Malcolm,0,7853376,5056396,12909772
Muniz,0,6222564,6627800,12850364
Sutton,0,45065832,10874362,55940194
Tyner,7002612,13566504,24224216,44793332
All,7002612,72708276,46782774,126493662


Comparing this to the previous table, we see that there is generally a positive correlation between the number of companies and the total sales volume.

Next, let's see if there is any correlation between the sales agents' tenure with the company and the amount of sales brought in.

In [36]:
final_data.groupby(['last_name']).agg({'invoice_total':'sum', 'tenure':'last'}).style.format({'invoice_total':'{:,.0f}', 'tenure':'{:.1f}'})

Unnamed: 0_level_0,invoice_total,tenure
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Malcolm,12909772,3.8
Muniz,12850364,5.6
Sutton,55940194,5.4
Tyner,44793332,0.7


Now let's see how much each agent would have made, assumming a flat commission rate of 0.79%.

In [37]:
final_data['comm_rate'] = avg_com
final_data['comm'] = round(final_data['invoice_total'] * avg_com,2)
final_data

Unnamed: 0,sku,qty,invoice_total,company_name,channel,state,region,first_name,last_name,tenure,comm_rate,comm
0,SW200,4,60800.0,Bell Frontier Resource,retail,CA,West,Leonard,Malcolm,3.8,0.007906,480.66
1,PS501,4,108000.0,Speed Resource Vision,retail,MO,Midwest,Mona,Sutton,5.4,0.007906,853.80
2,SW500,1,12800.0,Venture Construction,retail,CT,NorthEast,Shannon,Muniz,5.6,0.007906,101.19
3,ACC5144,4,1408.0,Studio Pacific Galaxy,retail,TX,South,Mickey,Tyner,0.7,0.007906,11.13
4,SW200,1,16600.0,Vision People Solutions,retail,VA,South,Mickey,Tyner,0.7,0.007906,131.23
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,SPB1,1,4050.0,Software Bell Technology,retail,OH,Midwest,Mona,Sutton,5.4,0.007906,32.02
1996,SW500,3,45600.0,Resource Adventure Internet,retail,MI,Midwest,Mona,Sutton,5.4,0.007906,360.49
1997,SW200,2,24000.0,Vision People Solutions,retail,VA,South,Mickey,Tyner,0.7,0.007906,189.73
1998,ACC9011,18,4464.0,Advanced Alpha Federated,reseller,MN,Midwest,Mona,Sutton,5.4,0.007906,35.29


In [38]:
agent_comm1 = final_data.groupby(['last_name']).agg({'comm':'sum'}).style.format('{:,.2f}')
agent_comm1

Unnamed: 0_level_0,comm
last_name,Unnamed: 1_level_1
Malcolm,102058.73
Muniz,101589.0
Sutton,442236.99
Tyner,354115.33


We see that assumming a flat commission rate of 0.79%, Malcolm and Muniz would make significantly less compared to Sutton and Tyner.

##### Based on the sales agents' data, the following observations can be made:
- All 4 sales agents have sales via the reseller and retail channels.
- Malcolm and Muniz have less companies within their access, compared to Sutton and Tyner, across all channels.
- The top sales agent, Sutton, has a much larger reseller volume compared to everyone else.
- Tyner on the other hand, appears to have significantly more companies using the retail channel compared to the other sales agents.
- Only Tyner has sales coming in via the partner channel.
- Sutton also happens to be one of the longest-serving sales agent, having been with the company for 5.4 years. Apart from that, there appears to be little correlation between tenure and total sales volume for the rest of the agents.
- Assumming a flat commission rate of 0.79%, Malcolm and Muniz would make significantly less commissions compared to Sutton and Tyner.






### Next steps:

- We will consider applying different commission rates to the different sales channels, based on the effort required to manage these clients.
- We will take a closer look at the number of companies and sales volumes based on geographical distribution, and consider reorganising the sales agents' coverage areas.

##### Let's save this final data table into a new excel file.

In [39]:
writer = pd.ExcelWriter(output_file, engine = 'xlsxwriter')
final_data.to_excel(writer, index = False, sheet_name="final_data")

workbook = writer.book
worksheet = writer.sheets['final_data']
num_format = workbook.add_format({'num_format': '#,##0.00'})
worksheet.set_column('C:C',14, num_format)
worksheet.set_column('D:D',28)
writer.close()

##### We will also save the various tables into the report_v1 file for ease of performing comparisons later on.

In [40]:
writer = pd.ExcelWriter(report, engine = 'xlsxwriter')
agent_vs_channel_1.to_excel(writer, index = True, sheet_name="agents_channels")
total_sales_by_agent_and_ch1.to_excel(writer, index = True, sheet_name="agents_channels", startcol = 9)
agent_comm1.to_excel(writer, index = True, sheet_name="agents_comms")

workbook = writer.book

num_format = workbook.add_format({'num_format': '#,##0'})
worksheet = writer.sheets["agents_channels"]
worksheet.set_column('K:N',11, num_format)

num_format = workbook.add_format({'num_format': '#,##0.00'})
worksheet = writer.sheets["agents_comms"]
worksheet.set_column('B:B',10, num_format)

writer.close()