##### Importieren der notwendigen Bibliotheken:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

##### Lesen der Verkaufsdaten

In [2]:
sales_df = pd.read_csv("sales_clean.csv")
sales_df.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
0,1,5,2015-07-31,5263,555,1,1,1,0,0,0
1,2,5,2015-07-31,6064,625,1,1,1,0,0,0
2,3,5,2015-07-31,8314,821,1,1,1,0,0,0
3,4,5,2015-07-31,13995,1498,1,1,1,0,0,0
4,5,5,2015-07-31,4822,559,1,1,1,0,0,0


##### Erstellung neuer Metriken

Wir werden die Daten von 942 aggregieren, um individuelle Leistungsberichte für jede der Filialen zu erstellen. Die Spalten Date und DayOfWeek werden wir für diese Anlyse nicht mehr brauchen.

In [3]:
sales_df.drop(['Date', 'DayOfWeek'], axis=1, inplace=True)

Und nun gruppieren wir die Daten auf `Store` Ebene

In [4]:
global_sales_df = sales_df.groupby('Store').sum()
global_sales_df.head()

Unnamed: 0_level_0,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
Store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3716854,440523,781,360,193,17,6,4
2,3883858,457855,784,360,167,15,6,4
3,5408261,584310,779,360,170,19,6,4
4,7556507,1036254,784,360,173,14,6,4
5,3642818,418588,779,360,172,21,6,4


Wir erstellen drei neue Variablen: `SalesPerDay`, `CustomersDay` und `SalesPerCustomer`.

In [5]:
global_sales_df['SalesPerDay'] = global_sales_df['Sales'] / global_sales_df['Open']
global_sales_df['CustomersPerDay'] = global_sales_df['Customers'] / global_sales_df['Open']
global_sales_df['SalesPerCustomer'] = global_sales_df['Sales'] / global_sales_df['Customers']

global_sales_df.head()

Unnamed: 0_level_0,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas,SalesPerDay,CustomersPerDay,SalesPerCustomer
Store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,3716854,440523,781,360,193,17,6,4,4759.096031,564.049936,8.437366
2,3883858,457855,784,360,167,15,6,4,4953.90051,583.998724,8.482725
3,5408261,584310,779,360,170,19,6,4,6942.568678,750.077022,9.255808
4,7556507,1036254,784,360,173,14,6,4,9638.401786,1321.752551,7.292138
5,3642818,418588,779,360,172,21,6,4,4676.274711,537.34018,8.702634


##### Verknüpfung mit den Daten der Filialen:

In [29]:
stores_df = pd.read_csv("stores_clean.csv")
sales_stores_df = pd.merge(stores_df, global_sales_df, how = 'inner', on = 'Store')
sales_stores_df.set_index('Store', inplace=True)
sales_stores_df.head()

Unnamed: 0_level_0,StoreType,Assortment,CompetitionDistance,Promo2,PromoInterval,CompetitionOpenSince,Promo2Since,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas,SalesPerDay,CustomersPerDay,SalesPerCustomer
Store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,c,a,1270.0,0,,2008-09-01,,3716854,440523,781,360,193,17,6,4,4759.096031,564.049936,8.437366
2,a,a,570.0,1,"Jan,Apr,Jul,Oct",2007-11-01,2010-03-29,3883858,457855,784,360,167,15,6,4,4953.90051,583.998724,8.482725
3,a,a,14130.0,1,"Jan,Apr,Jul,Oct",2006-12-01,2011-04-04,5408261,584310,779,360,170,19,6,4,6942.568678,750.077022,9.255808
4,c,c,620.0,0,,2009-09-01,,7556507,1036254,784,360,173,14,6,4,9638.401786,1321.752551,7.292138
5,a,a,29910.0,0,,2015-04-01,,3642818,418588,779,360,172,21,6,4,4676.274711,537.34018,8.702634


Betrachten wir die deskriptiven Statistiken für die Variablen, die direkt mit dem Gesamtumsatz zusammenhängen.

In [31]:
sales_vars = ['Sales', 'SalesPerDay', 'CustomersPerDay', 'SalesPerCustomer']
sales_stores_df[sales_vars].describe().apply(lambda x: x.apply('{0:.5f}'.format))

Unnamed: 0,Sales,SalesPerDay,CustomersPerDay,SalesPerCustomer
count,1115.0,1115.0,1115.0,1115.0
mean,5267426.56771,6934.20845,754.51016,9.64376
std,1951304.48397,2383.91105,353.34441,1.98686
min,2114322.0,2703.73657,240.1831,3.5137
25%,3949377.0,5322.29997,541.46869,8.13186
50%,4990259.0,6589.94847,678.66752,9.46406
75%,6084147.5,7964.20064,866.2033,10.98117
max,19516842.0,21757.48342,3403.4586,16.16264


In der ersten Analyse hatten wir schon festgestellt, dass einige Filialen oft weit überdurchschnittliche Gewinne erzielen.  
Dies sind die Filialen, deren Gesamtumsatz mehr als 3 Standardabweichungen über dem Durchschnitt liegt.

In [37]:
sales_mean = sales_stores_df['Sales'].mean()
sales_std = sales_stores_df['Sales'].std()
high_sales = sales_mean + 3*sales_std

outlier_stores = sales_stores_df[sales_stores_df['Sales'] > high_sales].sort_values(by='Sales', ascending=False).index
sales_stores_df.iloc[outlier_stores]

Unnamed: 0_level_0,StoreType,Assortment,CompetitionDistance,Promo2,PromoInterval,CompetitionOpenSince,Promo2Since,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas,SalesPerDay,CustomersPerDay,SalesPerCustomer
Store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
263,a,c,1140.0,1,"Jan,Apr,Jul,Oct",2013-05-01,2014-10-06,2306075,221342,622,286,124,21,6,2,3707.516077,355.855305,10.418606
818,d,a,490.0,1,"Mar,Jun,Sept,Dec",2010-02-01,2010-08-30,4772357,496141,781,360,193,17,6,4,6110.572343,635.263764,9.618953
563,a,a,700.0,1,"Jan,Apr,Jul,Oct",2015-03-01,2014-03-10,4278470,545236,776,360,170,19,6,4,5513.492268,702.623711,7.847006
1115,d,c,5350.0,1,"Mar,Jun,Sept,Dec",2010-02-01,2012-05-28,4922229,337884,781,360,193,17,6,4,6302.46991,432.629962,14.567807
252,d,c,22330.0,1,"Feb,May,Aug,Nov",2010-02-01,2010-02-01,8269484,630702,779,360,170,19,6,4,10615.512195,809.630295,13.111555
514,c,c,1200.0,1,"Jan,Apr,Jul,Oct",2012-07-01,2012-07-02,3580238,375667,622,286,124,21,6,2,5756.009646,603.966238,9.53035
789,a,c,9770.0,0,,2003-07-01,,2626269,364576,777,360,155,22,6,4,3380.011583,469.209781,7.203626
734,a,a,220.0,1,"Mar,Jun,Sept,Dec",2010-02-01,2013-09-09,4658828,440954,779,360,170,19,6,4,5980.523748,566.051348,10.565338
384,a,c,130.0,1,"Jan,Apr,Jul,Oct",2010-02-01,2011-04-04,6937572,781105,782,360,167,15,6,4,8871.575448,998.855499,8.881741
757,a,c,3450.0,0,,2010-02-01,,4832775,416360,777,360,155,22,6,4,6219.787645,535.855856,11.607203
