### Wöchentliche Aggregation

Wie üblich beginnen wir mit dem Importieren der erforderlichen Bibliotheken und dem Einlesen der Daten als DataFrame.

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [12]:
sales_df = pd.read_csv("../data/sales_clean.csv")
sales_df.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
0,1,5,2015-07-31,5263,555,1,1,1,0,0,0
1,2,5,2015-07-31,6064,625,1,1,1,0,0,0
2,3,5,2015-07-31,8314,821,1,1,1,0,0,0
3,4,5,2015-07-31,13995,1498,1,1,1,0,0,0
4,5,5,2015-07-31,4822,559,1,1,1,0,0,0


Da wir die Daten aif wöchentlicher Ebene aggregieren werden, brauchen wir `DayOfWeek` nicht mehr:

In [13]:
sales_df.drop('DayOfWeek', axis=1, inplace=True)

Wir ändern den Typ von Date in datetime und machen ihn zum Index:

In [14]:
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
sales_df.set_index('Date', inplace=True)
sales_df.sort_index(inplace=True)
sales_df.head()

Unnamed: 0_level_0,Store,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-01,1115,0,0,0,0,1,1,0,0
2013-01-01,379,0,0,0,0,1,1,0,0
2013-01-01,378,0,0,0,0,1,1,0,0
2013-01-01,377,0,0,0,0,1,1,0,0
2013-01-01,376,0,0,0,0,1,1,0,0


Nun haben wir einen DataFrame für jedes Datum:

In [15]:
sales_df.loc['2013-01-01']

Unnamed: 0_level_0,Store,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-01,1115,0,0,0,0,1,1,0,0
2013-01-01,379,0,0,0,0,1,1,0,0
2013-01-01,378,0,0,0,0,1,1,0,0
2013-01-01,377,0,0,0,0,1,1,0,0
2013-01-01,376,0,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...
2013-01-01,742,0,0,0,0,1,1,0,0
2013-01-01,743,0,0,0,0,1,1,0,0
2013-01-01,744,0,0,0,0,1,1,0,0
2013-01-01,745,0,0,0,0,1,1,0,0


Lass uns die Daten wöchentlich aggregieren:

In [16]:
# The anchored offset W would give sunnday weekly frequency
week_sales_df = sales_df.groupby([pd.Grouper(freq='W-Mon'), 'Store']).sum()
week_sales_df.head()

# With code like this we can check that it has worked fine:
#foo = sales_df.loc['2013-01-01':'2013-01-06']
#foo[foo['Store']  == 1]['Sales'].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
Date,Store,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-07,1,26516,3285,5,1,7,1,0,0
2013-01-07,2,22182,2866,5,1,4,1,0,0
2013-01-07,3,35564,3820,5,1,4,1,0,0
2013-01-07,4,48928,6985,5,1,4,1,0,0
2013-01-07,5,20742,2520,5,1,2,1,0,0


In [17]:
week_sales_df = week_sales_df.reset_index()

Ich möchte, dass `Date` den Montag anstatt den Sonntag zeigt:

In [18]:
week_sales_df['Date'] = week_sales_df['Date'] - pd.Timedelta(days=7)
week_sales_df.head()

Unnamed: 0,Date,Store,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
0,2012-12-31,1,26516,3285,5,1,7,1,0,0
1,2012-12-31,2,22182,2866,5,1,4,1,0,0
2,2012-12-31,3,35564,3820,5,1,4,1,0,0
3,2012-12-31,4,48928,6985,5,1,4,1,0,0
4,2012-12-31,5,20742,2520,5,1,2,1,0,0


In [19]:
# Create a time series for the number of observations per date
obs_by_date = week_sales_df.groupby('Date').size()

total_days = 0
for obs in obs_by_date.unique():
    obs_size = obs_by_date[obs_by_date == obs].size
    total_days += obs_size
    print("Es gibt {} Wochen mit {} Datenpunkten.".format(obs_size, obs))

print("Und es gibt {} Wochen insgesamt.".format(total_days))

Es gibt 109 Wochen mit 1115 Datenpunkten.
Es gibt 26 Wochen mit 935 Datenpunkten.
Und es gibt 135 Wochen insgesamt.


### Verknüpfung mit den Daten der Filialen

Leesen wie die Daten der Filialen und kombinieren die mit denen den Umsätzen:

In [20]:
stores_df = pd.read_csv("../data/stores_clean.csv")
sales_stores_df = pd.merge(stores_df, week_sales_df, how = 'inner', on = 'Store')

Wir setzen nochmal den Index auf das Datum

In [21]:
sales_stores_df.set_index('Date', inplace=True)
sales_stores_df.head()

Unnamed: 0_level_0,Store,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment,CompetitionDistance,COBefore2005-11-16,COBetween2005-11-16_2010-03-01,COAfter2010-03-01,...,P2Between2011-04-04_2013-02-04,P2After2013-02-04,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-12-31,1,0,0,1,0,0,1270.0,0,1,0,...,0,0,26516,3285,5,1,7,1,0,0
2013-01-07,1,0,0,1,0,0,1270.0,0,1,0,...,0,0,30493,3749,6,4,4,0,0,0
2013-01-14,1,0,0,1,0,0,1270.0,0,1,0,...,0,0,26655,3408,6,1,0,0,0,0
2013-01-21,1,0,0,1,0,0,1270.0,0,1,0,...,0,0,31732,3804,6,4,0,0,0,0
2013-01-28,1,0,0,1,0,0,1270.0,0,1,0,...,0,0,31670,3774,6,1,0,0,0,0


In [22]:
sales_stores_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 145845 entries, 2012-12-31 to 2015-07-27
Data columns (total 25 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Store                           145845 non-null  int64  
 1   StoreType_a                     145845 non-null  int64  
 2   StoreType_b                     145845 non-null  int64  
 3   StoreType_c                     145845 non-null  int64  
 4   StoreType_d                     145845 non-null  int64  
 5   Assortment                      145845 non-null  int64  
 6   CompetitionDistance             145845 non-null  float64
 7   COBefore2005-11-16              145845 non-null  int64  
 8   COBetween2005-11-16_2010-03-01  145845 non-null  int64  
 9   COAfter2010-03-01               145845 non-null  int64  
 10  Promo2                          145845 non-null  int64  
 11  P2Jan                           145845 non-null  int64  
 12  

In [23]:
sales_stores_df.describe()

Unnamed: 0,Store,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment,CompetitionDistance,COBefore2005-11-16,COBetween2005-11-16_2010-03-01,COAfter2010-03-01,...,P2Between2011-04-04_2013-02-04,P2After2013-02-04,Sales,Customers,Open,Promo,SchoolHoliday,PublicHoliday,Easter,Christmas
count,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,...,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0,145845.0
mean,558.42393,0.542261,0.015558,0.134499,0.307683,0.935041,5425.089993,0.345853,0.325702,0.328445,...,0.156282,0.154815,40270.016956,4415.933045,5.789653,2.660907,1.245987,0.138915,0.045871,0.028112
std,321.909204,0.498213,0.123757,0.341189,0.461536,0.993798,7705.211913,0.475647,0.468638,0.469649,...,0.363124,0.36173,15672.653504,2326.007467,0.601661,1.611517,2.026739,0.357901,0.299396,0.235445
min,1.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,280.0,0.0,0.0,0.0,0.0,0.0,710.0,0.0,0.0,0.0,...,0.0,0.0,30053.0,3088.0,6.0,1.0,0.0,0.0,0.0,0.0
50%,558.0,1.0,0.0,0.0,0.0,0.0,2330.0,0.0,0.0,0.0,...,0.0,0.0,37617.0,3903.0,6.0,4.0,0.0,0.0,0.0,0.0
75%,838.0,1.0,0.0,0.0,1.0,2.0,6880.0,1.0,1.0,1.0,...,0.0,0.0,46879.0,5058.0,6.0,4.0,1.0,0.0,0.0,0.0
max,1115.0,1.0,1.0,1.0,1.0,2.0,75860.0,1.0,1.0,1.0,...,1.0,1.0,205663.0,30030.0,7.0,5.0,7.0,2.0,2.0,2.0
