In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
from fastai.basics import *

# Rossmann

## Data preparation / Feature engineering

Set `PATH` to the path `~/data/rossmann/`. Create a list of table names, with one entry for each CSV that you'll be loading: 
- train
- store
- store_states
- state_names
- googletrend
- weather
- test

For each csv, read it in using pandas (with `low_memory=False`), and assign it to a variable corresponding with its name. Print out the lengths of the `train` and `test` tables.

In [4]:
PATH=Path('~/data/rossmann/')
table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']
tables = [pd.read_csv(PATH/f'{fname}.csv', low_memory=False) for fname in table_names]
train, store, store_states, state_names, googletrend, weather, test = tables
len(train),len(test)

FileNotFoundError: [Errno 2] File b'/home/paperspace/data/rossmann/train.csv' does not exist: b'/home/paperspace/data/rossmann/train.csv'

Turn the `StateHoliday` column into a boolean indicating whether or not the day was a holiday.

In [None]:
train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

Print out the head of the dataframe.

In [None]:
train.head()

Create a function `join_df` that joins two dataframes together. It should take the following arguments:
- left (the df on the lft)
- right (the df on the right)
- left_on (the left table join key)
- right_on (the right table join key, defaulting to None; if nothing passed, default to the same as the left join key)
- suffix (default to '_y'; a suffix to give to duplicate columns)

In [6]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

Join the weather and state names tables together, and reassign them to the variable `weather`.

In [7]:
weather = join_df(weather, state_names, "file", "StateName")

Show the first few rows of the weather df.

In [8]:
weather.head()

Unnamed: 0,file,Date,Max_TemperatureC,Mean_TemperatureC,Min_TemperatureC,Dew_PointC,MeanDew_PointC,Min_DewpointC,Max_Humidity,Mean_Humidity,...,Min_VisibilitykM,Max_Wind_SpeedKm_h,Mean_Wind_SpeedKm_h,Max_Gust_SpeedKm_h,Precipitationmm,CloudCover,Events,WindDirDegrees,StateName,State
0,NordrheinWestfalen,2013-01-01,8,4,2,7,5,1,94,87,...,4.0,39,26,58.0,5.08,6.0,Rain,215,NordrheinWestfalen,NW
1,NordrheinWestfalen,2013-01-02,7,4,1,5,3,2,93,85,...,10.0,24,16,,0.0,6.0,Rain,225,NordrheinWestfalen,NW
2,NordrheinWestfalen,2013-01-03,11,8,6,10,8,4,100,93,...,2.0,26,21,,1.02,7.0,Rain,240,NordrheinWestfalen,NW
3,NordrheinWestfalen,2013-01-04,9,9,8,9,9,8,100,94,...,2.0,23,14,,0.25,7.0,Rain,263,NordrheinWestfalen,NW
4,NordrheinWestfalen,2013-01-05,8,8,7,8,7,6,100,94,...,3.0,16,10,,0.0,7.0,Rain,268,NordrheinWestfalen,NW


In the `googletrend` table, set the `Date` variable to the first date in the hyphen-separated date string in the `week` field. Set the `State` field to the third element in the underscore-separated string from the `file` field. In all rows where `State == NI`, make it instead equal `HB,NI` which is how it's referred to throughout the reset of the data.

In [9]:
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

Write a function `add_datepart` that takes a date field and adds a bunch of numeric columns containing information about the date. It should take the following arguments:
- df (the dataframe you'll be modifying)
- fldname (the date field you'll be splitting into new columns)
- drop (whether or not to drop the old date field; defaults to True)
- time (whether or not to add time fields -- Hour, Minute, Second; defaults to False)

Remember the edge cases around the dtype of the field. Specifically, if it's of type DatetimeTZDtype, cast it instead to np.datetime64. If it's not a subtype of datetime64 already, infer it (see `pd.to_datetime`).

In [5]:
def add_datepart(df, fldname, drop=True, time=False):
    "Helper function that adds columns relevant to a date."
    fld = df[fldname]
    fld_dtype = fld.dtype
    if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        fld_dtype = np.datetime64

    if not np.issubdtype(fld_dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
            'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    if drop: df.drop(fldname, axis=1, inplace=True)

Use `add_datepart` to add date fields to the weather, googletrend, train and test tables.

In [11]:
add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

Print out the head of the weather table.

In [12]:
weather.head()

Unnamed: 0,file,Date,Max_TemperatureC,Mean_TemperatureC,Min_TemperatureC,Dew_PointC,MeanDew_PointC,Min_DewpointC,Max_Humidity,Mean_Humidity,...,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elapsed
0,NordrheinWestfalen,2013-01-01,8,4,2,7,5,1,94,87,...,1,1,1,False,True,False,True,False,True,1356998400
1,NordrheinWestfalen,2013-01-02,7,4,1,5,3,2,93,85,...,2,2,2,False,False,False,False,False,False,1357084800
2,NordrheinWestfalen,2013-01-03,11,8,6,10,8,4,100,93,...,3,3,3,False,False,False,False,False,False,1357171200
3,NordrheinWestfalen,2013-01-04,9,9,8,9,9,8,100,94,...,4,4,4,False,False,False,False,False,False,1357257600
4,NordrheinWestfalen,2013-01-05,8,8,7,8,7,6,100,94,...,5,5,5,False,False,False,False,False,False,1357344000


In the `googletrend` table, the `file` column has an entry `Rossmann_DE` that represents the whole of germany; we'll want to break that out into its own separate table, since we'll need to join it on `Date` alone rather than both `Date` and `Store`.

In [13]:
trend_de = googletrend[googletrend.file == 'Rossmann_DE']

Now let's do a bunch of joins to build our entire dataset! Remember after each one to check if the right-side data is null. This is the benefit of left-joining; it's easy to debug by checking for null rows. Let's start by joining `store` and `store_states` in a new table called `store`.

In [14]:
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

0

Next let's join `train` and `store` in a table called `joined`. Do the same for `test` and `store` in a table called `joined_test`.

In [15]:
joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])

(0, 0)

Next join `joined` and `googletrend` on the columns `["State", "Year", "Week"]`. Again, do the same for the test data.

In [16]:
joined = join_df(joined, googletrend, ["State","Year", "Week"])
joined_test = join_df(joined_test, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()]),len(joined_test[joined_test.trend.isnull()])

(0, 0)

Join `joined` and `trend_de` on `["Year", "Week"]` with suffix `_DE`. Same for test.

In [17]:
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
joined_test = joined_test.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()]),len(joined_test[joined_test.trend_DE.isnull()])

(0, 0)

Join `joined` and `weather` on `["State", "Date"]`. Same for test.

In [18]:
joined = join_df(joined, weather, ["State","Date"])
joined_test = join_df(joined_test, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()]),len(joined_test[joined_test.Mean_TemperatureC.isnull()])

(0, 0)

Now for every column in both `joined` and `joined_test`, check to see if it has the `_y` suffix, and if so, drop it. Warning: a data frame can have duplicate column names, but calling `df.drop` will drop _all_ instances with the passed-in column name! This could lead to calling drop a second time on a column that no longer exists!

In [20]:
for df in (joined, joined_test):
    for c in df.columns:
        if c.endswith('_y'):
            if c in df.columns: df.drop(c, inplace=True, axis=1)

For the columns `CompetitionOpenSinceYear`, `CompetitionOpenSinceMonth`, `Promo2SinceYear`, and `Promo2SinceMonth`, replace `NA` values with the following values (respectively):
- 1900
- 1
- 1900
- 1

In [21]:
for df in (joined,joined_test):
    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
    df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)

Create a new field `CompetitionOpenSince` that converts `CompetitionOpenSinceYear` and `CompetitionOpenSinceMonth` and maps them to a specific date. Then create a new field `CompetitionDaysOpen` that subtracts `CompetitionOpenSince` from `Date`. 

In [22]:
for df in (joined,joined_test):
    df["CompetitionOpenSince"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, 
                                                     month=df.CompetitionOpenSinceMonth, day=15))
    df["CompetitionDaysOpen"] = df.Date.subtract(df.CompetitionOpenSince).dt.days

For `CmpetitionDaysOpen`, replace values where `CompetitionDaysOpen < 0` with 0, and cases where `CompetitionOpenSinceYear < 1990` with 0.

In [23]:
for df in (joined,joined_test):
    df.loc[df.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0

We add "CompetitionMonthsOpen" field, limiting the maximum to 2 years to limit number of unique categories.

In [24]:
for df in (joined,joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
    df.loc[df.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()

array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])

Same process for Promo dates. You may need to install the `isoweek` package first.

In [28]:
# If needed, uncomment:
# ! pip install isoweek

Collecting isoweek
  Downloading https://files.pythonhosted.org/packages/c2/d4/fe7e2637975c476734fcbf53776e650a29680194eb0dd21dbdc020ca92de/isoweek-1.3.3-py2.py3-none-any.whl
Installing collected packages: isoweek
Successfully installed isoweek-1.3.3
[33mYou are using pip version 19.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Use the `isoweek` package to turn `Promo2Since` to a specific date -- the Monday of the week specified in the column. Compute a field `Promo2SinceDays` that subtracts the current date from the `Promo2Since` date.

In [29]:
from isoweek import Week
for df in (joined,joined_test):
    df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1))
    df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days

Perform the following modifications on both the train and test set:
- For cases where `Promo2Days` is negative or `Promo2SinceYear` is before 1990, set `Promo2Days` to 0
- Create `Promo2Weeks
- For cases where `Promo2Weeks` is negative, set `Promo2Weeks` to 0
- For cases where `Promo2Weeks` is above 25, set `Promo2Weeks` to 25

Print the number of unique values for `Promo2Weeks` in training and test df's.

In [30]:
for df in (joined,joined_test):
    df.loc[df.Promo2Days<0, "Promo2Days"] = 0
    df.loc[df.Promo2SinceYear<1990, "Promo2Days"] = 0
    df["Promo2Weeks"] = df["Promo2Days"]//7
    df.loc[df.Promo2Weeks<0, "Promo2Weeks"] = 0
    df.loc[df.Promo2Weeks>25, "Promo2Weeks"] = 25
    df.Promo2Weeks.unique()

Pickle `joined` to `PATH/'joined'` and `joined_test` to `PATH/'joined_test'`.

In [31]:
joined.to_pickle(PATH/'joined')
joined_test.to_pickle(PATH/'joined_test')

## Durations

Write a function `get_elapsed` that takes arguments `fld` (a boolean field) and `pre` (a prefix to be appended to `fld` in a new column representing the days until/since the event in `fld`), and adds a column `pre+fld` representing the date-diff (in days) between the current date and the last date `fld` was true.

In [32]:
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = res

We'll be applying this to a subset of columns:

Create a variable `columns` containing the strings: 
- Date
- Store
- Promo
- StateHoliday
- SchoolHoliday

These will be the fields on which we'll be computing elapsed days since/until.

In [33]:
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]

Create one big dataframe with both the train and test sets called `df`.

In [34]:
#df = train[columns]
df = train[columns].append(test[columns])

Sort by `Store` and `Date` ascending, and use `add_elapsed` to get the days since the last `SchoolHoliday` on each daya. Reorder by `Store` ascending and `Date` descending to get the days _until_ the next `SchoolHoliday`.

In [35]:
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

Do the same for `StateHoliday`.

In [36]:
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

Do the same for `Promo`.

In [37]:
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

Set the index on `df` to `Date`.

In [38]:
df = df.set_index("Date")

Reassign `columns` to `['SchoolHoliday', 'StateHoliday', 'Promo']`.

In [39]:
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']

For columns `Before/AfterSchoolHoliday`, `Before/AfterStateHoliday`, and `Before/AfterPromo`, fill null values with 0.

In [40]:
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0).astype(int)

Create a dataframe `bwd` that gets 7-day backward-rolling sums of the columns in `columns`, grouped by `Store`.

In [48]:
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()

Create a dataframe `fwd` that gets 7-day forward-rolling sums of the columns in `columns`, grouped by `Store`.

In [49]:
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

Show the head of `bwd`.

In [51]:
bwd.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Store,SchoolHoliday,StateHoliday,Promo
Store,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2013-01-01,1.0,1.0,1.0,0.0
1,2013-01-02,2.0,2.0,1.0,0.0
1,2013-01-03,3.0,3.0,1.0,0.0
1,2013-01-04,4.0,4.0,1.0,0.0
1,2013-01-05,5.0,5.0,1.0,0.0


Show the head of `fwd`.

In [52]:
fwd.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Store,SchoolHoliday,StateHoliday,Promo
Store,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2015-09-17,1.0,0.0,0.0,1.0
1,2015-09-16,2.0,0.0,0.0,2.0
1,2015-09-15,3.0,0.0,0.0,3.0
1,2015-09-14,4.0,0.0,0.0,4.0
1,2015-09-13,5.0,0.0,0.0,4.0


Drop the `Store` column from `fwd` and `bwd` inplace, and reset the index inplace on each.

In [53]:
bwd.drop('Store',1,inplace=True)
bwd.reset_index(inplace=True)

In [54]:
bwd.head()

Unnamed: 0,Store,Date,SchoolHoliday,StateHoliday,Promo
0,1,2013-01-01,1.0,1.0,0.0
1,1,2013-01-02,2.0,1.0,0.0
2,1,2013-01-03,3.0,1.0,0.0
3,1,2013-01-04,4.0,1.0,0.0
4,1,2013-01-05,5.0,1.0,0.0


In [55]:
fwd.drop('Store',1,inplace=True)
fwd.reset_index(inplace=True)

Reset the index on `df`.

In [56]:
df.reset_index(inplace=True)

Merge `df` with `bwd` and `fwd`.

In [57]:
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])

Drop `columns` from df inplace -- we don't need them anymore, since we've captured their information in columns with types more suitable for machine learning.

In [58]:
df.drop(columns,1,inplace=True)

Print out the head of `df`.

In [59]:
df.head()

Unnamed: 0,Date,Store,AfterSchoolHoliday,BeforeSchoolHoliday,AfterStateHoliday,BeforeStateHoliday,AfterPromo,BeforePromo,SchoolHoliday_bw,StateHoliday_bw,Promo_bw,SchoolHoliday_fw,StateHoliday_fw,Promo_fw
0,2015-09-17,1,13,0,105,0,0,0,0.0,0.0,4.0,0.0,0.0,1.0
1,2015-09-16,1,12,0,104,0,0,0,0.0,0.0,3.0,0.0,0.0,2.0
2,2015-09-15,1,11,0,103,0,0,0,0.0,0.0,2.0,0.0,0.0,3.0
3,2015-09-14,1,10,0,102,0,0,0,0.0,0.0,1.0,0.0,0.0,4.0
4,2015-09-13,1,9,0,101,0,9,-1,0.0,0.0,0.0,0.0,0.0,4.0


Pickle `df` to `PATH/'df'`.

In [60]:
df.to_pickle(PATH/'df')

Cast the `Date` column to a datetime column.

In [61]:
df["Date"] = pd.to_datetime(df.Date)

In [62]:
df.columns

Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

In [63]:
joined = pd.read_pickle(PATH/'joined')
joined_test = pd.read_pickle(PATH/f'joined_test')

Join `joined` with `df` on `['Store', 'Date']`.

In [64]:
joined = join_df(joined, df, ['Store', 'Date'])

In [65]:
joined_test = join_df(joined_test, df, ['Store', 'Date'])

This is not necessarily the best idea, but the authors removed all examples for which sales were equal to zero. If you're trying to stay true to what the authors did, do that now.

In [66]:
joined = joined[joined.Sales!=0]

Reset the indices, and pickle train and test to `train_clean` and `test_clean`.

In [67]:
joined.reset_index(inplace=True)
joined_test.reset_index(inplace=True)

In [68]:
joined.to_pickle(PATH/'train_clean')
joined_test.to_pickle(PATH/'test_clean')