In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
from fastai.basics import *

In [2]:
from pathlib import Path

In [8]:
import pandas as pd

# Rossmann

## Data preparation / Feature engineering

Set `PATH` to the path `~/data/rossmann/`. Create a list of table names, with one entry for each CSV that you'll be loading: 
- train
- store
- store_states
- state_names
- googletrend
- weather
- test

For each csv, read it in using pandas (with `low_memory=False`), and assign it to a variable corresponding with its name. Print out the lengths of the `train` and `test` tables.

In [11]:
PATH = Path("~/.fastai/data/rossmann/")

In [68]:
csvs = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']

In [69]:
tables = [pd.read_csv(f"{PATH}/{csv}.csv", low_memory=False) for csv in csvs]

In [70]:
train, store, store_states, state_names, googletrend, weather, test = tables

Turn the `StateHoliday` column into a boolean indicating whether or not the day was a holiday.

In [72]:
train['StateHoliday'] = train['StateHoliday'] != '0'
test['StateHoliday'] = test['StateHoliday'] != '0'
train['SchoolHoliday'] = train['SchoolHoliday'] != 0
test['SchoolHoliday'] = test['SchoolHoliday'] != 0

In [73]:
train['StateHoliday'].value_counts()

False    986159
True      31050
Name: StateHoliday, dtype: int64

In [74]:
train['SchoolHoliday'].value_counts()

False    835488
True     181721
Name: SchoolHoliday, dtype: int64

Print out the head of the dataframe.

In [75]:
train.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,False,True
1,2,5,2015-07-31,6064,625,1,1,False,True
2,3,5,2015-07-31,8314,821,1,1,False,True
3,4,5,2015-07-31,13995,1498,1,1,False,True
4,5,5,2015-07-31,4822,559,1,1,False,True


Create a function `join_df` that joins two dataframes together. It should take the following arguments:
- left (the df on the lft)
- right (the df on the right)
- left_on (the left table join key)
- right_on (the right table join key, defaulting to None; if nothing passed, default to the same as the left join key)
- suffix (default to '_y'; a suffix to give to duplicate columns)

In [None]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    return left.merge(right, left_on=left_on, right_on=right_on if not right_on is None else left_on, 
                      how='left', suffix=suffix)

Join the weather and state names tables together, and reassign them to the variable `weather`.

Show the first few rows of the weather df.

In the `googletrend` table, set the `Date` variable to the first date in the hyphen-separated date string in the `week` field. Set the `State` field to the third element in the underscore-separated string from the `file` field. In all rows where `State == NI`, make it instead equal `HB,NI` which is how it's referred to throughout the reset of the data.

Write a function `add_datepart` that takes a date field and adds a bunch of numeric columns containing information about the date. It should take the following arguments:
- df (the dataframe you'll be modifying)
- fldname (the date field you'll be splitting into new columns)
- drop (whether or not to drop the old date field; defaults to True)
- time (whether or not to add time fields -- Hour, Minute, Second; defaults to False)

Remember the edge cases around the dtype of the field. Specifically, if it's of type DatetimeTZDtype, cast it instead to np.datetime64. If it's not a subtype of datetime64 already, infer it (see `pd.to_datetime`).

Use `add_datepart` to add date fields to the weather, googletrend, train and test tables.

Print out the head of the weather table.

In the `googletrend` table, the `file` column has an entry `Rossmann_DE` that represents the whole of germany; we'll want to break that out into its own separate table, since we'll need to join it on `Date` alone rather than both `Date` and `Store`.

Now let's do a bunch of joins to build our entire dataset! Remember after each one to check if the right-side data is null. This is the benefit of left-joining; it's easy to debug by checking for null rows. Let's start by joining `store` and `store_states` in a new table called `store`.

Next let's join `train` and `store` in a table called `joined`. Do the same for `test` and `store` in a table called `joined_test`.

Next join `joined` and `googletrend` on the columns `["State", "Year", "Week"]`. Again, do the same for the test data.

Join `joined` and `trend_de` on `["Year", "Week"]` with suffix `_DE`. Same for test.

Join `joined` and `weather` on `["State", "Date"]`. Same for test.

Now for every column in both `joined` and `joined_test`, check to see if it has the `_y` suffix, and if so, drop it. Warning: a data frame can have duplicate column names, but calling `df.drop` will drop _all_ instances with the passed-in column name! This could lead to calling drop a second time on a column that no longer exists!

For the columns `CompetitionOpenSinceYear`, `CompetitionOpenSinceMonth`, `Promo2SinceYear`, and `Promo2SinceMonth`, replace `NA` values with the following values (respectively):
- 1900
- 1
- 1900
- 1

Create a new field `CompetitionOpenSince` that converts `CompetitionOpenSinceYear` and `CompetitionOpenSinceMonth` and maps them to a specific date. Then create a new field `CompetitionDaysOpen` that subtracts `CompetitionOpenSince` from `Date`. 

For `CmpetitionDaysOpen`, replace values where `CompetitionDaysOpen < 0` with 0, and cases where `CompetitionOpenSinceYear < 1990` with 0.

We add "CompetitionMonthsOpen" field, limiting the maximum to 2 years to limit number of unique categories.

Same process for Promo dates. You may need to install the `isoweek` package first.

In [28]:
# If needed, uncomment:
# ! pip install isoweek

Collecting isoweek
  Downloading https://files.pythonhosted.org/packages/c2/d4/fe7e2637975c476734fcbf53776e650a29680194eb0dd21dbdc020ca92de/isoweek-1.3.3-py2.py3-none-any.whl
Installing collected packages: isoweek
Successfully installed isoweek-1.3.3
[33mYou are using pip version 19.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Use the `isoweek` package to turn `Promo2Since` to a specific date -- the Monday of the week specified in the column. Compute a field `Promo2SinceDays` that subtracts the current date from the `Promo2Since` date.

Perform the following modifications on both the train and test set:
- For cases where `Promo2Days` is negative or `Promo2SinceYear` is before 1990, set `Promo2Days` to 0
- Create `Promo2Weeks
- For cases where `Promo2Weeks` is negative, set `Promo2Weeks` to 0
- For cases where `Promo2Weeks` is above 25, set `Promo2Weeks` to 25

Print the number of unique values for `Promo2Weeks` in training and test df's.

Pickle `joined` to `PATH/'joined'` and `joined_test` to `PATH/'joined_test'`.

## Durations

Write a function `get_elapsed` that takes arguments `fld` (a boolean field) and `pre` (a prefix to be appended to `fld` in a new column representing the days until/since the event in `fld`), and adds a column `pre+fld` representing the date-diff (in days) between the current date and the last date `fld` was true.

We'll be applying this to a subset of columns:

Create a variable `columns` containing the strings: 
- Date
- Store
- Promo
- StateHoliday
- SchoolHoliday

These will be the fields on which we'll be computing elapsed days since/until.

Create one big dataframe with both the train and test sets called `df`.

Sort by `Store` and `Date` ascending, and use `add_elapsed` to get the days since the last `SchoolHoliday` on each daya. Reorder by `Store` ascending and `Date` descending to get the days _until_ the next `SchoolHoliday`.

Do the same for `StateHoliday`.

Do the same for `Promo`.

Set the index on `df` to `Date`.

Reassign `columns` to `['SchoolHoliday', 'StateHoliday', 'Promo']`.

For columns `Before/AfterSchoolHoliday`, `Before/AfterStateHoliday`, and `Before/AfterPromo`, fill null values with 0.

Create a dataframe `bwd` that gets 7-day backward-rolling sums of the columns in `columns`, grouped by `Store`.

Create a dataframe `fwd` that gets 7-day forward-rolling sums of the columns in `columns`, grouped by `Store`.

Show the head of `bwd`.

Show the head of `fwd`.

Drop the `Store` column from `fwd` and `bwd` inplace, and reset the index inplace on each.

Reset the index on `df`.

Merge `df` with `bwd` and `fwd`.

Drop `columns` from df inplace -- we don't need them anymore, since we've captured their information in columns with types more suitable for machine learning.

Print out the head of `df`.

Pickle `df` to `PATH/'df'`.

Cast the `Date` column to a datetime column.

Join `joined` with `df` on `['Store', 'Date']`.

This is not necessarily the best idea, but the authors removed all examples for which sales were equal to zero. If you're trying to stay true to what the authors did, do that now.

Reset the indices, and pickle train and test to `train_clean` and `test_clean`.