In [51]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [8]:
from fastai.tabular import *

# Rossmann

## Data preparation

To create the feature-engineered train_clean and test_clean from the Kaggle competition data, run `rossman_data_clean.ipynb`. One important step that deals with time series is this:

```python
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)
```

In [9]:
Config().data_path()

PosixPath('/home/paperspace/.fastai/data')

Set the variable `path` to `Config().data_path()/'rossmann'`.

Set `train_df` to the pickled dataset at `path/'train_clean'`.

Show a transposed output of `train_df`.

Set the len of `train_df` to `n`, and print it out.

### Experimenting with a sample

Set `idx` to a random permutation of `range(n)` up to the 2000th position.

Sort `idx`.

Set `small_train_df` to be `train_df` up to the 1000th value in `idx`.

Set `small_test_df` to be `train_df` from the 1000th value on in `idx`.

Set `small_cont_vars` to `['CompetitionDistance', 'Mean_Humidity']`

Set `small_cat_vars` to `['Store', 'DayOfWeek', 'PromoInterval']`.

Make `small_train_df` and `small_test_df` have only those variables.

Show `small_train_df`'s first few rows.

Same with `small_test_df`.

Instantiate a `Categorify` with `small_cat_vars` and `small_cont_vars` and call it `categorify`. Call `categorify` on `small_train_df`. Do it again on `small_test_df` with `test=True`.

Show the first few rows of `small_test_df`.

Show the categories from `small_train_df.PromoInterval`.

Show the `small_train_df` column.

Show the first 5 categorical codes from `small_train_df['PromoInterval']`.

Instantiate a `FillMissing` object in the same way, and call it `fill_missing`. Call it like a function applied to `small_train_df` and `small_test_df`.

A new column, `CompetitionDistance_na` exists on the train dataframe. Show the rows where it's `True`.

### Preparing full data set

Read the full `train_clean` to `train_df`.

Read the full `test_clean` to `test_df`.

Print out `len(train_df)` and `len(test_df)`.

Set our `procs` to `FillMissing`, `Categorify`, and `Normalize`.

This:
```
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']
```

Set `dep_var` to `Sales`.

Constrain the columns in `df` to `cat_vars`, `cont_vars`, `dep_var`, and `Date`.

Show the `min` and `max` `Date`.

Show the length of `test_df`.

Create a variable `cut` that represents the cutoff point before which all rows will be part of the validation set. The logic here is that we want it to be about as big as the test set, but that it should be comprised of all complete days that don't exist in the training set. You'll want to get the maximum index of the rows sharing the date found at the `len(test_df)`th row of the training set. In other words: find the date at `len(test_df)` and find the _last_ index in the training set that shares that date.

Set `valid_idx` to the range of values up to `cut`.

Show the first few rows of `df[dep_var]`.

Create a data bunch from the `df` that we've created. Hints:
- This will start with a `TabularList`
- You can use `split_by_idx` to specify the validation indices
- You can label from the df using `label_from_df`. Remember to specify `label_cls=FloatList` to mark this as a regression problem, and to specify that `log=True`.
- Use `add_test` to add aa test set
- Call `databunch()` to turn the `TabularList` into a `DataBunch`.

## Model

Set `max_log_y` to the log of the `Sales` column + 20%. Set `y_range` to a 2-element torch tensor with the first element being `0` and the second element being `max_log_y`. Set `device` to `defaults.device`. What does this last part do?

Create a learner using the databunch you created earlier, with layers `1000,500`, `ps=[0.001, 0.01]`, `emb_drop = 0.04`, with the `y_range` you created above. Your metrics should be `exp_rmspe`.

Show the model object.

Show the len of `cont_names` in the `train_ds`.

Use the learning rate finder.

Show the learning rate plot. The ideal starting spot should be around `1e-2 - 1e-3`.

Fit one cycle with 5 epochs at the learning rate you found, with rate decay 0.2.

Save the learner to `l`.

Show the losses, skipping the first 7000 batches processed.

Load `l`.

In [65]:
learn.load('1');

Fit another cycle with 5 epochs and learning rate 3e-4.

In [1]:
learn.fit_one_cycle(5, 3e-4)

Fit another cycle with 5 epochs and lr = 3e-4.

In [None]:
learn.fit_one_cycle(5, 3e-4)

10th place in the competition was 0.108. How low can you get? 

Get the predictions from the test set and set them to `test_preds`. Set `test_df['Sales']` to the exponentiated values of `test_preds[0].data`, converted to a numpy array, and then call `T[0]`. Why did you have to do `test_preds[0].data`? Why `.T[0]`? Replace `test_df[["Id", "Sales"]]` with an int-ified version of itself. Write it out to a `csv` `rossmann_submission.csv` without an index.