# CHAPTER 6
## 6.1 Benchmark: Split Data into Training and Test Sets
Now that we have a convenient way to make recommendations, we still need to make an informed choice as to which of `bestPy`'s algorithms we should pick and how we should set its parameters to achieve the highest possible fidelity in our recommendations.

The only way of telling how well we are doing with our recommendations is to see how well we can predict _future_ purchases from _past_ purchases. This means that, instead of using _all_ of our data to _train_ an algorithm, we have to hold out the _last_ couple of purchases of each user and only used the rest. Then, we can _test_ the recommendations produced by our algorithm against what customers actually did buy next.

To conveniently perform this split of our data into training and test sets, `bestPy` offers the advanced data class `TrainTest`.


### Preliminaries
We only need this because the examples folder is a subdirectory of the `bestPy` package.

In [1]:
import sys
sys.path.append('../..')

### Imports and logging
No algorithm or recommender is needed for now as we are focusing solely on the data structure `TrainTest`, which is, naturally, accessible through the sub-package `bestPy.datastructures`.

In [2]:
from bestPy import write_log_to
from bestPy.datastructures import TrainTest  # Additionally import RecoBasedOn

logfile = 'logfile.txt'
write_log_to(logfile, 20)

### Read `TrainTest` data
Reading in `TrainTest` data works in pretty much the same way as reading in `Transactions` data. Again, two data sources are available, a postgreSQL database and a CSV file. For the former, we again need a fully configured `PostgreSQLparams` instance (let's call it `database`) before we can read in the data with:

```
data = TrainTest.from_postgreSQL(database)
```

Reading from then works like so:

In [3]:
file = 'examples_data.csv'
data = TrainTest.from_csv(file)

__NOTE__: There is only one difference to reading `Transactions` data. The `from_csv()` class method has an addtional argument `fmt`. If it is _not_ given, then the timestamps in the CSV file are assumed to be _UNIX timestamp since epoch_, i.e., integer numbers.

If, on the other hand, it _is_ given, then it must be a valid _format string_ specifying the format in which the timestamps are written in the CSV file. To tell `bestPy` that, for example, the timestamps in your CSV file look like
```
2012-03-09 16:18:02
```
_i.e._, year-month-day hour:minute:second, you would have to set `fmt` to the string:
```
'%Y-%m-%d %H:%M:%S'
```

With the [documentation](https://docs.python.org/3/library/datetime.html) of the `datetime` package, it should be easy to assemble the correct format string for just about any which way a timestamp could possible be composed.

### Initial attributes of `TrainTest` data objects
Inspecting the new `data` object with Tab completion reveals reveals several attribute that we already now from `Transactions` data. Notably, these are:

In [4]:
print(data.number_of_corrupted_records)
print(data.number_of_transactions)

2
100000


There is also an additional attribute that tells us the maximum numbers of purchases we can possibly hold out as test data for each customer.

In [5]:
data.max_hold_out

983

### Splitting the data into training and test sets
Also present is a method called `split()`, which indeed does exactly what you think it should. It has two arguments, `hold_out` and `only_new`. Naturally, the former tells `bestPy` how many unique purchases per cutomer to hold out (_i.e._, put aside) for each customer. Naturally, cutomers who bought fewer than `hold_out` articles cannot be tested at all and cutomers who bought exactly `hold_out` articles will be treated as _new_ customers in testing.

The second argument, `only_new` tells `bestPy` whether only new articles will be recommended in the benchmark run or whether recommendations will include also articles that customers bouhgt before. If `True`, then all previous buys of any of the `hold_out` last unique items need to be deleted from the training data for each customer. Let's try.

In [6]:
data.split(4, False)

### Attributes of split `TrainTest` data objects
Inspecting the `TrainTest` data object with Tab completion again reveals two more attributes that magically appeared, `train` and `test`. The former is an instance of `Transactions` with all the attributes we already know.

In [7]:
print(type(data.train))
print(data.train.user.count)

<class 'bestPy.datastructures.transactions.transactions.Transactions'>
2141


So we have 2141 customers that bought 4 items or more and whose next 4 purchases can therefore be compared to our recommendations. I suggest you make it a habbit of checking that you have a decent number of customers left in your training set.

__NOTE__: Should you, for some reason, chose to hold out `max_hold_out` purchases, you might well end up with a single customer in your training set and, therefore, obtain spurious benchmark results.

The `test` attribute of a split `TrainTest` instance is a new, auxiliary data type with a very simple structure. Its `data` attribute contains the test data in the form of a `python` _dictionary_ with customer IDs as keys and the artcile IDs of their `hold_out` last unique purchases as values.

In [8]:
data.test.data

{'1': {'AP082EL37CPUALID-1762',
  'CL225EL36CTRALID-1863',
  'GR087ME45LMG-7753',
  'NE739EL06ORLANID-27491'},
 '12': {'AP082EL37CPUALID-1762',
  'MA130HL15VTSANID-32883',
  'NE739EL06ORLANID-27491',
  'PH789EL86AANALID-13'},
 '11': {'BL152EL82CRXALID-1817',
  'CA189EL29AGOALID-170',
  'LE629EL54ANHALID-345',
  'SA848EL83DOYALID-2416'},
 '10': {'CO142HB25OYKANID-27672',
  'GR087ME41QJI-11057',
  'GR087ME54QIV-11044',
  'NE739EL06ORLANID-27491'},
 '7': {'AC016EL58BKFALID-941',
  'AP082EL03BMIALID-996',
  'AS100EL41BOSALID-1058',
  'OL756EL65HDYALID-4834'},
 '13': {'AP082EL37CPUALID-1762',
  'AS100EL41BOSALID-1058',
  'BL152EL82CRXALID-1817',
  'PI794EL32ENZALID-3067'},
 '17': {'CA189EL29AGOALID-170',
  'LA407HL45ZWGANID-35839',
  'MA130HL30AQBANID-36435',
  'MO717EL52ARFALID-447'},
 '19': {'AC016EL58BKFALID-941',
  'AD029EL42BKVALID-957',
  'AP082EL01CFQALID-1498',
  'DO274EL91APSALID-408'},
 '14': {'AP082EL36CPVALID-1763',
  'CO228EL88FBFALID-3411',
  'NI743EL41IBYALID-5458',
  'NO749E

Its attributes `hold_out` and `only_new` simply reflect the respective arguments from the last call to the `split()` method and, hard to guess, `number_of_cases` yields the number of test cases.

In [9]:
print(data.test.hold_out)
print(data.test.only_new)
print(data.test.number_of_cases)

4
False
2873


__NOTE__: Should you wish to split the same data again, but this time with different settings, no need to read it in again. Just call `split()` again.

In [10]:
data.split(6, True)

print(data.train.user.count)
print(data.test.number_of_cases)

904
1289


And that's it for the equally powerful and convenient `TrainTest` data structure.