# Data science candidate exercise

Thank you for working on the Formation Data Science Exercise! The purpose of this exercise is to allow you to showcase your Data Science skill set in the following areas:
 * Transforming data
 * Training and tuning of model parameters
 * Evaluation of model performance and selecting the optimal model 
 
In addition to these areas, we’re interested in seeing how you utilize code to explore a new data set to both explore data and model performance and develop/implement models in our code base. 


As you work on the exercise, keep in mind that we value reusable code that your teammates can jump into quickly, so please be sure to comment frequently, and let us know which portions of your code you would use for more formal development to showcase your skills in coding style best practices.

We also encourage you to use markdown cells to explain your thought process and observations as you move through the exercise. Feel free to use the software and toolbox of your own preference. 


### Timing and Questions
We understand that your time is valuable, while you are welcome to spend as much time as you desire on the exercise, please do not feel obligated to spend more than 2-4 hours.
 
* Questions are encouraged:
    * Please feel welcome to contact your TAKT recruiter with any questions you might have for Formation that may help clarify what is expected, or you have questions about the data or other items for our data science team.

# Problem Statement

In this exercise, we would like to build collaborative filtering models for recommending product items. Imagine a fast food chain releases a new mobile app allowing its customers to place orders before they even have to walk into the store. There are several opportunities for the app to show recommendations: When a customer first tap on the "order" page, we may recommend the first item to be added to the basket (e.g. a burger). After that, items good for pairing with the existing basket could be recommended. For example, if there is a burger already in the order basket, the app may want to recommend fries and/or drinks, rather than recommending another burger.


## Input data
We provide the artificial transaction history data in `trx_data.csv`. Each row represents a past order. It has two columns - customerId and products (separated by ","). The products column contains 1 to 10 product ID(s) of those being purchased, separated by "|".

Here is an example of the transaction records. You can find customer 0 purchased 1 item in the first log entery, and customer 1 purchased 10 items in the second log entery (some are duplicated).

`customerId,products
0,20
1,2|2|23|68|68|111|29|86|107|152
2,111|107|29|11|11|11|33|23
1,164|227
3,2|2
6,144|144|55|267
7,136|204|261
3,79|8|8|48
9,102|2|2|297
10,84|77|286|259
11,25|127|127
0,18|183|288|171|289
11,79|8|8|38
1,2|2|20|20|20
7,251|143
`

***Things to be aware of about the data.***
* The `trx_data.csv` is a log of user purcases, so:
    * Users might be found on multiple lines in the csv with different basket items attached.
    * The products "basket" is not listed in any particular order.
        * For example the following data points can be considered as equivelent
            * `6,144|144|55|267` 
            * `6,267|144|55|144`
    * **However** the order of the basket logs is temporal, so in the above example log:
        * `1,2|2|23|68|68|111|29|86|107|152` was a basket of earlier purchases by user `1` than
        * `1,164|227`
        

## Models
A collaborative filtering model can be built once given a user-item matrix with ratings. 
For this exercise, we ask you to build **ONE** of the following two recommender models. 

### Option 1:
* Build a Model that recommends to the user the "first item" they may want to place into their "basket"
  * Input: user - customer ID
  * Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) "basket"

### Option 2:
* Build a Model for the recommending a "second item" after the first item has already been added to the basket.
  * Input: (user - customer ID, item - product ID)
  * Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) "basket"




## Tasks

Some of the things we will be looking at as we review your submission include:

### 1. Data transformation
How are you converting the raw data as you prepare a model? Please feel welcome to define and transform the ratings in the way you think would be optimal for generating the best possible product recommendations.

### 2. Model selection and development
What algorithms did you consider using in order to develop the collaborative filtering model?  Why did you end up selecting the algorithm you ended up using.  How did you end up implementing the selected model?

### 3. Validate and evaluate the model performance
During the training process, what steps did you use for validating and evaluating the model?
* What holdout practices did you use?
* What performance metrics did you for validation and testing?
* How did you choose to present your results?
* What actions did you take or improvements did you make based on your results?

### 4. Apply model to test datasets
##### Option 1 Model
In `recommend_1.csv`, we provide a list of customer IDs. If you select option 1, use this data to generate a csv file that indicates top 10 recommendations for each of the customers. Note the order of the recommended products should be ordered by user preference, with the most preferred item in the beginning.

Sample output:

`customerId, recommendedProducts
1,0|1|2|3|4|5|6|7|8|9
2,8|3|1|2|4|7|9|10|11|13
3,20|21|22|23|24|25|26|27|28|29
...
`

##### Option 2 Model
In recommend_2.csv, we provide a list of customer IDs and their first basket item. If you select option 2, use this data to generate a csv file that indicates top 10 pairing recommendations for each of the customer-basket combinations. Note the order of the recommended products should be ordered by user preference, with the most preferred item in the beginning.

Sample output:

`customerId, itemId, recommendedProducts
1,0,1|2|3|4|5|6|7|8|9|10
1,1,0|2|3|4|5|6|7|8|9|10
2,8,20|21|22|23|24|25|26|27|28|29
...
`


## Notes on the business use case for evaluation
* The goal of this modeling project is to recommend to the user a list of items that they are most likely to purchase (option 1) or add to their existing basket (option 2).  
* As you are selecting metrics, please keep in mind that 
 1. the primary goal is to successfully recommend as many items in your list that they may be inclined to purchase/add, and 
 2. the secondary goal is that the items are ordered by the user's inclination (the more inclined they are, the higher up in your list of 10 recommendations.

## Notes on implementation options

For implementation of your modeling solution(s), please feel free to use any method that might be added to production were you to use it for development as a Formation employee. 

For example:
* If you find an open source project licensed under e.g. the Apache 2.0 license or another license open for comercial purposes, you are welcome to use it for your code, or
* If you feel like coding up a solution from scratch (assuming you have the time) you are welcome to do that as well.


In [1]:
### Candidate code here


In [None]:
### Candidate code here


In [None]:
### Candidate code here


In [None]:
### Candidate code here


# Submission

When you have completed your exercise, please email the following to your recruiter:
1. This Raw .ipynb file with the stdout included under your cells
1. An HTML export of the .ipynb file to ensure that no output is lost in submitting the raw .ipynb
    * [File] > [Download as] > [HTML (.html)]
1. The output CSV file, with format passing the below sanity check


In [None]:
### The following is used to validate your submission format
FILE_NAME = "" # Put your filename here

submission_file = open(FILE_NAME)
header = submission_file.readline()
columns = header.strip().split(",")
numColumns = 2
# check columns in the header
if len(columns) == 2 and columns[0] == "customerId" and columns[1] == "recommendedProducts":
  # The header matches the requirement of Option 1
  pass
elif len(columns) == 3 and columns[0] == "customerId" and columns[2] == "itemId" and columns[2] == "recommendedProducts":
  # The header matches the requirement of Option 2
  numColumns = 3
else:
  raise Exception("The header does not match the required format!")
  
# check data    
for (i, l) in enumerate(submission_file):
  fields = l.strip().split(",")
  if len(fields) != numColumns:
    raise Exception("Line %d does not have %d comma-separated fields!" %(i+1, numColumns))      
  if len(fields[-1].split("|")) != 10:
    raise Exception("Line %d does not have 10 recommended products!" %(i+1))

if i != 999:
  raise Exception("The submission file has %d lines (1000 expected)" %(i+1))

print("Your submission file has passed the format check!")
    
