<img src="./images/logo.png" alt="Alt Text" width="700">


# Daisy Rec Evaluate New Algorithm tutorial

This is a tutorial under construction on how to add a new model into DaisyRec and evaluate its test metrics. 
This tutorial is still under construction so please do report on any bugs you might find!

## Steps 

 Say you want to re-implement the Neural Collaborative Filtering algorithm as designed by He et al. (2017) (archiv link: [arxiv.org/abs/1708.05031](https://arxiv.org/abs/1708.05031)). The following is a guide on implementation  
 
 First, **create a shortened name string**. For this implementation, we use "neumf". Remember this string.



### Step 1 - adding default hyperparameter configurations
 
1. Go to folder 'daisy/assets'

2. Create a yaml config file for the model using the name. This case the new file would be 'daisy/assets/neumf.yaml'

3. Inside the YAML file, input **all** the hyperparameters in as keys yaml format. The values will be the default values. In this case, we have:

```
    # Hyperparameters
    factors: 24
    num_layers: 2
    dropout: 0.5
    lr: 0.001
    epochs: 30
    reg_1: 0.001
    reg_2: 0.001
    GMF_model: ~
    MLP_model: ~

    # Model name
    model_name: NeuMF
```

You may input your custom MLP or GMF model into the .yaml file, or put ~ to use our implementation. Note that these values are indeed **default only**, they can be tuned or different values can be tested using command line arguments 

### Step 2 - adding the model .py file

1. Create the new model python file, following naming convention of adding "Recommender" at the back of the model name i.e., NeuMFRecommender.py

2. Put the file into the daisy/models folder, in the /accuracy subfolder or /diversity subfolder, depending on if your model is focused on increasing accuracy or diversity of recommendation. In this case, since our model is accuracy-based, the absolute file path of our model code (from the root folder) is /daisy/model/accuracy/NeuMFRecommender.py


3. Inside the file, define your model class briefly; we will go into deeper details later. Name your class the model name in your yaml file. Your model should usually be a child class of `GeneralRecommender`, imported from daisy/model/AbstractRecommender.py. Please add a class variable `tunable_param_names` which is an array with all of the hyperparameters as outlined in the yaml file. In this case, it would be:

```
from daisy.model.AbstractRecommender import GeneralRecommender

class NeuMF(GeneralRecommender):
    tunable_param_names = ['num_ng', 'factors', 'num_layers', 'dropout', 'lr', 'batch_size', 'reg_1', 'reg_2']
    '''
    NeuMF Recommender Class, it can be seperate as: GMF and MLP
    '''
    def __init__(self, config):
        super(NeuMF, self).__init__(config)
        self.config = config
```

4. Now, import the Model in daisy/model/Models.py, which is used for importing models into other files. Inside the RecommenderModel() function, you will see a large if-else block matching the model name with the model class import. For this case, we will add the `elif` block:

```
    elif algo_name == 'neumf':
    from daisy.model.accuracyRecommender.NeuMFRecommender import NeuMF
    return NeuMF
```

### Step 3 - Loading data into the model

The dataset is loaded in test.py and tune.py using:

```
''' Train Test split '''
    splitter = TestSplitter(config)
    train_index, test_index = splitter.split(df)
    train_set, test_set = df.iloc[train_index, :].copy(), df.iloc[test_index, :].copy()
```

`train_set` is just a portion of the full data in a Pandas DataFrame with columns corresponding to user IDs, item IDs, ratings and timestamps. The user IDs and item IDs numbering start from 0, not necessarily in order. For example:

| User ID | Item ID | Rating | Timestamp |
|---------|---------|--------|-----------|
|   133 |   2023 |   4.5  |  1624165321  |
|   0 |   345 |   3.8  |  1624165487  |
|   210 |   293 |   5.0  |  1624165632  |

Some datasets have explicit feedback (i.e., ratings are 0.0/5 to 5.0/5) whereas some only have implicit feedback (i.e., only interaction existence is captured. Rating is hard set to 1.0/5). Inspect the dataset you want before use.

For negative sampling, we will use `BasicNegtiveSampler`. If you need some special processing, feel free to create your own custom sampler, or explore the functionality of `AEDataset` and `SkipGramNegativeSampler` and see if these classes are performing the processing that you are looking for. 

Most of the methods (especially neural methods) need to convert into a pytorch DataLoader. In this case, we do:

```
sampler = BasicNegativeSampler(train_set, config)
train_samples = sampler.sampling() # This returns a numpy array or pandas df
train_dataset = BasicDataset(train_samples) # Converts pd.df/np.ndarray to simple pytorch Dataset
train_loader = get_dataloader(
    train_dataset, 
    batch_size=config['batch_size'], 
    shuffle=True, 
    num_workers=4) # Convert torch Dataset to torch DataLoader
```

Now, `train_loader` is the data loader, which will be the input to `model.fit()` to be explained shortly


### Step 4 - incorporating into tune.py and test.py

We now need to put the model name into the model builder in both tune.py and test.py. 

1. In each file, search "if config['algo_name'].lower() in" in your code editor. You should come across the following code block:
    
    <img src='images\new_algo\testpy.png' alt="image in test.py" width="700"/>
    

2. This code block basically:
    - Builds the model that you want using `RecommenderModel()`
    - Pre-processes the raw `train_set` data into a format that is your model needs. This includes performing negative sampling  and converting the raw pandas df or numpy array into a torch DataLoader `train_loader`
    - Fits your model to the training data using `model.fit(train_loader)`


3. Notice that each algorithm shorthand name is in an array corresponding to whatever settings are needed for building and fitting that model. As shown in the blue circle, we add our model name 'neumf' into the array corresponding to the data processing that we need to fit our model. If you need custom processing, feel free to create your own `elif` block 

### Step 5 - Understanding `AbstractRecommender` and `GeneralRecommender` 

Use the `self.config` object for all the configuration/hyperparameter information needed for the model. Use this function to train the model and store this information inside the model object. We will be using `model.rank()` for testing the model later for top-n ranking (or for "forward-propagation" of neural methods). Following is an example of how it is done in `SndMostPop.fit()`:

```
    def fit(self, training_df: DataFrame) -> np.ndarray:
        '''
        Ranks for item in the training dataframe by its popularity (number of user-item interactions)
        returns this 1-D array of ranks
        '''
        items_column = training_df[self.config['IID_NAME']]  # config['IID_NAME'] usually returns 'item'
        self.item_counts_series = items_column.value_counts()

        return self.item_counts_series
```


### Step 6 - Designing the `model.rank()`

Unlike `fit()`, rank() does not take in any data type as an argument other than a torch Dataloader; thus, it is important to follow the following function signature:

```
    def rank(self, test_loader: torch.utils.data.DataLoader) -> np.ndarray:
```

### The function input: `test_loader`

As for the dataset contained in the DataLoader called `test_loader`, remember that these data are contained in batches (default 128). The data is two-dimensional, so we have 128 arrays (rows). Each array is for each user, so the array will contain 1. **the user ID** and 2. **another array of negatively-sampled items, called the candidates set** (i.e. items with no interaction data for which a recommendation needs to be made for the user and ranked).

So we will have 128 arrays of:

[ user-id, [... 1000 candidate_item_ids ...]]

### The function output: `t`







