<img src="./images/logo.png" alt="Alt Text" width="700">


# Daisy Rec Evaluate New Algorithm tutorial

This is a tutorial under construction on how to add a new model into DaisyRec and evaluate its test metrics. 
This tutorial is still under construction so please do report on any bugs you might find!

## Steps 

 Say you want to create a fantastic new recommender algorithm called Second Most Popular, where you take the top-n+1 most popular items and cut off the most popular
 
 First, **create a shortened name string** , e.g. "sndmostpop". Remember this string.



### Step 1 - adding default hyperparameter configurations
 
1. Go to folder 'daisy/assets'

2. Create a yaml config file for the model using the name. This case the new file would be 'daisy/assets/sndmostpop.yaml'

3. Inside the YAML file, input all the **default** hyperparameters in yaml format. Since this algorithm has no hyperparametrs, the file will be left empty (but still need to be created).

![Alt Text](./images/new_algo/new_yaml.png)

### Step 2 - adding the model .py file

1. Create the new model python file, following naming convention of adding "Recommender" at the back E.g., SecondMostPopRecommender.py

2. Put the file into the daisy/models folder, in the /accuracy subfolder or /diversity subfolder, depending on if your model is focused on increasing accuracy or diversity of recommendation.

![Alt Text](./images/new_algo/new_py.png)

3. Inside the file, define your model class briefly; we will go into deeper details later. Define a preliminary fit(self) and predict(self) function, ignore the function signatures for now. Your model should be a child class of GeneralRecommender. In this case, it would be:


    ```
    from daisy.model.AbstractRecommender import GeneralRecommender
    from pandas import DataFrame
    import torch
    
    class SndMostPop(GeneralRecommender):
        '''
        Model recommends user the second most popular items in the list
        '''
        tunable_param_names = [] # this should be a list of all hyperparameter names
        def __init__(self, config):
            super(SndMostPop, self).__init__(config)
            self.config = config

        def fit(self, training_data: DataFrame) -> np.ndarray:
            pass
        
        def rank(self, test_loader: torch.utils.data.DataLoader ) -> np.ndarray:
            pass
    ```

4. Now, import the Model in daisy/model/Models.py, which is used for importing models into other files:

![Alt](./images/new_algo/modelspy.png)

### Step 3 - incorporating into tune.py and test.py

We now need to put the model name into the model builder in both tune.py and test.py. 

1. In each file, search "if config['algo_name'].lower() in" in your code editor. You should come across the following code block:

    ![image.png](./images/new_algo/testpy.png)

2. This code block basically:
    - Builds the model that you want
    - Pre-processes the raw training data (called `train_set`, in a Pandas DataFrame) into a format that is your model needs
    - Fits your model to the training data using model.fit()


3. Notice that each algorithm shorthand name is in an array corresponding to whatever settings are needed for building and fitting that model. As shown in the blue circle, add your model name into the array corresponding to the data-type for the training data that you need to fit your model (see step 4 for choosing which one). 

### Step 4 - Which data type do I need for my model to work?



As you can see from the code blocks above, they are mainly focusing on taking the `train_set`, pre-processing it and converting it into a data format that the model needs.

`train_set` is just a Pandas DataFrame with columns corresponding to user IDs, item IDs, ratings and timestamps. The user IDs and item IDs numbering start from 0, not necessarily in order. For example:

| User ID | Item ID | Rating | Timestamp |
|---------|---------|--------|-----------|
|   1 |   2023 |   4.5  |  1624165321  |
|   0 |   4 |   3.8  |  1624165487  |
|   2 |   293 |   5.0  |  1624165632  |

Some datasets have explicit feedback (i.e., ratings are 0.0/5 to 5.0/5) whereas some only have implicit feedback (i.e., only interaction existence is captured. Rating is hard set to 1.0/5). Inspect the dataset you want before use.

If you need some special processing, feel free to explore the functionality of `AEDataset`, `BasicNegtiveSampler` and `SkipGramNegativeSampler` and see if these classes are performing the processing that you are looking for. 

Most of the methods (especially neural methods) need to convert into a pytorch DataLoader. In this case, we do:

```
    sampler = MySampler(train_set, config)
    train_samples = sampler.sampling() # This returns a numpy array or pandas df
    train_dataset = BasicDataset(train_samples) # Converts pd.df/np.ndarray to simple pytorch Dataset
    train_loader = get_dataloader(
        train_dataset, 
        batch_size=config['batch_size'], 
        shuffle=True, 
        num_workers=4) # Convert to DataLoader
    model.fit(train_loader)
```

### Step 5 - Designing the `model.fit()` 

Use the `self.config` object for all the configuration/hyperparameter information needed for the model. Use this function to train the model and store this information inside the model object. We will be using `model.rank()` for testing the model later for top-n ranking (or for "forward-propagation" of neural methods). Following is an example of how it is done in `SndMostPop.fit()`:

```
    def fit(self, training_df: DataFrame) -> np.ndarray:
        '''
        Ranks for item in the training dataframe by its popularity (number of user-item interactions)
        returns this 1-D array of ranks
        '''
        items_column = training_df[self.config['IID_NAME']]  # config['IID_NAME'] usually returns 'item'
        self.item_counts_series = items_column.value_counts()

        return self.item_counts_series
```


### Step 6 - Designing the `model.rank()`

Unlike `fit()`, rank() does not take in any data type as an argument other than a torch Dataloader; thus, it is important to follow the following function signature:

```
    def rank(self, test_loader: torch.utils.data.DataLoader) -> np.ndarray:
```

### The function input: `test_loader`

As for the dataset contained in the DataLoader called `test_loader`, remember that these data are contained in batches (default 128). The data is two-dimensional, so we have 128 arrays (rows). Each array is for each user, so the array will contain 1. **the user ID** and 2. **another array of negatively-sampled items, called the candidates set** (i.e. items with no interaction data for which a recommendation needs to be made for the user and ranked).

So we will have 128 arrays of:

[ user-id, [... 1000 candidate_item_ids ...]]

### The function output: `t`







