## Training Workflows

### Introduction

- Here are some common task blocks:
  - Extract data and process it
  - Train a model
  - Predict on a test set
  - Save results in a database


 - The data must first be prepared (via [ETL](https://en.wikipedia.org/wiki/Extract%2C_transform%2C_load)/ELT or extract/transform/load jobs). 
 - Training and making predictions (i.e., inference) requires appropriate compute resources.
 - Data read and write imply access to an external service (such as an application database) or storage (such as AWS S3).
   - When we do data science work on a local machine, we may likely use some simple/manual ways to read data (likely from disk or from databases) as well as write our results to disk. Unfortunately, this is not sufficient in a production setting or a business context. In these latter contexts, automation, reliability, maintenance/resource requirements and documentation become paramount.


 - In all setting, the aforementioned jobs may need to run periodically so as to get the latest data from logging systems/data lakes and to make the freshest predictions.

 - Some example of pipelines are:
  - persistent model pipelines: model update is de-coupled from updating predictions. For instance, the model is updated monthly, but predictions are made in real-time or daily/hourly.
  - transient model pipelines: model update is tightly coupled with predictions. This helps with ensuring that the model is not losing prediction accuracy due to changing data distributions (e.g., time varying).
  - ...

#### Transient Pipeline

 - We will build a transient pipeline such that the model is built from scratch every-time to generate predictions.
  	- This will need compute resources (GPUs if its a neural network), which may impact cost benefit analysis in a business context.

 - Our sub-tasks in the above transient pipeline are as follows:
  - get the training data
    - We have used surpriselib's Dataset class to get the movielens 100k/1m dataset versions.
  - train the model
    - We have traine a recommendation engine for movie recommendation using pytorch (we have seen this model before) using the movielens data.
  - use the model for predictions
    - We have a script that reads a model from disk and makes predictions for a given set of users.
  - save the results in an external resource. In particular, we will try out [BigQuery](https://cloud.google.com/bigquery/).
    - This is new, and we illustrate a way to do this below.

 - In the below script we are peforming the last two steps together:

In [None]:
from recommend_pytorch_train import MF
from recommend_pytorch_inf import get_top_n
import torch
import pandas as pd
import surprise
import datetime
import time
from google.oauth2 import service_account
import pandas_gbq

def get_model_from_disk():
    start_time = time.time()

    # data preload
    data = surprise.Dataset.load_builtin('ml-1m')
    trainset = data.build_full_trainset()
    testset = trainset.build_anti_testset()
    movies_df = pd.read_csv('../data/ml-1m/movies.dat',
                            sep="::", header=None, engine='python')
    movies_df.columns = ['iid', 'name', 'genre']
    movies_df.set_index('iid', inplace=True)

    # model preload
    k = 100  # latent dimension
    c_bias = 1e-6
    c_vector = 1e-6
    model = MF(trainset.n_users, trainset.n_items,
               k=k, c_bias=c_bias, c_vector=c_vector)
    model.load_state_dict(torch.load(
        '../data/models/recommendation_model_pytorch.pkl'))  # TODO: prevent overwriting
    model.eval()

    print('Model and data preloading completed in ', time.time()-start_time)

    return model, testset, trainset, movies_df


def get_predictions(model, testset, trainset, movies_df):

    # save the recommended items for a given set of users
    sample_users = list(set([x[0] for x in testset]))[:4]


    df_list = []
    for uid in sample_users:
        recommended = get_top_n(model, testset, trainset, uid, movies_df, n=10)
        df_list.append(pd.DataFrame(data={'uid':[uid]*len(recommended),
                                    'recommended': [x[1] for x in recommended]},
            columns=['uid','recommended']))

    df = pd.concat(df_list, sort=False)
    df['pred_time'] = str(datetime.datetime.now())
    return df

def upload_to_bigquery(df):
    #Send predictions to BigQuery
    #requires a credential file in the current working directory
    table_id = "movie_recommendation_service.predicted_movies"
    project_id = "authentic-realm-276822"
    credentials = service_account.Credentials.from_service_account_file('model-user.json')
    pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists = 'replace', credentials=credentials)

if __name__ == '__main__':
    model, testset, trainset, movies_df = get_model_from_disk()
    df = get_predictions(model, testset, trainset, movies_df)
    print(df)
    upload_to_bigquery(df)


 - To get the above code running, we will install one additional package called `pandas_gbq` from [https://pandas-gbq.readthedocs.io/en/latest/](https://pandas-gbq.readthedocs.io/en/latest/) to upload our predictions to Google's BigQuery managed service (can act like an application database).

```bash
conda install pandas-gbq --channel conda-forge
```



 - (Aside) To do a quick check if you are authenticated, execute the following commands in the terminal (don't forget to set the environment variable using `export GOOGLE_APPLICATION_CREDENTIALS=/Users/theja/model-user.json` beforehand):

```bash
(datasci-dev) ttmac:pipelines theja$ gcloud auth list
   Credentialed Accounts
ACTIVE  ACCOUNT
*       *****@gmail.com

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

(datasci-dev) ttmac:pipelines theja$ gcloud config list project
[core]
project = authentic-realm-276822

Your active configuration is: [default]
```



 - (Aside) You may need to do a downgrade of a package using the command `conda install google-cloud-core==1.3.0` in case you are seeing errors such as

```python
AttributeError: 'ClientOptions' object has no attribute 'scopes'
```

 - Once we run all cells of the notebook, we have essentially pushed a pandas dataframe of predictions to Google BigQuery.




 - We should be able to see these predictions on the Google cloud console. So lets open up the Google Console homepage.

![bq1](images/bq1.png)


 - From the console homepage, navigating to BigQuery lands us the following page. 

![bq2](images/bq2.png)


 - We are not interested in the SQL editor at the moment. At the bottom right, we can see our project.

![bq3](images/bq3.png)



 - Expand the project on the left column to get to the `movie_recommendation_service` database and then to the `predicted_movies` table. The default information is the schema.

![bq4](images/bq4.png)

 - Changing from the scema tab to the preview tab shows that the upload was successful.

![bq5](images/bq5.png)



 - Lets rerun the script from the commandline. I am assuming that the `model-user.json` is in the current directory for simplicity. If its inconvenient to place it in the current directory, we can set the environment variable `GOOGLE_APPLICATION_CREDENTIALS`.

 - Going back to the BigQuery interface, the only thing that has changed is the timestamp when the predictions were generated (previewed results may not retrieve the same user-ids).

 ![bq6](images/bq6.png)

 - Querying from this table can also be done from a notebook using the following snippet

In [1]:
import pandas as pd
from google.cloud import bigquery
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './model-user.json'
client = bigquery.Client()
sql = "select * from movie_recommendation_service.predicted_movies"
df = client.query(sql).to_dataframe()
df.head()

DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

#### Remark

 - While we did all four blocks in a couple of scripts, it makes sense to break the whole operation into 4 explicit blocks (fetch data, train, predict, send predictions to database service). This, way a block can retry its execution if it fails and can also be manually handled by a team member. In particular, this retry can be automated using scheduling tools. One such tool is `cron`, which is the subject of the next section. 

 - For simplicity we will containerize the inference and sending predictions to database blocks of this transient model pipeline into a single container and script. Our next tool will allow us to run it automatically in a periodic manner.







