# Sagemaker

### Necessary imports

In [2]:
import boto3
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri

import numpy as np
import io
import pandas as pd
from sklearn.model_selection import train_test_split

### Load data

Read the data **from a S3 bucket to a CSV**. 

In [3]:
df = pd.read_csv(
    's3://la-liga-final-project/datasets/players_full_data_v2.csv'
)

In [4]:
df.head()

Unnamed: 0,weekNumber,totalPoints,id,name,position,team_id,team_shortName,cum_totalPoints,cumavg_totalPoints,curr_match_as_local,curr_match_opponent_id
0,1,1,1000,Edgar Paul Akouokou,Centrocampista,5,BET,0.0,0.0,True,7
1,2,2,1000,Edgar Paul Akouokou,Centrocampista,5,BET,1.0,1.0,False,33
2,3,0,1000,Edgar Paul Akouokou,Centrocampista,5,BET,3.0,1.5,True,13
3,5,0,1000,Edgar Paul Akouokou,Centrocampista,5,BET,3.0,1.0,True,20
4,6,2,1000,Edgar Paul Akouokou,Centrocampista,5,BET,3.0,0.75,True,28


In [90]:
df = df.drop(columns=['weekNumber', 'name', 'team_shortName', 'position'])

In [91]:
# filtering out players that appear less than 10 times as we need 
# substantial information for training and testing
tmp = df['id'].tolist()
df = df[df['id'].apply(lambda x: tmp.count(x) >= 10)]

In [92]:
# target
target_var = 'totalPoints'

# move target column to first
target_col = df.pop(target_var)
df.insert(0, target_var, target_col)

df.head()

Unnamed: 0,totalPoints,id,team_id,cum_totalPoints,cumavg_totalPoints,curr_match_as_local,curr_match_opponent_id
0,1,1000,5,0.0,0.0,True,7
1,2,1000,5,1.0,1.0,False,33
2,0,1000,5,3.0,1.5,True,13
3,0,1000,5,3.0,1.0,True,20
4,2,1000,5,3.0,0.75,True,28


### Train / Val / Test split

We split the data into train (80%), validation (10%) and test (10%) sets. 

In [93]:
train, testval = train_test_split(df, train_size=0.8, random_state=1200, stratify=df[['id']])
val, test = train_test_split(testval, train_size=0.5, random_state=1200, stratify=testval[['id']])

In [94]:
train.shape, val.shape, test.shape

((6764, 7), (845, 7), (846, 7))

To work with Sagemaker, we need to upload our data to S3. This snippet achieves that. Note what the snippet does: 

- It uses `.to_csv` as if it was writing to a file, but instead of giving a file it writes to an object of the class `StringIO`. This acts as an intermediate string placeholder. 
- It writes to S3, like in the example we saw in Lambda functions, but now giving as body the contents of the `StringIO`. 

In [95]:
s3 = boto3.resource('s3')

def upload_to_s3(df, bucket, filename):
    
    placeholder = io.StringIO()
    df.to_csv(placeholder, header=False, index=False)
    object = s3.Object(bucket, filename)
    object.put(Body=placeholder.getvalue())
    

After defining this, we proceed to the upload of the train and validation split. 

In [97]:
upload_to_s3(train, 'la-liga-final-project', 'output/train.csv')

In [98]:
upload_to_s3(val, 'la-liga-final-project', 'output/val.csv')

(Go to the S3 console and verify these files are there)

## Setting up the model

Now that's the part where we set up our model. 

We use the class `Estimator` from the `sagemaker.estimator` module. That will create the **environment** to run  training jobs for a model.

We specify: 

- A container name (Sagemaker works with containers. This code is pointing to a pre-existing container that holds everything that is needed to run xgboost. 
- A role name (the training job needs a role to have sufficient permissions, similarly to what we saw in Lambda functions). Remember that we created this role when starting the notebook server. 
- The number of instances for training (we use 1 but could use more in large jobs, to scale). 
- The type of instance (we select one that's included in the Sagemaker Free Tier). 
- The output path, where the model and other info will be written
- The hyperparameters of the algorithm (number of training rounds and loss function)
- The current session (it needs that for internal purposes)

(Remember to check the [pricing info](https://aws.amazon.com/sagemaker/pricing/#:~:text=amount%20of%20usage.-,Amazon%20SageMaker%20Free%20Tier,-Amazon%20SageMaker%20is) for more details on the Sagemaker Free Tier)

In [99]:
region_name = boto3.Session().region_name
role = sagemaker.get_execution_role()


container = sagemaker.image_uris.retrieve('xgboost', 
                                          region_name,
                                          version='0.90-1')


output_location = 's3://la-liga-final-project/sagemaker/'

#For a list of possible parameters of xgboost, see
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters
hyperparams = {
    'num_round': '20',
    'objective': 'reg:squarederror'
}

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.t3.medium',
    output_path=output_location,
    hyperparameters=hyperparams,
    sagemaker_session=sagemaker.Session()
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


Now we have to crete what sagemaker calls "channels". We need to specify where is the data and in which format in a specific dictionary:  

In [100]:
train_channel = sagemaker.session.s3_input(
    's3://la-liga-final-project/output/train.csv',
    content_type='text/csv'
)
val_channel = sagemaker.session.s3_input(
    's3://la-liga-final-project/output/val.csv',
    content_type='text/csv'
)

channels_for_training = {
    'train': train_channel,
    'validation': val_channel
}

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


We are ready to train. 

When we execute the cell below, a training job will be launched. The training job is a "managed service" independent from this notebook. So if you go to the Sagemaker console and click on "Training jobs", you will find it there. 

BTW: once you got to this point, you could also launch a training job from the console (it will request you to enter all the information above manually-- just for you to know it is possible). 


In [101]:
estimator.fit(inputs=channels_for_training, logs=False)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-03-11-14-17-58-037



2023-03-11 14:17:58 Starting - Starting the training job...........
2023-03-11 14:18:54 Starting - Preparing the instances for training................
2023-03-11 14:20:23 Downloading - Downloading input data....
2023-03-11 14:20:48 Training - Downloading the training image......
2023-03-11 14:21:23 Training - Training image download completed. Training in progress....
2023-03-11 14:21:44 Uploading - Uploading generated training model..
2023-03-11 14:22:00 Completed - Training job completed


We can print the job name -- this is the name that appears in the console. 

In [102]:
estimator._current_job_name

'sagemaker-xgboost-2023-03-11-14-17-58-037'

Finally, we can also get some metrics of the training job here. 

In [103]:
metrics = sagemaker.analytics.TrainingJobAnalytics(
    estimator._current_job_name,
    metric_names=['train:rmse', 'validation:rmse']
)

In [104]:
metrics.dataframe()

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:rmse,2.52256
1,0.0,validation:rmse,2.97158
