## What is SageMaker  
(from their website) Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.  

SageMaker has lots of features for every step of the ML pipeline, like AutoPilot for AutoML, DataWrangler to automate preprocessing steps using GUI etc.  

**In this notebook I used only the Hyperparameter Tuning feature**

The Feature engineering steps I got from this kernel: [https://www.kaggle.com/subinium/how-to-use-pycaret-with-feature-engineering](http://)  
For more information on the steps of FE, please check his kernel, its great.  

Here I'll focus only on the Hyperparameter Optimization part

As for training and the idea of  pseudolabelling the test data I got from this kernel: [https://www.kaggle.com/alexryzhkov/n3-tps-april-21-lightautoml-starter](http://)  

So thanks a lot **subinium** and **alexryzhkov** for sharing

## IMPORTANT. 

I didn't run the sageMaker cells here because it needed my credentials.  
I had run this on SageMaker Jupyter Notebooks, there I didn't need to configure anything, all the libraries were installed out of the box.  
**The main point is to explain how to use sageMaker tunning jobs.**

## Importing libraries

In [None]:
import pandas as pd
import numpy as np
from category_encoders.cat_boost import CatBoostEncoder

from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')
pseud_label = pd.read_csv('/kaggle/input/best-sub/best_sub.csv')

In [None]:
test['Survived'] = pseud_label['Survived']

In [None]:
def converter(x):
    c, n = '', ''
    x = str(x).replace('.', '').replace('/','').replace(' ', '')
    for i in x:
        if i.isnumeric():
            n += i
        else :
            c += i 
    if n != '':
        return c, int(n)
    return c, np.nan

In [None]:
def create_extra_features(data):
    data['Ticket_type'] = data['Ticket'].map(lambda x: converter(x)[0])
    data['Ticket_number'] = data['Ticket'].map(lambda x: converter(x)[1])
    
    data['Cabin_type'] = data['Cabin'].map(lambda x: converter(x)[0])
    data['Cabin_number'] = data['Cabin'].map(lambda x: converter(x)[1])
    data['Name1'] = data['Name'].map(lambda x: x.split(', ')[0])    
    data['Name2'] = data['Name'].map(lambda x: x.split(', ')[1])
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
    data['isAlone'] = data['FamilySize'].apply(lambda x : 1 if x == 1 else 0)
    
    # Sex
    data['Sex'] = data['Sex'].map({'male':0, 'female':1})
    
    # Age
    age_map = train[['Age', 'Pclass']].dropna().groupby('Pclass').median().to_dict()['Age']
    data.loc[train['Age'].isnull(), 'Age'] = data.loc[train['Age'].isnull(), 'Pclass'].map(age_map)

    # Embarked
    data['Embarked'] = data['Embarked'].fillna('X')
    return data

train = create_extra_features(train)
test = create_extra_features(test)

In [None]:
ce = CatBoostEncoder()

column_name = ['Ticket_type', 'Embarked', 'Cabin_type', 'Name1', 'Name2']
train[column_name] = ce.fit_transform(train[column_name], train['Survived'])
test[column_name] = ce.transform(test[column_name])

In [None]:
train.drop(['PassengerId','Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test.drop(['PassengerId','Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [None]:
all_data = pd.concat([train, test]).reset_index(drop = True)
all_data = all_data.fillna(all_data.median())
_, test_data = all_data[:len(train)], all_data[len(test):]


In [None]:
all_data.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(all_data.drop('Survived', axis=1), all_data.Survived, test_size=0.2, random_state=0, stratify=all_data.Survived)

## SageMaker  

I've runned the following inside a SageMaker Jupyter Notebook, so I didn't need to install any of the following libraries

```python
from sagemaker.xgboost.estimator import XGBoost
from sagemaker import image_uris
import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker import get_execution_role
import boto3

region = boto3.Session().region_name
session = sagemaker.Session()
role = get_execution_role()
bucket = session.default_bucket()
```

In [None]:
# When running XGBoost builtin in SageMaker it implies the first column is the target
# It also expects a train set and validation set
# Prepare datasets
df_train = X_train.copy()
df_valid = X_test.copy()

df_train['Survived'] = y_train
df_valid['Survived'] = y_test

cols = df_train.columns.tolist()
cols.insert(0, cols.pop(cols.index('Survived')))

df_train = df_train.reindex(columns= cols)
df_valid = df_valid.reindex(columns= cols)
df_train.head()


### I'll use the sagemaker session to send the data to S3 so SageMaker can read from there.

### Sending data to s3
```python
prefix = 'sagemaker/tps-titanic/xgboost'
train_file = 'df_train.csv';
df_train.to_csv(train_file, index=False, header=False)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'df_valid.csv';
df_valid.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)
```

## Setup_Hyperparameter_Tuning
Note, with the default setting below, the hyperparameter tuning job can take about 30 minutes to complete.

Now that we have prepared the dataset, we are ready to train models. Before we do that, one thing to note is there are algorithm settings which are called "hyperparameters" that can dramtically affect the performance of the trained models. For example, XGBoost algorithm has dozens of hyperparameters and we need to pick the right values for those hyperparameters in order to achieve the desired model training results. Since which hyperparameter setting can lead to the best result depends on the dataset as well, it is almost impossible to pick the best hyperparameter setting without searching for it, and a good search algorithm can search for the best hyperparameter setting in an automated and effective way.

We will use SageMaker hyperparameter tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.

In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, which includes:

The container image for the algorithm (XGBoost)  
Configuration for the output of the training jobs  
The values of static algorithm hyperparameters, those that are not specified will be given default values  
The type and number of instances to use for the training jobs 

[https://github.com/aws/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/xgboost_direct_marketing/hpo_xgboost_direct_marketing_sagemaker_python_sdk.ipynb](http://)

### initialize hyperparameters
```python
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":"100"}

output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'titanic-xgb-built-in-algo')

xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, 'latest') 

### construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters, 
                                          role=role,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          sagemaker_session=session, 
                                          output_path=output_path)
objective_metric_name = 'validation:auc'
```

### Setting up Hyperparameter Ranges

### Adding Hyperparameter Tuning
```python
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {
    'max_depth': IntegerParameter(1, 20, scaling_type="Linear"),
    'eta': ContinuousParameter(0.01, 0.3, scaling_type="Logarithmic"),
    'min_child_weight': IntegerParameter(1,10, scaling_type="Linear"),
    'gamma': ContinuousParameter(0.01, 0.5, scaling_type="Logarithmic"),
    'colsample_bytree': ContinuousParameter(0.8, 0.95, scaling_type="Logarithmic"),
    'subsample': ContinuousParameter(0.6, 0.9, scaling_type="Logarithmic")
}
```

### Creating the Input that SageMaker can work with (from s3) and Launching

### Creating the inputs
```python
content_type = "csv"

train_input = TrainingInput(train_data_s3_path, content_type=content_type)
validation_input = TrainingInput(test_data_s3_path, content_type=content_type)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=20,
    max_parallel_jobs=10
)
```

%%time
```python
tuner.fit({'train': train_input, 'validation': validation_input}, include_cls_metadata=False)
```

### Analysing Results and best parameters

### Analysing results
```python
from pprint import pprint
sage_client = boto3.Session().client('sagemaker')
tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)
if tuning_job_result.get('BestTrainingJob',None):
    print("Best model found so far:")
    pprint(tuning_job_result['BestTrainingJob'])
else:
    print("No training jobs have reported results yet.")
    
```

### Output from this step:  

Best model found so far:  
{'CreationTime': datetime.datetime(2021, 4, 10, 12, 55, tzinfo=tzlocal()),  
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:auc',  
                                                 'Value': 0.9596620202064514},  
 'ObjectiveStatus': 'Succeeded',  
 'TrainingEndTime': datetime.datetime(2021, 4, 10, 12, 58, 21, tzinfo=tzlocal()),  
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:475414269301:training-job/xgboost-210410-1254-006-a1aa1de8',  
 'TrainingJobName': '**xgboost**-210410-1254-006-a1aa1de8',  
 'TrainingJobStatus': 'Completed',  
 'TrainingStartTime': datetime.datetime(2021, 4, 10, 12, 57, 19, tzinfo=tzlocal()),  
 'TunedHyperParameters': **{'colsample_bytree': '0.8777596567529345',  
                          'eta': '0.08153822251994182',  
                          'gamma': '0.03797776943376543',  
                          'max_depth': '10',  
                          'min_child_weight': '13',  
                          'subsample': '0.8543920787438437'}**}  


### Creating a predictor (a deployed server with this model ready)

%%time
```python
from sagemaker.serializers import CSVSerializer
xgb_predictor = tuner.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m5.4xlarge',
    serializer = CSVSerializer())
```

### Predict
```python
predictions = xgb_predictor.predict(test_data.values).decode('utf-8')
predictions = np.fromstring(predictions, sep=',')
predictions = predictions>0.5
predictions.astype(int)
accuracy_score(y_test, predictions) #0.89
```

### Deleting the endpoint (deployed server with the model) to stop incurring charges.

```python
sage_client.delete_endpoint(EndpointName=xgb_predictor.endpoint_name)
```

## The result of prediction below

In [None]:
my_sub = pd.read_csv('/kaggle/input/mysubsagemakertunning/tunningjob_submission.csv')

In [None]:
sample_submission['Survived'] = my_sub['Survived']
sample_submission.to_csv('tunningjob_submission.csv', index=False)