# NOTEBOOK NAME FROM CODA TASK

Copy paste from Coda to fill the information

## Objective(s)


## Metadata 

* Metadata parameters are available here: 
* Task type:
  * 
* Users: :
  * 
* Watchers:
  * 
* Estimated Log points:
  * One being a simple task, 15 a very difficult one
  *  
* Task tag
  *  
* Toggl Tag
  * 
* Instance [AWS]
  *   
  
## Input Cloud Storage [AWS]

If link from the internet, save it to the cloud first

### Tables [AWS]

1. Batch 1:
  * Select Provider: 
  * Select table(s): 
    * Select only tables created from the same notebook, else copy/paste selection to add new input tables
    * If table(s) does not exist, add them: Add New Table
    * Information:
      * Region: 
        * Name: 
        * Code: 
      * Database: 
      * Notebook construction file: 
    
## Destination Output/Delivery

* AWS
  1. Athena: 
      * Region: 
      * Database: 
      * Tables (Add name new table): 

## Things to know (Steps, Attention points or new flow of information)

### Sources of information  (meeting notes, Documentation, Query, URL)


# S3 Architecture

The S3 architecture for the algorithm is as follow:

```
├── ALGORITHM
│   ├── PYTHON_SCRIPTS
│   └── YYYYMMDD
│       ├── ALGO_NAME
│       │   ├── LOGS
│       │   └── MODEL
│       ├── DATA
│       │   ├── PREDICT
│       │   │   ├── RAW
│       │   │   └── TRANSFORM
│       │   └── TRAIN
│       │       ├── RAW
│       │       └── TRANSFORM
│       └── EVALUATION
```

where `YYYYMMDD` is the date when the model has been trained or retrained. Within this folder, there are two Childs, `ALGO_NAME` and `DATA`. The folder `ALGO_NAME` should be renamed according to the algorithm used (ie `XGBOOST`, `RNN` , etc). 

- `YYYYMMDD`
  - `DATA`
    - `TRAIN` -> contains the raw data to be trained on
      - `RAW`: Training raw data before preprocessing
      - `TRANSFORM`: Training data after preprocessing
      - Sagemaker referenced functions: [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html?highlight=SKLearnProcessor#sagemaker.sklearn.processing.SKLearnProcessor) or [ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html?highlight=ScriptProcessor#sagemaker.processing.ScriptProcessor) for preprocessing and [SKLearn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html?highlight=SKLearn#sagemaker.sklearn.estimator.SKLearn) for training
    - `PREDICT`: -> contains the raw data to be predicted
      - `RAW`: Prediction raw data before preprocessing
      - `TRANSFORM`: Prediction data after preprocessing
      - Sagemaker referenced function: [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html?highlight=SKLearnProcessor#sagemaker.sklearn.processing.SKLearnProcessor) or [ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html?highlight=ScriptProcessor#sagemaker.processing.ScriptProcessor) for prediction
  - `ALGO_NAME` : Name of the algorithm used to train and predict the data
    - `LOGS`: Logs generated by Sagemaker during the training
    - `MODEL`: `tar` file with the model's object
  - `EVALUATION`: -> Contains the model evaluation performances
- `PYTHON_SCRIPTS`: Contains the preprocessing, training and evaluating scripts

# Create preprocessing class 


In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
#from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.sklearn.processing import ScriptProcessor

region = boto3.session.Session().region_name

role = get_execution_role()

### use this code when sagemaker managed image
#sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
#                                     role=role,
#                                     instance_type='ml.m5.xlarge',
#                                     instance_count=1)

In [None]:
processing_repository_uri = '869881768412.dkr.ecr.eu-west-2.amazonaws.com/sagemaker-xgboost-container:latest'
sklearn_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

In [None]:
#!sudo -u ec2-user -i conda install -c conda-forge jupytext --y 

In [None]:
#!sudo -u ec2-user -i jupyter nbextension install --py jupytext

The custom role is availables here:

In [None]:
import re
matches = re.search('([^\/]+$)', role).group()
url = 'https://console.aws.amazon.com/iam/home?region={0}#/roles/{1}'.format(region, matches)
print(url)

We load our custom library to load the data from Athena

In [None]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_s3 import service_s3

path_cred = "s3://creditsafedata/CREDENTIALS/thomas_credentials.csv"
bucket = 'creditsafedata'

In [None]:
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = region)
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = bucket, verbose = False)

In [None]:
#pandas_setting = True
#if pandas_setting:
#    pd.set_option('display.max_columns', None)
#    pd.set_option('display.max_colwidth', None)

# Steps

The analysis steps performed in this notebook are the following


- Write preprocessing pipeline
- Load the table to S3
- Train the model on XGBOOST
- Evaluate the model

## Write preprocessing pipeline

Write the preprocessing pipeline using Scikit learn. Please refer to the [official AWS tutorial](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb) to learn how to build a python script that Sagemaker will call to build the job.

Make sure the script is versioned in Github. 

### Preprocessing steps:

1. Create a `make_column_transformer` 
    - `make_pipeline` 
        - `RobustScaler`
    - `make_pipeline` 
        - `OneHotEncoder` 
        
The script is available in Github, [02_Data_analysis/01_model_training](https://github.com/Optimum-Finance/creditsafePrediction/blob/master/02_Data_analysis/01_model_training/preprocessing.py) and called by the `run` function from [ALGORITHM/PYTHON_SCRIPTS](https://s3.console.aws.amazon.com/s3/buckets/creditsafedata/ALGORITHM/PYTHON_SCRIPTS/?region=eu-west-2&tab=overview)

In [None]:
%%writefile preprocessing.py

#import argparse
import os
import warnings

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

if __name__=='__main__':
    
    # Prepare paths in Sagemaker instance
    ## Input 
    input_data_path = os.path.join('/opt/ml/processing/input', 'X_TRAIN_INVOICE_FINANCE.csv')
    ## Output preprocessing job 
    train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_features.csv')
    
    #train_labels_output_path = os.path.join('/opt/ml/processing/train', 'train_labels.csv')
    
    # Open data
    df = pd.read_csv(input_data_path)
    
    X_train = df.drop(columns = ['status'])
    #y_train = df[['status']]
    
    # Build preprocessing jobs
    feat_obj = (X_train
            .dtypes
            .loc[lambda x : 
                 (x =='object') 
                &(x.index != 'row_id')
                ]
            .index
            .to_list()
           )
    feat_cont = (X_train
            .dtypes
            .loc[lambda x : (x =='float64')  & (x.index != 'row_id')]
            .index
            .to_list()
           )
    
    numeric_transformer = Pipeline(steps=[
        ('scaler', RobustScaler())
    ])

    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, feat_cont),
            ('cat', categorical_transformer, feat_obj)]
    )
    
    pip_preprocessor = Pipeline(steps=[('preprocessor', preprocessor)])
    
    ### Need to convert label 
    label_encoder = LabelEncoder()
    y_train = label_encoder.fit_transform(df[['status']])
    
    # Call the job
    train_features = pip_preprocessor.fit_transform(X_train)
    
    (pd.DataFrame(
        np.column_stack((y_train,train_features))
    ).to_csv(train_features_output_path,
             header=False,
             index=False)
    )
    
    ### temporary solution
    test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_label_output_path = os.path.join('/opt/ml/processing/test', 'test_labels.csv')
    
    # Save the data to Sagemaker
    ### XGBoost requires label as the first columns
    pd.DataFrame(train_features).to_csv(test_features_output_path,
                                        header=False,
                                        index=False)
    
    pd.DataFrame(y_train).to_csv(test_label_output_path, header=False, index=False)

In [None]:
s3.upload_file(
    file_to_upload ='preprocessing.py',
    destination_in_s3 = 'ALGORITHM/PYTHON_SCRIPTS')

## Test the pipeline

In [None]:
import pandas as pd
from sagemaker.processing import ProcessingInput, ProcessingOutput

In [None]:
input_data = 's3://creditsafedata/ALGORITHM/{0}/DATA/TRAIN/RAW/{1}'.format(today, filename_x_train)
destination_data_train = 's3://creditsafedata/ALGORITHM/{0}/DATA/TRAIN/TRANSFORM/'.format(today)
destination_data_test = 's3://creditsafedata/ALGORITHM/{0}/DATA/EVALUATION/TRANSFORM/'.format(today)
process_py = "s3://creditsafedata/ALGORITHM/PYTHON_SCRIPTS/preprocessing.py"

In [None]:
sklearn_processor.run(code=process_py,
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train',
                                               destination = destination_data_train),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test',
                                               destination =destination_data_test)
                              ]
                     )
preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
output_config

In [None]:
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_training_data = output['S3Output']['S3Uri']
preprocessed_training_data

## Evaluate the model

In [None]:
%%writefile evaluation.py

import os
#from sklearn.linear_model import LogisticRegression
#os.system('pip install joblib')
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import xgboost
#from sklearn.externals import joblib
import json, tarfile, joblib, pickle

def convert(o):
        if isinstance(o, np.int64): return int(o)  
        raise TypeError

if __name__=="__main__":
    
    model_path = os.path.join('/opt/ml/processing/model', 'model.tar.gz')
    
    print('Extracting model from path: {}'.format(model_path))
    
    with tarfile.open(model_path) as tar:
        tar.extractall(path='.')
    model = pickle.load(open("xgboost-model", "rb"))
    
    test_features_data = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_labels_data = os.path.join('/opt/ml/processing/test', 'test_labels.csv')
                                    
    X_test = xgboost.DMatrix('{}?format=csv'.format(test_features_data))
    y_test = pd.read_csv(test_labels_data, header=None)
    predictions = np.round(model.predict(X_test),0)
        
    conf_mat =confusion_matrix(y_test, predictions)
    
    dic_metrics = {
    "score": [
        {
            "confusion_matrix": dict(enumerate(conf_mat.flatten(), 1)),
            "classification_report": classification_report(
                y_test,
                predictions,
                target_names=["Not User", "User"],
                output_dict=True,
            ),
        }
    ],
}
    evaluation_output_path = os.path.join('/opt/ml/processing/evaluation', 'evaluation.json')
    with open(evaluation_output_path, 'w') as f:
        f.write(json.dumps(dic_metrics,default=convert))

In [None]:
s3.upload_file(
    file_to_upload ='evaluation.py',
    destination_in_s3 = 'ALGORITHM/PYTHON_SCRIPTS')

In [None]:
evaluate_py = "s3://creditsafedata/ALGORITHM/PYTHON_SCRIPTS/evaluation.py"
destination_evaluation = 's3://creditsafedata/ALGORITHM/{}/EVALUATION/'.format(today)
preprocessed_testing_data = 's3://creditsafedata/ALGORITHM/{}/DATA/EVALUATION/TRANSFORM/'.format(today)

In [None]:
import json
from sagemaker.s3 import S3Downloader

sklearn_processor.run(code=evaluate_py,
                      inputs=[ProcessingInput(
                                  source=model_data_s3_uri,
                                  destination='/opt/ml/processing/model'),
                              ProcessingInput(
                                  source=preprocessed_testing_data,
                                  destination='/opt/ml/processing/test')],
                      outputs=[ProcessingOutput(output_name='evaluation',
                                  source='/opt/ml/processing/evaluation',
                                               destination = destination_evaluation)]
                     )                    
evaluation_job_description = sklearn_processor.jobs[-1].describe()

Now retrieve the file `evaluation.json` from Amazon S3, which contains the evaluation report.

In [None]:
evaluation_output_config = evaluation_job_description['ProcessingOutputConfig']
for output in evaluation_output_config['Outputs']:
    if output['OutputName'] == 'evaluation':
        evaluation_s3_uri = '{}evaluation.json'.format(output['S3Output']['S3Uri']) 
        break
evaluation_s3_uri

In [None]:
evaluation_output = S3Downloader.read_file(evaluation_s3_uri)
evaluation_output_dict = json.loads(evaluation_output)
print(json.dumps(evaluation_output_dict, sort_keys=True, indent=4))