## Autopilot 예시


- Kaggle housing price 데이터 : https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
- Customer churn 예시 : https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/autopilot_customer_churn.ipynb (이 노트북에서 참고한 노트북)
- CA housing price 예시 : https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/autopilot_california_housing.ipynb



### 참고할 내용

- Autopilot에서 지원되는 [task](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-datasets-problem-types.html) 는 tabular 형태 데이터에 대한 regression/classification 이다.
- time series 데이터 타입의 경우 tsfresh 활용해서 지원됨. 예시는 [Blog 글](https://aws.amazon.com/ko/blogs/machine-learning/amazon-sagemaker-autopilot-now-supports-time-series-data/) 을 참고할 것.


In [None]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sm_client = boto3.Session().client(service_name="sagemaker", region_name=region)
bucket = sagemaker_session.default_bucket()
print(bucket)


In [None]:
import pandas as pd
import numpy as np

In [None]:
train_s3_path = f"s3://{bucket}/lowcode-sm/hp/train.csv"
train_df = pd.read_csv(train_s3_path)
train_df

In [None]:
# test_s3_path = f"s3://{bucket}/lowcode-sm/hp/test.csv"
# test_df = pd.read_csv(test_s3_path)
# test_df

In [None]:
pred_target = "SalePrice"

output_path = f"s3://{bucket}/lowcode-sm/logs/hp-output/"


### 고려할 점

- job 파라미터 넘길 때 예시에서는 train 디렉토리만 넘기는데 (train_data.csv 만 있음) prefix 넣는 방식 외에도 file까지 지정할 수 있음.
- `S3DataSource` : https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html

```
[ {"prefix": "s3://customer_bucket/some/prefix/"},
"relative/path/to/custdata-1",
"relative/path/custdata-2",
...
"relative/path/custdata-N"
] 

```


In [None]:
input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": train_s3_path,
            }
        },
        "TargetAttributeName": pred_target,
    }
]

output_data_config = {"S3OutputPath": output_path}


In [None]:
from time import gmtime, strftime, sleep

timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())

auto_ml_job_name = "hp-autopilot-" + timestamp_suffix
print("AutoMLJobName: " + auto_ml_job_name)

sm_client.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig={"CompletionCriteria": {"MaxCandidates": 5}},
    RoleArn=role,
)


만일 metric 변경을 원한다면, problem type 도 지정해 주어야 함.
  - automl job API 참고 : https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job.html#

```
sm_client.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    ProblemType="Regression",  # metric 변경 시 필요. 그렇지 않으면 자동으로 판별
    AutoMLJobObjective={"MetricName": "RMSE"},  # regression은 기본적으로 MAE 사용
    AutoMLJobConfig={"CompletionCriteria": {"MaxCandidates": 5}},
    RoleArn=role,
)
```

참고로 해당 데이터 (kaggle) 에서의 evalution은 일반적인 RMSE가 아닌 `RMSE(log(pred), log(real))` 로 약간 다르다.

In [None]:
print("JobStatus - Secondary Status")
print("------------------------------")


describe_response = sm_client.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm_client.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(30)

### Notebook 다운로드

자동으로 수행한 작업에 대한 jupyter notebook을 다운로드 하고, 어떠한 과정을 진행했는지를 확인할 수 있습니다.

In [None]:
# print(describe_response)
print(describe_response["AutoMLJobArtifacts"]["CandidateDefinitionNotebookLocation"])
print(describe_response["AutoMLJobArtifacts"]["DataExplorationNotebookLocation"])

candidate_nbk = describe_response["AutoMLJobArtifacts"]["CandidateDefinitionNotebookLocation"]
data_explore_nbk = describe_response["AutoMLJobArtifacts"]["DataExplorationNotebookLocation"]

In [None]:
def split_s3_path(s3_path):
    path_parts = s3_path.replace("s3://", "").split("/")
    bucket = path_parts.pop(0)
    key = "/".join(path_parts)
    return bucket, key

s3_bucket, candidate_nbk_key = split_s3_path(candidate_nbk)
_, data_explore_nbk_key = split_s3_path(data_explore_nbk)

print(s3_bucket, candidate_nbk_key, data_explore_nbk_key)

sagemaker_session.download_data(path="./autopilot-sample", bucket=s3_bucket, key_prefix=candidate_nbk_key)

sagemaker_session.download_data(path="./autopilot-sample", bucket=s3_bucket, key_prefix=data_explore_nbk_key)

In [None]:
best_candidate = sm_client.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]
print(best_candidate)
print("\n")
print("CandidateName: " + best_candidate_name)
print(
    "FinalAutoMLJobObjectiveMetricName: "
    + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"]
)
print(
    "FinalAutoMLJobObjectiveMetricValue: "
    + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
)