# <font color='#8735fb'> **RAPIDS Single-GPU Workflow - XGBoost @ Airline Delays** </font> 

<img src='https://raw.githubusercontent.com/rapidsai/cloud-ml-examples/main/aws/img/airline_dataset.png' width='1250px'>

> **1. Mount S3 Dataset**

> **2. Data Ingestion**

> **3. ETL**
-> handle missing -> split

> **4. Train Classifier**
-> XGBoost

> **5. Inference**
-> FIL

In [1]:
import time
import cudf 
import xgboost
import joblib
from cuml.preprocessing.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier
from cuml.metrics import accuracy_score
import glob

### <font color='#8735fb'> **Mount S3 Dataset** </font>

In [2]:
!wget https://sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com/2_year_2020.tar.gz
!tar xvzf 2_year_2020.tar.gz

--2020-10-21 14:49:34--  https://sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com/2_year_2020.tar.gz
Resolving sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com)... 52.218.253.89
Connecting to sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com)|52.218.253.89|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98072481 (94M) [application/x-gzip]
Saving to: ‘2_year_2020.tar.gz’


2020-10-21 14:49:37 (29.4 MB/s) - ‘2_year_2020.tar.gz’ saved [98072481/98072481]

2_year_2020/
2_year_2020/part.13.parquet
2_year_2020/part.4.parquet
2_year_2020/part.17.parquet
2_year_2020/part.10.parquet
2_year_2020/part.7.parquet
2_year_2020/part.5.parquet
2_year_2020/part.11.parquet
2_year_2020/part.0.parquet
2_year_2020/part.6.parquet
2_year_2020/part.3.parquet
2_year_2020/part.8.parquet
2_year_2020/part.9.parquet
2_year_2020/part.16.parquet
2_ye

### <font color='#8735fb'> **Ingest Parquet Data** </font>

At the heart of our analysis will be domestic carrier on-time reporting data that has been kept for decades by the U.S. Bureau of Transportation.

This rich source of data allows us to scale, so while in this notebook (ML_100.ipynb) we only use 1 GPU and 1 year of data, in the next notebook (ML200.ipynb) we'll use 10 years of data and multiple GPUs.

> **Dataset**: [US.DoT - Reporting Carrier On-Time Performance, 1987-Present](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236)

The public dataset contains logs/features about flights in the United States (17 airlines) including:

* locations and distance  ( `Origin`, `Dest`, `Distance` )
* airline / carrier ( `Reporting_Airline` )
* scheduled departure and arrival times ( `CRSDepTime` and `CRSArrTime` )
* actual departure and arrival times ( `DpTime` and `ArrTime` )
* difference between scheduled & actual times ( `ArrDelay` and `DepDelay` )
* binary encoded version of late, aka our target variable ( `ArrDelay15` )

In [3]:
airline_feature_columns = [ 'Year', 'Quarter', 'Month', 'DayOfWeek', 
                            'Flight_Number_Reporting_Airline', 'DOT_ID_Reporting_Airline',
                            'OriginCityMarketID', 'DestCityMarketID',
                            'DepTime', 'DepDelay', 'DepDel15', 'ArrDel15',
                            'AirTime', 'Distance' ]

airline_label_column = 'ArrDel15'

In [4]:
file_list = glob.glob( '2_year_2020/*.parquet' )

In [5]:
%%time
data = cudf.read_parquet( file_list, 
                          columns = airline_feature_columns)

CPU times: user 1.22 s, sys: 638 ms, total: 1.86 s
Wall time: 2.11 s


In [6]:
data

Unnamed: 0,Year,Quarter,Month,DayOfWeek,Flight_Number_Reporting_Airline,DOT_ID_Reporting_Airline,OriginCityMarketID,DestCityMarketID,DepTime,DepDelay,DepDel15,ArrDel15,AirTime,Distance
0,2019.0,3.0,8.0,4.0,31.0,19790.0,30397.0,30194.0,2025.0,31.0,1,1,102.0,731.0
1,2019.0,3.0,8.0,4.0,32.0,19790.0,30194.0,30397.0,1706.0,-3.0,0,0,102.0,731.0
2,2019.0,3.0,8.0,4.0,54.0,19790.0,31453.0,30397.0,1829.0,40.0,1,1,102.0,689.0
3,2019.0,3.0,8.0,4.0,68.0,19790.0,34057.0,34614.0,1306.0,-4.0,0,0,94.0,630.0
4,2019.0,3.0,8.0,4.0,69.0,19790.0,34614.0,34057.0,820.0,-9.0,0,0,84.0,630.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9969606,2019.0,3.0,7.0,1.0,2980.0,20368.0,30647.0,33667.0,1409.0,-20.0,0,0,65.0,435.0
9969607,2019.0,3.0,7.0,6.0,2198.0,20368.0,31066.0,31136.0,1525.0,-5.0,0,0,86.0,646.0
9969608,2019.0,3.0,7.0,1.0,2319.0,20368.0,33342.0,34761.0,1356.0,3.0,0,0,144.0,1045.0
9969609,2019.0,3.0,7.0,6.0,2995.0,20368.0,30647.0,34685.0,2015.0,-22.0,0,0,98.0,641.0


### <font color='#8735fb'> **Handle Missing** </font>

In [7]:
%%time
data = data.dropna()

CPU times: user 3.21 ms, sys: 15.5 ms, total: 18.7 ms
Wall time: 19.2 ms


### <font color='#8735fb'> **Split** </font>

In [8]:
label_column = airline_label_column

train, test = train_test_split( data, random_state = 0 ) 

# build X [ features ], y [ labels ] for the train and test subsets
y_train = train[label_column]; 
X_train = train.drop(label_column, axis = 1)
y_test = test[label_column]
X_test = test.drop(label_column, axis = 1)

In [9]:
X_train

Unnamed: 0,Year,Quarter,Month,DayOfWeek,Flight_Number_Reporting_Airline,DOT_ID_Reporting_Airline,OriginCityMarketID,DestCityMarketID,DepTime,DepDelay,DepDel15,AirTime,Distance
6438450,2019.0,4.0,11.0,7.0,4023.0,20366.0,31453.0,30279.0,1832.0,-3.0,0,87.0,517.0
3848903,2020.0,1.0,1.0,3.0,590.0,19393.0,30423.0,33495.0,1526.0,16.0,1,56.0,444.0
5541659,2019.0,1.0,3.0,3.0,5647.0,20397.0,30255.0,31057.0,2100.0,108.0,1,59.0,333.0
9449330,2019.0,3.0,7.0,4.0,710.0,19977.0,30325.0,33570.0,1550.0,-5.0,0,116.0,853.0
280702,2019.0,3.0,8.0,2.0,2092.0,19977.0,30852.0,31453.0,613.0,-2.0,0,156.0,1235.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1958842,2019.0,2.0,5.0,5.0,333.0,19805.0,30194.0,30397.0,1420.0,-2.0,0,89.0,731.0
2897995,2019.0,2.0,6.0,3.0,5699.0,20304.0,30325.0,30285.0,1156.0,1.0,0,42.0,250.0
6749668,2019.0,4.0,11.0,6.0,1918.0,19805.0,31057.0,33667.0,943.0,23.0,1,44.0,290.0
5792341,2019.0,2.0,4.0,1.0,1387.0,19393.0,32337.0,31454.0,1843.0,-7.0,0,105.0,829.0


### <font color='#8735fb'> **Train/Fit** </font>

In [10]:
model_params = {            
    'max_depth' : 10,
    'num_boost_round': 300,
    'learning_rate': .25,
    'gamma': 0,
    'lambda': 1,
    'random_state' : 0,
    'verbosity' : 0,
    'seed': 0,   
    'objective' : 'binary:logistic',
    'tree_method': 'gpu_hist'
} 

In [11]:
%%time
dtrain = xgboost.DMatrix( X_train, y_train)
trained_model = xgboost.train( model_params, dtrain, 
                                num_boost_round = model_params['num_boost_round'] )

CPU times: user 15.4 s, sys: 368 ms, total: 15.7 s
Wall time: 15.8 s


### <font color='#8735fb'> **Predict & Score** </font>

### <font color='#8735fb'> **XGBoost Native Predict & Score** </font>

In [12]:
threshold = 0.5
dtest = xgboost.DMatrix( X_test )

In [13]:
%%time
predictions = trained_model.predict( dtest)
predictions = (predictions > threshold ) * 1.0
score = accuracy_score ( y_test.astype('float32'),
                         predictions.astype('float32') )

print(f'score = {score}')

score = 0.9511467814445496
CPU times: user 85 ms, sys: 26.1 ms, total: 111 ms
Wall time: 113 ms


In [14]:
model_filename = 'trained-model.xgb'
trained_model.save_model( model_filename )

### <font color='#8735fb'> **ForestInference Predict & Score** </font>

In [15]:
from cuml import ForestInference

In [16]:
reloaded_model = ForestInference.load( model_filename )

In [17]:
%%time 
fil_predictions = reloaded_model.predict( X_test )
fil_predictions = ( fil_predictions > threshold ) * 1.0
score = accuracy_score ( y_test.astype('float32'),
                         fil_predictions.astype('float32') )
print(f'fil score = {score}')

fil score = 0.9511467814445496
CPU times: user 108 ms, sys: 18.1 ms, total: 126 ms
Wall time: 149 ms


### <font color='#8735fb'> **Additional References** </font>