## 2. End-to-end example: predicting taxi tips in New York

<div class="alert alert-block alert-info">
<ul>
    <li><b>Classification task</b>: predict whether a trip will result in a tip greater than 20% or not.</li>
    <li><b>Data</b>: the June 2021 <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page">New York City Taxi & Limousine Commission's Trip Record Data</a>. This dataset contains over 2 million samples of yellow cab rides.</li>
    <li><b>Model</b>: apply a simple <a href="https://xgboost.readthedocs.io/en/stable/">XGBoost</a> model, a gradient boosted tree.</li>
</ul>
</div>

### How we use Ray AI Libraries for this task

|Ray AI Library | Use Case|
|:--|:--|
|**Ray Data**|Ingest and transform raw data; perform batch inference by mapping the checkpointed model to batches of data.|
|**Ray Train**|Use `Trainer` to scale XGBoost model training.|
|**Ray Tune**|Use `Tuner` for hyperparameter search.|
|**Ray Serve**|Deploy the model for online inference.|

### Inspecting the features of the NYC taxi dataset

* **`passenger_count`**
    * Float (whole number) representing number of passengers.
* **`trip_distance`** 
    * Float representing trip distance in miles.
* **`fare_amount`**
    * Float representing total price including tax, tip, fees, etc.
* **`trip_duration`**
    * Integer representing seconds elapsed.
* **`hour`**
    * Hour that the trip started.
    * Integer in the range `[0, 23]`
* **`day_of_week`**
    * Integer in the range `[1, 7]`.
* **`is_big_tip`**
    * The label we want to predict.
    * Whether the tip amount was greater than 20%.
    * Boolean `[True, False]`.