# Introduction

The goal of this project is to predict the trip duration of any given taxi ride. We will be taking into consideration various data points including: distance, pickup & drop-off location, month & time of day, and number of passangers. This is useful to both users and taxi companies to know and understand the different factors in how long a taxi ride will take. 

We will be analyzing two machine learning models and comparing their results to find an optimal model. The models used are:
- [Random Forest Regression](#random-forest-regression)
- [Gradient Boosting Model (XGBoost)](#gradient-boosting-model-xgboost)

# Packages & Imports
The cell below lists all the libraries we plan on importing for use in this project.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# The Data
*Data provided for by Kaggle (2016)

The data is gathered in the following:
- **id**: unique identifier for each trip
- **vendor_id**: a code indicating the provider associated with the trip recorded
- **pickup_datetime**: date and time when the meter was engaged
- **dropoff_datetime**: date and time when the meter was disengaged
- **passenger_count**: number of passangers in the vehicle
- **pickup_longitude**: longitude where meter was engaged
- **pickup_latitude**: latitude where meter was engaged
- **dropoff_longitude**: longitude where the meter was disengaged
- **dropoff_latitude**: latitude where the meter was disengaged
- **store_and_fwd_flag**: whether the trip was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server
- **trip_duration**: duration of the trip in seconds


# Cleaning the Data

- Single out hour in time. We don't need the exact minute and seconds just an estimated hour. (Maybe round up or down depending on 30min mark?)
- Sinlgle out day (?)
- Single out month
- Group pickup into origin
- Group dropoff into destination
- Get Distance

# Getting the Manhattan Distance

New York is designed with grid-like road layouts, thus we can use the Manhattan Distance formula to get an approximate distance between the point of origin and the destination a taxi might take. 

### The Manhattan Distance(in miles) is as follows: 

Let two points be:

- \( (\phi_1, \lambda_1) \): latitude and longitude of the origin(pickup) point.
- (phi2, lambda2): latitude and longitude of the destination(drop-off) point.

Approximation for Conversion to Miles:
- 1 degree of latitude ≈ 69.0 miles.
- 1 degree of longitude ≈ 69.0 × cos(phi_avg) miles, where phi_avg is the average of phi1 and phi2 in degrees.

Formula:

    $$
    \manhattan ≈ 69.0 × |phi2 - phi1| + (69.0 × cos((phi1 + phi2) / 2 in degrees)) × |lambda2 - lambda1|
    $$

Where: 
- phi1, phi2 are latitudes in degrees
- lambda1, lambda2 are longitudes in degrees
- cos() uses angle in degrees

In [2]:
def manhattan_distance(lat1, lon1, lat2, lon2):

    # Average latitude
    avg_lat = (lat1 + lat2) / 2.0
    # Approximate conversion factors
    lat_miles = 69.0
    lon_miles = 69.0 * np.cos(np.radians(avg_lat))

    d_lat = abs(lat2 - lat1) * lat_miles
    d_lon = abs(lon2 - lon1) * lon_miles

    return d_lat + d_lon

# Random Forest Regression

# Gradient Boosting Model (XGBoost)

# Results