### Prequels/sequels

- **ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration (data-prep)** | [Extended Dataset](https://www.kaggle.com/neomatrix369/nyc-taxi-trip-duration-extended)
- [ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration - analysis](https://www.kaggle.com/neomatrix369/chaieda-nyc-taxi-trip-duration-analysis)

## Installing and importing libraries and packages

In [None]:
!pip install swifter

In [None]:
%%bash
rm -f dtype_diet.py
wget https://raw.githubusercontent.com/ianozsvald/dtype_diet/master/dtype_diet.py
ls -lash *.py

In [None]:
import dtype_diet

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Loading datasets

In [None]:
DATA_FOLDER='/kaggle/input/nyc-taxi-trip-duration/'
training_dataset = pd.read_csv(f'{DATA_FOLDER}/train.zip')
test_dataset = pd.read_csv(f'{DATA_FOLDER}/test.zip')

In [None]:
DATASET_UPLOAD_FOLDER='/kaggle/working/upload'
EXTENDED_DATA_FOLDER='/kaggle/input/nyc-taxi-trip-duration-extended'

In [None]:
%%bash
UPLOAD_FOLDER=/kaggle/working/upload
mkdir -p ${UPLOAD_FOLDER}
cp /kaggle/input/nyc-taxi-trip-duration/*.zip ${UPLOAD_FOLDER} || true
cp /kaggle/input/chaieda-nyc-taxi-trip-duration-data-prep/*.csv ${UPLOAD_FOLDER} || true

In [None]:
%%time
# dtype_diet.report_on_dataframe(training_dataset)
training_dataset.info(memory_usage='deep')
training_dataset['vendor_id'] = training_dataset['vendor_id'].astype('int8')
training_dataset['passenger_count'] = training_dataset['passenger_count'].astype('int8')
training_dataset['store_and_fwd_flag'] = training_dataset['store_and_fwd_flag'].astype('category')
training_dataset['trip_duration'] = training_dataset['trip_duration'].astype('int32')
training_dataset.info(memory_usage='deep')

In [None]:
training_dataset

In [None]:
%%time
# dtype_diet.report_on_dataframe(test_dataset)
test_dataset.info(memory_usage='deep')
test_dataset['vendor_id'] = test_dataset['vendor_id'].astype('int8')
test_dataset['passenger_count'] = test_dataset['passenger_count'].astype('int8')
test_dataset['store_and_fwd_flag'] = test_dataset['store_and_fwd_flag'].astype('category')
test_dataset.info(memory_usage='deep')

In [None]:
test_dataset

In [None]:
print(training_dataset.columns)
print(test_dataset.columns)

In [None]:
train_test_filename = 'train_test_extended.csv'
extended_dataset_name = f'{EXTENDED_DATA_FOLDER}/{train_test_filename}'
found_extended_dataset = os.path.exists(extended_dataset_name)
if found_extended_dataset:
    print(f'Found {extended_dataset_name}, reusing existing one')
    combined_dataset = pd.read_csv(extended_dataset_name)
else:
    print(f'Did not find {extended_dataset_name}, will generate one starting here')
    combined_dataset = pd.concat([training_dataset, test_dataset])
    combined_dataset = combined_dataset.reset_index(drop=True)
combined_dataset

In [None]:
%%time
dtype_diet.report_on_dataframe(combined_dataset)
combined_dataset.info(memory_usage='deep')

## Additional data

Loading additional data that cover information about the districts and neighbourhoods in New York City with the help of the [NYC Airbnb dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data).

In [None]:
ADDITIONAL_DATA_FOLDER='/kaggle/input/new-york-city-airbnb-open-data/'
more_data_dataset = pd.read_csv(f'{ADDITIONAL_DATA_FOLDER}/AB_NYC_2019.csv')
more_data_dataset = more_data_dataset.drop(columns=['id', 'host_id', 'host_name', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 
                                                    'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'])

more_data_dataset = more_data_dataset.rename(columns={'neighbourhood_group': 'district'})

In [None]:
%%time
# dtype_diet.report_on_dataframe(more_data_dataset)
more_data_dataset.info(memory_usage='deep')
more_data_dataset['district'] = more_data_dataset['district'].astype('category')
more_data_dataset['neighbourhood'] = more_data_dataset['neighbourhood'].astype('category')
more_data_dataset.info(memory_usage='deep')

#### Convert Latitude/Longitude

It's easier to manage (compare, sort, etc...) multiple fields when they can be clubbed/merged into a single unique (and meaningful) value. Here we combine 

In [None]:
# https://stackoverflow.com/questions/8285599/is-there-a-formula-to-change-a-latitude-and-longitude-into-a-single-number
# Alternative calculation: (lat * 1e7 << 16) & 0xffff0000 | lng * 1e7 & 0x0000ffff
def geonumber(lat: float, lng: float):
    return ((lat + 90) * 180) + lng

In [None]:
%%time
more_data_dataset['name'] = more_data_dataset['name'].fillna('<Unknown>')
more_data_dataset = more_data_dataset.sort_values(by = ['latitude','longitude'])
more_data_dataset['geonumber'] = np.vectorize(geonumber)(more_data_dataset['latitude'], more_data_dataset['longitude'])
more_data_dataset['name'] = more_data_dataset['name'].apply(lambda x: x.title())

In [None]:
more_data_dataset

In [None]:
more_data_dataset.to_csv(f'{DATASET_UPLOAD_FOLDER}/nyc_additional_info.csv', index=False)

### Generate District and Neighbourhood from Latitude and Longitude

Using the additional data dataset mapping the Latitudes and Longitudes of each Pickup and Dropoff points with the respective districts and neighbourhoods corresponding to them. But this is done slighly differently, we convert the Latitude and Longitude of a point into a single number called _geonumber_. This helps make the matching during the mapping process easier and a bit more efficient.

In [None]:
%%time
combined_dataset['pickup_geonumber'] = np.vectorize(geonumber)(combined_dataset['pickup_latitude'], combined_dataset['pickup_longitude'])
combined_dataset['dropoff_geonumber'] = np.vectorize(geonumber)(combined_dataset['dropoff_latitude'], combined_dataset['dropoff_longitude'])

In [None]:
import tempfile
import numexpr
from tqdm.auto import tqdm
from joblib import Memory, Parallel, delayed 
import math

memory = Memory('/kaggle/working/', compress=9, verbose=0)

OFFSET = 200

def filter_more_data_dataset(filter, target_geonumber: float):
    mid_point = round(more_data_dataset[filter].shape[0] / 2)
    if more_data_dataset[filter].shape[0] > 0:
        return [more_data_dataset.iloc[mid_point]['district'], more_data_dataset.iloc[mid_point]['neighbourhood']]
    
    min_geonumber = more_data_dataset[0:1]['geonumber'].values[0]
    max_geonumber = more_data_dataset[-1:]['geonumber'].values[0]
    if (target_geonumber < min_geonumber) or (target_geonumber > max_geonumber):
        return ["Outside NYC", "Outside NYC"]

    return ["<Unknown>", "<Unknown>"]
    
@memory.cache()
def get_district_neighbourhood_info(target_geonumber: float) -> (str, str):
    if math.isnan(target_geonumber):
        return ["<Unknown>", "<Unknown>"]
    
    start_geonumber = target_geonumber - OFFSET
    end_geonumber = target_geonumber + OFFSET

    expression = f'(geonumber >= {start_geonumber}) and (geonumber <= {end_geonumber})'
    lat_long_filter = more_data_dataset.eval(expression)
    
    return filter_more_data_dataset(lat_long_filter, target_geonumber)


def process_pickup_dropoff_info(pickup_geonumber: float, dropoff_geonumber: float):
    pickup_results = get_district_neighbourhood_info(pickup_geonumber)
    dropoff_results = get_district_neighbourhood_info(dropoff_geonumber)
    
    return pickup_results + dropoff_results

In [None]:
import swifter
import gc

def apply_pickup_dropoff_info(dataset):
    pickup_geonumber = dataset['pickup_geonumber']
    dropoff_geonumber = dataset['dropoff_geonumber']
    return process_pickup_dropoff_info(pickup_geonumber, dropoff_geonumber)

In [None]:
def initialise(dataset: pd.DataFrame, new_field: str) -> pd.DataFrame:
    if new_field not in dataset.columns:
        dataset[new_field] = '<Unknown>'
    return dataset

def set_datatype(dataset: pd.DataFrame, new_field: str, type_name: str = 'category') -> pd.DataFrame:
    if new_field in dataset.columns:
        dataset[new_field] = dataset[new_field].astype(type_name)
    return dataset

def update_dataset(dataset: pd.DataFrame, start: int, end: int, results: list) -> pd.DataFrame:
    pickup_districts = []
    pickup_neighbourhoods = []
    dropoff_districts = []
    dropoff_neighbourhoods = []
    for each in results:
        pickup_districts.append(each[0])
        pickup_neighbourhoods.append(each[1])
        dropoff_districts.append(each[2])
        dropoff_neighbourhoods.append(each[3])
    
    dataset.iloc[start:end]['pickup_district'] = pickup_districts.copy()
    dataset.iloc[start:end]['pickup_neighbourhood'] = pickup_neighbourhoods.copy()
    dataset.iloc[start:end]['dropoff_district'] = dropoff_districts.copy()
    dataset.iloc[start:end]['dropoff_neighbourhood'] = dropoff_neighbourhoods.copy()
    print(f'Saved rows between {start} and {end} of the dataset.')
  
    del pickup_districts, pickup_neighbourhoods, dropoff_districts, dropoff_neighbourhoods
    
    return dataset

Create a filter we will use to keep our pipeline continuous and also filter out the locations that have been mapped successfully to those that have failed to be mapped.

In [None]:
unknown_filter = (combined_dataset.pickup_district == '<Unknown>') | (combined_dataset.pickup_neighbourhood == '<Unknown>') | \
                 (combined_dataset.dropoff_district == '<Unknown>') | (combined_dataset.dropoff_neighbourhood == '<Unknown>')

In [None]:
combined_dataset[unknown_filter]

Generate a mapping between the geonumbers in the combined dataset and additional data dataset using either joblib's Parallel or Swifter to parallel process the data processing tasks. The unmapped rows in the dataset are split into smaller batches and each batch are updated as soon as the process per batch finishes. The filter takes care of reducing the number of rows that need mapping each time a row is updated.

In [None]:
%%time
force_regenerate = True
if force_regenerate or (not found_extended_dataset):
    batch_size = combined_dataset[unknown_filter].shape[0] // 20       \
                 if combined_dataset[unknown_filter].shape[0] > 10_000 \
                 else combined_dataset[unknown_filter].shape[0]
    if batch_size == 0:
        print('No data needs mapping.')
    else:
        combined_dataset = initialise(combined_dataset, 'pickup_district')
        combined_dataset = initialise(combined_dataset, 'pickup_neighbourhood')
        combined_dataset = initialise(combined_dataset, 'dropoff_district')
        combined_dataset = initialise(combined_dataset, 'dropoff_neighbourhood')

        combined_dataset = set_datatype(combined_dataset, 'pickup_district', 'str')
        combined_dataset = set_datatype(combined_dataset, 'pickup_neighbourhood', 'str')
        combined_dataset = set_datatype(combined_dataset, 'dropoff_district', 'str')
        combined_dataset = set_datatype(combined_dataset, 'dropoff_neighbourhood', 'str')

        processing_method = 'swifter' # 'default' ==> option is slow on this dataframe, 'swifter'= maybe a after option

        for index in tqdm(range(0, combined_dataset[unknown_filter].shape[0], batch_size), \
                                desc=f'Pickup/dropoff (batchsize: {batch_size})'):
            start = index
            end = index + batch_size

            if processing_method == 'swifter':
                results = combined_dataset[unknown_filter][start:end].swifter \
                        .set_dask_scheduler(scheduler="processes") \
                        .progress_bar(enable=True, desc=f'Processing: {start} to {end}') \
                        .allow_dask_on_strings(enable=True) \
                        .apply(apply_pickup_dropoff_info, axis=1)
            else:
                pickup_geonumbers = combined_dataset[unknown_filter][start:end]['pickup_geonumber'].values
                dropoff_geonumbers = combined_dataset[unknown_filter][start:end]['dropoff_geonumber'].values

                results = Parallel(n_jobs=-1)(
                    delayed(process_pickup_dropoff_info)(
                        pickup_geonumbers, dropoff_geonumbers
                    ) for _, (pickup_geonumber, dropoff_geonumber) in \
                              enumerate(tqdm(zip(pickup_geonumbers, dropoff_geonumbers)))
                )
                del pickup_geonumbers, dropoff_geonumbers

            combined_dataset[unknown_filter] = update_dataset(combined_dataset.loc[unknown_filter], start, end, results)
            del results
            gc.collect()

    combined_dataset = set_datatype(combined_dataset, 'pickup_district', 'category')
    combined_dataset = set_datatype(combined_dataset, 'pickup_neighbourhood', 'category')
    combined_dataset = set_datatype(combined_dataset, 'dropoff_district', 'category')
    combined_dataset = set_datatype(combined_dataset, 'dropoff_neighbourhood', 'category')

In [None]:
[print(f'{field}\n{combined_dataset[field].value_counts()}\n') \
       for field in ['pickup_district', 'pickup_neighbourhood', 'dropoff_district', 'dropoff_neighbourhood']]

### Generate features from date/time related fields

The date and time fields related to pickup and dropoff between rides contain a number of time related features i.e. Month, Year, Season, etc... which can help understand the behaviour of the ride/client and also demand and usage of taxis throughout the year(s).

In [None]:
# The country's meteorological department follows the international standard of four seasons with some local adjustments: 
# - winter (December - February)
# - spring (March - May)
# - summer (June - August) 
# - fall (September - November)

date_to_season_mapping = {'1. Winter': [12, 2], '2. Spring': [3, 5], '3. Summer': [6, 8], '4. Fall': [9, 11]}

def date_to_season(dates):
    results = []
    date_values = pd.DatetimeIndex(dates).month.values
    
    for month in date_values:
        result = 'None'
        for each_season in date_to_season_mapping:
            start, end = date_to_season_mapping[each_season]
            if ((start < end) and (start <= month <= end)) or \
               ((start > end) and ((month >= start) or (month <= end))):
                result = each_season
                break

        results.append(result)
    return results

In [None]:
month_no_to_name_mapping = [
    '01. Jan', '02. Feb', '03. Mar', '04. Apr', '05. May', '06. Jun', '07. Jul', 
    '08. Aug', '09. Sep', '10. Oct', '11. Nov', '12. Dec'
]

def date_to_month_name(dates):
    month_values = pd.DatetimeIndex(dates).month.values
    results = []
    for month in month_values:
        result = month_no_to_name_mapping[month - 1]
        results.append(result)
    return results

def weekday_or_weekend(dates):
    results = []
    for date_value in pd.DatetimeIndex(dates.values):
        weekno = date_value.weekday()
        result = "Weekday" if weekno < 5 else "Weekend"
        results.append(result)
    return results

In [None]:
import holidays
holidays_usa = holidays.USA()

def regular_day_or_holiday(dates):
    results = []
    for date_value in pd.DatetimeIndex(dates.values):
        result = "Holiday (or Festival)" if date_value.date() in holidays_usa else "Regular day"
        results.append(result)
    return results

#### Day period and time ranges

- morning: 6-11:59
- afternoon: 12-5
- night: 6-12 (with activity)
- sleep time: 12-5:59 (w/o activity)

Thanks [Mindy](https://www.kaggle.com/mindyng) for your help and also confirming the above.

In [None]:
date_to_day_period_mapping = {'1. Morning': [6, 11], '2. Afternoon': [12, 17], 
                              '3. Evening': [18, 23], '4. Night': [0, 5]}
def date_to_day_period(datetimes):
    results = []
    datetime_values = datetimes.values
    for datetime in datetime_values:
        _, time_of_day = datetime.split(' ')
        hour, _, _ = time_of_day.split(':')
        hour = int(hour)
        result = 'None'
        for each_day_period in date_to_day_period_mapping:
            start, end = date_to_day_period_mapping[each_day_period]
            if ((start < end) and (start <= hour <= end)) or \
               ((start > end) and ((hour >= start) or (hour <= end))):
                result = each_day_period
                break

        results.append(result)
    return results

In [None]:
%%time
force_regenerate = True
if force_regenerate or (not found_extended_dataset):
    combined_dataset['pickup_hour'] = pd.DatetimeIndex(combined_dataset['pickup_datetime']).hour
    combined_dataset['day_period'] = date_to_day_period(combined_dataset['pickup_datetime'])
    combined_dataset['day_name'] = pd.DatetimeIndex(combined_dataset['pickup_datetime']).day_name()
    daynames_with_index = {
        'Monday': '1. Monday', 'Tuesday': '2. Tuesday', 'Wednesday': '3. Wednesday', 'Thursday': '4. Thursday',
        'Friday': '5. Friday', 'Saturday': '6. Saturday', 'Sunday': '7. Sunday'
    }
    combined_dataset['day_name'] = combined_dataset['day_name'].replace(daynames_with_index)
    combined_dataset['month'] = date_to_month_name(combined_dataset['pickup_datetime'])
    combined_dataset['financial_quarter'] = combined_dataset['month'] 
    month_to_quarter = {
        '01. Jan': 4, '02. Feb': 4, '03. Mar': 4, '04. Apr': 1, '05. May': 1, '06. Jun': 1, '07. Jul': 2, 
        '08. Aug': 2, '09. Sep': 2, '10. Oct': 3, '11. Nov': 3, '12. Dec': 3
    }
    combined_dataset['financial_quarter'] = combined_dataset['financial_quarter'].replace(month_to_quarter)
    combined_dataset['year'] = pd.DatetimeIndex(combined_dataset['pickup_datetime']).year
    combined_dataset['season'] = date_to_season(combined_dataset['pickup_datetime'])
    combined_dataset['weekday_or_weekend'] = weekday_or_weekend(combined_dataset['pickup_datetime'])
    combined_dataset['regular_day_or_holiday'] = regular_day_or_holiday(combined_dataset['pickup_datetime'])

In [None]:
%%time
combined_dataset.info(memory_usage='deep')
combined_dataset['pickup_hour'] = combined_dataset['pickup_hour'].astype('category')
combined_dataset['day_period'] = combined_dataset['day_period'].astype('category')
combined_dataset['day_name'] = combined_dataset['day_name'].astype('category')
combined_dataset['month'] = combined_dataset['month'].astype('category')
combined_dataset['financial_quarter'] = combined_dataset['financial_quarter'].astype('category')
combined_dataset['year'] = combined_dataset['year'].astype('category')
combined_dataset['season'] = combined_dataset['season'].astype('category')
combined_dataset['weekday_or_weekend'] = combined_dataset['weekday_or_weekend'].astype('category')
combined_dataset['regular_day_or_holiday'] = combined_dataset['regular_day_or_holiday'].astype('category')
combined_dataset.info(memory_usage='deep')

In [None]:
combined_dataset

### Save the generated fields in the new datasets

In [None]:
%%time
train_filter = ~combined_dataset.trip_duration.isna()
train_extended_dataset = combined_dataset[train_filter]
train_extended_dataset.to_csv(f'{DATASET_UPLOAD_FOLDER}/train_extended.csv', index=False)

In [None]:
train_extended_dataset

In [None]:
%%time
test_filter = combined_dataset.trip_duration.isna()
test_extended_dataset = combined_dataset[test_filter]
test_extended_dataset.to_csv(f'{DATASET_UPLOAD_FOLDER}/test_extended.csv', index=False)

In [None]:
test_extended_dataset

In [None]:
%%time
combined_dataset.to_csv(f'{DATASET_UPLOAD_FOLDER}/{train_test_filename}', index=False)

## Uploading newly created/updated csv to your Kaggle Dataset

Setup your local environment with your Kaggle login details (`KAGGLE_KEY` and `KAGGLE_USERNAME`).

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

import os
os.environ['KAGGLE_KEY'] = user_secrets.get_secret("KAGGLE_KEY")
os.environ['KAGGLE_USERNAME'] = user_secrets.get_secret("KAGGLE_USERNAME")

Using the `kaggle` Python client login, into your account from within the kernel.

In [None]:
import kaggle
kaggle.api.authenticate()

Get the metadata for the dataset you have already created manually - it's best to manually create it and upload the initial csv file(s) into it, to avoid subsequent issues with updating the dataset (as seen during my own end-to-end cycle).

Save the metadata file as a json file but before that, add/update two keys id and id_no with the respective details as shown below and then save it.

In [None]:
OWNER_SLUG='neomatrix369'
DATASET_SLUG='nyc-taxi-trip-duration-extended'
dataset_metadata = kaggle.api.metadata_get(OWNER_SLUG, DATASET_SLUG)
dataset_metadata['id'] = dataset_metadata["ownerUser"] + "/" + dataset_metadata['datasetSlug']
dataset_metadata['id_no'] = dataset_metadata['datasetId']
import json
with open(f'{DATASET_UPLOAD_FOLDER}/dataset-metadata.json', 'w') as file:
    json.dump(dataset_metadata, file, indent=4)

Finally call the dataset_create_version() api and pass it the folder where the metadata file exists and also where your .csv and .fth file(s) - those file(s) that you would like to upload into your existing Dataset (as a new version).

In [None]:
%%time
# !kaggle datasets version -m "Updating datasets" -p /kaggle/working/upload
kaggle.api.dataset_create_version(DATASET_UPLOAD_FOLDER, 'Updating datasets')

### Cleanup (joblib cache)

In [None]:
!rm -fr /kaggle/working/joblib

### Prequels/sequels

- **ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration (data-prep)** | [Extended Dataset](https://www.kaggle.com/neomatrix369/nyc-taxi-trip-duration-extended)
- [ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration - analysis](https://www.kaggle.com/neomatrix369/chaieda-nyc-taxi-trip-duration-analysis)