### Prequels/sequels

- [ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration (data-prep)](https://www.kaggle.com/neomatrix369/chaieda-nyc-taxi-trip-duration-data-prep) | [Extended Dataset](https://www.kaggle.com/neomatrix369/nyc-taxi-trip-duration-extended)
- **ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration - analysis**


<a id='ToC'></a>

----------

# Table of contents
- [Note](#note)
- [Summary](#summary)
- [Import libraries/packages](#import)
- [Loading datasets](#loading-datasets)
- Simple visualisations based on
  - [Location-based fields](#simple-location-based-visualisations)
    - Most/Least Popular Pickup District
    - Most/Least Popular Pickup Neighbourhood
    - Most/Least Popular Dropoff District
    - Most/Least Popular Dropoff Neighbourhood
    - Most/Least Popular Pickup/Dropoff District (side-by-side)
    - Most/Least Popular Pickup/Dropoff Neighbourhood
    - Most/Least Popular Pickup/Dropoff District pairs
    - Most/Least Popular Pickup/Dropoff Neighbourhood pairs
  - [Time-based fields](#simple-time-based-visualisations)
    - Pickup hour
    - Day period
    - Day name (day of the week)
    - Month
    - Quarter (financial)
    - Year
    - Season
    - Weekday or weekend
    - Regular day or Holiday
- Understanding the feature interactions
  - [Correlation matrix](#correlation-matrix-feature-interactions)
  - [The most and least correlated feature pairs](#feature-pairs-feature-interactions)
  - [Correlated feature trees (groups)](#feature-tree-feature-interactions)
- [Conclusions](#conclusions)

<a id='note'></a>

----------

## Note

#### There a number of great notebooks/kernels based on this dataset, and one that stands out among the other is the [NYC Taxi EDA - Update: The fast & the curious](https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious) from [Martin Henze](https://www.kaggle.com/headsortails) also known as [headortails](https://www.kaggle.com/headsortails). His notebook/kernel covers a lot of ground around this topic. A number of analysis and visualisations can be found there. 

#### In addition to this, there are a [number of notebooks/kernels](https://www.kaggle.com/c/nyc-taxi-trip-duration/notebooks) linked to the dataset, which cover additional analysis and visualisations, some overlapping with the others and yet unique ones.

#### The current notebook/kernel contain few such overlapping analysis and visualisations but there are others here that are not covered elsewhere. As the aspects and angles covered here are arising from diffeent methods of analysis. But please feel free to comment, correct and advise new methods to improve the overal content and quality of this notebook/kernel.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='summary'></a>

----------
>
> ## Summary
>
> - the dataset provided has very low missing values although observations provided cover only two vendors (two taxi companies) and also the data provided is across a single year and only six months of the year (fall data is missing)
> - Analysis using location-based data features from the data-points:
>   - **Manhattan** and **Queens** are the most popular pickup and dropoff districts
>   - The most popular districts overshadow the rest of the districts by as many as 20x.
>   - There is much less interaction between other districts using taxis. 
>   - Airport visits and returns are also much smaller in number compared to intra-city usage
>   - **Upper West Side**, **Harlem** and **East Harlem** are the most popular pickup and dropoff neighbourhoods
>   - The most popular neighbourhoods overshadow the rest of the neighbourhoods by as many as 10x
>   - Some of the neighbourhoods are not visited as much as the top 10-20 neighbourhoods - maybe due to distance, cost of the journey, less or no work activity, unsafe to travel to
> - Analysis using time-based data features from the data-points:
>   - Apart for a 4-5 hours late at night into the wee-hours of the morning, taxis are in demand throughout the 24 hour period
>   - If we observe closer, specifically speaking only during the evening times there is a drop in the demand for taxis, rest of the day and night times there is a continuous use of taxis in the city
>   - There is rise/fall pattern of usage during the week, drops on **Saturday**, **Sunday**, and **Monday** and then gradually rises from **Tuesday** to **Friday** and back to the same cycle
>   - From a monthly perspective (limited to six months only) - we can see a slow rise from **January** to **March** and then slow decline from **March** into **June**
>   - No year-on-year analysis is possible as we have data limited to one year only i.e. 2016
>   - From a seasonal point of view (limited to three seasons only, missing Fall data, we see a high usage in **Spring** and **Winter** and a relatively low usage during the **Summer**
>   - **Weekdays** usage exceeds **Weekend** usage by a factor of 2x
>   - And during Holidays and Festivals the usage is neglible as compared to a regular day usage, by a factor of say 20x
> - **We can see how the taxis in a city like New York is so much location and time based and it's usage is more or less predictable on the basis of these factors (among others)**

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='import'></a>

----------

## Import libraries/packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Import library and dataset
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import HTML, display

sns.set(style="whitegrid", font_scale=1.75)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

import math

import matplotlib.pyplot as plt

# prettify plots\n
plt.rcParams['figure.figsize'] = [20.0, 5.0]
    
%matplotlib inline

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
def plot_histogram(title: str = '<No title specified>', 
                   field_names=None, xlabel_title='<Not specified>', ylabel_title='<Not specified>', 
                   xticks_label_rotation: int = None,
                   sort_values: bool = None, bins=10, width: int = 20.0, height: int = 5.0):
    plt.rcParams['figure.figsize'] = [width, height]
    if xticks_label_rotation:
        plt.xticks(rotation=xticks_label_rotation)
    plt.suptitle(title, fontsize="xx-large", fontstyle="normal")
    if sort_values:
        ax = train_text_combined[field_names].sort_values(ascending=sort_values).hist(bins=bins)
    else:
        ax = train_text_combined[field_names].hist(bins=bins)
    ax.set_xlabel(xlabel_title)
    ax.set_ylabel(ylabel_title)

def plot_value_counts(title: str = '<No title specified>', 
                      field_names=None, xlabel_title='<Not specified>', ylabel_title='<Not specified>',
                      kind='barh', value_filter_func=None,
                      width: int = 20.0, height: int = 5.0):
    plt.rcParams['figure.figsize'] = [width, height]
    value_counts_dropoff = train_text_combined[field_names].value_counts()
    value_counts_dropoff = value_counts_dropoff.sort_values(ascending=True)
    if value_filter_func: 
        cutoff_filter = value_filter_func(value_counts_dropoff)
        ax = value_counts_dropoff[cutoff_filter].plot(kind=kind, title=title)
    else:
        ax = value_counts_dropoff.plot(kind=kind, title=title)
    
    ax.set_xlabel(xlabel_title)
    ax.set_ylabel(ylabel_title)

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='loading-datasets'></a>

----------

## Loading datasets

In [None]:
%%time
train_text_combined = pd.read_csv('/kaggle/input/nyc-taxi-trip-duration-extended/train_test_extended.csv')
print('train_test_extended.csv dataset')
print("Column count:", len(train_text_combined.columns))

In [None]:
%%time
train_text_combined.info(memory_usage='deep') 
train_text_combined = train_text_combined.drop(columns=['id', 'pickup_datetime', 'dropoff_datetime', 'pickup_longitude','pickup_latitude',
                                                       'dropoff_longitude', 'dropoff_latitude', 'pickup_geonumber', 'dropoff_geonumber'])
train_text_combined['store_and_fwd_flag'] = train_text_combined['store_and_fwd_flag'].astype('category')
train_text_combined['pickup_district'] = train_text_combined['pickup_district'].astype('category')
train_text_combined['pickup_neighbourhood'] = train_text_combined['pickup_neighbourhood'].astype('category')
train_text_combined['dropoff_district'] = train_text_combined['dropoff_district'].astype('category')
train_text_combined['dropoff_neighbourhood'] = train_text_combined['dropoff_neighbourhood'].astype('category')
train_text_combined['pickup_hour'] = train_text_combined['pickup_hour'].astype('category')
train_text_combined['day_period'] = train_text_combined['day_period'].astype('category')
train_text_combined['day_name'] = train_text_combined['day_name'].astype('category')
train_text_combined['year'] = train_text_combined['year'].astype('int16')
train_text_combined['month'] = train_text_combined['month'].astype('category')
train_text_combined['financial_quarter'] = 'Q' + train_text_combined['financial_quarter'].apply(str)
train_text_combined['financial_quarter'] = train_text_combined['financial_quarter'].astype('category')
train_text_combined['season'] = train_text_combined['season'].astype('category')
train_text_combined['weekday_or_weekend'] = train_text_combined['weekday_or_weekend'].astype('category')
train_text_combined['regular_day_or_holiday'] = train_text_combined['regular_day_or_holiday'].astype('category')

train_text_combined['pickup_dropoff_district'] = train_text_combined['pickup_district'].astype(str) + ' → ' + train_text_combined['dropoff_district'].astype(str)
train_text_combined['pickup_dropoff_district'] = train_text_combined['pickup_dropoff_district'].astype('category')

train_text_combined['pickup_dropoff_neighbourhood'] = train_text_combined['pickup_neighbourhood'].astype(str) \
           + " (" + train_text_combined['pickup_district'].astype(str) + ") → " \
           + train_text_combined['dropoff_neighbourhood'].astype(str) \
           + " (" + train_text_combined['dropoff_district'].astype(str) + ")"
train_text_combined['pickup_dropoff_neighbourhood'] = train_text_combined['pickup_dropoff_neighbourhood'].astype('category')

train_text_combined.info(memory_usage='deep')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">`diet type` is super useful to help analyse and condence bulky datasets by recommending changes to the data types as can be seen from the above.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='simple-location-based-visualisations'></a>

----------

## Simple visualisations based on Location-based fields

### Most/Least Popular Pickup District

In [None]:
plot_histogram("Most/Least Popular Pickup District", 'pickup_district', 
               xlabel_title='NYC Districts',ylabel_title='Trips made')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We can see how popular **Brooklyn** and **Manhattan** are as compared to the other districts i.e. **Staten Island**, **Queens**, when it comes to pickup locations in **New York City**. So most people are either living and/or working in these two districts, or there is more activities in these two areas. We can say more people go to **Manhanttan** as it's work related while relatively more people live in **Brooklyn** -- although we cannot be fully correct about this conclusion.

### Most/Least Popular Pickup Neighbourhood

In [None]:
plot_value_counts("Most Popular Pickup Neighbourhood", 
                  xlabel_title='Trips made', ylabel_title='NYC Neighbourhood (District)',
                  field_names=['pickup_neighbourhood', 'pickup_district'], 
                  value_filter_func=lambda x: x > 10_000,
                  width=15.0, height=20.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The top 5-10 neighbourhoods totally overshadows the rest of the neighbourhoods put together. It's like almost all the work and living places are located here or many activities related to the same happen in these specific neighbourhoods as opposed to the other relatively less popular ones. Mind you the above is filtered by trips with more than **10,000 occurrences** in the six months time period.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

In [None]:
plot_value_counts("Least Popular Pickup District", 
                  xlabel_title='Trips made', ylabel_title='NYC Neighbourhood (District)',
                  field_names=['pickup_neighbourhood', 'pickup_district'], 
                  value_filter_func=lambda x: x < 1_000,
                  width=15.0, height=20.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The above pickup points are among the least popular in the whole of the city, and these are filtered out from the list from those with less than **1,000 occurrences** of usage in the six months period in the list of observations. And here it's a similar observation that the top 5-10 of the least used pickup locations overshadow the rest of the locations in the list by a very big margin. We can say there's still very quiet or least used locations even in a busy city like **New York**.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Most/Least Popular Dropoff District

In [None]:
plot_histogram("Most/Least Popular Dropoff District", 'dropoff_district',
                ylabel_title='Trips made', xlabel_title='NYC Districts',)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The most to least popular dropoff location is almost similar in nature to the most to least popular pickup locations.

### Most/Least Popular Dropoff Neighbourhood

In [None]:
plot_value_counts("Most Popular Dropoff Neighbourhood", 
                  xlabel_title='Trips made', ylabel_title='NYC Neighbourhood (District)',
                  field_names=['dropoff_neighbourhood', 'dropoff_district'], 
                  value_filter_func=lambda x: x > 10_000,
                  width=15.0, height=15.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Similarly, the most to least popular dropoff neighbourhoods are nearly almost similar in nature to the most to least popular pickup locations.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

In [None]:
plot_value_counts("Least Popular Dropoff Neighbourhood", 
                  xlabel_title='Trips made', ylabel_title='NYC Neighbourhood (District)',
                  field_names=['dropoff_neighbourhood', 'dropoff_district'], 
                  value_filter_func=lambda x: x < 1_000,
                  width=15.0, height=20.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Same goes for the least popular dropoff location as compared to the least popular pickup locations.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Most/Least Popular Pickup/Dropoff District (side-by-side)

In [None]:
plt.rcParams['figure.figsize'] = [25.0, 10.0]
plt.subplot(1, 2, 1)
plot_value_counts("Pickup district", 'pickup_district', 
                  xlabel_title='Trips made', ylabel_title='NYC District',
                  width=25.0, height=10.0)
plt.subplot(1, 2, 2)
plot_value_counts("Dropoff district", 'dropoff_district', 
                  xlabel_title='Trips made', ylabel_title='',
                  width=25.0, height=10.0)
plt.suptitle("Most/Least Popular Pickup/Dropoff District", fontsize="xx-large", fontstyle="normal")

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">A side-by-side view only concludes our above observations and narrations.

### Most/Least Popular Pickup/Dropoff Neighbourhood

In [None]:
cutoff_count = 20_000
value_filter_func=lambda x: x > cutoff_count
plt.subplot(1, 2, 1)
plot_value_counts("Pickup neighbourhood", 'pickup_neighbourhood', 
                  xlabel_title='Trips made', ylabel_title='NYC Neighbourhood',
                  value_filter_func=value_filter_func,
                  width=30.0, height=35.0)
plt.subplot(1, 2, 2)
plot_value_counts("Dropoff neighbourhood", 'dropoff_neighbourhood', 
                  xlabel_title='Trips made', ylabel_title='',
                  value_filter_func=value_filter_func,
                  width=30.0, height=35.0)
plt.suptitle("Most/Least Popular Pickup/Dropoff Neighbourhood", fontsize="xx-large", fontstyle="normal")

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Ditto - a side-by-side view only concludes our above observations and narrations. And we are beginning to notice the differences in the list of pickup and dropoff locations, the most to least in this least is tapering down into different shapes - we will see this further in visualisations to come.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Most/Least Popular Pickup/Dropoff District pairs

In [None]:
plot_value_counts("Most/Least Popular Pickup/Dropoff District pairs", 'pickup_dropoff_district', 
                  xlabel_title='Trips made', ylabel_title='Pickup -> Dropoff Districts',
                  value_filter_func=lambda x: x < 1_000,
                  height=10.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Now we when take a pickup-dropoff districts pair and graph them, we can see which routes between two districts have been most to least popular. We can see how most people do not go to or return from **Brooklyn**, **Staten Island**, and outside **New York** city. Most of the traffic is between **Manhattan** and **Queens**, and **Manhattan** and other locations. But the other districts are not as popularly used in comparison. No one travels from **Queens** to any other part of **New York** but **Manhattan**. **Queens** is a popular pickup and dropoff point from **Manhattan** but least popular pickup and dropoff point to or from any other district. And many such conclusions can be drawn if we look closer.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Most/Least Popular Pickup/Dropoff Neighbourhood pairs

In [None]:
plot_value_counts("Most/Least Popular Pickup/Dropoff Neighbourhood pairs", 'pickup_dropoff_neighbourhood', 
                  xlabel_title='Trips made', ylabel_title='Pickup -> Dropoff Neighbourhoods',
                  value_filter_func=lambda x: x > 10_000,
                  width=15.0, height=30.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It's interesting to see how many travel just within the same neighbourhood and we can see how **Upper West Side**, **Harlem** and **East Harlem** (among other locations) are the most popular pickup/dropoff locations. We see that the most popular pickup locations is used to travel to other popular pickup locations and even many less popular pickup and dropoff locations.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

In [None]:
percentiles = [value/100 for value in range(10, 100, 10)] + [0.25, 0.75]

In [None]:
train_text_combined['pickup_dropoff_neighbourhood'] = train_text_combined['pickup_neighbourhood'].astype(str) + ' → ' + train_text_combined['dropoff_neighbourhood'].astype(str)
value_counts_dropoff = train_text_combined['pickup_dropoff_neighbourhood'].value_counts()
cutoff_filter = value_counts_dropoff <= 10_000
print('Descriptive statistics of least popular pickup-dropoff points by neighbourhood')
value_counts_dropoff[cutoff_filter].describe(percentiles=percentiles)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We can see the breakdown of the least used pickup/dropoff location pairs. It's amazing to see how only 50% of them are more regularly used while the rest are very seldom. Maybe these are less habited or less safe areas that many do not both to use. They may even be harder to get to or other reasons not known from the data in the dataset.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='simple-time-based-visualisations'></a>

----------

## Simple visualisations based on Time-based fields

### Pickup hour

In [None]:
plot_histogram("Most/Least Popular Pickup hour", 'pickup_hour', 
               ylabel_title='Trips made', xlabel_title='Hour of the day (0 to 23)',
               bins=20, height=5.5)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It's clear there is some cycle of pattern of usage, the most busy being the morning to night hours while mid-night and later is least used till 5/6am in the morning when it starts to pick up again.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Day period

In [None]:
plot_histogram("Most/Least Popular Day period (busy period)", 'day_period', 
               ylabel_title='Trips made', xlabel_title='Day period',
               sort_values=True, xticks_label_rotation=45, 
               bins=8, height=5.5)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We already saw this trend from the above graph, the day time into the late hours of the evening is the most used hours, interestingly it drops at night. The evening usage could be for various reasons i.e. returning home, shopping, going out, etc... Hence it is worth finding out the various reasons for the usage across the day.

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;"> Just to understand the meaning of the various day periods, here is a breakdown: **Morning**: 6AM  to 11:59 Noon, **Afternoon**: 12 noon to 5.59PM, 
**Evening/Night**: 6PM to 12 Midnight (with activity), **Night/Sleep time**: 12 Midnight to 5:59AM (without activity).

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Day name (day of the week)

In [None]:
plot_histogram("Most/Least Popular day of the week (busy period)", 'day_name', 
               ylabel_title='Trips made', xlabel_title='Day of the week',
               xticks_label_rotation=45, sort_values=True,
               bins=20, height=8.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Again here we see some sort of a cycle, **Thursday** and **Friday** seem to be busier than **Saturday** and **Sunday**. Also we can see a slow rise from **Monday** all the way into **Friday** and then a slow decline into **Saturday**, **Sunday** and **Monday**.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Month

In [None]:
plot_histogram("Most/Least Popular month (busy period)", 'month', 
               ylabel_title='Trips made', xlabel_title='Month of the year',               
               xticks_label_rotation=45, sort_values=True,
               bins=12, height=8.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We only have six months worth of data, but March was the busiest month among them, rising from January into March and then slowly tappering down towards June.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Quarter (financial)

In [None]:
plot_value_counts("Most/Least Popular financial quarter (busy period)", 'financial_quarter', 
                  ylabel_title='Trips made', xlabel_title='Financial Quarter (Q1 -> Q4)',
                  width=12.5, height=5.5)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">As we noticed fromt the previous section (Month) - we have 6 months of data only, which spans across two quarters. Although we do not see much of a difference between the two periods and this is apparent from the previous section, where the differences are a but more visible.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Year

In [None]:
plot_value_counts("Most/Least Popular year", 'year', 
                  ylabel_title='Trips made', xlabel_title='Calendar Year',
                  height=4.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Nothing can be concluded from this as we only have been given data for 2016.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Season

In [None]:
plot_histogram("Most/Least Popular season (busy period)", 'season', 
               ylabel_title='Trips made', xlabel_title='Season',
               xticks_label_rotation=45, sort_values=True,
               bins=6, width=15.0, height=10.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">As we have no data for Fall, we can't conclude much but it appears that Spring is the most popular time of the three seasons, Summer being the least and Winter being a half way between them. It's possible that people are on vacation during the Summer and maybe tourists seldom use Taxis leading to much less usage. Possible the taxi drivers themselves are on vacation leading to less rides and less data collected about the activities.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Weekday or weekend

In [None]:
plot_histogram("Most/Least Popular weekday type", 'weekday_or_weekend', 
               ylabel_title='Trips made', xlabel_title='Weekday kind',
               bins=4, width=5.0, height=8.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Clearly Weekdays are busier than Weekends by a factor of 2x. Obviously people are at home and do not come to the city unless it's for shopping purposes but most of the working people are either at home or not using taxis when going out, during weekends.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Regular day or Holiday

In [None]:
plot_histogram("Most/Least popular day type: Regular day or holiday", 'regular_day_or_holiday', 
               ylabel_title='Trips made', xlabel_title='Day kind',
               bins=4, width=5.0, height=8.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">So Bank Holidays and/or Festival days means, again not coming to the city (staying home) or not using taxis to get to the city, even if they visit the city for shopping purposes.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Understanding the feature interactions

<a id='correlation-matrix-feature-interactions'></a>

----------

### Correlation matrix

In [None]:
for column in list(train_text_combined.columns) + ['store_and_fwd_flag']:
    column_datatype = train_text_combined[column].dtype
    if isinstance(column_datatype, pd.core.dtypes.dtypes.CategoricalDtype):
        train_text_combined[f'{column}_id'] = train_text_combined[column].cat.codes

In [None]:
selected_columns = list(set(list(train_text_combined.columns)) - set(['year']))
train_text_combined[selected_columns].corr()

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='feature-pairs-feature-interactions'></a>

----------

### Most and least correlated feature pairs

In [None]:
def most_correlated_pairs(dataframe, threshold=0.05):
    corr_matrix = dataframe.corr()
    indexes = corr_matrix.columns
    pair_names = []
    values = []
    abs_values = []
    for row_index in indexes:
        for col_index in indexes:
            if str(row_index) != str(col_index):
                pair_name = f'{row_index} v/s {col_index}'
                alt_pair_name = f'{col_index} v/s {row_index}'
                if (pair_name not in pair_names) and (alt_pair_name not in pair_names):
                    pair_names.append(pair_name)
                    values.append(corr_matrix[row_index][col_index])
                    abs_values.append(abs(corr_matrix[row_index][col_index]))

    correlation_pairs = pd.DataFrame({
        'pair_name': pair_names,
        'value': values,
        'abs_value': abs_values
    }).sort_values(by='abs_value', ascending=False).reset_index(drop=True)
    return correlation_pairs[correlation_pairs.abs_value >= threshold]

In [None]:
train_text_combined_correlated_pairs_dataframe = most_correlated_pairs(train_text_combined, threshold=0.05)
train_text_combined_correlated_pairs_dataframe

<a id='feature-tree-feature-interactions'></a>

----------

### Correlated feature trees (groups)

In [None]:
def correlated_tree(dataframe, threshold=0.05):
    corr_matrix = dataframe.corr()
    indexes = corr_matrix.columns
    nodes = {}
    for row_index in indexes:
        for col_index in indexes:
            value = corr_matrix[row_index][col_index]
            if (str(row_index) != str(col_index)) and (threshold < value):
                value_as_str = f'{col_index} ({str(abs(round(value, 3)))})'
                if row_index not in nodes:
                    nodes[row_index] = []

                nodes[row_index].append(value_as_str)
    
    return dict(sorted(nodes.items(), key=lambda item: item[0]))

In [None]:
train_text_combined_tree = correlated_tree(train_text_combined, threshold=0.07)
for each_node in train_text_combined_tree:
    print(each_node)
    for each in train_text_combined_tree[each_node]:
        print(f'└─ {each }')
    print()

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The correlation matrix and most and least feature plots show us the most likely pairs of features (and the ones to ignore) to consider when doing further analysis. Of course, these features could be highly collinear due to dependent features or one or more redundant features that make up the other features. For e.g. `month_id` v/s `season_id` are quite dependent on each other due to obvious fact that the season of the year is related to the time or calendar month. So further analysis is required. Also the least correlated features do tell us how some of these features do not impact our analysis or the end result as there is little or no correlation. Further analysis using this information will tell us more.
    
<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Correlated feature trees (groups) tell us which groups of features (clusters) that can potentially be clubbed together to perform further analysis, although each of the clusters in the above list should be individually analysed before considering them further.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='conclusions'></a>

----------

## Conclusions

<i><p style="font-size:22px; background-color: #FFE1D7; border: 2px solid black; margin: 20px; padding: 20px;">The dataset has a good flavour of details, even though we have imbalances at different points. We can see how some locations are hotspots and not affected by time while other locations are very much temporal and changes with the hours, days, weeks, months and season. Our data is only limited to 2016 and only six months of it. The additional location-based and time-based fields have certainly helped see the tax trips from a different light than without them. We can see from all of the graphs how it's skewed towards certain locations and time factors and their combinations.

### Prequels/sequels

- [ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration (data-prep)](https://www.kaggle.com/neomatrix369/chaieda-nyc-taxi-trip-duration-data-prep) | [Extended Dataset](https://www.kaggle.com/neomatrix369/nyc-taxi-trip-duration-extended)
- **ChaiEDA sessions: ChaiEDA: NYC Taxi Trip Duration - analysis**

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>