# Data Description

## Overview

Every year, there are over 85 million taxi rides in New York City, crisscrossing the city with distinctive yellow cabs. Our project seeks to explore the key factors correlated with taxi ridership in NYC. We drew on and combined two sources of data:
1. Detailed trip-level data on daily NYC taxi ridership from January to August 2019, obtained from the New York City Taxi and Limousine creation website (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The data came in the form of 8 monthly datasets of yellow taxi ridership, each of a large size (7+ million observations, 700+ MB)
2. Daily NYC weather data (from Central Park) for January to August 2019, scraped from the National Weather Service website (https://w2.weather.gov/climate/xmacis.php?wfo=okx).

Our data was preprocessed by drawing random samples from each of the 8 monthly taxi datasets, concatenating the 8 sample sets into one dataset, before joining it to the NYC weather dataset. Illogical entries (e.g. negative taxi fares) and null values were then dropped, as further explained below.


## What are the observations (rows) and the attributes (columns)?


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv("nyc_taxis_weather_jantoaug19s.csv")
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113852 entries, 0 to 113851
Data columns (total 29 columns):
pickup_datetime          113852 non-null object
dropoff_datetime         113852 non-null object
PULocationID             113852 non-null int64
DOLocationID             113852 non-null int64
RatecodeID               113852 non-null float64
congestion_surcharge     113852 non-null float64
extra                    113852 non-null float64
fare_amount              113852 non-null float64
improvement_surcharge    113852 non-null float64
mta_tax                  113852 non-null float64
passenger_count          113852 non-null float64
payment_type             113852 non-null float64
store_and_fwd_flag       113852 non-null object
tip_amount               113852 non-null float64
tolls_amount             113852 non-null float64
total_amount             113852 non-null float64
trip_distance            113852 non-null float64
pickup_dayofweek         113852 non-null int64
trip_duration_mi

  interactivity=interactivity, compiler=compiler, result=result)


The observations are unique taxi trips taken on a yellow taxi in NYC. 

In [3]:
list(df.columns)

['pickup_datetime',
 'dropoff_datetime',
 'PULocationID',
 'DOLocationID',
 'RatecodeID',
 'congestion_surcharge',
 'extra',
 'fare_amount',
 'improvement_surcharge',
 'mta_tax',
 'passenger_count',
 'payment_type',
 'store_and_fwd_flag',
 'tip_amount',
 'tolls_amount',
 'total_amount',
 'trip_distance',
 'pickup_dayofweek',
 'trip_duration_mins',
 'date',
 'maxtemp',
 'mintemp',
 'avetemp',
 'departuretemp',
 'hdd',
 'cdd',
 'precipitation',
 'newsnow',
 'snowdepth']

The attributes provide detailed information on various aspects of a taxi ride. These include:

- **pickup_datetime/dropoff_datetime:** The date and time when the taximeter is engaged/disengaged in a taxi ride respectively
- **PULocationID/DOLocationID:** The Taxi Zone (corresponding to a 263 zones listed on the TLC website) in which the taximeter was engaged/disengaged
- **RateCodeID:** The final rate code in effect at the end of the trip.
 - 1= Standard rate
 - 2=JFK
 - 3=Newark
 - 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
- **congestion_surcharge:** \\$2.50 surcharge for non-shared trips in taxicabs
- **extra:** Miscellaneous extras and surcharges. Currently, this only includes the \\$0.50 and \\$1 rush hour and overnight charges
- **Fare_amount:** The time-and-distance fare calculated by the meter
- **improvement_surcharge:** \\$0.30 improvement surcharge assessed trips at the flag drop
- **mta_tax:** \\$0.50 MTA tax that is automatically triggered based on the metered rate in use.
- **passenger_count:** Number of passengers in a taxi
- **payment_type:** A numeric code signifying how the passenger paid for the trip.
  -1 = Credit card 
  -2 = Cash
  -3 = No charge 
  -4 = Dispute
  -5 = Unknown 
  -6 = Voided trip
- **store_and_fwd_flag:** Indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.
- **tip_amount:** Amount of credit card tips. Cash tips are not included.
- **total_amount:** The total amount charged to passengers. Does not include cash tips.
- **trip_distance:** The elapsed trip distance in miles reported by the taximeter.
- **pick-up_dayofweek** _(column created by us)_ **:** The day of week of pick-up, coded from Monday =0 to Sunday =6
- **trip_duration_mins** _(column created by us)_ **:** The duration of a trip, computed by finding the time difference between `pickup_datetime` and `dropoff_datetime`
- **date**: The data of a taxi ride (used for us to track a successful join with the weather dataset)
- **maxtemp:** The daily maximum temperature recorded at Central Park (degrees F)
- **mintemp:** The daily minimum temperature recorded at Central Park (degrees F)
- **avetemp:** The daily average temperature recorded at Central Park (degrees F)
- **departuretemp:** The average temperature departure from normal (degrees F)
- **hdd:** Heating degree days recorded at Central Park (base 65)
- **cdd:** Cooling degree days recorded at Central Park (base 65)
- **precipitation:** The daily precipitation recorded at Central Park (inches)
- **newsnow:** The daily new snowfall recorded at Central Park (inches)
- **snowdepth:** The daily snow depth recorded at Central Park (inches)

## Why was this dataset created?

This dataset was created to analyze the potential impact of various factors on taxi ridership in New York City, with particular attention given to weather, time of day, payment method, and location. This dataset also allows for the exploration of potential correlation between different attributes, and seeks to answer questions such as "How are duration of ride and time of day related?" "Are there trends in popular pick up locations in the late-night hours compared to in the morning?" and "What is the correlation, if any, between payment method and pick up location?" Ultimately, this dataset was created to learn more about the trends and patterns that affect the vast network of taxis, as well as taxi riders, across New York City.

This data drew on information from two other datasets, a dataset published by the NYC Taxi and Limousine Commission and a dataset published by the National Weather Service. The taxi dataset was created for similar reasons to this dataset: to observe changing trends in taxi ridership across the city and hopefully inform decisions in the industry and improve the overal taxi system. The weather dataset was created for reasons entirely separate from taxi riding and is instead part of a much larger data collection conducted across the country for the purposes of predicting trends in weather and climate, from daily temperatures to natural disasters. 

## Who funded the creation of the dataset?

This dataset did not receive any external funding and was independently compiled from external data sources by students at Cornell University. The raw taxi dataset was collected by various technology providers funded and authorized by the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The weather dataset was collected from sources at the National Climactic Data Center, a subset of the National Oceanic and Atmospheric Adminstration, which is funded by the United States Department of Commerce. 

## What processes might have influenced what data was observed and recorded and what was not?

### Taxi Dataset

In the taxi dataset, the type of data observed and recorded is influenced by concerns of usefulness, measurability and privacy. 

Certain attributes (such as `Fare_amount`, `PULocationID` and `trip_distance`) are useful as they help TLC assess the health and profitability of the industry it is regulating. TLC may act on those attributes to draft new policies (e.g. enforcing the fare rate in taxis, buying taxi models that are more fuel-efficient for short trips). On the other hand, an attribute like "

Other attributes (such as `store_and_fwd_flag`) are collected just because they are measurable. While the significance of knowing whether a trip record was held in vehicle memory is unclear, the ease of measuring this attribute with taxi sensors likely led to its inclusion in the taxi dataset. The lack of measurability is also why certain useful attributes (such as "satisfaction rate of customers" or "fatigue level of drivers") are not included in the dataset&mdash;they are simply not measurable with current sensors in taxis.

Privacy concerns limit both what attributes can be observed and the granularity of observations. For instance, due to privacy laws, we cannot record attributes like "passenger age" or "passenger occupation", no matter how useful/measurable those attributes might be. Also, we can only observe pick-up/drop-off location data as coded into general zones (LocationIDs), rather than in a more granular form such as "pick-up/drop-off address". 

### Weather Dataset

In the weather dataset, the type of data observed and recorded influenced by comparability and measurability. 

Given that weather data is often aggregated and averaged by region (e.g. NYC), and compared among regions (e.g. NYC vs Boston), the type of data observed is determined by its suitability for comparison. Due to this focus on comparability, the attributes observed (e.g. `maxtemp`, `mintemp`, `precipitation`) are very standard across weather stations in the US, and they are measured in standard units (e.g. F, inches). 

Measurability is a key factor as niche equipment is needed to observe various aspects of weather precisely. In Central Park, weather data is collected by an automated weather station on Belvedere Castle. The attributes that are observed are based on what can be reliably measured by an automated system with little human oversight. 

## What preprocessing was done, and how did the data come to be in the form that you are using?

### Step 1: Data Collection and Concatenation

- The 8 monthly taxi datasets were downloaded from the TLC website, constituting 57,080,500 taxi trips and 5.2GB of data in all. For each month, a random sample of 2% of observations was drawn, adding up to around 14,200 taxi trips per month. Each random sample was screened using _Series.unique( )_ to ensure that there was there was an even distribution of days in the month. 
- The 8 samples were concatenated into a single taxi dataset ("taxis_jantoaug19s.csv") that contained 114,161 taxi trips from January to August 2019. 
- The NYC (Central Park) weather dataset from January to August 2019 was scraped from the National Weather Service website and saved as "nyc_weather_jantoaug19.csv".

### Step 2: Data Cleaning

- Irregularities in "taxis_jantoaug19s.csv" were detected and fixed. In particular, 7 entries with `pickup_datetime` or `dropoff_datetime` correponding to non-2019 years were deleted, and 189 entries with negative `fare_amount`,`total_amount` or `congestion_surcharge` values were deleted. 
- The `VendorID` attribute was dropped due to its limited relevance. 
- 118 null values across 5 attributes&mdash;`RatecodeID`, `VendorID`, `passenger_count`, `payment_type` and `store_and_fwd_flag`&mdash;were dropped.
- The `pickup_dayofweek` attribute was created by applying _Series.dt.dayofweek_ to `pickup_datetime`.
- The `trip_duration_mins` attribute was created by finding the difference between `dropoff_datetime` and `pickup_datetime`, before dividing that by _np.timedelta64(1,'m')_.

### Step 3: Data Integration

- The taxi and weather datasets were combined by left-joining "taxis_jantoaug19s.csv" to to "nyc_weather_jantoaug19.csv" on `pd.to_datetime(df_taxis["pickup_datetime"]).dt.date`=`pd.to_datetime(df_weather['date']).dt.date`
- The `pickup_datetime` and `date` fields in the joined dataset were checked to ensure a successful join. The joined dataset, containing 113852 taxi rides, was saved as "nyc_taxis_weather_jantoaug19s.csv". 

## If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?



No personal information or data was collected from individuals in the creation of the weather dataset or the taxi dataset. Although the taxi dataset indirectly involves people as taxi riders, who may not have been made aware that information about their ride would be recorded, the dataset has nothing to do with the individuals as people and focuses instead on information pertaining to the taxi ride itself, as collected and recorded by taxi companies. 

## Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box). 


# Potential Problems with Dataset

The potential problems that we foresee are minor irregularities in the data (e.g. an impossibly large `trip_distance` for a certain `fare_amount`, or an `extra` being levied for a non-rush-hour time). More obvious irregularities, such as negative fare amounts and non-2019 trips, have already been detected and removed, along with any null values in the data. Based on my preprocessing efforts thus far, the incidence of obvious irregularities in the data is low (~300 glaring irregularities for 114,000+ entries, or a 0.2% rate). They all appear to be errors in data entry (perhaps due to human error / bugs in electronic data transmission) rather than systemic issues in data collection. 

Hence, we are optimistic that we will not encounter major issues while processing and analyzing our dataset. However, we will maintain a healthy sense of scepticism while performing Exploratory Data Analysis, and investigate/resolve any potential errors in the data that we encounter. 