# Understanding Hired Rides in NYC

_[Project prompt](https://docs.google.com/document/d/1VERPjEZcC1XSs4-02aM-DbkNr_yaJVbFjLJxaYQswqA/edit#)_

_This scaffolding notebook may be used to help setup your final project. It's **totally optional** whether you make use of this or not._

_If you do use this notebook, everything provided is optional as well - you may remove or add prose and code as you wish._

_Anything in italics (prose) or comments (in code) is meant to provide you with guidance. **Remove the italic lines and provided comments** before submitting the project, if you choose to use this scaffolding. We don't need the guidance when grading._

_**All code below should be consider "pseudo-code" - not functional by itself, and only a suggestion at the approach.**_

## Requirements

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project._

* Code clarity: make sure the code conforms to:
    * [ ] [PEP 8](https://peps.python.org/pep-0008/) - You might find [this resource](https://realpython.com/python-pep8/) helpful as well as [this](https://github.com/dnanhkhoa/nb_black) or [this](https://jupyterlab-code-formatter.readthedocs.io/en/latest/) tool
    * [ ] [PEP 257](https://peps.python.org/pep-0257/)
    * [ ] Break each task down into logical functions
* The following files are submitted for the project (see the project's GDoc for more details):
    * [ ] `README.md`
    * [ ] `requirements.txt`
    * [ ] `.gitignore`
    * [ ] `schema.sql`
    * [ ] 6 query files (using the `.sql` extension), appropriately named for the purpose of the query
    * [x] Jupyter Notebook containing the project (this file!)
* [x] You can edit this cell and add a `x` inside the `[ ]` like this task to denote a completed task

## Project Setup

In [1]:
# all import statements needed for the project, for example:

import math
import bs4
import matplotlib.pyplot as plt
import pandas as pd
import requests
import sqlalchemy as db
import os
import pyarrow.parquet as pq
import datetime
import re
from bs4 import BeautifulSoup
import geopandas as gpd

In [2]:
# any constants you might need, for example:

TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
# add other constants to refer to any local data, e.g. uber & weather
UBER_CSV = "uber_rides_sample.csv"

NEW_YORK_BOX_COORDS = ((40.560445, -74.242330), (40.908524, -73.717047))

DATABASE_URL = "sqlite:///project.db"
DATABASE_SCHEMA_FILE = "schema.sql"
QUERY_DIRECTORY = "queries"

## Part 1: Data Preprocessing

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] Define a function that calculates the distance between two coordinates in kilometers that **only uses the `math` module** from the standard library.
* [ ] Taxi data:
    * [ ] Use the `re` module, and the packages `requests`, BeautifulSoup (`bs4`), and (optionally) `pandas` to programmatically download the required CSV files & load into memory.
    * You may need to do this one file at a time - download, clean, sample. You can cache the sampling by saving it as a CSV file (and thereby freeing up memory on your computer) before moving onto the next file. 
* [ ] Weather & Uber data:
    * [ ] Download the data manually in the link provided in the project doc.
* [ ] All data:
    * [ ] Load the data using `pandas`
    * [ ] Clean the data, including:
        * Remove unnecessary columns
        * Remove invalid data points (take a moment to consider what's invalid)
        * Normalize column names
        * (Taxi & Uber data) Remove trips that start and/or end outside the designated [coordinate box](http://bboxfinder.com/#40.560445,-74.242330,40.908524,-73.717047)
    * [ ] (Taxi data) Sample the data so that you have roughly the same amount of data points over the given date range for both Taxi data and Uber data.
* [ ] Weather data:
    * [ ] Split into two `pandas` DataFrames: one for required hourly data, and one for the required daily daya.
    * [ ] You may find that the weather data you need later on does not exist at the frequency needed (daily vs hourly). You may calculate/generate samples from one to populate the other. Just document what you’re doing so we can follow along. 

### Calculating distance
_**TODO:** Write some prose that tells the reader what you're about to do here._

### Processing Taxi Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [52]:
def find_taxi_pq_urls():
    response = requests.get(TAXI_URL)
    if response.status_code == 200:
        print("Success")
        results_page = BeautifulSoup(response.content,'lxml')
        results_page
        lists = results_page.find('div',{"class":"faq-v1"})
        
        a=lists.find_all('div',{"class":"faq-answers"})
        #print(a)
        trip_datalist=[]
        for item in a:
            #print(item)
            h=item.find_all('a',{"title":"Yellow Taxi Trip Records"})
            for link in h:
                #print(link.get("href"))
                trip_datalist.append(link.get("href"))
        return trip_datalist
    else:
        print("Failure")    

In [53]:
find_taxi_pq_urls()

Success


['https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-04.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-05.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-06.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-07.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-08.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-09.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet',
 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.pa

In [54]:
cd

C:\Users\29311


In [62]:
shapefile = gpd.read_file("Desktop/CU/E4501/taxi_zones/taxi_zones.shp")
type(shapefile)

shapefile=shapefile.to_crs(4326)# reproject the geodataframe to a geographic CRS

shapefile["lon"]=shapefile["geometry"].apply(lambda m: m.centroid.x)
shapefile["lat"]= shapefile["geometry"].apply(lambda m: m.centroid.y)
xx=pd.DataFrame(shapefile)
xx.at[56,"LocationID"]=57
xx.at[103,"LocationID"]=104
xx.at[104,"LocationID"]=105
shapefile

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,borough,geometry,lon,lat
0,1,0.116357,0.000782,Newark Airport,1,EWR,"POLYGON ((-74.18445 40.69500, -74.18449 40.695...",-74.174000,40.691831
1,2,0.433470,0.004866,Jamaica Bay,2,Queens,"MULTIPOLYGON (((-73.82338 40.63899, -73.82277 ...",-73.831299,40.616745
2,3,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,"POLYGON ((-73.84793 40.87134, -73.84725 40.870...",-73.847422,40.864474
3,4,0.043567,0.000112,Alphabet City,4,Manhattan,"POLYGON ((-73.97177 40.72582, -73.97179 40.725...",-73.976968,40.723752
4,5,0.092146,0.000498,Arden Heights,5,Staten Island,"POLYGON ((-74.17422 40.56257, -74.17349 40.562...",-74.188484,40.552659
...,...,...,...,...,...,...,...,...,...
258,259,0.126750,0.000395,Woodlawn/Wakefield,259,Bronx,"POLYGON ((-73.85107 40.91037, -73.85207 40.909...",-73.852215,40.897932
259,260,0.133514,0.000422,Woodside,260,Queens,"POLYGON ((-73.90175 40.76078, -73.90147 40.759...",-73.906306,40.744235
260,261,0.027120,0.000034,World Trade Center,261,Manhattan,"POLYGON ((-74.01333 40.70503, -74.01327 40.704...",-74.013023,40.709139
261,262,0.049064,0.000122,Yorkville East,262,Manhattan,"MULTIPOLYGON (((-73.94383 40.78286, -73.94376 ...",-73.946510,40.775932


In [63]:
def download(url: str, dest_folder: str):
    if not os.path.exists(dest_folder):
        os.makedirs(dest_folder)  # create folder if it does not exist

    filename = url.split('/')[-1].replace(" ", "_")  # be careful with file names
    file_path = os.path.join(dest_folder, filename)

    r = requests.get(url, stream=True)
    if r.ok:
        print("saving to", os.path.abspath(file_path))
        with open(file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024 * 8):
                if chunk:
                    f.write(chunk)
                    f.flush()
                    os.fsync(f.fileno())
    else:  # HTTP status code 4XX/5XX
        print("Download failed: status code {}\n{}".format(r.status_code, r.text))
    return file_path,filename

In [64]:
def get_lat_lon_from_ID(locID,shapefile):
    if locID in range(1,264):
        sp=pd.DataFrame(shapefile)
        lat= sp[sp["LocationID"]==locID]["lat"].values[0]
        lon= sp[sp["LocationID"]==locID]["lon"].values[0]
    else:
        lat= None
        lon= None
    return lat,lon

In [65]:
def set_lat_lon(origin_data,filename):
    origin_data=pd.DataFrame(origin_data)
    if "PULocationID" in origin_data.columns:
        try:
            origin_data["PU_lat"]=origin_data["PULocationID"].apply(lambda x: get_lat_lon_from_ID(x,shapefile)[0])
            origin_data["PU_lon"]=origin_data["PULocationID"].apply(lambda x: get_lat_lon_from_ID(x,shapefile)[1])
            origin_data["DO_lat"]=origin_data["DOLocationID"].apply(lambda x: get_lat_lon_from_ID(x,shapefile)[0])
            origin_data["DO_lon"]=origin_data["DOLocationID"].apply(lambda x: get_lat_lon_from_ID(x,shapefile)[1])
        #origin_data.to_parquet(filename+"lat_lon")
        except:
            origin_data["PU_lat"]=origin_data["PU_lon"]=origin_data["DO_lat"]=origin_data["DO_lon"]=0
        return origin_data
    else:
        return origin_data

In [66]:
def clean_data(renamed_data):
    origin_data=renamed_data
    origin_data = origin_data.reset_index()
    if "PU_lat" in origin_data.columns:
        remove_list=[]
        for i in range(len(origin_data)):
            if not((NEW_YORK_BOX_COORDS[0][0] <=origin_data.iloc[i]["PU_lat"]<= NEW_YORK_BOX_COORDS[1][0]) and (NEW_YORK_BOX_COORDS[0][1]<=origin_data.iloc[i]["PU_lon"]<= NEW_YORK_BOX_COORDS[1][1]) and (NEW_YORK_BOX_COORDS[0][0] <=origin_data.iloc[i]["DO_lat"]<= NEW_YORK_BOX_COORDS[1][0]) and (NEW_YORK_BOX_COORDS[0][1]<=origin_data.iloc[i]["DO_lon"]<=NEW_YORK_BOX_COORDS[1][1])):
                print("OUT",i,origin_data.iloc[i]["PU_lat"],origin_data.iloc[i]["PU_lon"])
                remove_list.append(i)
                print("remove",i)
            else:
                continue
        origin_data=origin_data.drop(remove_list,axis=0)
        
        return origin_data
    else:
        print("whattttt")
       

In [67]:
def rename_datetime(cleaned_data):
    if "vendor_name" in cleaned_data.columns:
        cleaned_data = cleaned_data.rename({'vendor_name': 'VendorID','Passenger_Count':'passenger_count','Trip_Distance':'trip_distance','Rate_Code':'RatecodeID','store_and_forward':'store_and_fwd_flag','Tip_Amt':'tip_amount','Fare_Amt':'fare_amount','Total_Amt':'total_amount','Trip_Pickup_DateTime': 'tpep_pickup_datetime', 'Trip_Dropoff_DateTime': 'tpep_dropoff_datetime','Start_Lat':'PU_lat','Start_Lon':'PU_lon','End_Lat':'DO_lat','End_Lon':'DO_lon'}, axis='columns')
    if "vendor_id" in cleaned_data.columns:
        cleaned_data = cleaned_data.rename({'vendor_id': 'VendorID','rate_code':'RatecodeID','pickup_datetime': 'tpep_pickup_datetime', 'dropoff_datetime': 'tpep_dropoff_datetime','pickup_latitude':'PU_lat','pickup_longitude':'PU_lon','dropoff_latitude':'DO_lat','dropoff_longitude':'DO_lon'}, axis='columns')
    return cleaned_data    


In [81]:
def set_months(renamed_data):
    renamed_data["year"]=renamed_data["tpep_pickup_datetime"].apply(lambda x: datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").year )
    renamed_data["month"]=renamed_data["tpep_pickup_datetime"].apply(lambda x: datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").month )
    renamed_data["day"]=renamed_data["tpep_pickup_datetime"].apply(lambda x: datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").day )
    renamed_data["hour"]=renamed_data["tpep_pickup_datetime"].apply(lambda x: datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").hour )
    
    return renamed_data

In [82]:
#def drop_columns(df):
#    df = df.drop('column_name', axis=1)

In [83]:
def select_columns(taxi_data):
    taxi_data = taxi_data.rename({'tpep_pickup_datetime': 'pickup_datetime', 'tpep_dropoff_datetime': 'dropoff_datetime','PU_lat':'pickup_latitude','PU_lon':'pickup_longitude','DO_lat':'dropoff_latitude','DO_lon':'dropoff_longitude','trip_distance':'distance','tip_amount':'tip'}, axis='columns')
    final = taxi_data[['year','month','hour','pickup_datetime','dropoff_datetime','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude','distance','tip','passenger_count']]
    final.reset_index(drop=True, inplace=True)
    print(final)
    return final

In [84]:
def get_and_clean_taxi_data():
    all_taxi_dataframes = []
    all_pq_urls = find_taxi_pq_urls()
    ii=0
    for pq_url in all_pq_urls:
        print(pq_url)
        file_path, filename = download(pq_url,"OriginalTaxiData")
        print(file_path,filename)
        yearly_data = pd.read_parquet(file_path, engine='auto').sample(n=1250)
        lat_lon_data = set_lat_lon(yearly_data,filename)
        
        renamed_data = rename_datetime(lat_lon_data)
        cleaned_data = clean_data(renamed_data)##data with lat and lon and within the box

        clean_month_taxi_data = set_months(cleaned_data)
        
        selected_data = select_columns(clean_month_taxi_data)
        n="Downloads/1208/AllTaxi"+filename+".csv"
        selected_data.to_csv(n)
        ii=ii+1 
        all_taxi_dataframes.append(clean_month_taxi_data)    
    taxi_data = pd.concat(all_taxi_dataframes)
    selected_columns_taxi_data = select_columns(taxi_data)
    return selected_columns_taxi_data


In [85]:
taxi_data=get_and_clean_taxi_data()

Success
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
saving to C:\Users\29311\OriginalTaxiData\yellow_tripdata_2022-01.parquet
OriginalTaxiData\yellow_tripdata_2022-01.parquet yellow_tripdata_2022-01.parquet
OUT 21 nan nan
remove 21
OUT 58 nan nan
remove 58
OUT 84 40.74033744175701 -73.99045782354732
remove 84
OUT 133 40.762252755319366 -73.98984464313301
remove 133
OUT 376 nan nan
remove 376
OUT 427 nan nan
remove 427
OUT 472 nan nan
remove 472
OUT 512 nan nan
remove 512
OUT 554 nan nan
remove 554
OUT 603 40.7147325069394 -73.98302455833488
remove 603
OUT 645 40.64698489239515 -73.78653298334973
remove 645
OUT 658 40.63790012347376 -73.9609682490637
remove 658
OUT 664 40.64698489239515 -73.78653298334973
remove 664
OUT 695 nan nan
remove 695
OUT 879 40.69454235438129 -73.83092407449995
remove 879
OUT 883 nan nan
remove 883
OUT 916 nan nan
remove 916
OUT 951 40.74991407790216 -73.97044256869235
remove 951
OUT 954 nan nan
remove 954
OUT 1036 nan nan
re

KeyboardInterrupt: 

In [79]:
taxi_data#.reset_index()
taxi_data.to_csv('Downloads/1208/------ALLLLLLLTaxi------.csv')

### Processing Uber Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def load_and_clean_uber_data(csv_file):
    raise NotImplemented()

In [None]:
def get_uber_data():
    uber_dataframe = load_and_clean_uber_data(UBER_DATA)
    add_distance_column(uber_dataframe)
    return uber_dataframe

### Processing Weather Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def clean_month_weather_data_hourly(csv_file):
    raise NotImplemented()

In [None]:
def clean_month_weather_data_daily(csv_file):
    raise NotImplemented()

In [None]:
def load_and_clean_weather_data():
    hourly_dataframes = []
    daily_dataframes = []
    
    # add some way to find all weather CSV files
    # or just add the name/paths manually
    weather_csv_files = ["TODO"]
    
    for csv_file in weather_csv_files:
        hourly_dataframe = clean_month_weather_data_hourly(csv_file)
        daily_dataframe = clean_month_weather_data_daily(csv_file)
        hourly_dataframes.append(hourly_dataframe)
        daily_dataframes.append(daily_dataframe)
        
    # create two dataframes with hourly & daily data from every month
    hourly_data = pd.concat(hourly_dataframes)
    daily_data = pd.concat(daily_dataframes)
    
    return hourly_data, daily_data

### Process All Data

_This is where you can actually execute all the required functions._

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
taxi_data = get_and_clean_taxi_data()
uber_data = get_uber_data()
hourly_weather_data, daily_weather_data = load_and_clean_weather_data()

## Part 2: Storing Cleaned Data

_Write some prose that tells the reader what you're about to do here._

In [None]:
engine = db.create_engine(DATABASE_URL)

In [None]:
# if using SQL (as opposed to SQLAlchemy), define the commands 
# to create your 4 tables/dataframes
HOURLY_WEATHER_SCHEMA = """
TODO
"""

DAILY_WEATHER_SCHEMA = """
TODO
"""

TAXI_TRIPS_SCHEMA = """
TODO
"""

UBER_TRIPS_SCHEMA = """
TODO
"""

In [None]:
# create that required schema.sql file
with open(DATABASE_SCHEMA_FILE, "w") as f:
    f.write(HOURLY_WEATHER_SCHEMA)
    f.write(DAILY_WEATHER_SCHEMA)
    f.write(TAXI_TRIPS_SCHEMA)
    f.write(UBER_TRIPS_SCHEMA)

In [None]:
# create the tables with the schema files
with engine.connect() as connection:
    pass

### Add Data to Database

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def write_dataframes_to_table(table_to_df_dict):
    raise NotImplemented()

In [None]:
map_table_name_to_dataframe = {
    "taxi_trips": taxi_data,
    "uber_trips": uber_data,
    "hourly_weather": hourly_data,
    "daily_weather": daily_data,
}

In [None]:
write_dataframes_to_table(map_table_name_to_dataframe)

## Part 3: Understanding the Data

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] For 01-2009 through 06-2015, what hour of the day was the most popular to take a yellow taxi? The result should have 24 bins.
* [ ] For the same time frame, what day of the week was the most popular to take an uber? The result should have 7 bins.
* [ ] What is the 95% percentile of distance traveled for all hired trips during July 2013?
* [ ] What were the top 10 days with the highest number of hired rides for 2009, and what was the average distance for each day?
* [ ] Which 10 days in 2014 were the windiest, and how many hired trips were made on those days?
* [ ] During Hurricane Sandy in NYC (Oct 29-30, 2012) and the week leading up to it, how many trips were taken each hour, and for each hour, how much precipitation did NYC receive and what was the sustained wind speed?

In [None]:
def write_query_to_file(query, outfile):
    raise NotImplemented()

### Query N

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each query_

In [None]:
QUERY_N = """
TODO
"""

In [None]:
engine.execute(QUERY_N).fetchall()

In [None]:
write_query_to_file(QUERY_N, "some_descriptive_name.sql")

## Part 4: Visualizing the Data

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] Create an appropriate visualization for the first query/question in part 3
* [ ] Create a visualization that shows the average distance traveled per month (regardless of year - so group by each month). Include the 90% confidence interval around the mean in the visualization
* [ ] Define three lat/long coordinate boxes around the three major New York airports: LGA, JFK, and EWR (you can use bboxfinder to help). Create a visualization that compares what day of the week was most popular for drop offs for each airport.
* [ ] Create a heatmap of all hired trips over a map of the area. Consider using KeplerGL or another library that helps generate geospatial visualizations.
* [ ] Create a scatter plot that compares tip amount versus distance.
* [ ] Create another scatter plot that compares tip amount versus precipitation amount.

_Be sure these cells are executed so that the visualizations are rendered when the notebook is submitted._

### Visualization N

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each visualization._

_The example below makes use of the `matplotlib` library. There are other libraries, including `pandas` built-in plotting library, kepler for geospatial data representation, `seaborn`, and others._

In [None]:
# use a more descriptive name for your function
def plot_visual_n(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
def get_data_for_visual_n():
    # Query SQL database for the data needed.
    # You can put the data queried into a pandas dataframe, if you wish
    raise NotImplemented()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)