# Understanding Hired Rides in NYC

_[Project prompt](https://docs.google.com/document/d/1VERPjEZcC1XSs4-02aM-DbkNr_yaJVbFjLJxaYQswqA/edit#)_

_This scaffolding notebook may be used to help setup your final project. It's **totally optional** whether you make use of this or not._

_If you do use this notebook, everything provided is optional as well - you may remove or add prose and code as you wish._

_Anything in italics (prose) or comments (in code) is meant to provide you with guidance. **Remove the italic lines and provided comments** before submitting the project, if you choose to use this scaffolding. We don't need the guidance when grading._

_**All code below should be consider "pseudo-code" - not functional by itself, and only a suggestion at the approach.**_

## Requirements

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project._

* Code clarity: make sure the code conforms to:
    * [ ] [PEP 8](https://peps.python.org/pep-0008/) - You might find [this resource](https://realpython.com/python-pep8/) helpful as well as [this](https://github.com/dnanhkhoa/nb_black) or [this](https://jupyterlab-code-formatter.readthedocs.io/en/latest/) tool
    * [ ] [PEP 257](https://peps.python.org/pep-0257/)
    * [ ] Break each task down into logical functions
* The following files are submitted for the project (see the project's GDoc for more details):
    * [ ] `README.md`
    * [ ] `requirements.txt`
    * [ ] `.gitignore`
    * [ ] `schema.sql`
    * [ ] 6 query files (using the `.sql` extension), appropriately named for the purpose of the query
    * [x] Jupyter Notebook containing the project (this file!)
* [x] You can edit this cell and add a `x` inside the `[ ]` like this task to denote a completed task

## Project Setup

In [1]:
# all import statements needed for the project, for example:

import math
import bs4
import requests
import sqlalchemy as db

import pandas as pd 
import numpy as np
import datetime as dt
import statsmodels.api as sm #统计
from statsmodels.tsa.stattools import adfuller #ADF检验
import matplotlib as mpl #画图
import matplotlib.pyplot as plt
mpl.rcParams['font.family']='serif'
plt.style.use('seaborn') 

perc=[0.01,0.05,0.25,0.5,0.75,0.9,0.95,0.99]
def isid(data,variables): #重复
    dup=data.duplicated(variables,keep=False)
    if True in dup.values:
        print(str(variables)+" Do NOT uniquely identify this dataset") 
    else:
        print(str(variables)+" uniquely identify this dataset")
import os #地址
os.chdir('/Users/yw/Desktop/4501 Project') 

In [2]:
# any general notebook setup, like log formatting

In [3]:
# any constants you might need, for example:

TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
# add other constants to refer to any local data, e.g. uber & weather
UBER_CSV = "uber_rides_sample.csv"

NEW_YORK_BOX_COORDS = ((40.560445, -74.242330), (40.908524, -73.717047))

DATABASE_URL = "sqlite:///project.db"
DATABASE_SCHEMA_FILE = "schema.sql"
QUERY_DIRECTORY = "queries"

## Part 1: Data Preprocessing

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] Define a function that calculates the distance between two coordinates in kilometers that **only uses the `math` module** from the standard library.
* [ ] Taxi data:
    * [ ] Use the `re` module, and the packages `requests`, BeautifulSoup (`bs4`), and (optionally) `pandas` to programmatically download the required CSV files & load into memory.
    * You may need to do this one file at a time - download, clean, sample. You can cache the sampling by saving it as a CSV file (and thereby freeing up memory on your computer) before moving onto the next file. 
* [ ] Weather & Uber data:
    * [ ] Download the data manually in the link provided in the project doc.
* [ ] All data:
    * [ ] Load the data using `pandas`
    * [ ] Clean the data, including:
        * Remove unnecessary columns
        * Remove invalid data points (take a moment to consider what's invalid)
        * Normalize column names
        * (Taxi & Uber data) Remove trips that start and/or end outside the designated [coordinate box](http://bboxfinder.com/#40.560445,-74.242330,40.908524,-73.717047)
    * [ ] (Taxi data) Sample the data so that you have roughly the same amount of data points over the given date range for both Taxi data and Uber data.
* [ ] Weather data:
    * [ ] Split into two `pandas` DataFrames: one for required hourly data, and one for the required daily daya.
    * [ ] You may find that the weather data you need later on does not exist at the frequency needed (daily vs hourly). You may calculate/generate samples from one to populate the other. Just document what you’re doing so we can follow along. 

### Calculating distance
_**TODO:** Write some prose that tells the reader what you're about to do here._

$distance=2rarcsin \bigg(\sqrt { sin^2 \big( \frac{\phi_2-\phi_1}{2} \big) +cos \phi_1 · cos\phi_2  · sin^2 \big( \frac{\lambda_2 - \lambda_1}{2} \big) } \bigg) $

$\phi: \text{latitude of points};\  \lambda: \text{longtitude of points}; \ r: \text{radius of sphere}$

def calculate_distance(from_coord, to_coord): #input ([longtitude1,latitude1],[longtitude2,latitude2]) 
    r=6373 #Earth radius
    
    #use radians rather than degrees
    longtitude1=math.radians(from_coord[0])
    latitude1=math.radians(from_coord[1])
    longtitude2=math.radians(to_coord[0])
    latitude2=math.radians(to_coord[1])
    
    part1=(math.sin((latitude2-latitude1)/2))**2
    part2=math.cos(latitude1)*math.cos(latitude2)*(math.sin((longtitude2-longtitude1)/2))**2
    distance=2*r*math.asin(math.sqrt(part1+part2))
    return distance

In [4]:
def calculate_and_add_distance(data,length): #input dataframe:([longtitude1,latitude1],[longtitude2,latitude2]) 
    r=6373 #Earth radius
    
    #use radians rather than degrees
    for i in range(length):
        data.loc[i,'picklong']=math.radians(data.loc[i,'pickup_longitude'])
        data.loc[i,'picklat']=math.radians(data.loc[i,'pickup_latitude'])
        data.loc[i,'droplong']=math.radians(data.loc[i,'dropoff_longitude'])
        data.loc[i,'droplat']=math.radians(data.loc[i,'dropoff_latitude'])
    
        data.loc[i,'distance']=2*r*math.asin(
            math.sqrt((math.sin((data.loc[i,'droplat']-data.loc[i,'picklat'])/2))**2
                  +math.cos(data.loc[i,'picklat'])*math.cos(data.loc[i,'droplat'])
                      *(math.sin((data.loc[i,'droplong']-data.loc[i,'picklong'])/2))**2))
    del data['picklong'],data['picklat'],data['droplong'],data['droplat']
    return data

### Processing Taxi Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [5]:
def find_taxi_csv_urls():
    raise NotImplemented()

In [6]:
def get_and_clean_month_taxi_data(url):
    raise NotImplemented()

In [7]:
def get_and_clean_taxi_data():
    all_taxi_dataframes = []
    
    all_csv_urls = find_taxi_csv_urls()
    for csv_url in all_csv_url:
        # maybe: first try to see if you've downloaded this exact
        # file already and saved it before trying again
        dataframe = get_and_clean_month_taxi_data(csv_url)
        add_distance_column(dataframe)
        # maybe: if the file hasn't been saved, save it so you can
        # avoid re-downloading it if you re-run the function
        
        all_taxi_dataframes.append(dataframe)
        
    # create one gigantic dataframe with data from every month needed
    taxi_data = pd.contact(all_taxi_dataframes)
    return taxi_data

### Processing Uber Data

1. Load data with pandas
2. Clean the data: 
- Remove unnecessary columns
- Remove invalid data points (missing data / wrong range)
- Normalize column names
- Remove trips that start and/or end outside the designated coordinate box (-74.242330, 40.560445); (-73.717047, 40.560445); (-74.242330, 40.908524); (-73.717047, 40.908524)

In [8]:
uber=pd.read_csv('uber_rides_sample.csv')
uber.info() #missing data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         200000 non-null  int64  
 1   key                200000 non-null  object 
 2   fare_amount        200000 non-null  float64
 3   pickup_datetime    200000 non-null  object 
 4   pickup_longitude   200000 non-null  float64
 5   pickup_latitude    200000 non-null  float64
 6   dropoff_longitude  199999 non-null  float64
 7   dropoff_latitude   199999 non-null  float64
 8   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB


In [260]:
uber

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.994710,40.750325,1
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.740770,-73.962565,40.772647,1
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5
...,...,...,...,...,...,...,...,...,...
199995,42598914,2012-10-28 10:49:00.00000053,3.0,2012-10-28 10:49:00 UTC,-73.987042,40.739367,-73.986525,40.740297,1
199996,16382965,2014-03-14 01:09:00.0000008,7.5,2014-03-14 01:09:00 UTC,-73.984722,40.736837,-74.006672,40.739620,1
199997,27804658,2009-06-29 00:42:00.00000078,30.9,2009-06-29 00:42:00 UTC,-73.986017,40.756487,-73.858957,40.692588,2
199998,20259894,2015-05-20 14:56:25.0000004,14.5,2015-05-20 14:56:25 UTC,-73.997124,40.725452,-73.983215,40.695415,1


In [9]:
uber.describe(perc) 
#problem: longtitude and latitude range
#passenger_count outliers
#fare_amount 499 有待考察

Unnamed: 0.1,Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,27712500.0,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,16013820.0,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,1.0,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
1%,553911.5,3.3,-74.014402,0.0,-74.015288,0.0,1.0
5%,2723455.0,4.1,-74.006838,40.701801,-74.00746,40.68641,1.0
25%,13825350.0,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,27745500.0,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,41555300.0,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0
90%,49892570.0,20.5,-73.950785,40.779855,-73.945389,40.78268,4.0


#invalid data - drop
uber[ (uber.pickup_longitude<-180) | (uber.pickup_latitude>90) | \
     (uber.dropoff_longitude<-180) | (uber.dropoff_longitude>180) | \
     (uber.dropoff_latitude<-90)  | (uber.dropoff_latitude>90)]

In [10]:
uber[uber.duplicated()==True] #no duplicated data
uber[uber.dropoff_longitude.isnull()==True] 

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
87946,32736015,2013-07-02 03:51:57.0000001,24.1,2013-07-02 03:51:57 UTC,-73.950581,40.779692,,,0


In [11]:
uber.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         200000 non-null  int64  
 1   key                200000 non-null  object 
 2   fare_amount        200000 non-null  float64
 3   pickup_datetime    200000 non-null  object 
 4   pickup_longitude   200000 non-null  float64
 5   pickup_latitude    200000 non-null  float64
 6   dropoff_longitude  199999 non-null  float64
 7   dropoff_latitude   199999 non-null  float64
 8   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB


In [12]:
def load_and_clean_uber_data(uber):
    uber=uber.iloc[:,2:] #drop unnecessary columns: first column
    uber.pickup_datetime=pd.to_datetime(uber.pickup_datetime) #proper data type
    uber['pickup_datetime']=uber['pickup_datetime'].dt.tz_localize(None)
    
    uber=uber[uber.dropoff_longitude.isnull()!=True]  #drop missing value
    uber['date']=pd.to_datetime(uber['pickup_datetime'].dt.date) #add y-m-d time
    uber['week']=uber['date'].dt.dayofweek+1
    uber=uber.sort_values('pickup_datetime').reset_index(drop=True) #sort
    #uber=uber[uber['fare_amount']>200]#drop outliers, 99% = 53.3
    uber=uber[uber['passenger_count']<7] #drop outlier: passenger_count=208
    #Remove trips that start and/or end outside the designated coordinate box 
    # (40.560445, -74.242330) and (40.908524, -73.717047)
    uber=uber[(uber.pickup_longitude>=-74.242330) & (uber.pickup_longitude<=-73.717047) \
         & (uber.pickup_latitude >=40.560445 ) & (uber.pickup_latitude <= 40.908524)]
    uber=uber[(uber.dropoff_longitude>=-74.242330) & (uber.dropoff_longitude<=-73.717047) \
         & (uber.dropoff_latitude >=40.560445 ) & (uber.dropoff_latitude <= 40.908524)]
    uber=uber.reset_index(drop=True)
    return uber

In [13]:
def get_uber_data():
    uber_dataframe = load_and_clean_uber_data(uber)
    uber_dataframe = calculate_and_add_distance(uber_dataframe,uber_dataframe.count()[0])
    return uber_dataframe

In [14]:
uber_data=get_uber_data()

In [15]:
uber_data.to_csv('uber_test.csv')

### Processing Weather Data

1. load data
2. separate daily data and hourly data with necessary columns, drop other columns
3. clean data 
* data type: datetime
* grouping
* deal with inconsistent data: precipitation: T & s
4. use hourly data to replenish missing daily data

In [373]:
weather09=pd.read_csv('2009_weather.csv',low_memory=False)
weather10=pd.read_csv('2010_weather.csv',low_memory=False)
weather11=pd.read_csv('2011_weather.csv',low_memory=False)
weather12=pd.read_csv('2012_weather.csv',low_memory=False)
weather13=pd.read_csv('2013_weather.csv',low_memory=False)
weather14=pd.read_csv('2014_weather.csv',low_memory=False)
weather15=pd.read_csv('2015_weather.csv',low_memory=False)

from datetime import datetime
def clean_weather_data_daily(dataframe):
    df=pd.merge(dataframe['DATE'],dataframe['DailyPrecipitation'],left_index=True,right_index=True)
    df1=pd.merge(dataframe['DailyAverageWindSpeed'],dataframe['DailyPeakWindSpeed'],left_index=True,right_index=True)
    df2=pd.merge(df,df1,left_index=True,right_index=True)
    
    df2=df2.rename(columns=str.lower)
    df2['date']=pd.to_datetime(df2.date)
    df2['t']=df2['date'].dt.date
    df2['t']=pd.to_datetime(df2.t)
    
    res=pd.DataFrame(df2.groupby('t')['dailyaveragewindspeed'].last())
    res['dailypeakwindspeed']=pd.DataFrame(df2.groupby('t')['dailypeakwindspeed'].last())
    res['dailyprecipitation']=pd.DataFrame(df2.groupby('t')['dailyprecipitation'].last())
    return res

In [295]:
from datetime import datetime
def clean_weather_data_daily(dataframe):
    df=pd.merge(dataframe['DATE'],dataframe['DailyPrecipitation'],left_index=True,right_index=True)
    df1=pd.merge(dataframe['DailyAverageWindSpeed'],dataframe['DailyPeakWindSpeed'],left_index=True,right_index=True)
    df2=pd.merge(df,df1,left_index=True,right_index=True)
    
    df2=df2.rename(columns=str.lower)
    df2['date']=pd.to_datetime(df2.date)
    df2['date']=df2['date'].dt.date
    df2['date']=pd.to_datetime(df2.date)
    
    res=pd.DataFrame(df2.groupby('date')['dailyaveragewindspeed'].last())
    res['dailypeakwindspeed']=pd.DataFrame(df2.groupby('date')['dailypeakwindspeed'].last())
    res['dailyprecipitation']=pd.DataFrame(df2.groupby('date')['dailyprecipitation'].last())
    return res

In [296]:
daily09=clean_weather_data_daily(weather09)
daily10=clean_weather_data_daily(weather10)
daily11=clean_weather_data_daily(weather11)
daily12=clean_weather_data_daily(weather12)
daily13=clean_weather_data_daily(weather13)
daily14=clean_weather_data_daily(weather14)
daily15=clean_weather_data_daily(weather15)

In [297]:
daily15[daily15['dailypeakwindspeed']>2000]

Unnamed: 0_level_0,dailyaveragewindspeed,dailypeakwindspeed,dailyprecipitation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-11-28,,2237.0,0.02
2015-11-29,,2237.0,0.0


In [374]:
def clean_weather_data_hourly(dataframe):
    
    df=pd.merge(dataframe['DATE'],dataframe['HourlyWindGustSpeed'],left_index=True,right_index=True)
    df1=pd.merge(dataframe['HourlyWindSpeed'],dataframe['DailySustainedWindSpeed'],left_index=True,right_index=True)
    df2=pd.merge(df,df1,left_index=True,right_index=True)
    df2['HourlyPrecipitation']=dataframe['HourlyPrecipitation']
    
    df2.loc[df2['HourlyPrecipitation']=='T','HourlyPrecipitation']=0 #把T微量降雨改成0
    df2=df2.drop(df2[df2['HourlyPrecipitation'].str.contains(pat='s')==True].index) #数据含如1.2s这样的乱数据
    df2['HourlyPrecipitation']=df2['HourlyPrecipitation'].astype(float) #改成float后面可以加在一起算daily precipitation
    
    df2=df2.rename(columns=str.lower)
    df2['date']=pd.to_datetime(df2.date)
    return df2

In [375]:
hour09=clean_weather_data_hourly(weather09)
hour10=clean_weather_data_hourly(weather10)
hour11=clean_weather_data_hourly(weather11)
hour12=clean_weather_data_hourly(weather12)
hour13=clean_weather_data_hourly(weather13)
hour14=clean_weather_data_hourly(weather14)
hour15=clean_weather_data_hourly(weather15)

In [377]:
#给daily数据补充hourly数据generate的数据
def adddata_daily(changedf,adddf):
    
    adddf['t']=adddf['date'].dt.date
    adddf['t']=pd.to_datetime(adddf.t)
    
    changedf['averagehourlywindspeed']=adddf.groupby('t')['hourlywindspeed'].mean()
    changedf['peakhourlywindspeed']=adddf.groupby('t')['hourlywindspeed'].max()
    ##wind gust 不知道是否需要
    changedf['averagehourlywindgustspeed']=adddf.groupby('t')['hourlywindgustspeed'].mean()
    changedf['peakhourlywindgustspeed']=adddf.groupby('t')['hourlywindgustspeed'].max()
    changedf['sumhourlyprecipitation']=adddf[adddf['hourlyprecipitation'].isnull()==False].groupby('t')['hourlyprecipitation'].sum()
    
    changedf.loc[changedf['dailyaveragewindspeed'].isnull()==True,'dailyaveragewindspeed']=changedf['averagehourlywindspeed']
    changedf.loc[changedf['dailypeakwindspeed'].isnull()==True,'dailypeakwindspeed']=changedf['peakhourlywindspeed']
    changedf.loc[changedf['dailyprecipitation'].isnull()==True,'dailyprecipitation']=changedf['sumhourlyprecipitation']
    changedf=changedf.reset_index()
    return changedf.iloc[:,0:4]

In [378]:
daily09=adddata_daily(daily09,hour09)
daily10=adddata_daily(daily10,hour10)
daily11=adddata_daily(daily11,hour11)
daily12=adddata_daily(daily12,hour12)
daily13=adddata_daily(daily13,hour13)
daily14=adddata_daily(daily14,hour14)
daily15=adddata_daily(daily15,hour15)

从daily generate hourly data怎么做呢

但剩下还有些空缺值，是否需要删除

In [381]:
daily=pd.concat([daily09,daily10,daily11,daily12,daily13,daily14,daily15])
hour=pd.concat([hour09,hour10,hour11,hour12,hour13,hour14,hour15])
del hour['t']
hour=hour.reset_index(drop=True)
daily=daily.reset_index(drop=True)

In [382]:
daily

Unnamed: 0,date,dailyaveragewindspeed,dailypeakwindspeed,dailyprecipitation
0,2009-01-01,11.041667,18.0,
1,2009-01-02,6.806452,16.0,0.0
2,2009-01-03,9.875000,15.0,0.0
3,2009-01-04,7.370370,10.0,
4,2009-01-05,6.925926,11.0,0.0
...,...,...,...,...
2546,2015-12-27,5.700000,26.0,0.12
2547,2015-12-28,8.300000,28.0,0.03
2548,2015-12-29,7.000000,24.0,0.45
2549,2015-12-30,4.100000,13.0,0.19


In [25]:
def load_and_clean_weather_data():
    hourly_dataframes = []
    daily_dataframes = []
    
    # add some way to find all weather CSV files
    # or just add the name/paths manually
    weather_csv_files = ["TODO"]
    
    for csv_file in weather_csv_files:
        hourly_dataframe = clean_month_weather_data_hourly(csv_file)
        daily_dataframe = clean_month_weather_data_daily(csv_file)
        hourly_dataframes.append(hourly_dataframe)
        daily_dataframes.append(daily_dataframe)
        
    # create two dataframes with hourly & daily data from every month
    hourly_data = pd.concat(hourly_dataframes)
    daily_data = pd.concat(daily_dataframes)
    
    return hourly_data, daily_data

In [26]:
hour.to_csv('hour.csv')

In [27]:
daily.to_csv('daily.csv')

### Process All Data

_This is where you can actually execute all the required functions._

_**TODO:** Write some prose that tells the reader what you're about to do here._

taxi_data = get_and_clean_taxi_data()
uber_data = get_uber_data()
hourly_weather_data, daily_weather_data = load_and_clean_weather_data()