## 字段描述

- id -每次旅行的唯一标识符
- vendor_id—指示与旅行记录相关联的提供者的代码
- pickup_datetime：上车日期和时间
- dropff_datetime：下车日期和时间
- passenger_count：车辆上的乘客数量(司机输入的值)
- pickup_longitude：上车的经度
- pickup_latitude：上车纬度
- dropoff_longitude：下车经度
- dropoff_latitude：下车纬度
- store_and_fwd_flag——这个标志表示由于车辆没有连接到服务器，在发送给供应商之前旅行记录是否保存在车辆内存中——Y=store and forward;N=不是存储和前进的旅行
- trip_duration——旅行的持续时间，以秒为单位，目标变量

# 导入常见的库


In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from datetime import datetime
import math

import os
from pathlib import Path
print(os.listdir("../input"))


# 1 数据读取

In [None]:
df_train = pd.read_csv('../input/train.zip', compression='zip')
df_test = pd.read_csv('../input/test.zip', compression='zip')

In [None]:
df_train.head()

In [None]:
# check data usage
print('Memory usage, Mb: {:.2f}\n'.format(df_train.memory_usage().sum()/2**20))

# overall df info
print('---------------- DataFrame Info -----------------')
print(df_train.info())

# 2 数据可视化

#### 2.1 Check for N/A values

In [None]:
print(df_train.isnull().sum())

#### 2.2 Check for Outliers

In [None]:
print('----------------distance Outliers-------------------')
print('Latitude : {} to {}'.format(
    max(df_train.pickup_latitude.min(), df_train.dropoff_latitude.min()),
    max(df_train.pickup_latitude.max(), df_train.dropoff_latitude.max())
))
print('Longitude : {} to {}'.format(
    max(df_train.pickup_longitude.min(), df_train.dropoff_longitude.min()),
    max(df_train.pickup_longitude.max(), df_train.dropoff_longitude.max())
))
print('')
print('------------------Time Outliers---------------------')
print('Trip duration in seconds: {} to {}'.format(
    df_train.trip_duration.min(), df_train.trip_duration.max()))

print('')
print('------------------Date Outliers---------------------')
print('Datetime range: {} to {}'.format(df_train.pickup_datetime.min(), 
                                        df_train.dropoff_datetime.max()))
print('')
print('----------------Passengers Outliers------------------')
print('Passengers: {} to {}'.format(df_train.passenger_count.min(), 
                                        df_train.passenger_count.max()))

#### 2.3 Check duplicates values

In [None]:
print('duplicates IDs: {}'.format(len(df_train) - len(df_train.drop_duplicates(subset='id'))))

使用haversine距离来计算经纬度距离

In [None]:
def haversine(lat1, lon1, lat2, lon2):
    R = 6371800  # Earth radius in meters  
    phi1, phi2 = math.radians(lat1), math.radians(lat2) 
    dphi       = math.radians(lat2 - lat1)
    dlambda    = math.radians(lon2 - lon1)
    
    a = math.sin(dphi/2)**2 + \
        math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
    
    return 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))

In [None]:
df_train['distance'] = df_train.apply(lambda row: 
                                      haversine(row['pickup_latitude'], 
                                                row['pickup_longitude'], 
                                                row['dropoff_latitude'], 
                                                row['dropoff_longitude']), axis=1)
df_test['distance']  = df_test.apply(lambda row: 
                                     haversine(row['pickup_latitude'], 
                                               row['pickup_longitude'], 
                                               row['dropoff_latitude'], 
                                               row['dropoff_longitude']), axis=1)

In [None]:
df_train.head()

In [None]:
sns.set(rc={'figure.figsize':(15,10)})
sns.distplot(df_train['distance'],hist=False)

# 3 数据处理

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(df_train['trip_duration']).set_title("Distribution of Trip Duration")
plt.xlabel("Trip Duration")

如果标签是数值类型的：转换为正态分布，会有精度增益

In [None]:
df_train['trip_duration'] = np.log(df_train['trip_duration'].values)

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(df_train['trip_duration']).set_title("Distribution of Trip Duration")
plt.xlabel("Trip Duration")

对日期进行提取

In [None]:
df_train['pickup_datetime'] = pd.to_datetime(df_train['pickup_datetime'], format='%Y-%m-%d %H:%M:%S')
df_test['pickup_datetime'] = pd.to_datetime(df_test['pickup_datetime'], format='%Y-%m-%d %H:%M:%S')

In [None]:
df_train['hour'] = df_train.loc[:,'pickup_datetime'].dt.hour;
df_train['week'] = df_train.loc[:,'pickup_datetime'].dt.week;
df_train['weekday'] = df_train.loc[:,'pickup_datetime'].dt.weekday;
df_train['hour'] = df_train.loc[:,'pickup_datetime'].dt.hour;
df_train['month'] = df_train.loc[:,'pickup_datetime'].dt.month;

df_test['hour'] = df_test.loc[:,'pickup_datetime'].dt.hour;
df_test['week'] = df_test.loc[:,'pickup_datetime'].dt.week;
df_test['weekday'] = df_test.loc[:,'pickup_datetime'].dt.weekday;
df_test['hour'] = df_test.loc[:,'pickup_datetime'].dt.hour;
df_test['month'] = df_test.loc[:,'pickup_datetime'].dt.month;

In [None]:
cat_vars = ['store_and_fwd_flag']
for col in cat_vars:
    df_train[col] = df_train[col].astype('category').cat.codes
df_train.head()

for col in cat_vars:
    df_test[col] = df_test[col].astype('category').cat.codes
df_test.head()

# 4 特征工程

In [None]:
y_train = df_train["trip_duration"]
X_train = df_train[["vendor_id", "store_and_fwd_flag","passenger_count",
                    "pickup_longitude", "pickup_latitude", "distance", 
                    "dropoff_longitude","dropoff_latitude", 
                    "hour", "week", "weekday", "month" ]]

# 5 构建机器学习模型

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
m = Ridge()

cross_val_score(m, X_train, y_train, cv=5)

In [None]:
%%time
from sklearn.linear_model import Ridge
m = Ridge()
m.fit(X_train, y_train)

X_test = df_test[["vendor_id", "store_and_fwd_flag","passenger_count","pickup_longitude", "pickup_latitude", "distance","dropoff_longitude","dropoff_latitude", "hour", "week", "weekday", "month"]]
prediction = m.predict(X_test)
prediction

In [None]:
from lightgbm import LGBMRegressor
m = LGBMRegressor(n_estimators=500)
m.fit(X_train, y_train)

X_test = df_test[["vendor_id", "store_and_fwd_flag","passenger_count","pickup_longitude", "pickup_latitude", "distance","dropoff_longitude","dropoff_latitude", "hour", "week", "weekday", "month"]]
prediction = m.predict(X_test)
prediction

# 6 对结果进行预测

In [None]:
submit = pd.read_csv('../input/sample_submission.zip', compression='zip')
submit.head()
submit['trip_duration'] = np.exp(prediction)
submit.to_csv('submission.csv', index=False)

## 7 课后作业

1. 阅读sklearn文档，尝试其它模型进行提交；
2. 理解数据构建新特征，加入模型训练；
    - https://www.kaggle.com/jeffreycbw/nyc-taxi-trip-public-0-37399-private-0-37206
    - https://www.kaggle.com/mnds18/nyc-taxi-eda-mrig
3. 对比prediction取log和没有取log对精度的差别；
4. 达到0.370以上的分数（小于0.370）；