# 1) Define Problem

# New York Taxi Fare Prediction: 

 our  tasked is  predicting the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. While we can get a basic estimate based on just the distance between the two points, this will result in an RMSE of $5-$8, depending on the model used . our  challenge is to do better than this using Machine Learning techniques!



# 2) Specify input and output

# Data Field:

1)**ID**
key - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. Required in your submission CSV. Not necessarily needed in the training set, but could be useful to simulate a 'submission file' while doing cross-validation within the training set.


# Features


**pickup_datetime** - timestamp value indicating when the taxi ride started.


**pickup_longitude** - float for longitude coordinate of where the taxi ride started.


**pickup_latitude** - float for latitude coordinate of where the taxi ride started.


**dropoff_longitude** - float for longitude coordinate of where the taxi ride ended.


**dropoff_latitude** - float for latitude coordinate of where the taxi ride ended.


**passenger_count** - integer indicating the number of passengers in the taxi ride.


# Target


**fare_amount** - float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set and it is required in your submission CSV.

# 3) Select Framework(libraries)

In [None]:
import os
import numpy as np#linear algebra   
import pandas as pd #data preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train =  pd.read_csv('../input/train.csv', nrows = 100000, parse_dates=["pickup_datetime"])  # 55m rows,but we import 10m rows

In [None]:
test = pd.read_csv('../input/test.csv')   #10k rows 

In [None]:
train.head()  # first 5 record of train 

# 4) EDA(Exploratery Data Analysis)

#  Data collection

In [None]:
train.describe() 

In [None]:
train.columns

In [None]:
train.info()

train  has total 8 column in that   5 float64 values, 1 int value , 1 object ,and 1  datetime64 . 

# Data Preprocessing & Data cleaning

In [None]:
print(train.isnull().sum())  # check anu null value is available or not .


In [None]:
print('Old size: %d' % len(train))
train = train.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(train))
# if gives 20million data then NaN values comes.

In [None]:
print(train.isnull().sum())


In [None]:
sns.distplot(train['fare_amount']);

in between 0-50 there are 95% 'fare_amount' located.

In [None]:
train.loc[train['fare_amount']<0].shape

There are 9 records with negative fare, we will remove these record from the data.

there are lots of cases where lat and longitude is 0 , check how many such cases are?

In [None]:
train[(train.pickup_latitude==0) | (train.pickup_longitude)==0 | (train.dropoff_latitude==0) | (train.dropoff_longitude==0)].shape

1918 values are** 0 in train.
Based on just look at the data, we can see that its not 100% clean and
some entries will contribute to higher error rates. 

In [None]:
sns.distplot(train['passenger_count'])

In [None]:
train.describe()

In [None]:
#clean up the train dataset to eliminate out of range values
train = train[train['fare_amount'] > 0]
train = train[train['pickup_longitude'] < -72]
train = train[(train['pickup_latitude'] > 40) &(train
                                               ['pickup_latitude'] < 44)]
train = train[train['dropoff_longitude'] < -72]
train = train[(train['dropoff_latitude'] >40) & (train
                                                ['dropoff_latitude'] < 44)]
train = train[(train['passenger_count']>0) &(train['passenger_count'] < 10)]

Now we can see there are no obvious inconstitencies with the data.

In [None]:
 train.describe()

#  Same operation perform on 'test'

In [None]:
test.head()  # first 5 record of test 

In [None]:
test.describe()

In [None]:
test.info()

In [None]:
print(test.isnull().sum())

In [None]:
test[(test.pickup_latitude==0) | (test.pickup_longitude)==0 | (test.dropoff_latitude==0) | (test.dropoff_longitude==0)].shape


In [None]:
print(test.isnull().sum())


In [None]:
print('Old size: %d' % len(test))
test = test.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(test))

In [None]:
#clean up the train dataset to eliminate out of range values
test = test[test['pickup_longitude'] < -72]
test = test[(test['pickup_latitude'] > 40) &(train
                                               ['pickup_latitude'] < 44)]
test = test[test['dropoff_longitude'] < -72]
test = test[(test['dropoff_latitude'] >40) & (train
                                                ['dropoff_latitude'] < 44)]
test = test[(test['passenger_count']>0) &(train['passenger_count'] < 10)]
train.head()

we clean the dataset.

#  Transforming Feature

In [None]:
# Pickup Datetime is in Date format convert it on int
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])


conver Datetime var into single column as year, month,day_of_week, and hour 

In [None]:
combine = [train,test]
for dataset in combine:
    dataset['pickup_datetime'] = pd.to_datetime(dataset['pickup_datetime'])
    dataset['hour'] = dataset.pickup_datetime.dt.hour
    dataset['week'] = dataset.pickup_datetime.dt.week
    dataset['month'] = dataset.pickup_datetime.dt.month
    dataset['year'] = dataset.pickup_datetime.dt.year

    
train.head()

In [None]:
test.head()

In [None]:
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train) 
train.head(1)

In [None]:
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(test) 
test.head(1)

In [None]:
# remove unnessary column that not requred for modeling.
train = train.drop(['pickup_datetime', 'key'],axis = 1) 
#train.info()

In [None]:
test.drop(['pickup_datetime'], axis = 1, inplace = True)

In [None]:
#Let's prepare the test set
x_pred = test.drop('key', axis=1)

In [None]:
test.drop(['key'], axis = 1, inplace = True)

#  Feature encoding

In [None]:
y= train['fare_amount']
x = train.drop(['fare_amount'], axis=1)

#  Cross Validation 

In [None]:
from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2)

#  i)Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as MSE 
  

In [None]:
linmodel = LinearRegression()
linmodel.fit(Xtrain, ytrain)
print(linmodel.score(Xtest,ytest))

In [None]:
#Prediction on train data
linmodel_pred = linmodel.predict(Xtest)
r21 = r2_score(ytest, linmodel_pred)
mse1 = mean_squared_error(ytest,linmodel_pred)
rmse1 = np.sqrt(MSE(ytest, linmodel_pred)) 
print(r21)
print(mse1)
print("RMSE : % f" %(rmse1)) 

In [None]:
#Prediction on test Data
linmodel_pred=linmodel.predict(x_pred)

In [None]:
linmodel_pred

#  Ridge Regression


In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge = Ridge(alpha=0.005, normalize=True)
ridge.fit(Xtrain, ytrain)
print(ridge.score(Xtest,ytest))

In [None]:
#Prediction on test Data
ridge_pred=linmodel.predict(x_pred)

In [None]:
ridge_pred

# Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso= Lasso(alpha=0.0004, normalize=True)
lasso.fit(Xtrain, ytrain)
print(lasso.score(Xtest,ytest))

In [None]:
#Prediction on test Data
lasso_pred=linmodel.predict(x_pred)

In [None]:
lasso_pred

# Gradient Descent

In [None]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(Xtrain)
y_std = sc_y.fit_transform(ytrain.values.reshape(-1,1)).flatten()

In [None]:
alpha = 0.0001
w_ = np.zeros(1 + X_std.shape[1])
cost_ = []
n_ = 100

for i in range(n_):
    y_pred = np.dot(X_std, w_[1:]) + w_[0]
    errors = (y_std - y_pred)
    
    w_[1:] += alpha * X_std.T.dot(errors)
    w_[0] += alpha * errors.sum()
    
    cost = (errors**2).sum() / 2.0
    cost_.append(cost) 

In [None]:
plt.figure(figsize=(10,8))
plt.plot(range(1, n_ + 1), cost_);
plt.ylabel('SSE');
plt.xlabel('Epoch');

In [None]:
w_

# Support Vector Regressor

In [None]:
from sklearn.svm import SVR

In [None]:
svr = SVR(kernel='linear')
svr.fit(Xtrain, ytrain)
print(svr.score(Xtest,ytest))

In [None]:
#Prediction on test Data
svr_pred=svr.predict(x_pred)

In [None]:
svr_pred

# Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(Xtrain, ytrain)
print(rfr.score(Xtest,ytest))

In [None]:
#Prediction on test Data
rfr_pred=rfr.predict(x_pred)

In [None]:
rfr_pred

# iii) XG-BOOST Model


In [None]:
import xgboost as xg

In [None]:
train_dmatrix = xg.DMatrix(data = Xtrain, label = ytrain) 
test_dmatrix = xg.DMatrix(data = Xtest, label = ytest)    #For prediction

DMatrix. It is an optimized data structure that the creators of XGBoost made. It gives the package its performance and efficiency gains.

In [None]:
param = {"booster":"gblinear", "objective":"reg:linear"} 
 

In [None]:
xg = xg.train(params = param, dtrain = train_dmatrix, num_boost_round = 10000)


In [None]:
xg_pred= xg.predict(test_dmatrix)

In [None]:
print(r2_score(ytest, xg_pred))

# parameter Tuning(xgboost cv)

In [None]:
params = {
      #parameters that we are going to tune
    'max_depth' :8 ,#result of tuning with cv
    'eta' :.03, #result of tuning with cv
    'subsample' : 1, # result of tuning with cv
    'colsample_bytree' : 0.8, #result of tuning with cv
    #other parameter
    'objective': 'reg:linear',
    'eval_metrics':'rmse',
    'silent': 1,
}

In [None]:
#Block of code used for hypertuning parameters. Adapt to each round of parameter tuning.
CV=False
if CV:
    dtrain = xgb.DMatrix(train,label=y)
    gridsearch_params = [
        (eta)
        for eta in np.arange(.04, 0.12, .02)
    ]

    # Define initial best params and RMSE
    min_rmse = float("Inf")
    best_params = None
    for (eta) in gridsearch_params:
        print("CV with eta={} ".format(
                                 eta))

        # Update our parameters
        params['eta'] = eta

        # Run CV
        cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round=1000,
            nfold=3,
            metrics={'rmse'},
            early_stopping_rounds=10
        )

        # Update best RMSE
        mean_rmse = cv_results['test-rmse-mean'].min()
        boost_rounds = cv_results['test-rmse-mean'].argmin()
        print("\tRMSE {} for {} rounds".format(mean_rmse, boost_rounds))
        if mean_rmse < min_rmse:
            min_rmse = mean_rmse
            best_params = (eta)

    print("Best params: {}, RMSE: {}".format(best_params, min_rmse))
else:
    #Print final params to use for the model
    params['silent'] = 0 #Turn on output
    print(params)

In [None]:
from sklearn.model_selection import train_test_split
import xgboost as xgb

In [None]:
def XGBmodel(Xtrain,Xtest,ytrain,ytest):
    matrix_train = xgb.DMatrix(Xtrain,label=ytrain)
    matrix_test = xgb.DMatrix(Xtest,label=ytest)
    model=xgb.train(params=params
                                  ,dtrain=matrix_train,num_boost_round=200, 
                    early_stopping_rounds=20,evals=[(matrix_test,'test')],)
    return model

model=XGBmodel(Xtrain,Xtest,ytrain,ytest)

In [None]:
import xgboost as xgb
xgbcv_pred= model.predict(xgb.DMatrix(x_pred), ntree_limit = model.best_ntree_limit)

In [None]:
xgbcv_pred

In [None]:
linmodel_pred, rfr_pred, xgb_pred


In [None]:
# Assigning weights. More precise models gets higher weight.
linmodel_weight = 1
rfr_weight = 3
xgbcv_weight = 1
prediction = (linmodel_pred * linmodel_weight + rfr_pred * rfr_weight + xgbcv_pred * xgbcv_weight) / (linmodel_weight + rfr_weight + xgbcv_weight)


In [None]:
prediction


# 6)Submission


In [None]:
# Add to submission
submission = pd.DataFrame({
        "key": test['key'],
        "fare_amount": prediction.round(2)
})

submission.to_csv('sub_fare.csv',index=False)


In [None]:
submission

# 7)Conclusion

i have tried all the parts related to the proccess of machin learning with a variety of python package and i know there are still some problem then i hope to get your feedback to improve it.