**In this , fare prediction of a cab service is done through journey details. Details like pickup and drop off locations, journey date and time and passenger details are provided. But certain other factors like if the journey was done during a weekday or a weekend or if the journey was done during daytime or nighttime etc. can also be responsible for the task. These factors are extracted from the given details and the prediction is done through a linear regression model. It is also seen how the fare prices vary with respect to the factors that are present and what are all the important factors contributing to the prediction through various hypothesis tests.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mlt
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
import seaborn as sns
from random import randrange, uniform
from sklearn import preprocessing

**•	From the date and time of the journey details, the information regarding whether the journey was done during morning, afternoon, evening, or night was extracted.
•	From the same details as above, features like the year, month, weekday, and the day in a month when the journey happened are also extracted.**

In [1]:
def daytime (row):
    if (row['hour'] <= 6) or (row['hour'] > 22):
        return ("night")
    elif (row['hour'] > 6) and (row['hour'] <= 12):
        return ("morning")
    elif (row['hour'] > 12) and (row['hour'] <= 17):
        return ("afternoon")
    elif (row['hour'] > 17) and (row['hour'] <= 22):
        return ("evening")

    
def add_time_features(df):
    df['year'] = df['pickup_datetime'].apply(lambda x: x.year)
    df['month'] = df['pickup_datetime'].apply(lambda x: x.month)
    df['day'] = df['pickup_datetime'].apply(lambda x: x.day)
    df['hour'] = df['pickup_datetime'].apply(lambda x: x.hour)
    df['weekday'] = df['pickup_datetime'].apply(lambda x: x.weekday())
    df['pickup_datetime'] =  df['pickup_datetime'].apply(lambda x: str(x))
    df['daytime'] = df.apply (lambda x: daytime(x), axis=1)
    df = df.drop('pickup_datetime', axis=1)
    df=df.drop('hour',axis=1)
    df=df.drop('day',axis=1)
    return df

In [1]:
df = pd.read_csv("../input/cabfare/Data.csv")

In [1]:
df.info()

In [1]:
df['pickup_datetime'] =  pd.to_datetime(df['pickup_datetime'], format='%Y-%m-%d %H:%M:%S %Z',errors='coerce')

In [1]:
df.isnull().sum()

In [1]:
df= add_time_features(df)

In [1]:
df["year"] = df["year"].astype(object)
df["month"] = df["month"].astype(object)
df["weekday"] = df["weekday"].astype(object)

**From the pickup and drop off latitude and longitude, the distance between the pickup and drop off points are extracted using a Python library called Geopy. Geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.**

In [1]:
from geopy.distance import geodesic
from geopy.distance import great_circle
df['great_circle']=df.apply(lambda x: great_circle((x['pickup_latitude'],x['pickup_longitude']), (x['dropoff_latitude'],   x['dropoff_longitude'])).miles, axis=1)
df['geodesic']=df.apply(lambda x: geodesic((x['pickup_latitude'],x['pickup_longitude']), (x['dropoff_latitude'],   x['dropoff_longitude'])).miles, axis=1)

In [1]:
df.info()

## Exploratory Data Analysis

##### Cab Fare vs. Year

In [1]:
def time_analysis(df):
    return pd.DataFrame({"FareAverage":np.mean(df.fare_amount),"Count":np.size(df.fare_amount),"FareSum":sum(df.fare_amount)},index=["Time"] )

In [1]:
df_yearly=df.groupby('year').apply(time_analysis).reset_index()
sns.catplot(x="year", y="FareAverage", kind="bar", data=df_yearly,color="c",palette="dark",height=3, aspect=1.5)
sns.catplot(x="year", y="Count", kind="bar", data=df_yearly,color="g",palette="dark",height=3, aspect=1.5)
sns.catplot(x="year", y="FareSum", kind="bar", data=df_yearly,color="m",palette="dark",height=3, aspect=1.5)

**The average cab fare is more or less the same over the years, but the total number of cab rides vary throughout. Thus, the corresponding revenue generated also varies accordingly and it increased initially and then decreased and another fluctuation happened later. There is no such pattern derived from the yearly analysis.**

##### Cab Fare vs month

In [1]:
df_monthly=df.groupby('month').apply(time_analysis).reset_index()
sns.catplot(x="month", y="FareAverage", kind="bar", data=df_monthly,color="c",palette="dark",height=3, aspect=1.5)
sns.catplot(x="month", y="Count", kind="bar", data=df_monthly,color="g",palette="dark",height=3, aspect=1.5)
sns.catplot(x="month", y="FareSum", kind="bar", data=df_monthly,color="m",palette="dark",height=3, aspect=1.5)

**The average fare for all the months over the year has been more or less constant. However, it is seen that the count of rides and correspondingly the revenue generated over the years is maximum for the month of June. One reason can be educational institutes mostly start their academic sessions at that time. Plus, due to the extreme hot climate in the month of June, people usually prefer a cab ride. We can also see that during the end months the rides and total revenue generated is higher than the starting months because those are the months of festivals.**

##### Cab Fare vs Weekday

In [1]:
df_weekly=df.groupby('weekday').apply(time_analysis).reset_index()
sns.catplot(x="weekday", y="FareAverage", kind="bar", data=df_weekly,color="c",palette="dark",height=3, aspect=1.5)
sns.catplot(x="weekday", y="Count", kind="bar", data=df_weekly,color="g",palette="dark",height=3, aspect=1.5)
sns.catplot(x="weekday", y="FareSum", kind="bar", data=df_weekly,color="m",palette="dark",height=3, aspect=1.5)

**Again, the average fare over the week is more or less the same. But it has been seen that the number of rides and correspondingly the revenue generated is slightly more during weekdays than during weekends mostly because of offices and educational institutes.**

##### Cab Fare vs Daytime

In [1]:
df_daily=df.groupby('daytime').apply(time_analysis).reset_index()
sns.catplot(x="daytime", y="FareAverage", kind="bar", data=df_daily,color="c",palette="dark",height=3, aspect=1)
sns.catplot(x="daytime", y="Count", kind="bar", data=df_daily,color="g",palette="dark",height=3, aspect=1)
sns.catplot(x="daytime", y="FareSum", kind="bar", data=df_daily,color="m",palette="dark",height=3, aspect=1)

**Cab rides are much more during the morning time because people are generally going to workplaces and also during the evening time as people generally travelling back from offices, going out for dinner,movies,hanging out after college/office. Also the average cab fare is more doing night due to night fare supplement charges.**

##### Cab fare vs Passenger count, distance, pick up and drop off latitude and longitude.

In [1]:
df1=df[['passenger_count',"pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude", 'great_circle',"geodesic","fare_amount"]]
sns.pairplot(df1)

**As we can see that the fare amount is directly proportional to the distance but for other variables further analysis has to be done.**

###### Due to lesser number of unique values in passenger count, we will treat it as a categorical variable

In [1]:
df["passenger_count"] = df["passenger_count"].astype(object)

## Correlation Analysis

Two independent continuous variables are checked at a time if they move together directionally. If yes, one should be removed. Because that could lead to biasness in the model. (Also, one continuous independent variable is taken, checked if it is highly correlated with the target variable if it is continuous too. They should move together directionality).

H0: Two variables are independent

H1: Two variables are not independent

• If p-value is less than 0.05 then the null hypothesis is rejected saying that 2 variables are dependent.

• And if p-value is greater than 0.05 then the null hypothesis is accepted saying that 2 variables are independent.

In [1]:
ncol=["great_circle","geodesic","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude","fare_amount"]

In [1]:
plt.figure(figsize=(10,10))
_ = sns.heatmap(df[ncol].corr(), square=True, cmap='RdYlGn',linewidths=1,linecolor='w',annot=True)
plt.title('Correlation matrix ')
plt.show()

"pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude" are not that correlated with fare amount. Hence, they are dropped.
Great circle and geodesic are highly correlated with each other, hence dropping great circle is dropped.
geodesic and fare amount are highly correlated with each other. Its p value is calculated. The p value for the above relations can also be calculated in the same way.

In [1]:
import scipy.stats as stats
_ = sns.jointplot(x='fare_amount',y='geodesic',data=df,kind = 'reg')
_.annotate(stats.pearsonr)
plt.show()

Fare amount and geodesic are highly correlated with each other and p=0, hence H0 is rejected stating that they are dependent which is a must need condition for linear regression.

In [1]:
df=df.drop(["great_circle","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude"],axis=1)

In [1]:
df.info()

## Chi-square test of Independence for Categorical Variables/Features

Similar analysis is done here as it was in correlation test but with the categorical variables.

Hypothesis testing:

Null Hypothesis: 2 variables are independent.

Alternate Hypothesis: 2 variables are not independent.

If p-value is less than 0.01 then the null hypothesis is rejected saying that 2 variables are dependent.
And if p-value is greater than 0.01 then the null hypothesis is accepted saying that 2 variables are independent.
Alpha here is taken as 0.01 as majority of the variables in the data are categorical variables and it is unfair to remove them based on small amount of dependencies with others.

In [1]:
# Import label encoder 
colnames = list(df.columns)
from sklearn import preprocessing 

# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
  
for col in colnames:
    if df[col].dtype==object:
        df[col]= label_encoder.fit_transform(df[col])

In [1]:
cat_var=["passenger_count","year","month","weekday","daytime"] 
catdf=df[cat_var]

In [1]:
from sklearn.feature_selection import chi2
n= 10
for i in range(0,4):
    X=catdf.iloc[:,i+1:n]
    y=catdf.iloc[:,i]
    chi_scores = chi2(X,y)
    p_values = pd.Series(chi_scores[1],index = X.columns)
    print("for",i)
    print(p_values)
    for j in range (0, len(p_values)):
        if (p_values[j]<0.01):
            print(p_values[j])

After the analysis it is seen that year, month and weekday are dependent on others, p value is less than 0.01, hence rejecting H0 for their relations and dropping them is done.

In [1]:
df=df.drop(["year","month","weekday"],axis=1)

## Anova test

It is carried out to compare between each group in a categorical variable. ANOVA is done to check if the means for different groups are same or not. It does not help us to identify which mean is different.

Hypothesis testing:

Null Hypothesis: mean of all categories in a variable are same.

Alternate Hypothesis: mean of at least one category in a variable is different.

If p-value is less than 0.05 then we reject the null hypothesis.
And if p-value is greater than 0.05 then we accept the null hypothesis.


In [1]:
df.info()

In [1]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('fare_amount ~ C(passenger_count)+C(daytime)',data=df).fit()
aov_table = sm.stats.anova_lm(model)
aov_table

In [1]:
probanova=list(aov_table["PR(>F)"])
for i in range(0,3):
    if probanova[i]>0.05:
        print(i)

No variable has same mean for all the categories. P value is less than 0.05, thus H0 is rejected.

## VIF Test

This test is to check if there is any multicollinearity left in the data after all the above statistical tests. VIF is always greater or equal to 1.

if VIF is 1 --- Not correlated to any of the variables.

if VIF is between 1-5 --- Moderately correlated.

if VIF is above 5 --- Highly correlated.

•	VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.

•	VIF score of an independent variable represents how well the variable is explained by other independent variables.
 
The closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the independent variable.

In [1]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return(vif)

In [1]:
df1=df.drop(["fare_amount"],axis=1)
calc_vif(df1)

**None of the remaining variables have high multicollinearity**

In [1]:
df["passenger_count"] = df["passenger_count"].astype(object)
df["daytime"] = df["daytime"].astype(object)

## Converting passenger count and daytime to dummy variable

In [1]:
df = pd.get_dummies(df, drop_first=True)
df.info()

## Multiple regression model

In [1]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
x = df.drop('fare_amount',axis=1).values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
X = pd.DataFrame(x_scaled)
y = df['fare_amount'].values

In [1]:
model = sm.OLS(y,X).fit()
model.summary()

1) Here the R squared statistic value indicates that 97.3 percentage of the variance in the dependent variable is explained by independent variables collectively. So, the model does a good job explaining the changes in the dependent variable. Adjusted R square is the same as R squared stating all variables are significant.

2) H0: Variables are not carrying any information towards the target variable. (b=0)
H1: Variables are carrying info towards target variable. (b != 0)
Here it can be seen that F-statistic value is exceptionally large and p value is less than 0.05, thus H0 is rejected stating that the variables have a linear relationship and are carrying info towards target variable. (b != 0).

3) The maximum value for the log of the likelihood function is -22083, the likelihood that the process described by the model produced the data that were observed (maximise the probability of observing the data).

4) Omnibus is a test of the skewness and kurtosis of the residual. The value is relatively high, and the probability of omnibus is relatively low indicating that the residual is not normally distributed.

5) Even the skew value is not close to 0 confirming the above result.

6) DW value suggests that there is positive autocorrelation. That is, error of a given sign tends to be followed by an error of the same sign. For example, positive errors are usually followed by positive errors, and negative errors are usually followed by negative errors.

7) Kurtosis of the normal distribution is 3.0. In this case it is close to 5, validates the other results.

8) A large JB value is seen and the probability of JB is 0 indicating that the errors are not normally distributed.

9) In linear regression the condition number of the moment matrix can be used as a diagnostic for multicollinearity. A relatively small number (<30) is required, in this case it is.


## Recommendations

## (Ways to deal with Non Normal Residual Distribution and positive autocorrelation)

1.	One should not remove outliers just because they make the distribution of the residuals non-normal. We may examine the case that has that high residual and see if there are problems with it (the easiest would be if it is a data entry error).

2.	Assuming there is no good reason to remove that observation, one can run the regression with and without it and see if there are any large differences in the parameter estimates; if not, you can leave it and note that removing it made little difference.

3.	If it makes a big difference, the choice of the OLS model itself may be entirely wrong for this data set. It may be needed to look at alternate models. One could try robust regression, which deals with outliers or quantile regression or any other regression model that make no assumptions about the distribution of the residuals.

4.	Some key explanatory variables might have been left out which is causing some signal to leak into the residuals in the form of autocorrelations. If one can use one residual to predict the next residual, there is some predictive information present that is not captured by the predictors. Typically, this situation involves time-ordered observations. For example, if a residual is more likely to be followed by another residual that has the same sign, adjacent residuals are positively correlated. One can include a variable that captures the relevant time-related information or use a time series analysis.

5.	Maybe one can transform the response variable to make the distribution of the random errors approximately normal, fit the model, transform the predicted values back into the original units using the inverse of the transformation applied to the response variable.
