**Data Preprocessing and Feature Engineering (Exploratory Data Analysis)**

1. Studying the feature statistics
2. Impute missing values (with mean, median, mode)
3. Aggregation
4. Sampling
5. Dimensionality reduction (PCA)
6. Feature subset selection
7. Feature creation
8. Discretization and binarization (with Gini Index / Entropy)
9. Variable transformation and binning

**1. Studying the feature statistics**

In [None]:
import matplotlib.pyplot as plt # data visualisation
import seaborn as sb # data visualisation
import pandas as pd # dataframes
import math # math formulae

In [None]:
# importing training data
df = pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows = 1_000_000)
df.head()

In [None]:
# removing 'key' column
df = df.drop(columns = ['key'])
df.head()

In [None]:
# dimensions of dataset
df.shape

In [None]:
# checking for duplicates
duplicate_rows = df[df.duplicated()]
duplicate_rows.shape

In [None]:
# data type of features and target
df.dtypes

In [None]:
# statistical data for numerical features and target
df.describe()

Notes:
* Negative/zero fares present
* Zero passengers trips present
* Outliers present -> 208 passengers
* Invalid coordinates -> lat = (90,-90), lon = (180,-180)
* New York coordinates -> lat = (40.2940,45.0042), lon = (71.4725,79.4554)
* https://www.netstate.com/states/geography/ny_geography.htm

In [None]:
# removing invalid coordinates
df = df[df['pickup_longitude'] <= -71.4725]
df = df[df['pickup_longitude'] >= -79.4554]

df = df[df['pickup_latitude'] <= 45.0042]
df = df[df['pickup_latitude'] >= 40.2940]

df = df[df['dropoff_longitude'] <= -71.4725]
df = df[df['dropoff_longitude'] >= -79.4554]

df = df[df['dropoff_latitude'] <= 45.0042]
df = df[df['dropoff_latitude'] >= 40.2940]

df.shape

In [None]:
# removing trips with zero/negative fares
df = df[df['fare_amount'] > 0]
df.shape

In [None]:
# removing trips with zero passengers
df = df[df['passenger_count'] > 0]
df.shape

In [None]:
# checking statistical data again
df.describe()

In [None]:
# changing 'pickup_datetime' to datetime data type
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], format = '%Y-%m-%d %H:%M:%S %Z')
df.dtypes

**2. Input missing values**

In [None]:
# checking for null values
df.isnull().sum()

Notes:
* To fill in null values with mean if any

**3. Aggregation**

To group by:
* Year
* Month
* Hour
* Number of Passengers
* Distance

In [None]:
# sorting df by 'pickup_datetime'
df = df.sort_values('pickup_datetime')
df

In [None]:
# obtaining year, month and hour attributes from 'pickup_datetime'
df['year'] = df['pickup_datetime'].dt.strftime('%Y')
df['month'] = df['pickup_datetime'].dt.strftime('%m')
df['hour'] = df['pickup_datetime'].dt.strftime('%H')
df

In [None]:
# changing year, month and hour attributes
df[['year', 'month', 'hour']] = df[['year', 'month', 'hour']].apply(pd.to_numeric)
df.dtypes

In [None]:
# calculating trip distance using haversine formula
def haversine(start_lon, start_lat, end_lon, end_lat):
    earth_radius = 6371
    start_lon, start_lat, end_lon, end_lat = map(math.radians, [start_lon, start_lat, end_lon, end_lat])
    lat_diff = end_lat - start_lat
    lon_diff = end_lon - start_lon
    
    a = pow(math.sin(lat_diff/2), 2) + math.cos(start_lat) * math.cos(start_lat) * pow(math.sin(lon_diff/2), 2)
    c = 2 * math.asin(math.sqrt(a))
    dist = earth_radius * c
    
    return dist

In [None]:
# adding distance column to dataframe
dist_array = []

for i in range(df.shape[0]):
    plon = df.iloc[i]['pickup_longitude']
    plat = df.iloc[i]['pickup_latitude']
    dlon = df.iloc[i]['dropoff_longitude']
    dlat = df.iloc[i]['dropoff_latitude']
    dist = haversine(plon, plat, dlon, dlat)
    dist_array.append(dist)
    
df['distance in kilometres'] = dist_array
df

In [None]:
df.describe()

In [None]:
# trips with zero distances
zero_dist = df[df['distance in kilometres'] == 0]
zero_dist.shape

In [None]:
# removing zero distance trips
df = df[df['distance in kilometres'] > 0]
df.describe()

In [None]:
# relationship between distance in kilometres and fare_amount
sb.relplot(data = df, x = 'distance in kilometres', y = 'fare_amount')

In [None]:
# frequency of fare_amount
df['fare_amount'].plot.hist(bins = 100, figsize=(8,2))

In [None]:
# mean by year
yearly_mean = df.groupby(['year']).mean()
yearly_mean

In [None]:
years = df['year'].unique()
sb.barplot(x = years, y = yearly_mean['fare_amount'])

In [None]:
# mean by month
monthly_mean = df.groupby('month').mean()
monthly_mean

In [None]:
months = df['month'].unique()
sb.barplot(x = months, y = monthly_mean['fare_amount'])

In [None]:
# mean by hour
hourly_mean = df.groupby('hour').mean()
hourly_mean

In [None]:
hours = df['hour'].unique()
sb.barplot(x = hours, y = hourly_mean['fare_amount'])

Notes:
* Gradual increase in mean fare amount from 2009 to 2015
* Mean fare amount increases sharply from 01:00 to 05:00
* The month which the trip was taken does not seem to impact fare amount

In [None]:
# mean by number of passengers
pass_mean = df.groupby(['passenger_count']).mean()
pass_mean

In [None]:
num_of_pass = df['passenger_count'].unique()
sb.barplot(x = num_of_pass, y = pass_mean['fare_amount'])

Notes:
* The number of passengers does not seem to impact the fare amount

**6. Feature Subset Selection**

* Approaches:
1. Filter -> Pearson coefficient to measure correlation between features
2. Wrapper -> Forward selection and backward elimination

In [None]:
# filter approach using Pearson correlation
pearson_corr = df.corr()
plt.figure(figsize = (10,5))
sb.heatmap(data = pearson_corr, cmap = "Reds", annot = True)

In [None]:
# listing correlations of features with target
correlations = abs(pearson_corr['fare_amount'])
correlations

**Notes:**
* Distance has the highest correlation with target with Pearson coefficient of 0.795

In [None]:
df

In [None]:
# wrapper method: forward selection
# estimator: LinearRegression
# cross-validation: 5-fold
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X = df.iloc[:,6:]
y = df.iloc[:,0]
linear = LinearRegression()
sfs = SFS(linear, k_features = 'best', forward = True, floating = False, verbose = 0, cv = 5)
sfs = sfs.fit(X,y)

In [None]:
# Best feature at each step
sfs.subsets_

In [None]:
# name of top features
sfs.k_feature_names_

In [None]:
# cross-validation score
sfs.k_score_

**Notes**
* Distance, year and passenger count were identified as better features in this order
* Cross-validation score: 0.581

In [None]:
# wrapper method: backward selection
# estimator: LinearRegression
# cross-validation: 5-fold
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X = df.iloc[:,6:]
y = df.iloc[:,0]
linear = LinearRegression()
sfs = SFS(linear, k_features = 'best', forward = False, floating = False, verbose = 0, cv = 5)
sfs = sfs.fit(X,y)

In [None]:
# Best feature at each step
sfs.subsets_

In [None]:
# name of top features
sfs.k_feature_names_

In [None]:
# cross-validation score
sfs.k_score_

**Notes**
* Distance, year and passenger count were identified as better features in this order as well
* Cross-validation score: 0.581

**7. Feature Creation**