# **Project Name**    - Real Estate Industry Project

##### **Group - 31**

# **Project Summary -**

The NYC taxi trip durations dataset is a dataset released by the NYC Taxi and Limousine Commission which records the trip duration of various taxi rides within the New York City for the first half of the year 2016. The objective of this project is to predict this duration of taxi trips in NYC based on various features such as pickup and dropoff coordinates, pickup and dropoff datetime information, and other recorded factors. This project involves several steps to pre-process the data, explore the relationships between variables, build regression models, and choose the best model after evaluating their performance using relevant metrics and charts.

The initial data preprocessing involved converting datetime features into more meaningful variables such as the day of the week, day of the month, month of the year, and hour of the day. This allows for a better understanding of how the trip duration varies across different time periods, and also create new features which can establish temporal (daily, weekly and monthly) relationships with the trip duration. Additionally, outliers of trip duration are handled by trimming extreme values and filtering pickup and dropoff coordinates within (and the near proximity of) NYC boundaries.

To select the most relevant features, separate datasets are created by considering different combinations of a few features of interest - namely, passenger count, store_and_fwd_flag, and holidays. Each of these datasets are then fit into regression models as separate iterations, which allows for a comparison of their impact on the predicting power of the models.

On these features, extensive Exploratory Data Analysis (EDA) is conducted using various visualization techniques including - but not limited to - scatter plots, histograms, and box plots, to uncover patterns between variables. Univariate analysis for each continuous and categorical variables is analysed. Bivariate relationships between the same features with the dependent variable are also studied.

Further pre-processing steps involved transforming continuous variables using the appropriate transformations to ensure normal distribution of the residues, and applying standard scaling to ensure consistent scaling across features. A few hypothesis tests are also done, which included tests to determine the dependance feature to be predicted with some features of interest.

The scaled dataset is split into train and test dataset based on an appropriate test ratio. Seven different regression models are then implemented onto this dataset roughly in the order of increasing complexity, namely - Linear Regression, Lasso regularized linear model, Ridge regularized linear model, Polynomial regression, Light gradient-boosting machine, Decision Trees, and the XGBoost. The 2 best models out of these are then stacked together to test for further improvement.

The R<sup>2</sup> score, adjusted R<sup>2</sup> and RMSE are selected as the model evaluation metrics based on logical reasons. Based on these metrics alone, the XGBoost was concluded to be the best performing model with a best R<sup>2</sup> score of 0.813, with LightGBM coming a close second. The Stacking model based off of these two models did not show any improvement of results. Also, the best combination of the three features of interest is concluded to be to include holidays while excluding the passenger count and store_and_fwd_flag, which are deemed redundant.

Visualization played a crucial role in understanding and presenting the results. Predictions for each model are visualized using scatter plots, line plots, and density plots, allowing for a comparison between the true and predicted values. The ELI5 library is then utilized to provide explainability for the XGBoost model, enabling a better understanding of the feature importances and their contributions to the predictions.

Future scope for this project includes several avenues for improvement and expansion. First, additional features related to traffic patterns, or special events could be incorporated to enhance the predictive power of the models. Also, fine-tuning the hyperparameters of the models and conducting cross-validation experiments can help achieve better generalization and robustness, which was avoided in this project due to its high computational cost due to the complexity of the models like XGBoost.

In conclusion, this project successfully developed a regression model using XGBoost to predict the duration of NYC taxi trips. Through thorough data preprocessing, exploratory analysis, and model evaluation, valuable insights were gained into the factors influencing the trip duration. The project also lays foundation for future improvements and expansions in order to enhance the accuracy and applicability of the prediction models in predicting the taxi trip durations.


# **GitHub Link -**

https://github.com/MAST30034-Applied-Data-Science/real-estate-industry-project-open-source-industry-project-31

# **Problem Statement**

**BUSINESS PROBLEM STATEMENT**

The New York City (NYC) taxi service is a critical transportation system for millions of residents and visitors. Efficient prediction of the duration of taxi trips is necessary to significantly enhance the overall taxi service experience and enable effective planning and resource allocation.

The NYC taxi trip duration dataset is a dataset released by the NYC [Taxi and Limousine Commission](https://www.nyc.gov/site/tlc/index.page), which includes several features like pickup time, dropoff time, pickup coordinates etc as possible predictors for prediction of taxi trip duration. The aim of this project is to accurately predict the trip duration using a regression model. Apart from utilising the above predictors, several other external factors such as holidays, month of the year, hour of the day etc shall be utilised to improve the prediction power.

By developing a reliable prediction algorithm which predicts the trip duration beforehand, several stakeholders can benefit. For taxi service providers, it enables better resource management, allowing them to allocate taxis more efficiently and reduce customer wait times. Additionally, improved trip duration predictions can help taxi drivers plan their routes and schedules more effectively, leading to increased productivity. Hence, this project aims to this project aims to enhance the efficiency of the taxi service, improve customer satisfaction, and contribute to the overall transportation ecosystem of the city.

# **Let's Begin !**

## **1. Data Exploration & Retrieval**

**Main Dataset**

**Internal**:
The main dataset is extracted from `domain.com.au`, which records the rental properties and their corresponding features, including `Address`, `Prices`, `Number of Bedroom`, `Number of Bathroom`, `Number of Parking`, and `Unit Type`. 

**External**:
In addition to the property data, official statistics data on a wide range of economic, social, population and environmental matters of importance to Australia published by Australia Bureau of Statistics (ABS) is used to provide more details to support the analysis of rental properties.

**Supporting External Dataset**

1. PTV Train Station Data
2. PTV Tram Stop Data
3. School Location Data
4. Landmarks Data
5. Crime & Offence Data
6. Moving Annual Rent By Suburb (Quarterly) Data

## **2. Extract, Transform, Load**

**Property Dataset**

It is found that
*   There are **10125 rows and 6 columns**, out of which one is **Prices**, which is the the dependent variable to be predicted
*   There are **906 duplicated data** in the dataset, which we removed

Then we
*   Extract the numerical values from `Prices` column
*   Remove the rows with missing `Prices` values
*   Clean the `Address` column


**ABS Dataset**
*   Extract only relevant features


## **3. Exploratory Data Analysis**

After joining all the internal and external datasets together, we did preliminary analysis on the property, property prices, and property unit type.

In [30]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Define the path to your PNG image file
image_path = '../plots/house.png'

# Load and display the image using Matplotlib
img = mpimg.imread(image_path)
plt.imshow(img)
plt.axis('off')  # Optional: Turn off axes if not needed
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '../plots/house.png'

### Import Libraries

In [1]:
import pandas as pd
import os

In [2]:
os.chdir('../data/raw/dataset')

#Listing all the files in directory ending with ".csv"
csv_files = [f for f in os.listdir() if f.endswith('.csv')]
print(len(csv_files))

#Looping through the list of all the CSV files and read each file into a DataFrame
dfs = []
for csv in csv_files:
    df = pd.read_csv(csv)
    dfs.append(df)
    
#Joining together all the DataFrames in the list to one DataFrame.
final_df = pd.concat(dfs, ignore_index=True)
final_df

507


Unnamed: 0,Address,Prices,Bedroom,Bathroom,Parking,Type
0,"39 Durham Crescent, \nHOPPERS CROSSING VIC 3029",$450 Per Week,3,2,2,House
1,"7/1 Mabel Street, \nIVANHOE VIC 3079",$450 Per Week,2,1,1,Townhouse
2,"2/59 Green Street, \nIVANHOE VIC 3079",$450 per week,2,1,1,Apartment / Unit / Flat
3,"104/15 Ivanhoe Parade, \nIVANHOE VIC 3079",$450 per week,1,1,1,Apartment / Unit / Flat
4,"4/64 Lorraine Crescent, \nJACANA VIC 3047",$450,2,1,1,Townhouse
...,...,...,...,...,...,...
10120,"30 Scott Street, \nBELMONT VIC 3216",$470.00,3,1,2,House
10121,"6 Stringybark Court, \nBERWICK VIC 3806",$470 per week,3,2,,House
10122,"104 Canterbury Jetty Road, \nBLAIRGOWRIE VIC 3942",$470.00 per week pw,3,1,2,House
10123,"14 Northerly Drive, \nBONNIE BROOK VIC 3335",$470,4,2,2,House


In [5]:
sum(final_df.duplicated())

906

In [6]:
final_df.isnull().sum()

Address     0
Prices      0
Bedroom     0
Bathroom    0
Parking     0
Type        0
dtype: int64

On a first look at the dataset, it is found that
*   There are **10125 rows and 6 columns**, out of which one is **Prices**, which is the the dependent variable to be predicted
*   There are **906 duplicated data** in the dataset
*   There are **no missing values** in the dataset

## **2. Understanding Your Variables**