In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('/content/sample_data/uber.csv')

# 1. Pre-process the dataset
# Handle missing values (if any)
data.dropna(inplace=True)

# Split the data into features (X) and target (y)
X = data[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']]
y = data['fare_amount']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Initialize and train Random Forest Regression model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# 5. Evaluate the Models and Compare Their Scores
# Make predictions using both models
lr_predictions = lr_model.predict(X_test)
rf_predictions = rf_model.predict(X_test)

# Calculate R-squared (R2) and Root Mean Squared Error (RMSE) for both models
lr_r2 = r2_score(y_test, lr_predictions)
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))
rf_r2 = r2_score(y_test, rf_predictions)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))

print("Linear Regression R2:", lr_r2)
print("Linear Regression RMSE:", lr_rmse)
print("Random Forest Regression R2:", rf_r2)
print("Random Forest Regression RMSE:", rf_rmse)

# Visualize model predictions against actual fares
plt.figure(figsize=(10, 6))
plt.scatter(y_test, lr_predictions, label='Linear Regression', alpha=0.5)
plt.scatter(y_test, rf_predictions, label='Random Forest Regression', alpha=0.5)
plt.xlabel('Actual Fare Amount')
plt.ylabel('Predicted Fare Amount')
plt.legend()
plt.title('Model Predictions vs. Actual Fares')
plt.show()


Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for amachine
learning model. It is the rst and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean itand put in a formatted
way. So for this, we use data preprocessing task

Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such assales, salary, age, product price,etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it nds how the value of the dependent variable is changing according to the value
of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables


Random Forest Regression Models:
Random Forest is a popular machine learning algorithm that belongs to the supervisedlearning
technique. It can be used for both Classi cation and Regression problems in ML. Itis based on the
concept ofensemble learning,which is a process ofcombining multiple classi ers to solve a complex
problem and to improve the performance of the model.
As the name suggests,"Random Forest is a classi er that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset."Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts output.
The greater number of trees in the forest leads to higher accuracy and prevents theproblem of overtting.



Mean Squared Error;
TheMean Squared Error (MSE)orMean Squared Deviation (MSD)of an estimator measures
the average of error squares i.e. the average squared difference between theestimated values
and true value. It is a risk function, corresponding to the expected value ofthe squared error
loss. It is always non – negative and values close to zero are better. TheMSE is the second
moment of the error (about the origin) and thus incorporates both thevariance of theestimator
and its bias.
