
## üéØ **Assignment Title**

> Predict the price of an Uber ride from a given pickup point to the agreed drop-off location.

---

## üß† **Theory Overview (What to Say in Viva)**

This assignment focuses on **Supervised Machine Learning (Regression)** where we predict a **continuous value** ‚Äî the **fare amount** ‚Äî using features like **distance**, **date-time**, and **passenger count**.

### ‚ú≥Ô∏è Concepts Involved

| Concept                      | What It Means                                                                                                            |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| **Data Preprocessing**       | Cleaning and transforming raw data (removing missing values, fixing invalid lat/lon, extracting date-time info)          |
| **Outliers**                 | Extreme data points that distort model performance ‚Äî usually removed with **IQR method** or visualized with **boxplots** |
| **Correlation**              | Shows how two features relate; e.g. distance ‚Üî fare (positive correlation)                                               |
| **Feature Engineering**      | Creating new features (like ‚Äúdistance‚Äù using Haversine formula)                                                          |
| **Linear Regression**        | Predicts values assuming linear relation between inputs and output                                                       |
| **Random Forest Regression** | Ensemble of decision trees; handles non-linear data better                                                               |
| **Evaluation Metrics**       | R¬≤ (goodness of fit) and RMSE (error magnitude)                                                                          |

---

## üßæ **Steps in Your Code (Typical Notebook Flow)**

Let‚Äôs go step-by-step as your notebook (`B1.ipynb`) likely does.

---

### **1Ô∏è‚É£ Importing Required Libraries**

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
```

üìò *Exam Tip:* Be ready to explain why each library is used:

* `pandas` ‚Üí data handling
* `seaborn/matplotlib` ‚Üí visualization
* `sklearn` ‚Üí modeling and evaluation

---

### **2Ô∏è‚É£ Loading and Inspecting Dataset**

```python
df = pd.read_csv("uber.csv")
df.head()
df.info()
df.describe()
```

**What to mention:**

* Dataset contains columns like: `pickup_datetime`, `fare_amount`, `pickup_latitude`, `pickup_longitude`, `dropoff_latitude`, `dropoff_longitude`, and `passenger_count`.
* `fare_amount` is the **target variable**.

---

### **3Ô∏è‚É£ Data Cleaning**

**Tasks:**

* Remove missing values:

  ```python
  df.dropna(inplace=True)
  ```
* Remove invalid fare values (`fare_amount <= 0`)
* Keep valid passenger counts (`1 <= passenger_count <= 6`)
* Remove invalid coordinates:

  ```python
  df = df[(df.pickup_latitude.between(-90, 90)) &
          (df.pickup_longitude.between(-180, 180)) &
          (df.dropoff_latitude.between(-90, 90)) &
          (df.dropoff_longitude.between(-180, 180))]
  ```

---

### **4Ô∏è‚É£ Feature Engineering ‚Äî Extracting Date & Time Features**

```python
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
df['hour'] = df.pickup_datetime.dt.hour
df['day'] = df.pickup_datetime.dt.day
df['month'] = df.pickup_datetime.dt.month
df['year'] = df.pickup_datetime.dt.year
df['dayofweek'] = df.pickup_datetime.dt.dayofweek
```

**Purpose:**
These new features can capture **peak hours** or **seasonal effects** on fare.

---

### **5Ô∏è‚É£ Compute Distance (Haversine Formula)**

```python
import math

def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius (km)
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

df['distance_km'] = haversine(df['pickup_latitude'], df['pickup_longitude'],
                              df['dropoff_latitude'], df['dropoff_longitude'])
```

---

### **6Ô∏è‚É£ Outlier Detection & Removal**

**Boxplot Visualization:**

```python
sns.boxplot(x=df['fare_amount'])
plt.show()
```

**IQR Method:**

```python
Q1 = df['fare_amount'].quantile(0.25)
Q3 = df['fare_amount'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['fare_amount'] >= Q1 - 1.5*IQR) & (df['fare_amount'] <= Q3 + 1.5*IQR)]
```

---

### **7Ô∏è‚É£ Correlation Check**

```python
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
```

Expected:
Strong positive correlation between **distance_km** and **fare_amount**.

---

### **8Ô∏è‚É£ Splitting Data**

```python
X = df[['distance_km', 'passenger_count']]
y = df['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

### **9Ô∏è‚É£ Model 1 ‚Äî Linear Regression**

```python
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Linear Regression R¬≤:", r2_score(y_test, y_pred_lr))
print("Linear Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))
```

---

### **üîü Model 2 ‚Äî Random Forest Regression**

```python
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest R¬≤:", r2_score(y_test, y_pred_rf))
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
```

Expected:
Random Forest performs better (higher R¬≤, lower RMSE).

---

### **üîç 11Ô∏è‚É£ Visualization ‚Äî Actual vs Predicted**

```python
plt.scatter(y_test, y_pred_rf, alpha=0.5)
plt.xlabel("Actual Fare")
plt.ylabel("Predicted Fare")
plt.title("Random Forest - Actual vs Predicted")
plt.show()
```

---

## üßÆ **Expected Results**

| Model             | R¬≤          | RMSE |
| ----------------- | ----------- | ---- |
| Linear Regression | 0.70 ‚Äì 0.75 | ~5.0 |
| Random Forest     | 0.85 ‚Äì 0.90 | ~3.0 |

‚úÖ **Conclusion:**
Random Forest Regression gives better accuracy because it handles non-linear relationships effectively.

---

## üí¨ **Viva / Oral Questions and How to Answer**

| Question                                                | Short, Confident Answer                                                           |
| ------------------------------------------------------- | --------------------------------------------------------------------------------- |
| What is Regression?                                     | A supervised ML technique used to predict continuous values.                      |
| What is R¬≤ score?                                       | Measures how much variance in target is explained by model. Higher is better.     |
| What is RMSE?                                           | Root Mean Square Error ‚Äî measures average prediction error. Lower is better.      |
| Why Haversine formula?                                  | To calculate real-world distance between coordinates.                             |
| Why remove outliers?                                    | They affect the accuracy and skew the model.                                      |
| Difference between Linear Regression and Random Forest? | Linear is simple but assumes linear relation; RF handles complex non-linear data. |
| Why use `train_test_split`?                             | To test the model‚Äôs performance on unseen data.                                   |
| Which model is better?                                  | Random Forest, due to higher accuracy and robustness.                             |
| What is Feature Engineering?                            | Creating new informative variables from existing ones.                            |

---

## ‚öôÔ∏è **Possible Exam Variations (Be Ready for These Changes)**

| Modification Asked             | What to Do                                                            |
| ------------------------------ | --------------------------------------------------------------------- |
| ‚ÄúAdd another feature‚Äù          | Add `hour` or `dayofweek` to `X`                                      |
| ‚ÄúTry different test_size‚Äù      | Change test_size=0.3                                                  |
| ‚ÄúShow model accuracy visually‚Äù | Add scatter plot or residual plot                                     |
| ‚ÄúAdd feature scaling‚Äù          | Use `StandardScaler` before fitting model                             |
| ‚ÄúUse only distance feature‚Äù    | Change `X = df[['distance_km']]`                                      |
| ‚ÄúExplain overfitting‚Äù          | Model performs well on training data but poorly on test data          |
| ‚ÄúExplain ensemble learning‚Äù    | Combining multiple models (like Random Forest) to improve performance |

---

## üìÑ **Conclusion**

In this assignment, we learned:

* How to preprocess data and remove outliers
* Compute real-world distances using latitude/longitude
* Apply **Linear Regression** and **Random Forest Regression**
* Evaluate models with **R¬≤** and **RMSE**
* Understand the importance of **data quality and feature engineering**

---

Would you like me to generate a **1-page revision sheet (PDF summary)** for this assignment ‚Äî with **viva questions, formulas, key code snippets, and definitions** that you can print and revise before your practical?
