**Group 1:Jay Capozzoli, Sufyan Haroon, Noah Severin**



Objective
In this assignment, you will work with the "used car" dataset, applying data cleansing, linear regression, and K-Nearest Neighbors (KNN) regression to predict car prices.

**Part 1**

Tasks:

1. Fix Data Issues
    
    Correct the typo: rename the column "milage" to "mileage".
2. Convert Numerical Variables
      
    Ensure "mileage" and "model_year" are treated as numerical variables.
3. Encode Categorical Variables
    
    Apply one-hot encoding to the categorical columns:    
      1. "fuel_type"
      2. "clean_title"
      3. "engine"
      4. "transmission"
      5. "ext_col" (exterior color)
      6. "int_col" (interior color)
4. Remove Anomalies
    
    Identify and handle any anomalies or outliers in the dataset.

In [None]:
import pandas as pd
import numpy as np

# Load dataset
used_car_df = pd.read_csv('used_cars.csv')

# Fix column name (Correcting typo)
used_car_df.rename(columns={"milage": "mileage"}, inplace=True)

# Convert numerical columns
used_car_df["mileage"] = used_car_df["mileage"].str.replace(',', '').str.replace(' mi.', '').astype(int)
used_car_df["price"] = used_car_df["price"].str.replace(',', '').str.replace('$', '').astype(int)
used_car_df["model_year"] = used_car_df["model_year"].astype(int)

# Define categorical columns to encode explicitly
categorical_cols = ["fuel_type", "clean_title", "engine", "transmission", "ext_col", "int_col"]

# Apply one-hot encoding to specific categorical columns
used_car_df = pd.get_dummies(used_car_df, columns=categorical_cols, drop_first=True)

# Remove anomalies (outliers)
used_car_df = used_car_df[(used_car_df['price'] > 0) & (used_car_df['price'] < 300000)]
used_car_df = used_car_df[(used_car_df['mileage'] > 0) & (used_car_df['mileage'] < 250000)]

print(f"Dataset size after preprocessing: {used_car_df.shape[0]} rows")

**Part 2: Linear Regression Model**

Tasks:
1. Feature Selection
  1. Use "mileage", "model_year", and all one-hot encoded features as independent variables.
  2. The dependent variable is "price".

2. Train-Test Split
  1. Split the dataset into 80% training and 20% testing.

3. Train Linear Regression Model
  1. Fit a linear regression model using the training data.

4. Evaluate the Model
  1. Calculate and report the Root Mean Squared Error (RMSE) on the test data.

5. Visualization
  1. Plot a scatter chart with:

    A. X-axis: "mileage"

    B. Y-axis: "price"

    C. Differentiate actual vs. predicted prices.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Feature Selection: Exclude "price" and drop any non-numeric columns
X = used_car_df.drop(columns=["price"]).select_dtypes(include=[np.number])
y = used_car_df["price"]

# Train-Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and Evaluate the Model
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Linear Regression Root Mean Squared Error (RMSE): {rmse:.2f}")

# Visualization (Actual vs. Predicted Prices for Linear Regression)
plt.figure(figsize=(10, 6))

# Actual Prices
plt.scatter(X_test["mileage"], y_test, color="blue", label="Actual Prices", alpha=0.5, edgecolors='k')

# Predicted Prices
plt.scatter(X_test["mileage"], y_pred, color="red", label="Predicted Prices", alpha=0.5, edgecolors='k')

plt.xlabel("Mileage")
plt.ylabel("Price")
plt.title("Actual vs Predicted Prices (Linear Regression)")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

**Part 3: K-Nearest Neighbors (KNN) Regression**
Tasks:
1. Train KNN Model
  1. Use the same independent variables as in Part 2
  2. Choose an appropriate value for k (e.g., 3, 5, or 7).

2. Evaluate the Model
  1. Calculate and report the RMSE on the test data.

3. Visualization
  1. Plot a scatter chart with:

    A. X-axis: "mileage"

    B. Y-axis: "price"
    
    C. Differentiate actual vs. predicted prices.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# Feature Scaling for KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN Model (Choose k = 7 as default, but can change to 3, 5, 7, etc.)
k = 7
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train_scaled, y_train)

# Predict and Evaluate the Model
y_pred_knn = knn.predict(X_test_scaled)
rmse_knn = np.sqrt(mean_squared_error(y_test, y_pred_knn))
print(f"KNN Root Mean Squared Error (RMSE): {rmse_knn:.2f}")

# Visualization (Actual vs. Predicted Prices for KNN)
plt.figure(figsize=(10, 6))

# Actual Prices
plt.scatter(X_test["mileage"], y_test, color="blue", label="Actual Prices", alpha=0.5, edgecolors='k')

# Predicted Prices
plt.scatter(X_test["mileage"], y_pred_knn, color="green", label="Predicted Prices (KNN)", alpha=0.5, edgecolors='k')

plt.xlabel("Mileage")
plt.ylabel("Price")
plt.title(f"Actual vs Predicted Prices (KNN, k={k})")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()