**Linear Regression (Supervised Learning - Regression)**

**Task:** Predict a continuous value (e.g., house prices).

**Dataset:** California Housing Dataset (built into scikit-learn).

In [5]:
# Install scikit-learn
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing  # To load the California housing dataset
from sklearn.model_selection import train_test_split  # To split the dataset into training and testing sets
from sklearn.linear_model import LinearRegression  # To create and train a Linear Regression model
from sklearn.metrics import mean_squared_error  # To evaluate the model's performance
import numpy as np  # To handle numerical operations

# Step 1: Load the California Housing dataset
data = fetch_california_housing()  # Fetches the dataset, which contains house prices and related features

# Step 2: Extract features (X) and target variable (y)
X = data.data  # Features (e.g., population, median income, house age, etc.)
y = data.target  # Target variable (median house value in $100,000s)

# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# `test_size=0.2` means 20% of the data is used for testing, 80% for training
# `random_state=42` ensures reproducibility, so the same data split happens every time the code runs

# Step 4: Create and train a Linear Regression model
model = LinearRegression()  # Initializes the Linear Regression model
model.fit(X_train, y_train)  # Trains the model using the training data (X_train, y_train)

# Step 5: Make predictions on the test data
y_pred = model.predict(X_test)  # Uses the trained model to predict house prices for the test set

# Step 6: Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)  # Calculates the average squared difference between actual and predicted values
print("Mean Squared Error (MSE):", mse)  # Prints the error value

# To prevent potential overflow issues, convert predictions to a NumPy array
y_pred = np.array(y_pred)


Mean Squared Error (MSE): 0.5558915986952426


**Task:**
1. Predict Car Prices using a Custom Dataset

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split  # To split the dataset into training and testing sets
from sklearn.linear_model import LinearRegression  # To create and train a Linear Regression model
from sklearn.metrics import mean_squared_error  # To evaluate the model's performance
from sklearn.impute import SimpleImputer  # To handle missing values
import numpy as np

# Step 1: Load the dataset
df = pd.read_csv("D:\\CUBICLES\\car-sale-advertisements\\car_ad.csv", encoding='ISO-8859-1') #replace with your file path.
# print(df.head())
# df.info()

# Step 2: Extract features (X) and target variable (y)
X = df[['year', 'mileage', 'engV']]
y = df['price']

# Handle missing values by imputing with the mean
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Create and train a Linear Regression model
model = LinearRegression()  
model.fit(X_train, y_train)  

# Step 5: Make predictions on the test data
y_pred = model.predict(X_test) 

# Step 6: Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)  # Calculates the average squared difference between actual and predicted values
print("Mean Squared Error (MSE):", mse)

r2 = r2_score(y_test, y_pred) #Added r2 score.
print("R-squared (R2):", r2)

y_pred = np.array(y_pred)
print(y_pred)

Mean Squared Error (MSE): 557158768.5378634
R-squared (R2): 0.14461985527651156
[11877.47512055 18549.59601949 26632.39529023 ... 15720.65725138
 19436.20200534 15336.03759624]


In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score #Added r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler #Added standard scaler.
import os

# Step 1: Load the dataset
dataset_path = os.path.join("D:\\CUBICLES", "car-sale-advertisements", "car_ad.csv")

if os.path.exists(dataset_path):
    df = pd.read_csv(dataset_path, encoding='ISO-8859-1')
    # print(df.head())
    # df.info()

    # Step 2: Extract features (X) and target variable (y)
    X = df[['year', 'mileage', 'engV']]
    y = df['price']

    # Handle missing values by imputing with the mean
    imputer = SimpleImputer(strategy='mean')
    X = imputer.fit_transform(X)

    # Scale the features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Step 3: Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Step 4: Create and train a Linear Regression model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Step 5: Make predictions on the test data
    y_pred = model.predict(X_test)

    # Step 6: Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred) #Added r2 score.
    print("Mean Squared Error (MSE):", mse)
    print("R-squared (R2):", r2)

else:
    print(f"Error: Dataset not found at {dataset_path}")

Mean Squared Error (MSE): 557158768.5378633
R-squared (R2): 0.14461985527651178


**Task:**

2. Salary Prediction based on Experience