<a href="https://colab.research.google.com/github/vaisshnavee1410/ASSIGNMENT-7-Multiple-Linear-Regression-.ipynb/blob/main/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MULTIPLE LINEAR REGRESSION**

### **ASSIGNMENT TASKS:**

Your task is to perform a multiple linear regression analysis to predict the price of Toyota corolla
based on the given attributes

### **Dataset Description:**

**The dataset consists of the following variables:**

**Age:** Age in years

**KM:** Accumulated Kilometers on odometer

**FuelType:**  Fuel Type (Petrol, Diesel, CNG)

**HP:** Horse Power

**Automatic:** Automatic  (Yes=1, No=0)

**CC:** Cylinder Volume in cubic centimeters

**Doors:** Number of doors

**Weight:** Weight in Kilograms

**Quarterly_Tax:**

**Price:** Offer Price in EUROs

### **TASKS:**

---> 1. Perform exploratory data analysis (EDA) to gain insights into the dataset. Provide visualizations
and summary statistics of the variables. Pre process the data to apply the MLR.



*   **IMPORT NECESSARY LIBRARIES:**



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
df = pd.read_csv('ToyotaCorolla - MLR.csv')

# Display dataset information and first few rows
df.info()
df.head()


*   **SUMMARY STATISTICS:**



In [None]:
# Summary statistics for numerical variables
print(df.describe())



*   **DATA VISUALIZATIONS:**


(A) HISTOGRAMS OF NUMERIC FEATURES:

In [None]:
# Histograms for numerical columns
fig, axes = plt.subplots(4, 3, figsize=(15, 12))
axes = axes.ravel()

for i, col in enumerate(df.select_dtypes(include=['int64']).columns):
    sns.histplot(df[col], bins=30, kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

(B)  SCATTER PLOTS:

In [None]:
# Scatter plots
fig, axes = plt.subplots(3, 2, figsize=(12, 12))

sns.scatterplot(x=df["Age_08_04"], y=df["Price"], ax=axes[0, 0])
axes[0, 0].set_title("Price vs Age")

sns.scatterplot(x=df["KM"], y=df["Price"], ax=axes[0, 1])
axes[0, 1].set_title("Price vs KM")

sns.scatterplot(x=df["HP"], y=df["Price"], ax=axes[1, 0])
axes[1, 0].set_title("Price vs Horse Power")

sns.scatterplot(x=df["cc"], y=df["Price"], ax=axes[1, 1])
axes[1, 1].set_title("Price vs CC")

sns.scatterplot(x=df["Weight"], y=df["Price"], ax=axes[2, 0])
axes[2, 0].set_title("Price vs Weight")

sns.boxplot(x=df["Fuel_Type"], y=df["Price"], ax=axes[2, 1])
axes[2, 1].set_title("Price vs Fuel Type")

plt.tight_layout()
plt.show()

(C) CORRELATION HEATMAP:

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

* **DATA PREPROCESSING OF MULTIPLE LINEAR REGRESSION:**

(A)  Encode Categorical Variable (Fuel_Type):

In [None]:
# Convert categorical column
df = pd.get_dummies(df, columns=['Fuel_Type'], drop_first=True)

 (B) Remove Unnecessary Columns:

In [None]:
# Drop any non-essential columns
df.drop(columns=['Cylinders'], inplace=True)

 (C) Check for Multicollinearity (Variance Inflation Factor - VIF):

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Define independent variables (excluding Price)
X = df.drop(columns=['Price'])

# Select only numeric columns for VIF calculation
X_numeric = X.select_dtypes(include=['number'])

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = X_numeric.columns
vif_data["VIF"] = [variance_inflation_factor(X_numeric.values, i) for i in range(X_numeric.shape[1])]

print(vif_data)

--> 2. Split the dataset into training and testing sets (e.g., 80% training, 20% testing).

* **Import Required Libraries:**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

* **Define Independent (X) and Dependent (y) Variables:**

In [None]:
# Define independent variables (excluding Price)
X = df.drop(columns=['Price'])

# Define dependent variable (Price)
y = df['Price']

* **Split Data into Training (80%) and Testing (20%) Sets:**

In [None]:
# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the datasets
print("Training Set Shape:", X_train.shape, y_train.shape)
print("Testing Set Shape:", X_test.shape, y_test.shape)

--> 3. Build a multiple linear regression model using the training dataset. Interpret the coefficients of
the model. Build minimum of 3 different models.

* **Import Required Libraries:**

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
df = pd.read_csv('ToyotaCorolla - MLR.csv')

# Convert categorical variable 'Fuel_Type' into dummy variables
df = pd.get_dummies(df, columns=['Fuel_Type'], drop_first=True)

# Define independent (X) and dependent (y) variables
X = df.drop(columns=['Price'])  # Features
y = df['Price']  # Target variable

# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model 1:- Using All Features:

In [None]:
# Initialize Linear Regression Model
model1 = LinearRegression()
model1.fit(X_train, y_train)

# Predictions
y_pred1 = model1.predict(X_test)

# Model Evaluation
print("Model 1 - Using All Features")
print("R2 Score:", r2_score(y_test, y_pred1))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred1))

# Display Model Coefficients
coefficients1 = pd.DataFrame({'Feature': X.columns, 'Coefficient': model1.coef_})
print(coefficients1)

Model 2:- Using Only Significant Features (Age, KM, HP, Weight):

In [None]:
# Selecting specific features
selected_features = ['Age_08_04', 'KM', 'HP', 'Weight']
X_train2, X_test2 = X_train[selected_features], X_test[selected_features]

# Train model
model2 = LinearRegression()
model2.fit(X_train2, y_train)

# Predictions
y_pred2 = model2.predict(X_test2)

# Model Evaluation
print("\nModel 2 - Using Selected Features (Age, KM, HP, Weight)")
print("R² Score:", r2_score(y_test, y_pred2))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred2))

# Display Model Coefficients
coefficients2 = pd.DataFrame({'Feature': selected_features, 'Coefficient': model2.coef_})
print(coefficients2)

Model 3:- Using Only Engine and Transmission Features (CC, HP, Automatic, Weight):

In [None]:
# Selecting another set of features
selected_features3 = ['cc', 'HP', 'Automatic', 'Weight']
X_train3, X_test3 = X_train[selected_features3], X_test[selected_features3]

# Train model
model3 = LinearRegression()
model3.fit(X_train3, y_train)

# Predictions
y_pred3 = model3.predict(X_test3)

# Model Evaluation
print("\nModel 3 - Engine and Transmission Features (CC, HP, Automatic, Weight)")
print("R² Score:", r2_score(y_test, y_pred3))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred3))

# Display Model Coefficients
coefficients3 = pd.DataFrame({'Feature': selected_features3, 'Coefficient': model3.coef_})
print(coefficients3)

Step 3:-   Interpretation of Coefficients:


   •	Positive Coefficients (e.g., HP, Weight) indicate that an increase in these features leads to a higher price.

   •	Negative Coefficients (e.g., Age, KM) suggest that as these values increase, the car’s price decreases.

   •	Automatic Transmission (Binary: 1 = Yes, 0 = No) will show a positive or negative impact on price.

Step 4:- Compare Model Performance:

In [None]:
print("\nModel Performance Comparison:")
print(f"Model 1 (All Features) R2: {r2_score(y_test, y_pred1):.4f}")
print(f"Model 2 (Age, KM, HP, Weight) R2: {r2_score(y_test, y_pred2):.4f}")
print(f"Model 3 (CC, HP, Auto, Weight) R2: {r2_score(y_test, y_pred3):.4f}")

--> 4. Evaluate the performance of the model using appropriate evaluation metrics on the testing
dataset.

**Evaluation Metrics**

1.	R² Score – Measures how well the model explains the variance in the data.

2.	Mean Squared Error (MSE) – Measures the average squared difference between actual and predicted values.

3.	Mean Absolute Error (MAE) – Measures the absolute difference between actual and predicted values.

4.	Root Mean Squared Error (RMSE) – Measures the standard deviation of residuals (errors).

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Define a function to evaluate model performance
def evaluate_model(model_name, y_test, y_pred):
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mse)

    print(f"\nModel Performance: {model_name}")
    print(f"R2 Score: {r2:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

# Evaluate Model 1 (All Features)
evaluate_model("Model 1 (All Features)", y_test, y_pred1)

# Evaluate Model 2 (Selected Features: Age, KM, HP, Weight)
evaluate_model("Model 2 (Age, KM, HP, Weight)", y_test, y_pred2)

# Evaluate Model 3 (Engine & Transmission: CC, HP, Automatic, Weight)
evaluate_model("Model 3 (CC, HP, Auto, Weight)", y_test, y_pred3)

--> 5. Apply Lasso and Ridge methods on the model

* **Import Necessary Libraries:**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso

# Standardize the dataset (important for Lasso & Ridge)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

* **Apply Ridge Regression:**

In [None]:
# Initialize Ridge Regression Model with alpha=1 (Regularization Strength)
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_ridge = ridge_model.predict(X_test_scaled)

# Evaluate Ridge Model
evaluate_model("Ridge Regression", y_test, y_pred_ridge)

* **Apply Lasso Regression:**

In [None]:
# Initialize Lasso Regression Model with alpha=0.1 (Regularization Strength)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_lasso = lasso_model.predict(X_test_scaled)

# Evaluate Lasso Model
evaluate_model("Lasso Regression", y_test, y_pred_lasso)

 * **Compare Ridge and Lasso Performance:**

In [None]:
print("\nModel Performance Comparison:")
print(f"Ridge Regression R2: {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Lasso Regression R2: {r2_score(y_test, y_pred_lasso):.4f}")

### **INTERVIEW QUESTIONS:**

--> **1.What is Normalization & Standardization and how is it helpful?**

Normalization and Standardization are two key techniques in data preprocessing used to scale numerical data. They help improve the performance of machine learning algorithms by ensuring that features are on a similar scale.

* **NORMALIZATION:**

Normalization is the process of scaling data to a fixed range, usually [0,1] or [-1,1]. It ensures that all features contribute equally to the model, preventing larger values from dominating smaller ones.


**When to Use?**

•	When data is not normally distributed (not Gaussian).

•	When working with neural networks and distance-based algorithms (like KNN).

•	Useful in cases where we need to compare different feature scale

* **STANDARDIZATION:**

Standardization transforms data so that it has a mean of 0 and a standard deviation of 1. This ensures that the data follows a standard normal distribution.

**When to Use?**

•	When data follows a normal (Gaussian) distribution.

•	When using models like Linear Regression, Logistic Regression, Support Vector Machines (SVM), and PCA.

•	When handling features with different units of measurement (e.g., height in cm and weight in kg).


**Why is Normalization and Standardization Helpful?**

* **Improves Model Performance:** Prevents large values from dominating the learning process.

* **Speeds Up Training:** Optimizers like gradient descent work faster when features are on the same scale.

* **Enhances Accuracy:** Ensures all features contribute equally to decision-making.

* **Reduces Computational Complexity:** Models converge faster when features are scaled properly.


--> **2. What techniques can be used to address multicollinearity in multiple linear regression?**

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to determine the individual effect of each predictor. This can lead to unreliable coefficient estimates and high variance in predictions.


1. **Remove Highly Correlated Predictors:**


•	Use the correlation matrix or Variance Inflation Factor (VIF) to identify highly correlated variables.

  •	Remove one of the correlated variables if it provides redundant information.


2. **Principal Component Analysis (PCA):**

•	PCA reduces dimensionality by transforming correlated features into uncorrelated principal components.

•	Instead of using original variables, use the principal components as predictors in the model.


3. **Ridge Regression (L2 Regularization):**

•	Ridge regression penalizes large coefficients, which reduces the impact of multicollinearity.

•	Unlike standard regression, it does not eliminate variables but shrinks their coefficients.

4. **Lasso Regression (L1 Regularization):**

•	Lasso regression eliminates some coefficients by setting them to zero, effectively performing variable selection.

•	Helps remove highly correlated variables automatically.


5. **Combine or Create New Features:**

•	If two variables are highly correlated, consider combining them into a single feature (e.g., sum, average, or ratio).

•	Example: Instead of using Height and Weight separately, use BMI (Weight/Height²)

6. **Increase Sample Size:**

•	If possible, collect more data. Larger datasets can reduce the impact of multicollinearity.

7. **Use Domain Knowledge:**

•	Select variables based on business logic rather than just statistical correlation.

•	If two variables measure the same concept, keep the one that is more interpretable.