# 📌 **Boston Housing Dataset - Student Notebook**

## 📝 Instructions for Students
Welcome to this hands-on **Simple Linear Regression** notebook! This notebook is designed to help you master **data preprocessing, exploratory data analysis, model building, and performance improvement** step by step.

### ✅ How to Use This Notebook:
1. **No Code is Provided** – You need to write the code yourself to complete each section.
2. **Work Through Each Section in Order** – The notebook is structured **from beginner to master level**.
3. **Understand the Dataset** – Read the dataset details before jumping into questions.
4. **Solve Each Question** – Each level contains **tasks** that will build your understanding.
5. **Use External Resources if Needed** – Feel free to refer to documentation and tutorials.
6. **Experiment & Learn** – Try different approaches to gain a deeper understanding.

---

## 📂 Download the Dataset
👉 **[Click Here to Download Boston Housing Dataset](https://www.kaggle.com/code/sandy5290/linear-regression-lasso-ridge-elasticnet/input)**  
Save the dataset to your working directory before proceeding.

---

## 📊 Dataset Overview
The **Boston Housing Dataset** contains **506 observations** with **13 features** and **1 target variable (`MEDV`)**, which represents the median home value. The dataset is widely used for **regression tasks**.



---

# 🟢 **Beginner Level: Data Preprocessing**

## 🎯 Introduction
This section focuses on **data preprocessing**, an essential step in machine learning. **Properly cleaning and preparing data ensures better model accuracy and reliability.**  

📌 **Use Pandas, NumPy, and Seaborn for data preprocessing.**  

---

## 📌 **Beginner-Level Tasks**
### **1️⃣ Load the dataset and display the first 5 rows.**  
### **2️⃣ Generate summary statistics and check for missing values.**  
### **3️⃣ Plot histograms for numerical features and detect skewness.**  
### **4️⃣ Identify outliers in `CRIM`, `LSTAT`, and `NOX` using boxplots.**  
### **5️⃣ Handle missing values appropriately.**  
### **6️⃣ Remove outliers using the IQR method.**  
### **7️⃣ Compute the correlation matrix and visualize it using a heatmap.**  
### **8️⃣ Apply Min-Max Scaling and Standardization on numerical features.**  
### **9️⃣ Encode the categorical variable `CHAS` using Label Encoding and One-Hot Encoding.**  
### **🔟 Split the dataset into 80% training and 20% testing sets.**  

✅ **Once completed, move on to the Intermediate Level: Exploratory Data Analysis.**



---

# 🟡 **Intermediate Level: Exploratory Data Analysis (EDA)**

## 🎯 Introduction
EDA helps in **understanding data trends, relationships, and patterns** before training a model.  

📌 **Use Pandas, Seaborn, and Matplotlib for visualizations.**  

---

## 📌 **Intermediate-Level Tasks**
### **1️⃣ Visualize the distribution of the target variable (`MEDV`) and interpret its skewness.**  
### **2️⃣ Identify the top 3 features most correlated with `MEDV`.**  
### **3️⃣ Create scatter plots between `MEDV` and the top 3 correlated features.**  
### **4️⃣ Perform a pairplot visualization for all numerical features.**  
### **5️⃣ Analyze the effect of `CHAS` on `MEDV` using a boxplot.**  
### **6️⃣ Compute the Variance Inflation Factor (VIF) to check for multicollinearity.**  
### **7️⃣ Remove highly correlated independent variables based on VIF values.**  
### **8️⃣ Detect influential data points using Cook’s Distance or Leverage Score.**  
### **9️⃣ Rank features based on statistical significance tests.**  
### **🔟 Summarize key insights from your EDA.**  

✅ **Once completed, move on to the Advanced Level: Model Building & Evaluation.**



---

# 🔵 **Advanced Level: Model Building & Evaluation**

## 🎯 Introduction
This section focuses on **training and evaluating a Simple Linear Regression model**.  

📌 **Use Scikit-Learn for model training and evaluation.**  

---

## 📌 **Advanced-Level Tasks**
### **1️⃣ Implement a Simple Linear Regression model using `sklearn` and train it on the dataset.**  
### **2️⃣ Print the regression coefficients (`intercept` and `slope`) and interpret them.**  
### **3️⃣ Make predictions on the test set and display the first 10 predicted vs. actual values.**  
### **4️⃣ Evaluate model performance using R² Score, MSE, RMSE, and MAE.**  
### **5️⃣ Create a residual plot to check for homoscedasticity.**  
### **6️⃣ Plot the regression line on a scatter plot of `MEDV` vs. the most important predictor.**  
### **7️⃣ Check for heteroscedasticity using statistical tests.**  
### **8️⃣ Perform k-fold cross-validation (k=5) and compare results with the original model.**  
### **9️⃣ Fit a multiple linear regression model and compare it with simple regression.**  
### **🔟 Summarize the model’s strengths and weaknesses.**  

✅ **Once completed, move on to the Master Level: Improving Model Performance.**



---

# 🔴 Master Level: Improving Model Performance

## 🎯 Introduction
The goal is to **refine the Simple Linear Regression model** and improve its accuracy.  

📌 **Use Scikit-Learn, NumPy, and Statsmodels for optimization.**  

---

## 📌 **Master-Level Tasks**
### **1️⃣ Identify and remove high-leverage points using Cook’s Distance and analyze its impact.**  
### **2️⃣ Compare the original model with a transformed feature model (log or polynomial) and assess performance improvement.**  



✅ **Congratulations! You have completed all levels of Simple Linear Regression.**  
📌 **Remember, refining models is an ongoing process, and small improvements can significantly impact predictions.**