## Section 1: Dataset Overview and Loading

### Task 1.1: Load the Ames Housing Dataset
- Use the `fetch_openml` function from `sklearn.datasets` to load the Ames Housing dataset. Make sure to convert it to a pandas DataFrame for easy manipulation.

> ref: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html
  
  **Note:** The target variable is `SalePrice`.

In [2]:
# Example code for loading the Ames Housing dataset from OpenML
from sklearn.datasets import fetch_openml
import pandas as pd

# Load dataset
data = fetch_openml(data_id=42165, as_frame=True) 
df_housing = pd.DataFrame(data.data, columns=data.feature_names)
df_housing['SalePrice'] = data.target  
df_housing.head() 

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


---

## Section 2: Exploratory Data Analysis (EDA)

### Task 2.1: Summary Statistics
- Calculate key summary statistics (mean, median, variance, standard deviation) for all features in the dataset.
- Discuss the importance of each statistical measure in understanding the dataset.

In [23]:
summary_stats = df_housing.describe().loc[['mean', 'std', '50%']].T  # 50% is the median
summary_stats['variance'] = df_housing.select_dtypes(include='number').var()
summary_stats.rename(index={'50%': 'median'}, inplace=True)

print(summary_stats)


# df_housing.median()    
# df_housing.var()  

                        mean           std       50%      variance
Id                730.500000    421.610009     730.5  1.777550e+05
MSSubClass         56.897260     42.300571      50.0  1.789338e+03
LotFrontage        70.049958     24.284752      69.0  5.897492e+02
LotArea         10516.828082   9981.264932    9478.5  9.962565e+07
OverallQual         6.099315      1.382997       6.0  1.912679e+00
OverallCond         5.575342      1.112799       5.0  1.238322e+00
YearBuilt        1971.267808     30.202904    1973.0  9.122154e+02
YearRemodAdd     1984.865753     20.645407    1994.0  4.262328e+02
MasVnrArea        103.685262    181.066207       0.0  3.278497e+04
BsmtFinSF1        443.639726    456.098091     383.5  2.080255e+05
BsmtFinSF2         46.549315    161.319273       0.0  2.602391e+04
BsmtUnfSF         567.240411    441.866955     477.5  1.952464e+05
TotalBsmtSF      1057.429452    438.705324     991.5  1.924624e+05
1stFlrSF         1162.626712    386.587738    1087.0  1.494501

### Task 2.2: Data Visualization
- Create visualizations to explore relationships between features. This should include:
  - Histograms for each feature to observe the distribution.
  - Pair plots to examine relationships between features.
  - A correlation matrix to analyze correlations between numerical variables and their effect on the target (`SalePrice`).

### Task 2.3: Handle Categorical Variables
- Visualize and analyze categorical variables such as `Neighborhood`, `ExterQual`, and `HouseStyle`.
- Understand the distribution of these categorical features.

### Task 2.4: Document Key Insights
- Document key insights from the EDA, such as:
  - Highly correlated features.
  - Potential outliers.
  - Visible patterns or trends.
- Identify any preliminary issues that may affect model performance (e.g., outliers, missing values, categorical variables).

---

## Section 3: Training a Baseline Regression Model

### Task 3.1: Convert Categorical Features to Numerical
- Use encoding techniques such as `One-Hot Encoding` or `Label Encoding` to convert categorical features into numerical values. 

### Task 3.2: Split the Dataset
- Split the dataset into training and testing sets using an 80/20 split.

### Task 3.3: Train a Gradient Descent-Based Regression Model
- Train a regression model (such as a Linear Regression model using Gradient Descent) on the raw dataset.
- Record the baseline performance using appropriate regression evaluation metrics (e.g., Mean Squared Error (MSE), Root Mean Squared Error (RMSE)).

### Task 3.4: Document Model Performance
- Document the performance of the baseline model. 
- Explain the significance of the evaluation metrics you have chosen.
- Explain which encoding technique best suited the model.

---

## Section 4: Data Preprocessing Techniques

### Task 4.1: Handling Missing Data
- For this step use the best version of preprocessed dataset from previous steps.
- The dataset contains missing values in various columns. Apply the following imputation techniques:
  - Removing rows with missing values.
  - Imputation using `mean`, `median`, and `mode`.
- Retrain the model after handling missing data and compare the performance with the baseline model.

#### Findings:
- Compare the model performance before and after handling missing data.
- Which imputation technique led to the best performance? Why?
- Document the effects of each method on model accuracy.

### Task 4.2: Data Normalization and Standardization
- Apply the following scaling techniques to numerical features:
  - `Min-Max Scaling`.
  - `Z-score Standardization` (mean = 0, variance = 1).
- Retrain the model on normalized and standardized datasets, and compare the performance against the baseline.

#### Findings:
- Compare the model performance before and after applying normalization and standardization.
- How do normalization and standardization impact model performance?
- Discuss which method improved model performance the most and why.

### Task 4.3: Outlier Detection and Removal
- Detect and potentially remove outliers in both numerical and categorical data using:
  - `Z-score` method for numerical features.
  - Frequency analysis for categorical features (e.g., rare categories in `Neighborhood`). [*OPTIONAL*]
- Retrain the model after removing outliers and evaluate its performance.

#### Findings:
- How did the removal of outliers affect model performance?
- Did outlier removal improve or degrade the model’s accuracy? Provide insights.

### Task 4.4: Feature Engineering
- Apply feature engineering techniques such as:
  - Creating polynomial features (e.g., `quadratic` or `cubic features` or `logarithmic transformations`).
  - Interaction features (e.g., combining features like `GrLivArea` and `OverallQual`).
- Retrain the model using the engineered features and compare the performance.

#### Findings:
- Did feature engineering improve the model's performance?
- Which feature transformations had the most significant effect on model accuracy?
- Justify why certain features may have contributed to better performance.

---

## Section 5: Regularization Techniques

### Task 5.1: Apply Ridge Regression ($L2$)
- Train a Ridge regression model on the preprocessed dataset (after applying necessary scaling, imputation, and encoding).
- Use cross-validation to find the optimal value of the regularization parameter (`alpha`).

#### Findings:
- Compare the performance of Ridge regression with the baseline linear regression model.
- How did Ridge regularization affect the model’s performance in terms of reducing overfitting and improving generalization?

### Task 5.2: Apply Lasso Regression ($L1$)
- Train a Lasso regression model on the preprocessed dataset.
- Use cross-validation to find the optimal value of the regularization parameter ($\alpha$).
  
#### Findings:
- Compare the performance of Lasso regression with both the baseline model and the Ridge regression model.
- Did Lasso shrink any coefficients to zero? If so, which features were eliminated, and why might they be considered less important?
- Analyze how Lasso improved the model's generalization and whether feature selection helped.


### Task 5.3: Comparison of Ridge and Lasso
- Compare the performance of Ridge and Lasso regression models.
- Discuss the trade-offs between the two approaches in terms of:
  - Feature selection (sparsity in Lasso vs. retaining all features in Ridge).
  - Impact on overfitting and model interpretability.
  - Which model performed better on the Ames Housing dataset, and why?

## Section 6: Summary and Conclusion

### Task 6.1: Summary of Findings
- Summarize your findings from each preprocessing technique (handling missing data, normalization, standardization, outlier removal, feature engineering).
- Clearly compare the performance of the regression model after each preprocessing step to the baseline model.

### Task 6.2: Conclusion
- Reflect on the importance of data preprocessing in machine learning.
- Which preprocessing techniques had the most significant impact on model performance, and why?
- Discuss how these insights might be applied to real-world machine learning tasks.