# DATA 3550_Fall 2023-Midterm Project 

* Submission Deadline: Saturday, 10/28/2023 at 5 pm.

# DATA 3550_Fall2023-Midterm Project   Submission Guidelines

## Scenario

Imagine that you are part of a data science team hired by ABC Housing LLC in Murfreesboro, specializing in buying and selling houses. Your role is to develop an accurate prediction model for the "Saleprice" of houses. Collaborating with your team, you'll need to convey your findings to a manager who lacks familiarity with statistics and data science. The ultimate aim of your project report is to provide insights that will guide the company's real estate decisions. Additionally, the company is interested in identifying the top five crucial factors influencing house prices. Concluding the project, your team will deliver a concise presentation highlighting your discoveries.

## Jupyter Notebook Formatting Guidelines

It is very important that your Jupyter Notebook is formatted correctly with markdown, comments, and code that works. You may lose lot of points if the formatting guidelines is not followed and the results are not adequately explained.
Please do the following for each section

* Include a title as a Heading 2
* Include a brief summary of the section
* Include your code and make sure it is executable and correct, include comments with the code.
* At the end of the section, include a brief summary of the results.

## Power Point Presentation Formatting Guidelines

Your PowerPoint presentation should encompass the outlined sections, each corresponding to the respective guideline sections. The presentation structure should follow:

1. **Exploratory Data Analysis (Section 2)**
2. **Feature Engineering (Section 3)**
3. **Data Imputation (Section 4)**
4. **Train/Test Split and Scaling (Section 5)**
5. **Multiple Regression (Section 6)**
6. **LASSO Regression (Section 7)**
7. **Ridge Regression (Section 8)**
8. **Kernel Ridge Regression (Section 9)**
9. **Model Comparison (Section 10)**
10. **Conclusion (Section 10)**

Each section should be clear, concise, and visually engaging. Ensure that your presentation effectively communicates the key points, methods used, results obtained, and any insights or conclusions drawn. Feel free to incorporate relevant graphs, charts, and examples to enhance the clarity of your content.

**Note: The total number of slides should be between 13 - 15 including the title slide**


## How to turn it in:

** You will turn in the notebook file and also upload a presentation slides**

*  Your Jupyter notebook file must be named Data3550MidPro_LastnameFirstInitial.ipynb. For example: the file name will be Data3550MidPro_RimalR.ipynb if I submit the Midterm Project.
* You are to turn in your Jupyter notebook file only. No data files and no folders.
* It is assumed that you created your Jupyter notebook in a folder named Data3550MidPro_student and in that folder is a Dataset folder. It is expected the path for importing a file is looking for a data folder, for example ‘Dataset/DF_AH.csv’.
* Additionally, for the presentation, save it as a PowerPoint file with the name: Data3550MidPro_GroupName.ppt. For instance, if your group name is LASSO, the file should be named Data3550MidPro_LASSO.ppt.

** Please make sure to adhere to these guidelines when submitting your materials for the Midterm Project**

## 1. Import required packages

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error,mean_absolute_percentage_error

from numpy import arange

from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score


from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

pd.set_option('display.max_columns',200) #allows for up to 200 columns to be displayed when viewing a dataframe
pd.set_option('display.max_rows',100)
plt.style.use('seaborn') # a style that can be used for plots - see style reference above

# trick to widen the screen
from IPython.core.display import display, HTML

#Widens the code landscape 
display(HTML("<style>.container { width:95% !important; }</style>"))

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

## 2. Import  the dataset DF_AH.csv and  Exploratory Data Analysis(15 points)
* The dataset DF_AH.csv is exported from the data data preprocessing notebook after we removed the outliars, and doing some feature engineering. This dataset contains 92 variables including the target variable.
* Do the exploratory data analysis to gain better insights on the data

## 3. Perform additional Feature Engineering(10 points)

1) Identify the unique number of foundation types

2) Create the dummy variable for the foundation type

3) Identify the all the unique neighbourhood

4) Create the dummy variable for the neighbourhood

5) Look at the DF_AH.info() and drop all the variables that have Non-Null count less than 2500.


## 4.  Impute the data (10 points)

1) Check whether each of the variables have missing values

2) Impute the missing values using appropriate method and explain why you use that method?

3) Look at the data and for remaining categorical variable, create the dummy variable, or drop the variable. Explain your reasoning.


## 5. Create the train/test data and scaling (5 points)

1) Split the data into training and test set with training on 80 percent. You may use the following code 

* X = DF_AH.drop('SalePrice', axis = 1) #keep features only for X
* y = DF_AH['SalePrice'] #keep target variable only for y

* X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

2) Standardize the data using standard scaler, then you will be building the regression models suing the scaled data.

## 6.  Build the Multiple Regression Model (10 points)

(a)  Build a multiple regression model using backwards elimination. To do this follow the following steps:

(1) Build initial model 

(2) Review p-values

    - if a p-value is > 0.05 then eliminate the highest p-value and go to step 3
    - **if all p-values are less than 0.05 then your model is complete**
    
(3) Build a new model without eliminated independent variable

(4) Go back to step 2

(b) Write the regression equation where Saleprice is a dependent(target) variable and all the statistically significant predictors as independent variable.

(5) As we used the validation set approach here, predict the y values in the test data and report the following measures of accuracy.

     - Root Mean Squared Error
     - Mean Absolute Percentage Error
     - R squared
Please interpret these values.

Note: To keep the readability of Jupyter notebook, please include the initial model and the final model for the multiple regression using backward elimination.

## 7. Build the LASSO Regression Model(10 points)

(a)  Build a LASSO regression model using all the variables that you used initially in your multiple linear regression model. You are advised to use the grid search CV(5 fold) on training data to find the best value of lambda. Then report the following measures on the test dataset:

     - Root Mean Squared Error
     - Mean Absolute Percentage Error
     - R squared
How do you interpret these values?

(b) Identify the variables with the non zero regression coefficients. Are they the same as the one that is statistically insignificant in multiple linear regression model?


(c) Write the regression equation where Saleprice is a dependent(target) variable and all the statistically significant predictors as independent variable.

## 8. Build the Ridge Regression Model(10 points)

(a)  Build the Ridge regression model using all the variables that you used initially in your multiple linear regression model together with the best lambda obtained by grid search CV(5 fold)  on the training data. 
Please report the scores of following measures on the test dataset:

     - Root Mean Squared Error
     - Mean Absolute Percentage Error
     - R squared
How do you interpret these values?

(b) Identify the regression coefficients of all variables. What is the difference you noticed between the Lasso and the Ridge regression coefficients?

(c) Write the regression equation with Saleprice as a dependent(target) variable and all the predictors as independent variable. Note: Since the value of coefficient may be very small, you may round the coefficient to two decimal places and use the non zero coefficients.
 

## 9. Build the Kernel Ridge Regression Model(10 points)

(a)  Build the Kernel Ridge regression model using all the variables that you used initially in your multiple linear regression model together with the best values of tuning parameters obtained by grid search CV(5 fold)  on the training data. 
Please report the scores of following measures on the test dataset:

     - Root Mean Squared Error
     - Mean Absolute Percentage Error
     - R squared
How do you interpret these values?

(b) By conducting the error analysis, what is the difference you noticed with the previous models?

(c)  Write the regression equation with Saleprice as a dependent(target) variable and all the predictors as independent variable, if possible.
 

## 10. Model Comparison and Conclusion (20 points)

 (a) Create the visualizations of the residuals from the multiple regression, LASSO regression, Ridge regression and Kernel Ridge regression. You can create several visuals that are useful to gain insights on the residuals of each models separately or together. Explain what you observe from these graphs.
 
 (b) Study the performance scores obtained from all three models and explain which model the company need to choose for deployment and why? Provide the support for your resoning/decision with the appropriate graphs and tables with a clear explanation.