# DA Homework 2


## Instructions

1. Homework assignments require individual attention and effort to be of any benefit. You may ask for help from the instructor and TA. However, you should individually complete the assignments.


2. For your homework submission, please make sure your notebook contains both your Python code and ALL the outputs. Submit the HTML file on Canvas.


3. At the end of the notebook, you also need to write **an executive summary** that describes your main findings and conclusions to me (so you can assume analytics knowledge). The summary should not exceed 1 page of text. 

## 1. The dataset

The Internet has revolutionized the real estate industry. Realtors now list houses and their prices on the web, and estimates of house and condominium prices have become widely available, even for units not on the market.

Zillow (www.zillow.com) is the most popular online real estate information site in the United States. It gets much of the data for its “Zestimates” of home values directly from publicly available city housing data, used to estimate property values for tax assessment. A simple approach would be a naive, model-less method—just use the assessed values as determined by the city. Those values, however, do not necessarily include all properties, and they might not include changes warranted by remodeling, additions, etc. Moreover, the assessment methods used by cities may not be transparent or always reflect true market values. However, the city property data can be used as a starting point to build a model, to which additional data (such as that collected by large realtors) can be added later.

In the homework assignments, we will look at how Boston property assessment data, available from the city of Boston, might be used to predict home values. The data in **`WestRoxbury.csv`** includes information on single family owner-occupied homes in West Roxbury, a neighborhood in southwest Boston, MA, in 2014. The data includes values for various predictor variables, and for an outcome—assessed home value (“total value”). 

The dataset includes 5802 homes and 14 variables. A description of each variable is provided below.

| Variable | Description |
| --- | --- |
| TOTAL VALUE | Total assessed value for property, in thousands of USD |
| TAX Tax bill | amount based on total assessed value multiplied by the tax rate, in USD |
| LOT SQ FT | Total lot size of parcel in square feet |
| YR BUILT | Year the property was built |
| GROSS AREA | Gross floor area |
| LIVING AREA | Total living area for residential properties (ft2) |
| FLOORS | Number of floors |
| ROOMS | Total number of rooms |
| BEDROOMS | Total number of bedrooms |
| FULL BATH | Total number of full baths |
| HALF BATH | Total number of half baths |
| KITCHEN | Total number of kitchens |
| FIREPLACE | Total number of fireplaces |
| REMODEL | When the house was remodeled (Recent/Old/None) |

## 2. Load data

In [None]:
# Import the Python libraries
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

import warnings
# Suppress only specific types of warnings
warnings.filterwarnings('ignore', category=FutureWarning)  # Suppress future deprecation warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)  # Suppress deprecation warnings

sns.set()
sns.set(rc={'figure.figsize':(12, 8)})

Load the West Roxbury data set. 

## 3. Data Preparation

### 3.1 Find the shape of the data

Determine the shape of the data frame. It should have 5802 rows and 14 columns

### 3.2 Remove old houses

Now, filter data to exclude all houses that were built before year 1900. 

In [None]:
# filter houses that were built before 1900



In [None]:
# Verify the shape of the data frame



### 3.3 Calculate the age of the house (i.e., 'AGE') from the 'YR_BUILT'

Calculate the `AGE` from the `YR_BUILT`, e.g., `AGE` = 2020 - `YR_BUILT`

In [None]:
# create the AGE column 



### 3.4 Transform `REMODEL` 

Convert the categorical varialbe `REMODEL` to a binary indicator `REMODEL_RECENT` that indicates whether or not a house was recently remodeled

In [None]:
# convert 'REMODEL' to binary indicator 'REMODEL_RECENT'



### 3.5 Summarize all variables

Obtain summary statistics of all numerical variables in the data frame

In [None]:
# get summary statistics of all numeric columns



## 4. Data Visualizations

### 4.1 Explore `TOTAL_VALUE` and `GROSS_AREA`

`TOTAL_VALUE` and `GROSS_AREA`: create a scatterplot with a linear fitted line for 

`TOTAL_VALUE` and `GROSS_AREA`: calcualte Pearson correlation coefficient

### 4.2 Explore `TOTAL_VALUE` and `AGE`

`TOTAL_VALUE` and `AGE`: create a scatterplot with a linear fitted line 

`TOTAL_VALUE` and `AGE`: calcualte Pearson correlation coefficient

## 5. Simple Linear Regression

### 5.1. Regress `TOTAL_VALUE` on `GROSS_AREA`

### 5.2. Vidualize the regression line

### 5.3. Check the distribution of residuals

Histogram of normalized residuals

### 5.4 Interpretation of Results

Interpret your regression results, make sure to discuss:
1. Overall fit of the model
2. Regression equation
3. Significance of individual variables and regression coefficients
4. Check if there is any violations of model assumptions
5. Check if there is any influential observations

<font color='blue'> Write your answer here (double click to open the cell): </font>



 

## 6. Multiple Linear Regression

### 6.1. Regress `TOTAL_VALUE` on `GROSS_AREA`, `AGE`, and `REMODEL_RECENT`

### 6.2 Compute VIF values

### 6.3 Visualize the partial regression plots

Plot partial regression plots to visualize the significance of indivdiual predictors separately

### 6.4 Interpretation of Results

Interpret your regression results, make sure to discuss:
1. Overall fit of the model
2. Regression equation
3. Significance of individual variables and regression coefficients
4. Check if there is any violations of model assumptions
5. Check if there is any influential observations
6. Check if there is any multicollinearity issue

<font color='blue'> Write your Answer (double click to open the cell): </font>



 

## 7. Executive Summary

Report your summary of observations/findings in the executive summary. The summary should not exceed 1 page long. 

<font color='blue'> Summary of findings (double click to open the cell): </font>





