**Author(s):**
1. Author (author@emory.edu)

**Objectives**: This <ins>assessment</ins> aims at
 1. Demonstrating you fully understand the difference between an unordered and ordered categorical variable as well as continuous variables and how to tune hyperparameters when training a model;
 2. Demonstrating familiarity with *GitHub* to retrieve and submit computer code.

**Instructions**:
 1. Please write down your Python code and <ins>execute</ins> it in the cell below each question.
 2. Please make sure that your final code <ins>runs</ins> before submitting it.
 
 **Data Source**: The `hprice3` data set from the `wooldridge` library, i.e., [Introductory Econometrics: A Modern Approach](https://pypi.org/project/wooldridge/) by Jeffrey Wooldridge.

**Data Description**:

```
## Obs:   321

##  1. year                     1978, 1981
##  2. age                      age of house
##  3. agesq                    age^2
##  4. nbh                      neighborhood, 0 to 6
##  5. cbd                      dist. to central bus. dstrct, feet
##  6. inst                     dist. to interstate, feet
##  7. linst                    log(inst)
##  8. price                    selling price
##  9. rooms                    # rooms in house
## 10. area                     square footage of house
## 11. land                     square footage lot
## 12. baths                    # bathrooms
## 13. dist                     dist. from house to incin., feet
## 14. ldist                    log(dist)
## 15. lprice                   log(price)
## 16. y81                      =1 if year = 1981
## 17. larea                    log(area)
## 18. lland                    log(land)
## 19. linstsq                  linst^2
```

1. [5 points] Load the `hprice3` data set from the `wooldridge` Python library. 🛑 If you load this data set from another source there will be a 1-point penalty.

In [None]:
import pandas as pd
import wooldridge as woo
hprice3 = woo.dataWoo('hprice3').copy()

2. [20 points] Plot the estimated joint density of $\log(price)$ and $\log(land)$ that utilizes 'rule-of-thumb' bandwidths. Please provide informative labels.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from statsmodels.nonparametric.kernel_density import KDEMultivariate as KDE

# Create surface plot data
x = np.linspace(hprice3.lprice.min(),hprice3.lprice.max(), 100)
y = np.linspace(hprice3.lland.min(),hprice3.lland.max(), 100)
X, Y = np.meshgrid(x, y)
dens = KDE(data=hprice3[['lprice','lland']],var_type='cc',bw='normal_reference')
Z = np.array([dens.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax =fig.gca(projection='3d')
ax.set_xlabel(r'$\log(price)$')
ax.set_ylabel(r'$\log(land)$')
ax.set_zlabel(r'$\hat{f}(lprice,lland)$')
ax.azim = -40
ax.dist = 10
ax.elev = 20
surface = ax.plot_surface(X, Y, Z)
plt.show()

3. You are interested in fitting a model for `lprice` based on `year`, `age`, `nbh`, `linst`, `rooms`, `larea`, `lland`, and `baths`, i.e.,

$$
\mathbb{E}[\texttt{lprice}]=m(\texttt{year},\texttt{age},\texttt{nbh},\texttt{linst},\texttt{rooms},\texttt{larea},\texttt{lland},\texttt{baths}).
$$

   (a) [3 points] After treating `age` as a continuous feature, what _other_ features are also _continuous_?

**Answer:**

   (b) [3 points] What features are _ordered_ categorical variables?

**Answer:**

 (c) [1 point] What feature is an _unordered_ categorical variable?

**Answer:**

4. [20 points] Proceed to create `X_train`, `X_test`, `y_train`, and `y_test` as previously done using 20% of your data set as the validation set and a `random_state=42`.

In [None]:
tmp = ["year", "age", "nbh", "linst", "rooms","larea", "lland", "baths"]
X = hprice3[tmp]
y = hprice3["lprice"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

5. [20 points] Fit a _local linear nonparametric regression_ to your training data set with bandwidths equal to 0.5 for `year`, `nbh`, `linst`, `lland`, and `baths`. Set the bandwidths for `age`, `rooms`, and `larea` equal to 20, 0.6, and 0.22 respectively. Then print the  $R^2$ (as defined in the lecture notes). 🛑 The $R^2$ of this fit should be equal to `0.9501017679028126`, otherwise there will be a 5-point penalty.

In [None]:
# This code fits a local constant nonparametric regression with rule-of-thumb bandwidths
from statsmodels.nonparametric.kernel_regression import KernelReg as KREG
reg_ll = KREG(endog=y_train,exog=X_train,reg_type='ll',var_type = 'ocucocco',bw=np.array([0.5,20,0.5,0.5,0.6,0.22,0.5,0.5]))
print(reg_ll.r_squared())

6. [20 points] Using the `mean_squared_error` function from the `sklearn.metrics` folder proceed to print the test Mean Squared Error. 🛑 The test MSE should be equal to `0.08341205788860644`, otherwise there will be a 5-point penalty.

In [None]:
ahat, bhat =reg_ll.fit(X_test)
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test,ahat))          # Calculate the test MSE

7. [8 points] What is the estimated _price_ of a house that is 3-year old in 1981, has 3 rooms, 2 baths, located in neighborhood 0, with `linst=10`, `larea=8`, and `lland=10`. 🛑 The estimated price is `127208.82088656` otherwise there will be a 8-point penalty.

In [None]:
x = np.array([1981,3,0,10,3,8,10,2])
ahat, bhat =reg_ll.fit(x)
np.exp(ahat)