## Valuation Tool

### Imports

In [46]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pandas as pd
import numpy as np


### Importing Data

In [47]:
boston_dataset = load_boston()
data=pd.DataFrame(data=boston_dataset.data, columns=boston_dataset.feature_names)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

In [48]:
features=data.drop(['INDUS','AGE'],axis=1)
features.head()

Unnamed: 0,CRIM,ZN,CHAS,NOX,RM,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,0.0,0.538,6.575,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,0.0,0.469,6.421,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,0.0,0.469,7.185,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,0.0,0.458,6.998,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,0.0,0.458,7.147,6.0622,3.0,222.0,18.7,396.9,5.33


In [49]:
# we need to work with log values

log_prices=np.log(boston_dataset.target)
#log_prices.shape we need two dimmensions! --> transform to a dataframe
target=pd.DataFrame(log_prices,columns=['PRICE'])
target.shape #Now is two dimmensional

(506, 1)

We're not going to be customizing all the values, because something like crime per capita is quite hard to know or the acres of industrial land in a particular area, it's also really hard to know. We're gonna make some assumptions.

In other words, for the property that we're looking at, we're just gonna go with the average for all of Boston, for now at least..


In [50]:
# defining variables to store row numbers for later use

CRIM_IDX=0
ZN_IDX=1
CHAS_IDX=2
RM_IDX=4
PTRATIO_IDX=8

#property_stats=np.ndarray(shape=(1,11)) #empty, now we add values
#property_stats[0][CRIM_IDX]=features['CRIM'].mean()
#property_stats[0][ZN_IDX]=features['ZN'].mean()
#property_stats[0][CHAS_IDX]=features['CHAS'].mean()
#property_stats[0][RM_IDX]=features['RM'].mean()
#property_stats[0][PTRATIO_IDX]=features['PTRATIO'].mean()# 

In [51]:
type(features.mean())

pandas.core.series.Series

In [52]:
type(features.mean().values)

numpy.ndarray

Calculate the mean value for all the features at the same time.

In [53]:
property_stats=features.mean().values.reshape(1,11)
property_stats

array([[3.61352356e+00, 1.13636364e+01, 6.91699605e-02, 5.54695059e-01,
        6.28463439e+00, 3.79504269e+00, 9.54940711e+00, 4.08237154e+02,
        1.84555336e+01, 3.56674032e+02, 1.26530632e+01]])

We need to obtain the estimated theta values for our model and we can also compute the root mean square error for our prediction interval.

The `fit` method needs two inputs: the features and our target. This calculates all the theta values in the background.
To get the predicted values or the fitted values, we can use `regr.predict(features)`

So based on our features dataframe we are calculating all the predicted values using the thetas from our model.

In [54]:
regr=LinearRegression().fit(features,target) # all thetas
fitted_vals=regr.predict(features) # all us predictions

#Calculate MSE and RMSE

MSE = mean_squared_error(target,fitted_vals)
RMSE = np.sqrt(MSE)
#the unit are log dollar prices in 000s


Now..We'll create a Python function wich will estimate the log house prices for a specific property. 

We'll first get the log prices , the log estimate, using our data set and afterwards we'll do step 2 where we convert this ouput into today's dollar values.

We'll add only four arguments to our function: `nr_rooms` `students_per_classroom`, `next_to_river` and `high_confidence`

In [55]:
def get_log_estimate(nr_rooms,
                    students_per_classroom,
                    next_to_river=False,
                    high_confidence=True):
    
    #Configure property
    property_stats[0][RM_IDX]=nr_rooms
    property_stats[0][PTRATIO_IDX]=students_per_classroom
    
    if next_to_river:
         property_stats[0][CHAS_IDX]=1
    else:
         property_stats[0][CHAS_IDX]=0
    
    #make prediction
    log_estimate = regr.predict(property_stats)[0][0]
    
    #Calc Range
    if high_confidence: #we calculate the 95%
        upper_bound = log_estimate + 2*RMSE
        lower_bound = log_estimate - 2*RMSE
        interval = 95
    else:    #we calculate 68% oly 1 standart desviation or one sigma
        upper_bound = log_estimate + RMSE
        lower_bound = log_estimate - RMSE
        interval = 68
        
    return log_estimate, upper_bound, lower_bound, interval


We must consider inflation to make our calculations considering that the values of the houses are in dollars at the price in 1970. So given that let's make an adjustment to the estimates that our model is spitting out. 

The only question is - how do we get a more realistic price out of our little valuation tool?
To evaluate we first calculate the `median` of our `target` and then we compare it with the `median` of the values of the houses today on the *Zillow* sales page and finally we calculate a `scale-factor` for our model.
https://www.zillow.com/boston-ma/home-values/

In [56]:
np.median(boston_dataset.target)

21.2

In [57]:
ZILLOW_MEDIAN_PRICE = 739.18 # Price to April 2022

SCALE_FACTOR=ZILLOW_MEDIAN_PRICE/np.median(boston_dataset.target)


In [58]:
def get_dollar_estimate(rm, ptratio,chas=False,large_range=True):
    """Estimate the price of a property in Boston.
    
    Keyword arguments:
    rm -- number of rooms in the property
    ptratio -- number of students per teacher in the classroom for the school in the area
    chas -- True if the property is next to the river, False otherwise
    large_range -- True for 95% prediction interval, False for 68% interval
    """
    
    
    if rm<1 or ptratio<1 or ptratio>50:
        return 'That is unrealistic. Try again'
    
    
    
    log_est,upper,lower,conf = get_log_estimate(nr_rooms=rm,
                                                students_per_classroom=ptratio, 
                                                next_to_river=chas,
                                                high_confidence=large_range )

    # Converto to today's dollars 
    dollar_est=np.e**log_est*1000*SCALE_FACTOR
    upper_bound_prices=np.e**upper*1000*SCALE_FACTOR
    lower_bound_prices=np.e**lower*1000*SCALE_FACTOR

    # Round the dollar values to nearest thousand

    round_est=np.around(dollar_est,-3)
    round_upper=np.around(upper_bound_prices,-3)
    round_lower=np.around(lower_bound_prices,-3)

    print(f'The estimated property value is: {round_est}')
    print(f'At {conf}% confidence the valuation range is:')
    print(f'USD {round_lower} at the lower end to USD {round_upper} at the high end.')


Now, let's test our model

In [59]:
get_dollar_estimate(rm=2,ptratio=15,chas=True)

The estimated property value is: 617000.0
At 95% confidence the valuation range is:
USD 424000.0 at the lower end to USD 898000.0 at the high end.




### Importing boston_valuation module 

Our last step is to create a python module with everything we developed above to be able to import it and calculate the house prices. 

In [61]:
import boston_valuation as val

val.get_dollar_estimate(rm=4,ptratio=16,chas=True)

The estimated property value is: 562000.0
At 95% confidence the valuation range is:
USD 386000.0 at the lower end to USD 818000.0 at the high end.


