# This project added great value in knowledge although the results were insignificant

## Predict potential profit in real-estate flips

The concept was that if a neural network could be trained to predict a price based on potential features available on DuProprio. Modifying values could give insight into the potential value of material choice and renovations that maximize profits. I assume people make fair assessments of the properties. An alternative approach could be to use gouvernment records about previous sales prices or following along further.

## Machine learning 
A artificial neural network and autoencoder were both experimented with. Such a problem may have been better approached with use of XGBOOST or a forest classifier.


## Data
The list of data per room from DuProprio was upacked and one-hot encoded. To avoid data which is too wide, I limit columns to more widely held, and therefore general categories. "County" is used because I am unclear as to what the proper term is although it is reference with "municipalities" on the centris statistics. Google geocoordinates were and various other processes are done directly on the database. 

I crafted features that felt relevant to an approach I would take in assessing valuable purchases. Through previous research, I felt that basic market statistics such as filtering by standard deviation from the mean and features designed on comparables within a distance were enough to see the posibilities available.

## Notes on input & output
I tried various modifications to the network architectures, at times trying to recreate the entire listing with the autoencoder and in others predicting from the listing input the ask price. No combination had extrodinary results especially when including one hot encoded features. 

## Challenges
-Processing the dataframe required me to journey into cython and try other options such as nimbus. Otherwise it was impossible. 

-My pipeline is very poor in design, I tried to avoid sklearn but perhaps some inhertance and exploration for the future is necessary 


## Bugs/Notes
There is a bug that forces the need for drop_duplicates(). 



In [1]:
import pymongo
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import json
import sklearn.preprocessing as pre
from sklearn.impute import SimpleImputer
from pandarallel import pandarallel


In [2]:
%load_ext Cython

In [3]:
%%cython

cimport numpy as np
# from collections cimport defaultdict
import pandas as pd

cpdef unpack_list(np.ndarray col):
    cdef int length = col.shape[0]
        
    cdef int i
    cdef int q
    
    ev = None
    for i in range(length):
        out = {}
        cell = col[i]
        if type(cell) == list:
            for q in range(len(cell)):
                out.update({cell[q]:1})
            if ev is None:
                ev = pd.Series(out)
            else:
                pd.concat([ev, pd.Series(out)])
    return ev

In [118]:
pandarallel.initialize()
class ProcessData:
    
    def __init__(self, cursor):
        frame = pd.DataFrame(cursor)
        # drop price upon request
        price = frame.loc[:, 'price']
        price = price.apply(lambda x: x if isinstance(x, int) else None)
        frame['price'] = price
        frame = frame[frame.price > 0]
        self.frame = frame
        self.scaler = None

        
    def feature_phase(self, frame=None):
        if frame is None:
            frame = self.frame
            
        frame = gpd.GeoDataFrame(frame, geometry=gpd.points_from_xy(frame['geo_index'].apply(lambda x: x['lng']),
                                                       frame['geo_index'].apply(lambda x: x['lat'])))

        def find_comparables(row):
            # compare within 6.78 km
            comparables = frame[frame.distance(row.geometry) < 0.03]

            if comparables.shape[0] > 3:
                comparables = comparables.dropna(subset=['Property taxes', 'School taxes'], axis=0)
                property_taxes = comparables['Property taxes'].apply(lambda x: x['monthly'] if not isinstance(x, float) else None)
                school_taxes = comparables['School taxes'].apply(lambda x: x['monthly'] if not isinstance(x, float) else None)
                total_tax = property_taxes + school_taxes
                # add mean comparison values
                row['compare_total_taxes_mean'] = round(total_tax.mean(), 2)
                row['compare_ask_mean'] = round(comparables.price.mean(), 2)
                row['compare_lot_mean'] = round(comparables['Lot dimensions'].mean(),2)
                row['compare_living_mean'] = round(comparables['Living space area (basement exclu)'].mean(), 2)
                # add median comparison values
                row['compare_total_taxes_med'] = round(total_tax.median(), 2)
                row['compare_ask_med'] = round(comparables.price.median(), 2)
                row['compare_lot_med'] = round(comparables['Lot dimensions'].median(),2)
                row['compare_living_med'] = round(comparables['Living space area (basement exclu)'].median(), 2)
                # add distance to most expensive house
                try:
                    row['distance_highest_price'] = frame.iloc[comparables.price.idxmax()]
                except Exception as e:
                    print(e)
                    row['distance_highest_price'] = None
                
                # try to add correlation of values to price
            else:
                return row
            return row
        
        frame = frame.parallel_apply(find_comparables, axis=1)
        
        if frame is self.frame:
            self.frame = frame
        
        return frame
        
        
    def drop_phase(self, frame=None):
        if frame is None:
            frame = self.frame
            
        # drop by percent of missing values
        pct = 0.20
        missing_pct = frame.isnull().sum() > (frame.shape[0] * pct)
        frame = frame.loc[:, missing_pct == False]
        
        # drop if living space or lot dimensions are missing
        drop_cols = ['Living space area (basement exclu)', 'Lot dimensions']
        for col in drop_cols:
            frame = frame[frame[col] > 0]
            
        # drop columns that are not used to model
        drop_cols = ['_id', 'link', 'description', 'results', 'listing_num', 'address', 'geo_index']
        frame = frame.drop(columns=drop_cols)
        
        if frame is self.frame:
            self.frame = frame
            
        return frame
            
    
    def unpack_taxes(self, frame=None):
        if frame is None:
            frame = self.frame
        taxes = [x for x in frame if 'taxes' in x]
        for tax in taxes:
            frame[tax] = frame[tax].swifter.apply(lambda x: x['monthly'] if isinstance(x, dict) else None)
        
        if frame is self.frame:
            self.frame = frame        
        
        return frame
    
    def impute_missing(self, frame=None):
        if frame is None:
            frame = self.frame
            
        x = frame.select_dtypes(exclude=['object', 'geometry', '<M8[ns]'])
        scaler = pre.MinMaxScaler()
        a = scaler.fit_transform(x.apply(lambda i: i.fillna(frame[i.name].mean())))
        a = pd.DataFrame(a, columns=x.columns, index=x.index)
        frame = frame.select_dtypes(include=['object', 'geometry', '<M8[ns]']).join(a)
        a = None
        x = None
        
        if frame is self.frame:
            self.frame = frame
            self.scaler = scaler
        
        return frame, scaler
    
    def one_hot(self, frame=None):
        if frame is None:
            frame = self.frame
            
        # identify column names
        
        categ_frame = frame.select_dtypes(include=['object'])
        categ_cols = categ_frame.columns
        to_transform = categ_frame
        trans_cols = to_transform.columns
        # transform
        
        trans = to_transform.parallel_apply(lambda x: unpack_list(x.to_numpy()), axis=1)
#         to_transform.join(categ_frame[['last_modified', 'geo_index']])
        frame = frame.join(trans)
        frame = frame.drop(columns=categ_cols)
        frame = frame.fillna(0)
        frame = frame.drop_duplicates()
        
        if frame is self.frame:
            self.frame = frame
            
        return frame
    
    def centris_stats(self, frame=None, scaler=None):
        if frame is None and scaler is None:
            frame = self.frame
            scaler = self.scaler
            
        centris_stats = client['properties']['LiquidityPremium']
        market_stats = pd.DataFrame(list(centris_stats.find({})))
        market_stats = market_stats[['_id', 'name', 'Single-family', 'Population (2016)', 'Total residential']]
        market_sales_pct = market_stats['Single-family'].apply(lambda x: x['quarter']['sales']['percent'])
        market_sales_num = market_stats['Single-family'].apply(lambda x: x['quarter']['sales']['num'])
        market_selling_pct = market_stats['Single-family'].apply(lambda x: x['quarter']['avg_selling_time_days']['percent'])
        market_selling_num = market_stats['Single-family'].apply(lambda x: x['quarter']['avg_selling_time_days']['num'])


        stats_centris = pd.concat((market_sales_pct, market_sales_num, market_selling_pct, market_selling_num, market_stats.name), axis=1)

        stats_centris['county'] = stats_centris.apply(lambda x: x['name'].split(',')[0], axis=1)
        stats_centris['county'] = stats_centris.apply(lambda x: 'Ahuntsic / Cartierville' if 'Ahuntsic' in x['county'] else x['county'], axis=1)
        stats_centris['county'] = stats_centris.apply(lambda x: 'Venise-En-Quebec' if 'Venise' in x['county'] else x['county'], axis=1)
        stats_centris['county'] = stats_centris.apply(lambda x: 'Beauport' if 'Québec (Beauport)' in x['county'] else x['county'], axis=1)
        stats_centris['county'] = stats_centris.apply(lambda x: 'Stoneham' if 'Stoneham' in x['county'] else x['county'], axis=1)
        stats_centris['county'] = stats_centris.apply(lambda x: 'Cap-Rouge' if 'Cap-Rouge' in x['county'] else x['county'], axis=1)
        stats_centris['county'] = stats_centris.apply(lambda x: 'Charlesbourg' if 'Charlesbourg' in x['county'] else x['county'], axis=1)

        stats_centris['county'] = stats_centris.apply(lambda x: "-".join([x if x != 'Saint' else 'St' for x in x.county.split('-')  ]) if 'Saint' in x['county'] else x['county'], axis=1)
        stats_centris.columns = ['market_sales_pct', 'market_sales_num', 'market_selling_pct', 'market_selling_num', 'name', 'county']
        # Convert astericks to nan
        names = stats_centris[['name', 'county']]
        stats_centris = stats_centris.iloc[:, :-2].apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)
        stats_centris = stats_centris.join(names)
        
        # scale the values
#         a = scaler.fit_transform(x)
#         a = pd.DataFrame(a, columns=x.columns, index=x.index)

        frame = (frame.reset_index()
                 .set_index('county')
                 .join(stats_centris
                 .set_index('county'))
                 .reset_index()
                 .set_index('index') )
        
        if frame is self.frame:
            self.frame = frame    
        
        return frame

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [119]:
client = pymongo.MongoClient('mongodb://localhost:27017/')
mdf = client['DuProprio']['ListingDetails']
cursor = mdf.find({'results': {'$exists': True}})
# df = pd.DataFrame(cursor)

In [120]:
test = ProcessData(cursor)

In [121]:
featured = test.feature_phase()

single positional indexer is out-of-bounds
single positional indexer is out-of-bounds


  return np.nanmean(a, axis, out=out, keepdims=keepdims)


In [122]:
tax = test.unpack_taxes(featured)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=8525.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=8525.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=8525.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=8525.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=8525.0, style=ProgressStyle(descriptio…




In [123]:
centris = test.centris_stats(tax)

In [137]:
abc.cov()['price'].sort_values(ascending=False)

price                                 0.006122
compare_ask_mean                      0.004914
compare_ask_med                       0.004633
Bathrooms                             0.002007
compare_living_med                    0.001964
School taxes                          0.001891
Half baths                            0.001629
Property taxes                        0.000146
compare_lot_mean                      0.000022
Lot dimensions                        0.000013
compare_lot_med                       0.000013
Living space area (basement exclu)    0.000013
compare_living_mean                   0.000012
Name: price, dtype: float64

In [340]:
def county_encode(row):
    row[str(row['county'])] = 1
    return row
county_info = abc.county.to_frame().parallel_apply(county_encode, axis=1)

In [342]:
county_info = county_info.fillna(0)

In [133]:
dropped = test.drop_phase(centris)

In [325]:
# test.unpack_taxes()
for c in ab.columns: print(c)

geometry
last_modified
Bathrooms
Half baths
Living space area (basement exclu)
Lot dimensions
Property taxes
School taxes
compare_ask_mean
compare_ask_med
compare_living_mean
compare_living_med
compare_lot_mean
compare_lot_med
price
A/C
Air exchanger
Alarm system
B/I Microwave
Bath and shower
Blinds
Ceiling fixtures
Central air
Central vacuum
Ceramic Shower
Claw Foot Bathtub
Cold room
Concrete
Crawl basement
Crawl space
Dishwasher
Dryer
Fireplace
Freezer
Fridge
Furnace
Furnished
Low (6 feet or under)
No backyard neighbors
None
Partially finished
Potential income
Preserved wood foundation
Purification field
Residential area
Separate Shower
Separate entrance
Septic tank
Shed
Soaker bath
Step-up bath
Stone
Stove
Therapeutic bath
Thermo-masseur bath tub
Totally finished
Two sinks
Unfinished
Ventilator
Walk-in closet
Washer
Water softener
Well
Whirlpool Bath Tub
Window coverings
Bidet
Brick
California shutters
Carpet
Cedar wardrobe
Ceramic
Dehumidifier
Generator
Greenhouse
Half bath on the 

In [745]:
test.frame.to_pickle('unpack_taxes_DuProprio.pkl')

In [135]:
abc, scaler = test.impute_missing(dropped)

In [138]:
ab = test.one_hot(abc)

In [18]:
ab = ab.drop_duplicates()
ab.to_pickle('process.pkl')

In [385]:
ab1 = ab.select_dtypes(exclude='object').rename_axis(index="").drop(columns=['geometry', 'last_modified', 'None']).drop_duplicates()

In [386]:
ab1 = ab1.join(county_info.reset_index().drop_duplicates().rename_axis(index=''))

In [459]:
pd.DataFrame(scaler.inverse_transform(ab.iloc[:,2:15]), columns =ab.iloc[:,2:15].columns).describe()

Unnamed: 0,Bathrooms,Half baths,Living space area (basement exclu),Lot dimensions,Property taxes,School taxes,compare_ask_mean,compare_ask_med,compare_living_mean,compare_living_med,compare_lot_mean,compare_lot_med,price
count,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0,23200.0
mean,443263.7,290153.2,41263.48,40567.26,57017.85,228368.2,926857.7,862500.0,42126.87,708302.3,57569.58,47377.0,427627.3
std,163040.5,226917.9,17565.53,17783.09,61509.44,113828.5,310193.9,301619.1,24224.17,239432.2,53107.07,26399.17,212810.6
min,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0
25%,261500.0,40000.0,40744.77,40049.95,51289.91,159157.1,698654.1,653384.6,41259.72,521956.6,44050.32,43279.24,285000.0
50%,483000.0,419714.3,41008.59,40069.83,54583.92,215826.4,847106.7,769337.6,41681.64,689459.4,46273.3,44127.65,384500.0
75%,483000.0,419714.3,41367.04,40162.39,57833.42,256993.4,1102420.0,1048585.0,42395.4,839249.0,58810.08,49022.85,525000.0
max,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0,2698000.0


In [416]:
columns = ab1.iloc[:,-1097:]
col_index = []
for i in range(columns.shape[1]): 
    try:
        if columns.iloc[:,i].value_counts()[1] > 50: col_index.append(i)
    except:
        col_index.append(i)
        
column_names = columns.iloc[:, col_index].columns
drop_cols = [x for x in columns.columns if x not in column_names]

# ab1.iloc[:,-1097:]

In [421]:
aaaaa = aaaaa.dropna()

In [417]:
aaaaa = ab1.drop(columns=drop_cols)

In [357]:
for i in ab1.columns: print(i)

Bathrooms
Half baths
Living space area (basement exclu)
Lot dimensions
Property taxes
School taxes
compare_ask_mean
compare_ask_med
compare_living_mean
compare_living_med
compare_lot_mean
compare_lot_med
price
A/C
Air exchanger
Alarm system
B/I Microwave
Bath and shower
Blinds
Ceiling fixtures
Central air
Central vacuum
Ceramic Shower
Claw Foot Bathtub
Cold room
Concrete
Crawl basement
Crawl space
Dishwasher
Dryer
Fireplace
Freezer
Fridge
Furnace
Furnished
Low (6 feet or under)
No backyard neighbors
Partially finished
Potential income
Preserved wood foundation
Purification field
Residential area
Separate Shower
Separate entrance
Septic tank
Shed
Soaker bath
Step-up bath
Stone
Stove
Therapeutic bath
Thermo-masseur bath tub
Totally finished
Two sinks
Unfinished
Ventilator
Walk-in closet
Washer
Water softener
Well
Whirlpool Bath Tub
Window coverings
Bidet
Brick
California shutters
Carpet
Cedar wardrobe
Ceramic
Dehumidifier
Generator
Greenhouse
Half bath on the ground floor
Hardwood
Hot tu

In [383]:
aaaaa.dropna()

Unnamed: 0,Bathrooms,Half baths,Living space area (basement exclu),Lot dimensions,Property taxes,School taxes,compare_ask_mean,compare_ask_med,compare_living_mean,compare_living_med,...,St-Hermenegilde,St-Jean-De-Brebeuf,St-Joseph-De-Sorel,St-Leonard-De-Portneuf,St-Léonard-D'Aston,St-Luc-De-Vincennes,St-Mathieu-d'Harricana,St-Pamphile,St-Pierre-De-La-Riviere-Du-Sud,St-Prosper
,,,,,,,,,,,,,,,,,,,,,
2,0.250000,0.000000,0.000487,0.000619,0.007117,0.050017,0.295148,0.273347,0.000999,0.244341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.166667,0.000000,0.000487,0.000016,0.003566,0.066217,0.288810,0.264017,0.000447,0.172989,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.166667,0.142857,0.000367,0.000047,0.003052,0.039522,0.256127,0.189210,0.000398,0.135320,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.166667,0.142857,0.000853,0.000172,0.005226,0.066217,0.534931,0.385080,0.001142,0.564641,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.083333,0.000000,0.000292,0.003276,0.005750,0.066217,0.295148,0.273347,0.000999,0.244341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8243,0.166667,0.142857,0.000527,0.000028,0.007545,0.051996,0.295148,0.273347,0.000999,0.244341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8322,0.166667,0.000000,0.000317,0.000020,0.006905,0.060228,0.295148,0.273347,0.000999,0.244341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8323,0.166667,0.142857,0.000372,0.000013,0.010503,0.130081,0.295148,0.273347,0.000999,0.244341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [422]:
x = aaaaa.drop(columns=['price', 'county'])
y = aaaaa.price


# from sklearn.impute import SimpleImputer
# imp = SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
# scaler = pre.MinMaxScaler()
# a = x.apply(lambda x: scaler.fit_transform(imp.fit_transform(x.to_numpy()))
# x = pd.DataFrame(a, columns= x.columns)
            
            
# imped = imp.fit_transform(x)
x_train = x[:-10]
y_train = y[:-10]
x_test = x[-10:]
y_test = y[-10:]

x = new_frame.drop(columns=['price', 'geometry', np.nan]).select_dtypes(exclude=['object', '<M8[ns]'])
y = new_frame.price

scaler = pre.MinMaxScaler()
a = x.apply(lambda x: scaler.fit_transform(x.to_numpy())
x = pd.DataFrame(a, columns= x.columns)

In [423]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras import metrics
from keras import regularizers
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Activation, Input
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import Adam, RMSprop
from keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint
from keras.utils import plot_model
from keras.models import load_model, Model
from tensorflow.keras import regularizers



# x_train = x[:-50]
# y_train = y[:-50]
# x_test = x[-50:]
# y_test = y[-50:]

def basic_model_3(x_size, y_size):
    t_model = Sequential()
    t_model.add(Dense(x_size, activation="tanh", kernel_initializer='normal', input_shape=(x_size,)))
    t_model.add(Dropout(0.5))
    t_model.add(Dense(x_size//2, activation="relu", kernel_initializer='normal', 
        kernel_regularizer=regularizers.l1(0.01), bias_regularizer=regularizers.l1(0.01)))
    t_model.add(Dropout(0.3))
    t_model.add(Dense(512, activation="relu", kernel_initializer='normal', 
        kernel_regularizer=regularizers.l1_l2(0.01), bias_regularizer=regularizers.l1_l2(0.01)))
    t_model.add(Dropout(0.1))
    t_model.add(Dense(10, activation="relu", kernel_initializer='normal'))
    t_model.add(Dropout(0.0))
    t_model.add(Dense(y_size))
    met = tf.keras.metrics.MeanAbsolutePercentageError()
    t_model.compile( 
        loss='mean_absolute_error',
        optimizer='adadelta',
        metrics=[metrics.mae, met])
    return t_model

def autoencoder(x_size, y_size):
    
    input_array = Input(shape=(x_size,))
    x = Dense(x_size, activation='relu')(input_array)
    x = Dense(x_size//2, activation='relu')(x)
    x = Dense(x_size//3, activation='relu')(x)
    
    x = Dense(x_size//4, activation='relu', kernel_regularizer=regularizers.l1_l2(0.01), bias_regularizer=regularizers.l1_l2(0.01),activity_regularizer=regularizers.l1_l2(0.01), name='encoded')(x)

    y = Dense(20, activation='relu', name='code')(x)

    decoded = Dense(x_size//3, activation='relu')(y)
    decoded = Dense(x_size//2, activation='relu')(decoded)

    z = Dense(y_size, activation='relu', name='output')(decoded)
    model = Model(input_array, z)
#     opt = tf.keras.optimizers.Adam(lr=0.01)
    met = tf.keras.metrics.MeanAbsolutePercentageError()
    model.compile(
        loss='mean_absolute_error',
        optimizer='adadelta', #nadam
        metrics=[metrics.mae, met])
    return model

model = basic_model_3(x_train.shape[1], 1)
model2 = autoencoder(x_train.shape[1], 1)

In [97]:
x_train.shape[0]

7457

In [371]:
model2.summary()

Model: "model_33"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_35 (InputLayer)        (None, 1004)              0         
_________________________________________________________________
dense_394 (Dense)            (None, 1004)              1009020   
_________________________________________________________________
dense_395 (Dense)            (None, 502)               504510    
_________________________________________________________________
dense_396 (Dense)            (None, 334)               168002    
_________________________________________________________________
encoded (Dense)              (None, 251)               84085     
_________________________________________________________________
code (Dense)                 (None, 20)                5040      
_________________________________________________________________
dense_397 (Dense)            (None, 334)               701

In [435]:
epochs = 1000
batch_size = 20
keras_callbacks = [
    # ModelCheckpoint('/tmp/keras_checkpoints/model.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss', save_best_only=True, verbose=2)
    ModelCheckpoint('./temp/model.{epoch:02d}.hdf5', monitor='mean_absolute_error', save_best_only=True, verbose=0),
    # TensorBoard(log_dir='/tmp/keras_logs/model_3', histogram_freq=0, write_graph=True, write_images=True, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None),
    EarlyStopping(monitor='mean_absolute_error', patience=40, verbose=0),
    TensorBoard(log_dir='/Users/mathewzaharopoulos/dev/rebuild_realestate/temp/new_struct_c7')
]

In [436]:


history = model.fit(x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    shuffle=True,
    verbose=0, # Change it to 2, if wished to observe execution
#     validation_split = 0.05,
    callbacks=keras_callbacks)


valid_score = model.evaluate(x_test, y_test, verbose=0)


In [437]:
scale1.inverse_transform(np.array(valid_score[1]).reshape(-1,1))

array([[126311.12924218]])

In [164]:
scale1 = scaler
scale1.min_, scale1.scale_ = scale1.min_[-1], scale1.scale_[-1]

In [432]:
history2 = model2.fit(x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    shuffle=True,
    verbose=0, # Change it to 2, if wished to observe execution
#     validation_split = 0.2,
    callbacks=keras_callbacks)

valid_score2 = model2.evaluate(x_test, y_test, verbose=0)

In [438]:
valid_score2
valid_score2 = model2.evaluate(x_test, y_test, verbose=0)
scale1.inverse_transform(np.array(valid_score2[1]).reshape(-1,1))

array([[128215.06140381]])

In [232]:
# error = model2.predict(x_test.join(y_test).iloc[1].to_frame().T)[-1][-1] - x_test.join(y_test).iloc[1].to_frame().T.price
# scale1.inverse_transform(error.to_numpy().reshape(-1,1))

In [439]:
data = x_test.join(y_test).iloc[1].to_frame().T
pd.DataFrame(scaler.inverse_transform(data), columns=data.columns)

Unnamed: 0,Bathrooms,Half baths,Living space area (basement exclu),Lot dimensions,Property taxes,School taxes,compare_ask_mean,compare_ask_med,compare_living_mean,compare_living_med,...,St-Jacques-Le-Mineur,St-Juste-Du-Lac,St-Malo,St-Marc-sur-Richelieu,St-Ours,St-Prosper-De-Dorchester,Ste-Anne-De-Bellevue,Val-Barrette,Verdun,price
0,261500.0,40000.0,40453.087628,40358.100999,47409.282702,95314.391952,824503.548391,766556.451436,42655.889294,689459.354134,...,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,350000.0


In [662]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)

imped = imp.fit_transform(x)
x_train = imped[:-10]
y_train = y[:-10]
x_test = imped[-10:]
y_test = y[-10:]

In [652]:
valid_score2 # 20% validation

[9512818708.48, 75238.6328125]

In [664]:
valid_score2 # mean

[10157221888.0, 70634.6640625]

In [659]:
valid_score2 # most frequent

[11036537856.0, 76819.625]

In [324]:
def centris():
    centris_stats = client['properties']['LiquidityPremium']
    market_stats = pd.DataFrame(list(centris_stats.find({})))
    market_stats = market_stats[['_id', 'name', 'Single-family', 'Population (2016)', 'Total residential']]
    market_sales_pct = market_stats['Single-family'].apply(lambda x: x['quarter']['sales']['percent'])
    market_sales_num = market_stats['Single-family'].apply(lambda x: x['quarter']['sales']['num'])
    market_selling_pct = market_stats['Single-family'].apply(lambda x: x['quarter']['avg_selling_time_days']['percent'])
    market_selling_num = market_stats['Single-family'].apply(lambda x: x['quarter']['avg_selling_time_days']['num'])

    
    stats_centris = pd.concat((market_sales_pct, market_sales_num, market_selling_pct, market_selling_num, market_stats.name), axis=1)

    stats_centris['county'] = stats_centris.apply(lambda x: x['name'].split(',')[0], axis=1)
    stats_centris['county'] = stats_centris.apply(lambda x: 'Ahuntsic / Cartierville' if 'Ahuntsic' in x['county'] else x['county'], axis=1)
    stats_centris['county'] = stats_centris.apply(lambda x: 'Venise-En-Quebec' if 'Venise' in x['county'] else x['county'], axis=1)
    stats_centris['county'] = stats_centris.apply(lambda x: 'Beauport' if 'Québec (Beauport)' in x['county'] else x['county'], axis=1)
    stats_centris['county'] = stats_centris.apply(lambda x: 'Stoneham' if 'Stoneham' in x['county'] else x['county'], axis=1)
    stats_centris['county'] = stats_centris.apply(lambda x: 'Cap-Rouge' if 'Cap-Rouge' in x['county'] else x['county'], axis=1)
    stats_centris['county'] = stats_centris.apply(lambda x: 'Charlesbourg' if 'Charlesbourg' in x['county'] else x['county'], axis=1)

    stats_centris['county'] = stats_centris.apply(lambda x: "-".join([x if x != 'Saint' else 'St' for x in x.county.split('-')  ]) if 'Saint' in x['county'] else x['county'], axis=1)
    stats_centris.columns = ['market_sales_pct', 'market_sales_num', 'market_selling_pct', 'market_selling_num', 'name', 'county']
        
    return stats_centris

In [325]:
cent = centris()