# Property appraisal ML project.
## Phase 2: NLP processing the 'Public Remarks' Section

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

from scipy.spatial import distance
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

In [2]:
#reading one file
df = pd.read_csv('F20 P1.csv', index_col=None, header=0)

# reading part of the files from Detached June 2020-2021 folder

# data_files = ['F20 P1.csv', 'F20 P2.csv', 'F30 P1.csv', 'F30 P2.csv', 'F50 P1.csv', 'F50 P2.csv']
# data_list = []

# for filename in data_files:
#     df_current = pd.read_csv(filename, index_col=None, header=0)
#     data_list.append(df_current)

# df = pd.concat(data_list, axis=0, ignore_index=True)

# deleting yellow columns
df.drop(['Status', 'For Tax Year', 'Gross Taxes', 'Original Price', 'List Price', 'GST Incl'], axis = 1, inplace = True)

# # size of our dataset
# print('Our dataset has', len(df), 'data lines and', len(df.columns.tolist()), 'features:')
# print('\n')
# print(df.columns.tolist())

In [3]:
# dropping the columns with more than 90% NAs
moreThan = []

for feature in df:
    if df[feature].isna().sum() / df.shape[0] > 0.9:
        moreThan.append(feature)
        print("Dropping the feature:", feature)
df.drop(moreThan, axis = 1, inplace = True)

if moreThan == []:
    print('No features dropped.')
print('\n')

# dropping the columns that are not insightful: Days On Market, Public Remarks
# df1.drop(['Sold Date', 'Public Remarks'], axis=1, inplace = True)
# df.drop(['Public Remarks'], axis=1, inplace = True)

columns_names = df.columns.tolist()

print("Features left:")
print(columns_names)
print('\n')
print("Now we have", len(columns_names), "features and their types:")

# types of our columns
pd.DataFrame(df.dtypes, columns=['DataTypes'])

No features dropped.


Features left:
['Address', 'S/A', 'Price', 'Sold Date', 'Days On Market', 'Age', 'Area', 'Total Bedrooms', 'Total Baths', 'Lot Sz (Sq.Ft.)', 'Floor Area -Grand Total', 'Driveway Finish', 'Floor Area - Unfinished', 'Foundation', 'Floor Area Fin - Basement', 'Zoning', 'Parking Places - Covered', '# Rms', 'No. Floor Levels', 'Frontage - Feet', 'Depth', 'Type', 'Public Remarks']


Now we have 23 features and their types:


Unnamed: 0,DataTypes
Address,object
S/A,object
Price,object
Sold Date,object
Days On Market,int64
Age,int64
Area,object
Total Bedrooms,int64
Total Baths,int64
Lot Sz (Sq.Ft.),object


In [4]:
# hereafter we're working only with the "Public Remarks" column

In [5]:
nlp_column = df['Public Remarks'].copy()

nlp_column

0      Investor's alert. 3 bedroom tenanted home with...
1      WHY RENT? Apartment size, 1 bedroom, modern, e...
2      INVESTORS and FIRST TIME HOME BUYERS ALERT! 2 ...
3      **LARGE 8255 sqft LOT****PERFECT FOR INVESTORS...
4      Tastefully renovated 2 bed 1bath house with de...
                             ...                        
553    6,500 SF of executive living. Exquisitely buil...
554    2.07 Acre Site Great Development Potential Lan...
555    Magnificently New Luxury home by SOOD DEVELOPM...
556    LOCATION, LOCATION!! Hobby Farm in South Pt. K...
557    Location! Location! Location! Port Kells futur...
Name: Public Remarks, Length: 558, dtype: object

In [8]:
nlp_column = df['Public Remarks'].copy()

# a bit of cleaning: filling with NaN's where not available, changing some words

nlp_column = nlp_column.fillna('0')
nlp_column_prep_1 = nlp_column.str.replace('&',' and ')
nlp_column_prep_1 = nlp_column_prep_1.str.replace('%',' percent')
nlp_column_prep_1 = nlp_column_prep_1.str.replace('*','')


# replace all the digits with corresponding words: 5 -> five
import re
import num2words

# this removes thousands delimiter. so 6,550 will be 6500
nlp_column_prep_1 = [re.sub(r'(\d+),(\d+)', r'\1\2', paragraph) for paragraph in nlp_column_prep_1]

# # # these two lines change 4'' or 4 '' to 4 inch
nlp_column_prep_1 = [re.sub(r'(\d+) \'\'', r' \1 inch ', paragraph) for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [re.sub(r'(\d+)\'\'', r' \1 inch ', paragraph) for paragraph in nlp_column_prep_1]                           
                       
# these two lines change 3' or 3 ' to 3 feet
nlp_column_prep_1 = [re.sub(r'(\d+)\'', r' \1 foot', paragraph) for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [re.sub(r'(\d+) \'', r' \1 foot', paragraph) for paragraph in nlp_column_prep_1]

# this line changes feetx/feet x to feet by; inchx/inch x to inch by; ft. x to foor by
nlp_column_prep_1 = [paragraph.replace('footx', 'feet by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('foot x', 'foot by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('inchx', 'inch by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('inch x', 'inch by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('ft. x', 'foot by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('ft.', 'foot') for paragraph in nlp_column_prep_1]

# change $number.00 to number dollars
nlp_column_prep_1 = [re.sub(r'\$(\d+)\.(\d+)', r'\1 dollars ', paragraph) for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [re.sub(r'\$(\d+)', r'\1 dollars ', paragraph) for paragraph in nlp_column_prep_1]

# here I MANUALLY change some of the non correct abbreviation for sqft and hwy
nlp_column_prep_1 = [paragraph.replace('sg ft', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('s.f.', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('square-foot', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('sq ft', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('sq.ft', 'sqft') for paragraph in nlp_column_prep_1]

nlp_column_prep_1 = [paragraph.replace('hwy', 'highway') for paragraph in nlp_column_prep_1]

# this line changes all the numbers to their words
nlp_column_prep_1 = [re.sub(r"(\d+)", lambda x: num2words.num2words(int(x.group(0))), paragraph) for paragraph in nlp_column]

# tokenize these comments
import nltk
from nltk.tokenize import word_tokenize
tokenizer = nltk.tokenize.WordPunctTokenizer()
preprocess = lambda text: ' '.join(tokenizer.tokenize(text.lower()))

nlp_column_prep_2 = [preprocess(paragraph) for paragraph in nlp_column_prep_1]

nlp_column_prep_2[12]

'mountain / river views and a beautiful sized south facing seven , zero sqft private backyard make this a home you will not want to miss . this well cared for fourbd twobath home is headache - free with a new roof , new h / w tank , and new furnace . the main living area boasts a one hundred and eighty - nine sqft wrap - around balcony that is perfect for enjoying those sunny days and the basement boasts a walkout onebd suite with shared laundry . located in one of the nicest neighborhoods of bolivar heights and is the perfect building lot in the future . single garage , school nearby , quick access to patullo bridge . near surrey central mall . the best - priced starter home on the market and a solid investment for a future custom home !'

In [4]:
# let's define a function to get the say 3 most similar paragraphs to the given one

def most_similar(doc_id, similarity_matrix, matrix):
    print (f'Similar Documents using {matrix}:')
    if matrix=='Cosine Similarity':
        similar_ix=np.argsort(similarity_matrix[doc_id])[::-1][:4]
    elif matrix=='Euclidean Distance':
        similar_ix=np.argsort(similarity_matrix[doc_id])[:4]
    for ix in similar_ix:
        if ix==doc_id:
            continue
        print('\n')
        print ({nlp_column[ix]})
        print (f'{matrix} Score : {similarity_matrix[doc_id][ix]}')
    print('\n')
    print('Similar paragraph indexes:', similar_ix[1:])
    print('\n')

# Gensim Doc2Vec model

Now we build a Doc2Vec model which is one of the best NLP tools that gives opportunity to get the similarities between texts (exactly between paragraphs!).

In [7]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tagged_data = [TaggedDocument(words=word_tokenize(paragraph), tags=[i]) for i, paragraph in enumerate(nlp_column_prep_2)]

vect_len = 50

model_d2v = Doc2Vec(vector_size=vect_len,alpha=0.025, min_count=1)
  
model_d2v.build_vocab(tagged_data)

for epoch in range(vect_len):
    model_d2v.train(tagged_data,
                total_examples=model_d2v.corpus_count,
                epochs=model_d2v.epochs)

In [8]:
paragraph_embeddings=np.zeros((np.shape(nlp_column_prep_2)[0],vect_len))

for i in range(len(paragraph_embeddings)):
    paragraph_embeddings[i]=model_d2v.dv[i]

In [10]:
pairwise_similarities = cosine_similarity(paragraph_embeddings)
pairwise_differences = euclidean_distances(paragraph_embeddings)

In [11]:
# idx = np.random.randint(len(nlp_column))
idx = 143
print("We are interested in this paragraph with index:", idx)
print(nlp_column[idx])
print('\n')

most_similar(idx, pairwise_similarities, 'Cosine Similarity')
most_similar(idx, pairwise_differences, 'Euclidean Distance')

We are interested in this paragraph with index: 143
Welcome to this gorgeous 4 Bedroom, 2 Washroom Renovated Rancher with back lane access sitting on a big 7380 sqft lot. Detached garage at rear along with a separate oversized shed for extra storage. Central and convenient location, close to both levels of school and Guildford mall, very quiet street. Private fenced yard for kids to play or summer fun. Easy access to Vancouver and Highway #1. First Showing Saturday 2-4pm open house.


Similar Documents using Cosine Similarity:


{"Great Potential: The big lot (9660 sqft) is eligible to be subdivided into two lots, buyer's agent to verify with the city hall. Most trees beside house are approved to cut down. The property is located in a quiet and convenient neighborhood. Complete renovation and upgraded appliances have been done in 2017, with 3 bedrooms and 2 full baths up and 3 bedrooms and 2 full baths down. potential 2 Basement suites are great mortgage helper. Call/text for the showi

# HuggingFace snetence embedding generator model

Now we build paragraph embeddings using one of the HuggingFace transformers.

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v1')
# model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v1')

In [11]:
paragraph_embeddings_HF = model.encode(nlp_column_prep_2)

In [12]:
print("We have ", paragraph_embeddings_HF.shape[0], "public remarks")
print("Each of them is transformed to", paragraph_embeddings_HF.shape[1],"shape numeric vector.")

print("An important and very useful feauter is that these embedding vectors are normed:")
print("Min value of one of them:",paragraph_embeddings_HF[3].min())
print("Max value of the above vector:",paragraph_embeddings_HF[3].max())
print("Dot product, i.e. the length of it:", np.dot(paragraph_embeddings_HF[3],paragraph_embeddings_HF[3]))

We have  558 public remarks
Each of them is transformed to 384 shape numeric vector.
An important and very useful feauter is that these embedding vectors are normed:
Min value of one of them: -0.17637709
Max value of the above vector: 0.15290494
Dot product, i.e. the length of it: 1.0


In [13]:
pairwise_similarities = cosine_similarity(paragraph_embeddings_HF)

# idx = 461
idx = 12
print("We are interested in this paragraph with index:", idx)
print(nlp_column[idx])
print('\n')

most_similar(idx, pairwise_similarities, 'Cosine Similarity')

We are interested in this paragraph with index: 12
Mountain/River views and a beautiful sized SOUTH FACING 7,000 sqft private backyard make this a home you will not want to miss. This well cared for 4bd 2bath home is headache-free with a NEW roof, NEW H/W tank, and NEW furnace. The main living area boasts a 189 sqft wrap-around balcony that is perfect for enjoying those sunny days and the basement boasts a WALKOUT 1bd suite with shared laundry. Located in one of the nicest neighborhoods of Bolivar Heights and is the perfect building lot in the future. Single garage, school nearby, quick access to Patullo Bridge. Near Surrey Central Mall. The best-priced starter home on the market and a solid investment for a future custom home!


Similar Documents using Cosine Similarity:


{'Amazing views! Enjoy mountain & river view from this area. lovely 3 bedroom & 1 bath upstairs, living, dining room & kitchen. Downstairs has unauthorized bright 2 br rental suite with separate entrance. Basement h

In [14]:
# let's take another random passage

pairwise_similarities = cosine_similarity(paragraph_embeddings_HF)

idx = np.random.randint(len(nlp_column))
print("We are interested in this paragraph with index:", idx)
print(nlp_column[idx])
print('\n')

most_similar(idx, pairwise_similarities, 'Cosine Similarity')

We are interested in this paragraph with index: 221
Long Water Frontage,  potential revenue from water front location and fully usable land. On this peaceful island, safer for growing ideally precious plants! Very flat and   Nice setting with gorgeous view of mountain, just short walk to the free ferry crossing even when you are the only passenger (please check ferry schedule for details). Easy access to TransCanada highway one, Highway 15 and Highway 17, just minutes away from Guildford Town Center and Surrey City Center, Pacific Academy, Fraser Heights Secondary, and many more. Extremely low property tax and easy maintenance for this valued land near Vancouver, a dream place for weekend, investment, self sufficient life style by growing your own vegetable, livestocks,  etc....however, easy access to urban amenities  if needed.


Similar Documents using Cosine Similarity:


{'Oh my views!! Fraser River, Pattullo Bridge, New Westminster, and the mountains. Private corner lot with 60x13

In fact, it is obvious that this processing of public remarks is better than the others developed before. So, we'll work with this in the next steps of project.

### Let's make predictions of the prices of properties just using their Public Remarks (this is possible!)

In [15]:
# I'm just using the same ML model developed in 1. baseline sulution file

y = df['Price'].str.replace('$','').str.replace(',','').astype(int)

In [14]:
# These functions take as inputs the real and predicted values 

# 1. % of values are predicted for <= 2, 3, 5, 10, 20 % of accuracy
def results_score(real_values, predictions):
    percentage_list = [2, 3, 5, 10, 20]
    
    for percentage in percentage_list:
        diff_list = []
        diff_list = np.abs((np.array(real_values) - np.round(predictions,1)))/np.array(real_values) * 100
        print(np.round(np.shape(np.where(np.round(diff_list,2) <= percentage))[1] / np.shape(real_values)[0] * 100, 1), '% of predited values has <=', percentage, '% accuracy.')
        
# 2. Mean absolute percentage error
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

# 3. Median absolute percentage error
def median_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.median(np.abs((y_true - y_pred) / y_true)) * 100

In [19]:
# now we make an ensemble learning using RandomForest, DecisionTree and XBG regressors and Ridge regression below that
# one might use more regressors but this will not significantly improve the results. even it might generate worst results

from sklearn.linear_model import Ridge, LassoCV
import xgboost as xgb
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(paragraph_embeddings_HF, y, test_size=0.2, random_state=42)
regr = 0

### Model
estimators = [
#     ('lr1', LassoCV()),
    ('lr2', RandomForestRegressor(n_estimators=7, random_state = 42, n_jobs=-1)),
    ('lr3', xgb.XGBRegressor(n_estimators=70, learning_rate=0.1, gamma=0, subsample=0.75, colsample_bytree=1)),
    ('lr4', DecisionTreeRegressor(random_state=0))
    ]
regr = StackingRegressor(
    estimators=estimators,
    final_estimator=Ridge()
)

regr.fit(X_train, y_train)
print('model fitted')

model fitted


In [20]:
# now we show the results:
print('Results for test dataset:')
results_score(y_test, np.round(regr.predict(X_test),1))

print('\n')
print('Results for train dataset (to check if we have overfitting or the results are comparable with those on test dataset):')
results_score(y_train, np.round(regr.predict(X_train),1))

Results for test dataset:
9.8 % of predited values has <= 2 % accuracy.
12.5 % of predited values has <= 3 % accuracy.
19.6 % of predited values has <= 5 % accuracy.
33.9 % of predited values has <= 10 % accuracy.
54.5 % of predited values has <= 20 % accuracy.


Results for train dataset (to check if we have overfitting or the results are comparable with those on test dataset):
40.4 % of predited values has <= 2 % accuracy.
56.3 % of predited values has <= 3 % accuracy.
78.5 % of predited values has <= 5 % accuracy.
97.8 % of predited values has <= 10 % accuracy.
99.6 % of predited values has <= 20 % accuracy.


In [43]:
# giving moreless the same results as the doc2vec embeddings, but the HFones are normed and this will be very useful.

# Take baseline solution, add these embeddings and train a ML model

In [6]:
df1 = df.copy()

# df1.drop(['Public Remarks'], axis=1, inplace = True)
df1.drop(['Address'], axis=1, inplace = True)

df1['Price'] = df1['Price'].str.replace('$','').str.replace(',','').astype(int)

df1['Sold Date'] = df1['Sold Date'].str.split("/")
df1['Sold Year'] = df1['Sold Date'].str[2]
df1['Sold Month'] = df1['Sold Date'].str[0]
df1.drop(['Sold Date'], axis=1, inplace = True)

df1['Age'] = df1['Age'].fillna(-1).astype(int)
df1.loc[df1['Age'] > 150, 'Age'] = 150

df1['Lot Sz (Sq.Ft.)'] = df1['Lot Sz (Sq.Ft.)'].str.replace(',', '').astype(float)
# df1.drop(df1[df1['Lot Sz (Sq.Ft.)'] > 50000].index, inplace = True)
# df1.drop(df1[df1['Lot Sz (Sq.Ft.)'] < 100].index, inplace = True)

df1['Floor Area -Grand Total'] = df1['Floor Area -Grand Total'].str.replace(',', '').astype(int)

# df1['Driveway Finish'] = df1['Driveway Finish'].astype(str)

df1['Floor Area - Unfinished'] = df1['Floor Area - Unfinished'].str.replace(',', '').astype(int)

df1['Foundation'] = df1['Foundation'].astype(str)

df1['Floor Area Fin - Basement'] = df1['Floor Area Fin - Basement'].str.replace(',', '').astype(float)

df1['Zoning'] = df1['Zoning'].astype(str)
df1['Zoning'] = df1['Zoning'].str.replace('A1', 'A-1')
df1['Zoning'] = df1['Zoning'].str.replace('A2', 'A-2')
df1['Zoning'] = df1['Zoning'].str.replace('1ACRER', 'RA')
df1['Zoning'] = df1['Zoning'].str.replace('1 AR', 'RA')
df1['Zoning'] = df1['Zoning'].str.replace('RF13', 'RF-13')
df1['Zoning'] = df1['Zoning'].str.replace('RHG', 'RH-G')
df1['Zoning'] = df1['Zoning'].str.replace('RS-1', 'RS1')
df1['Zoning'] = df1['Zoning'].str.replace('SING/F', 'SING')

df1['Parking Places - Covered'] = df1['Parking Places - Covered'].fillna(-1) ### or -1

df1.loc[df1['No. Floor Levels'] > 10, 'No. Floor Levels'] = -1

df1['Frontage - Feet'] = df1['Frontage - Feet'].str.replace(',', '').astype(float)
df1['Frontage - Feet'] = df1['Frontage - Feet'].fillna(-1) ### or -1

df1 = df1.drop(['Depth'], axis=1)

df1['new feature 1'] = (df1['Total Bedrooms'] + df1['# Rms'])

df1['Total Baths'] = df1['Total Baths'].astype(str)
df1['# Rms'] = df1['# Rms'].astype(str)

In [7]:
# delete outliers

first_quartile = df1.quantile(q=0.25)
third_quartile = df1.quantile(q=0.75)
IQR = third_quartile - first_quartile

outliers = df1[(df1 > (third_quartile + 1.5 * IQR)) | (df1 < (first_quartile - 1.5 * IQR))].count(axis=1)
outliers.sort_values(axis=0, ascending=False, inplace=True)

outliers = outliers.head(np.int(np.ceil(df.shape[0]/20)))
df1.drop(outliers.index, inplace=True)

df1.shape

(530, 23)

In [8]:
nlp_column = df1['Public Remarks'].copy()
df1.drop(['Public Remarks'], axis=1, inplace = True)

# a bit of cleaning: filling with NaN's where not available, changing some words

nlp_column = nlp_column.fillna('0')
nlp_column_prep_1 = nlp_column.str.replace('&',' and ')
nlp_column_prep_1 = nlp_column_prep_1.str.replace('%',' percent')
nlp_column_prep_1 = nlp_column_prep_1.str.replace('*','')


# replace all the digits with corresponding words: 5 -> five
import re
import num2words

# this removes thousands delimiter. so 6,550 will be 6500
nlp_column_prep_1 = [re.sub(r'(\d+),(\d+)', r'\1\2', paragraph) for paragraph in nlp_column_prep_1]

# # # these two lines change 4'' or 4 '' to 4 inch
nlp_column_prep_1 = [re.sub(r'(\d+) \'\'', r' \1 inch ', paragraph) for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [re.sub(r'(\d+)\'\'', r' \1 inch ', paragraph) for paragraph in nlp_column_prep_1]                           
                       
# these two lines change 3' or 3 ' to 3 feet
nlp_column_prep_1 = [re.sub(r'(\d+)\'', r' \1 foot', paragraph) for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [re.sub(r'(\d+) \'', r' \1 foot', paragraph) for paragraph in nlp_column_prep_1]

# this line changes feetx/feet x to feet by; inchx/inch x to inch by; ft. x to foor by
nlp_column_prep_1 = [paragraph.replace('footx', 'feet by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('foot x', 'foot by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('inchx', 'inch by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('inch x', 'inch by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('ft. x', 'foot by') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('ft.', 'foot') for paragraph in nlp_column_prep_1]

# change $number.00 to number dollars
nlp_column_prep_1 = [re.sub(r'\$(\d+)\.(\d+)', r'\1 dollars ', paragraph) for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [re.sub(r'\$(\d+)', r'\1 dollars ', paragraph) for paragraph in nlp_column_prep_1]

# here I MANUALLY change some of the non correct abbreviation for sqft and hwy
nlp_column_prep_1 = [paragraph.replace('sg ft', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('s.f.', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('square-foot', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('sq ft', 'sqft') for paragraph in nlp_column_prep_1]
nlp_column_prep_1 = [paragraph.replace('sq.ft', 'sqft') for paragraph in nlp_column_prep_1]

nlp_column_prep_1 = [paragraph.replace('hwy', 'highway') for paragraph in nlp_column_prep_1]

# this line changes all the numbers to their words
nlp_column_prep_1 = [re.sub(r"(\d+)", lambda x: num2words.num2words(int(x.group(0))), paragraph) for paragraph in nlp_column]

# tokenize these comments
import nltk
from nltk.tokenize import word_tokenize
tokenizer = nltk.tokenize.WordPunctTokenizer()
preprocess = lambda text: ' '.join(tokenizer.tokenize(text.lower()))

nlp_column_prep_2 = [preprocess(paragraph) for paragraph in nlp_column_prep_1]

In [9]:
paragraph_embeddings_HF = model.encode(nlp_column_prep_2)

In [10]:
from sklearn.decomposition import PCA

paragraph_embeddings_transformed=0

pca = PCA(n_components=15)
paragraph_embeddings_transformed = pca.fit_transform(paragraph_embeddings_HF)

paragraph_embeddings_transformed.shape

(530, 15)

In [11]:
y, X = 0, 0

y = np.array(df1['Price'])

# one-hot-encoding categorical features
X1 = pd.get_dummies(df1[[ 'Total Baths', '# Rms',  'S/A',  'Area', 'Driveway Finish', 'Foundation', 'Type', 'Zoning', 'Sold Year', 'Sold Month']])

X2 = df1.drop(['Total Baths', '# Rms', 'S/A', 'Area', 'Driveway Finish', 'Foundation', 'Type', 'Zoning', 'Sold Year', 'Sold Month', 'Price'], axis = 1)
X2.fillna(-1)

# generating a big preprocessed dataset including information from the text
X = pd.concat([X1, X2], axis = 1)
X = X.reset_index()
del X['index']
X = pd.concat([X, pd.DataFrame(paragraph_embeddings_transformed)], axis=1)

X.shape

(530, 130)

In [12]:
from sklearn.linear_model import Ridge, LassoCV
import xgboost as xgb
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 42)
regr = 0

### Model
estimators = [
#     ('lr1', LassoCV()),
    ('lr2', RandomForestRegressor(n_estimators=7, random_state = 42, n_jobs=-1)),
    ('lr3', xgb.XGBRegressor(n_estimators=73, learning_rate=0.1, gamma=0, subsample=0.75, colsample_bytree=1, random_state = 42)),
#     ('lr4', DecisionTreeRegressor(random_state=0))
    ]
regr = StackingRegressor(
    estimators=estimators,
    final_estimator=Ridge()
)

regr.fit(X_train, y_train)
print('model fitted')

y_test_pred = np.round(regr.predict(X_test))
y_train_pred = np.round(regr.predict(X_train))



model fitted




In [15]:
# now metrics:

print('Results for test dataset:')
results_score(y_test, y_test_pred)

print('\n')
print('Results for train dataset (to check if we have overfitting or the results are comparable with those on test dataset):')
results_score(y_train, y_train_pred)

Results for test dataset:
23.6 % of predited values has <= 2 % accuracy.
32.1 % of predited values has <= 3 % accuracy.
48.1 % of predited values has <= 5 % accuracy.
72.6 % of predited values has <= 10 % accuracy.
91.5 % of predited values has <= 20 % accuracy.


Results for train dataset (to check if we have overfitting or the results are comparable with those on test dataset):
74.5 % of predited values has <= 2 % accuracy.
87.7 % of predited values has <= 3 % accuracy.
97.6 % of predited values has <= 5 % accuracy.
100.0 % of predited values has <= 10 % accuracy.
100.0 % of predited values has <= 20 % accuracy.


In [142]:
# Mean absolute percentage error

print('Mean absolute percentage error, results on test dataset:')
print(mean_absolute_percentage_error(y_test, y_test_pred))

print('\n')
print('Mean absolute percentage error, results on train dataset:')
mean_absolute_percentage_error(y_train, y_train_pred)

Mean absolute percentage error, results on test dataset:
7.92398007958533


Mean absolute percentage error, results on train dataset:


1.0998728494015029

In [143]:
# Median absolute percentage error

print('Median absolute percentage error, results on test dataset:')
print(median_absolute_percentage_error(y_test, y_test_pred))

print('\n')
print('Median absolute percentage error, results on train dataset:')
median_absolute_percentage_error(y_train, y_train_pred)

Median absolute percentage error, results on test dataset:
5.271095406360423


Median absolute percentage error, results on train dataset:


0.9801123595505618

# Insights 1: about paragraph embeddings using NLP techincs

What here we do: take all the paragraphs, train a machine-learned NLP model which is able to represent paragraphs in terms of (in this case) 50-dimensional vector which is in fact giving the context of the paragraph. The model is called Doc2Vec, it is being trained on our Public Remarks and it is expected that it is able to catch similarity between texts (in fact one of the main tools used to check similarity between texts).
One big advantage of Doc2Vec model is that it is not very sensitive to bad non-correct words, noisy and not cleaned data. 

Similarity scores presented above are not very representative as this will be used jointly with the similarity ranking based on features of properties.

#### Update from Dec 22 after adding predictions of property prices using ONLY their Public Remarks

As we see this works, we are able to make not bad predictions ONLY using the information from Public Remarks, without knowing number of bedrooms, etc.etc.etc. This means, that the embeddings generated via NLP are giving some information.
I've tried the same tactics using also previously developed fasttext embeddings from 3. NLP added to baseline solution file as well as playing with the Doc2Vec embeddings (this file) vector lenghts, training epochs, etc. 

#### Update from Feb 22 after adding HuggingFace predictions and training ML model from baseline solution taking also NLP 

HuggingFace embeddings are much more better than Doc2Vec ones, give much better results and algorithms are able to train on them. And, even gives better results (some few procents of better predictions).