# Feature Selection, Training, Query, Prediction

In the first section of the notebook, we test a number of regressions types (linear and polynomial(degree 2), linear - SGD). The result of these sequential trial-and-error tests is: the best R2 score and MSE are achieved using a linear regression. The more features, the better; the test for the k-best features is kept here.

The next section is all about training the model and the widgets - designing and creating an entry as a query, to match the same format as a normal training/test `X` variable. For the widget creation, the corresponding documentation of `ipywidgets.widgets` has been used, on readthedocs.io.

In [389]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import itertools
import random
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
import sklearn.metrics as sm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score,explained_variance_score
from sklearn.preprocessing import PolynomialFeatures
import ipywidgets as widgets



In [390]:
dataset = pd.read_csv("data_ready.csv")
data = dataset

Encodings

In [362]:
# import features from data_prep script
# type_encoding = {"Apartment":1, "HotelApartments":2,"Compound":3,"Villa":4,"Townhouse":5,"Penthouse":7,"Duplex":6}
location_encoding = {'Al Sakhama': 1, 'The Pearl': 2, 'West Bay': 3, 'Musheireb': 4, 'Old Airport Road': 5, 'Al Duhail': 6, 'Mesaimeer': 7, 'Umm Salal Mohammad': 8, 'Al Aziziyah': 9, 'Al Mansoura': 10, 'Al Sadd': 11, 'Al Beshairiya Street': 12, 'Muaither Area': 13, 'Al Ghanim': 14, 'Umm Salal Ali': 15, 'Umm Ghuwailina': 16, 'Al Gharrafa': 17, 'Salata': 18, 'Al Maamoura': 19, 'Najma': 20, 'Al Markhiya': 21, 'Jeliah': 22, 'West Bay Lagoon': 23, 'Ain Khaled': 24, 'Corniche Road': 25, 'Madinat Khalifa': 26, 'Doha Al Jadeed': 27, 'Al Dafna': 28, 'Fereej Bin Mahmoud': 29, 'Al Waab': 30, 'Al Muntazah': 31, 'Al Thumama': 32, 'Abu Hamour': 33, 'Al Nasr': 34, 'Salwa Road': 35, 'Izghawa': 36, 'Al Asmakh': 37, 'Fereej Abdul Aziz': 38, 'Al Rawda': 39, 'Fereej Bin Omran': 40, 'Al Jasra': 41, 'Al Jebailat': 42, 'Diplomats Area': 43, 'Msheireb Downtown Doha': 44, 'Rawdat Al Khail': 45, 'Al Hilal': 46, 'Onaiza': 47, 'Al Messila': 48, 'Al Nuaija': 49, 'Mughalina': 50, 'B-Ring Road': 51, 'Al Rayyan': 52, 'Al Mirqab': 53, 'Industrial Area': 54, 'Al Hitmi': 55, 'New Doha': 56, 'Airport Area': 57, 'AlMuraikh': 58, 'D-Ring': 59, 'Barwa City': 60, 'C-Ring': 61, 'Fereej Kulaib': 62, 'Rawdat Al Matar': 63, 'Al Soudan': 64, 'Business District': 65, 'Al Luqta': 66, 'Umm Al Seneem': 67, 'Fereej Al Ali': 68, 'Onaiza 65': 69, 'Al Asiri': 70, 'Hazm Al Markhiya': 71, 'Al Sailiya': 72}

Necessary lists for widget creation

In [392]:
# variables for frontend
locations = ['Al Sakhama', 'The Pearl', 'West Bay', 'Musheireb', 'Old Airport Road', 'Al Duhail', 'Mesaimeer', 'Umm Salal Mohammad', 'Al Aziziyah', 'Al Mansoura', 'Al Sadd', 'Al Beshairiya Street', 'Muaither Area', 'Al Ghanim', 'Umm Salal Ali', 'Umm Ghuwailina', 'Al Gharrafa', 'Salata', 'Al Maamoura', 'Najma', 'Al Markhiya', 'Jeliah', 'West Bay Lagoon', 'Ain Khaled', 'Corniche Road', 'Madinat Khalifa', 'Doha Al Jadeed', 'Al Dafna', 'Fereej Bin Mahmoud', 'Al Waab', 'Al Muntazah', 'Al Thumama', 'Abu Hamour', 'Al Nasr', 'Salwa Road', 'Izghawa', 'Al Asmakh', 'Fereej Abdul Aziz', 'Al Rawda', 'Fereej Bin Omran', 'Al Jasra', 'Al Jebailat', 'Diplomats Area', 'Msheireb Downtown Doha', 'Rawdat Al Khail', 'Al Hilal', 'Onaiza', 'Al Messila', 'Al Nuaija', 'Mughalina', 'B-Ring Road', 'Al Rayyan', 'Al Mirqab', 'Industrial Area', 'Al Hitmi', 'New Doha', 'Airport Area', 'AlMuraikh', 'D-Ring', 'Barwa City', 'C-Ring', 'Fereej Kulaib', 'Rawdat Al Matar', 'Al Soudan', 'Business District', 'Al Luqta', 'Umm Al Seneem', 'Fereej Al Ali', 'Onaiza 65', 'Al Asiri', 'Hazm Al Markhiya', 'Al Sailiya']
furnished_types = ['Furnished', 'Unfurnished', 'Partly furnished']
amenity_opts = ['Kitchen Appliances', 'Pets Allowed', 'Security', 'Balcony', 'Built in Wardrobes', 'Central A/C', 'Shared Gym', 'Shared Pool', 'Shared Spa', 'View of Water', 'Covered Parking', "Children's Play Area", 'Concierge', 'Walk-in Closet', 'Maid Service', 'Private Jacuzzi', 'View of Landmark', 'Study', "Children's Pool", 'Lobby in Building', 'Networked', 'Maids Room', 'Private Gym', 'Private Pool', 'Private Garden', 'Barbecue Area']
rent_types = ['Apartment', 'Compound', 'Duplex', 'HotelApartments', 'Penthouse', 'Townhouse','Villa']
bathrooms = [1,2,3,4,5,6,7,8,9]
bedrooms = [1,2,3,4,5,6,7,8,9,10]

In [393]:
data.info()
data.drop(columns = ['Unnamed: 0'],inplace = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6538 entries, 0 to 6537
Data columns (total 41 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Unnamed: 0            6538 non-null   int64
 1   Price                 6538 non-null   int64
 2   Area(sqm)             6538 non-null   int64
 3   NoBedrooms            6538 non-null   int64
 4   NoBathrooms           6538 non-null   int64
 5   Furnishing            6538 non-null   int64
 6   Location              6538 non-null   int64
 7   Total Amenities       6538 non-null   int64
 8   Apartment             6538 non-null   int64
 9   Compound              6538 non-null   int64
 10  Duplex                6538 non-null   int64
 11  HotelApartments       6538 non-null   int64
 12  Penthouse             6538 non-null   int64
 13  Townhouse             6538 non-null   int64
 14  Villa                 6538 non-null   int64
 15  Balcony               6538 non-null   int64
 16  Barbec

In [394]:
data.head(15)

Unnamed: 0,Price,Area(sqm),NoBedrooms,NoBathrooms,Furnishing,Location,Total Amenities,Apartment,Compound,Duplex,...,Private Jacuzzi,Private Pool,Security,Shared Gym,Shared Pool,Shared Spa,Study,View of Landmark,View of Water,Walk-in Closet
0,2500,36,1,1,3,1,3,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,8000,60,1,1,3,2,10,1,0,0,...,0,0,1,1,1,1,0,0,1,0
2,8100,60,1,1,3,2,10,1,0,0,...,0,0,1,1,1,1,0,0,1,0
3,6000,40,1,1,3,3,4,1,0,0,...,0,0,0,1,1,0,0,0,0,0
4,4200,60,1,1,3,4,4,0,0,0,...,0,0,1,1,0,1,0,0,0,0
5,7900,60,1,1,3,2,10,1,0,0,...,0,0,1,1,1,1,0,0,1,0
6,8000,60,1,1,3,2,10,1,0,0,...,0,0,1,1,1,1,0,0,1,0
7,6500,80,1,1,3,2,8,1,0,0,...,0,0,1,1,1,0,0,0,0,0
8,7900,60,1,1,3,2,11,1,0,0,...,0,0,1,1,1,0,0,0,1,0
9,5000,60,1,1,2,2,5,1,0,0,...,0,0,1,1,0,0,0,0,0,0


Here we perform the 5-fold test for the k-best features selection. After a number of tests, it has been determined that:

- All features should be included
- Scaling is completely inoffensive - it doesn't change anything

In [395]:
y = data['Price']
X = data.drop(columns=['Price'])

y = y.to_numpy()
X = X.to_numpy()



def evaluate(X,y):

    accuracies = []
    avg = 0
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X):

        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

#         scaler = StandardScaler()

#         X_train = scaler.fit_transform(X_train)
#         X_test = scaler.transform(X_test)

        
        reg_model = LinearRegression().fit(X_train, y_train)
        y_test_pred = reg_model.predict(X_test)

        r2 = r2_score(y_test, y_test_pred)
        accuracies.append(r2)
    

    for a in accuracies:
        avg += a
    print(avg/len(accuracies))


for k in [1,5,10,15,20,25,30,35,40]:

    XX = np.copy(X)
    yy = np.copy(y)
    estimator = LinearRegression()
    selector = RFE(estimator,n_features_to_select=k, step=1)

    selector = selector.fit(X, y)

    goodOnes = selector.support_     # those with true are "good" features so we drop the false ones

    for i in range(len(goodOnes)-1,-1,-1):
        if (goodOnes[i] == False):
            XX = np.delete(XX, i, 1)

    evaluate(XX,yy)
    print(k)

0.1703939533171157
1
0.22862965086583625
5
0.5342259874163895
10
0.5894740818793351
15
0.5997460389401327
20
0.600504054544141
25
0.6018300333683124
30
0.6126559768742801
35
0.6984294478735669
40


Training the model. As mentioned earlier, the linear regression is the best way to go.

In [398]:
y = data['Price']
X = data.drop(columns=['Price'])
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X)

# poly = PolynomialFeatures(degree=2)
# X = poly.fit_transform(X) 

# poly.fit(X, y) 

estimator = LinearRegression()  
estimator.fit(X, y)

LinearRegression()

In [399]:
# widgets

types_w = widgets.Dropdown(options = rent_types, value = 'Apartment', description="Please choose the type of establishment: ", style = {'description_width': 'initial'})

nobeds = widgets.IntSlider(min = 1, max = 10, step = 1, description ="Number of Bedrooms: ", style = {'description_width': 'initial'})

nobaths = widgets.IntSlider(min = 1, max = 9, step = 1, description ="Number of Bathrooms: ", style = {'description_width': 'initial'})

area = widgets.IntText(description = 'Area of establishment, in sqm: ', style = {'description_width': 'initial'})

furnishings = widgets.RadioButtons(options = furnished_types, description='Furnishment: ')

lct = widgets.Dropdown(options=locations, value='The Pearl', description='Location:')

am_select = widgets.SelectMultiple(options = amenity_opts, value = [], description = "Select amenities", layout = widgets.Layout(width = '350px', height = '350px'),style = {'description_width': 'initial'})





In [400]:
data

Unnamed: 0,Price,Area(sqm),NoBedrooms,NoBathrooms,Furnishing,Location,Total Amenities,Apartment,Compound,Duplex,...,Private Jacuzzi,Private Pool,Security,Shared Gym,Shared Pool,Shared Spa,Study,View of Landmark,View of Water,Walk-in Closet
0,2500,36,1,1,3,1,3,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,8000,60,1,1,3,2,10,1,0,0,...,0,0,1,1,1,1,0,0,1,0
2,8100,60,1,1,3,2,10,1,0,0,...,0,0,1,1,1,1,0,0,1,0
3,6000,40,1,1,3,3,4,1,0,0,...,0,0,0,1,1,0,0,0,0,0
4,4200,60,1,1,3,4,4,0,0,0,...,0,0,1,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6533,14000,313,9,8,2,32,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6534,13000,500,9,8,2,52,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6535,20000,600,10,8,2,66,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6536,14000,400,10,6,2,72,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Getting the prediction after postprocessing the input from the widgets.

In [401]:
# creating the features entry
def pred():
    typ = list(map(lambda b: 0 if b != types_w.value else 1, rent_types))
    
    furn = [1 if furnishings.value == "Unfurnished" else 2 if furnishings.value == "Partly furnished" else 3]
    
    amens = list(map(lambda am: 1 if am in am_select.value else 0, amenity_opts))
    
    feat =[area.value, nobeds.value,nobaths.value]
    
    feat += furn + [location_encoding[lct.value],len(am_select.value)]+ typ + amens  
    
    x_query = pd.DataFrame([feat])
#     x_poly_q = poly.fit_transform(x_query)
    y_pred = estimator.predict(x_query)
    return int(y_pred[0])

In [402]:
submit = widgets.Button(description = "Check Prices")
out = widgets.Output()
def sub(b):
    with out:
        print('Evaluation:', pred())
submit.on_click(sub)

### Use Ctrl + Click for multiple amenities

In [404]:
display(types_w, nobeds, nobaths, area, furnishings, lct, am_select,submit,out)

Dropdown(description='Please choose the type of establishment: ', options=('Apartment', 'Compound', 'Duplex', …

IntSlider(value=1, description='Number of Bedrooms: ', max=10, min=1, style=SliderStyle(description_width='ini…

IntSlider(value=1, description='Number of Bathrooms: ', max=9, min=1, style=SliderStyle(description_width='ini…

IntText(value=60, description='Area of establishment, in sqm: ', style=DescriptionStyle(description_width='ini…

RadioButtons(description='Furnishment: ', options=('Furnished', 'Unfurnished', 'Partly furnished'), value='Fur…

Dropdown(description='Location:', index=1, options=('Al Sakhama', 'The Pearl', 'West Bay', 'Musheireb', 'Old A…

SelectMultiple(description='Select amenities', index=(3, 4, 5, 6, 7), layout=Layout(height='350px', width='350…

Button(description='Check Prices', style=ButtonStyle())

Output(outputs=({'output_type': 'stream', 'text': 'Evaluation: 2919\nEvaluation: 4992\nEvaluation: 4165\nEvalu…