<a href="https://colab.research.google.com/github/shandrayu/mining-massive-databases/blob/main/notebooks/homework_pyspark_yuliia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining massive databases homework

## Colab setup


### Clone repo

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# !wget -q https://raw.githubusercontent.com/tsunrise/colab-github/main/colab_github.py
# import colab_github
# colab_github.github_auth(persistent_key=True)

In [3]:
!git clone git@github.com:shandrayu/mining-massive-databases.git

Cloning into 'mining-massive-databases'...
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


In [4]:
!ls && ls mining-massive-databases

drive	     spark-3.5.0-bin-hadoop3	  spark-3.5.0-bin-hadoop3.tgz.1  stopwords-en.txt
sample_data  spark-3.5.0-bin-hadoop3.tgz  spark-3.5.0-bin-hadoop3.tgz.2
ls: cannot access 'mining-massive-databases': No such file or directory


### PySpark

In [5]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [6]:
!wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

--2023-11-14 13:37:13--  https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.95.219, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400395283 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.0-bin-hadoop3.tgz.3’


2023-11-14 13:37:28 (26.5 MB/s) - ‘spark-3.5.0-bin-hadoop3.tgz.3’ saved [400395283/400395283]



In [7]:
!tar xzvf spark-3.5.0-bin-hadoop3.tgz > /dev/null


In [8]:
!pip install -q findspark

In [9]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

In [10]:
import findspark
findspark.init()

In [11]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, avg, when
import pandas as pd

In [12]:
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

## Barcelona dataset recommendation system

The goal of this task is to recommend similar apartments (items) based on input query (apartment description).

Tasks:

- Convert text feature with TF-IDF to vector of features
- Grid search for parameters
- Grid search for number of features
- Calculate ground truth
- Choose metrics. Explain choice

- Add to report:
  - Accuracy for 4 different set of parameters
  - Computation time for tuning (grid search) procedure
  - Machine characteristics

## Complete pipeline

1. Download and preprocess - spark
2. Convert text feature with TF-IDF to vector of features - spark
3. Convert to pardas dataframe - pandas
4. Caculate ground truth - make hashable (like dict?)
5. LSH - spotify
6. Metric - from skiti learn for ranking
7. Grid search skikit learn




In [13]:
## C

## Download and preprocess data

In [14]:
from pyspark import SparkFiles
from pyspark.sql.functions import substring
from pyspark.sql.functions import split
from pyspark.sql import functions as F
import pyspark.sql.types as T

# listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/data/listings.csv.gz" # full data
listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/listings.csv" # short data

def load_file_to_spark(url):
  spark.sparkContext.addFile(url)
  filename = url.split("/")[-1]
  df = spark.read.csv("file://" + SparkFiles.get(filename), header=True, multiLine=True, escape='\"', inferSchema=True)
  return df

def bucket_rating(arr):
  """
  Preprocess rating into 6 buckets:
  - 5
  - 4.5
  - 4
  - 3.5
  - 3
  - 2
  """
  if arr and len(arr) >= 2:
    if isinstance(arr[1], str) and arr[1].startswith("★"):
      try:
        num = float(arr[1][1:])

        if num >= 4.8:
          arr[1] = "5"
        elif num < 4.8 and num >= 4.5:
          arr[1] = "4.5"
        elif num < 4.5 and num >= 4:
          arr[1] = "4"
        elif num < 4 and num >= 3.5:
          arr[1] = "3.5"
        elif num < 3.5 and num >= 3.0:
          arr[1] = "3"
        else:
          arr[1] = "2"
      except:
        # New listitng, no rating. OK for categorical classification
        arr[1] = "0"
  return arr

def preprocess_listing_database(df):
  df_preprocessed = df.withColumn("name_tokens", split("name", "\\ · "))
  df_preprocessed = df_preprocessed.withColumn("split_tokens", bucket_rating_udf("name_tokens"))
  df_preprocessed = df_preprocessed.withColumn("rating_bucket", col("split_tokens")[1].cast('float'))
  return df_preprocessed


bucket_rating_udf = F.udf(bucket_rating, T.ArrayType(T.StringType()))
listings_df = load_file_to_spark(listings_url)
listings_df = preprocess_listing_database(listings_df)
listings_df.select("name", "name_tokens", "split_tokens", "rating_bucket").show(15, False)

+--------------------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------------------+-------------+
|name                                                                      |name_tokens                                                             |split_tokens                                                          |rating_bucket|
+--------------------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------------------+-------------+
|Rental unit in Barcelona · ★4.30 · 3 bedrooms · 6 beds · 2 baths          |[Rental unit in Barcelona, ★4.30, 3 bedrooms, 6 beds, 2 baths]          |[Rental unit in Barcelona, 4, 3 bedrooms, 6 beds, 2 baths]            |4.0          |
|Rental unit in Sant Adria de Besos · ★4.77 · 3 bedrooms · 4

## Convert text feature with TF-IDF to vector of features

In [15]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql.functions import col, concat, slice, size

def convert_to_tokens(df, input_column_name, additional_stop_words, stop_words_exceptions):
    tokenizer = Tokenizer(inputCol=input_column_name, outputCol="tokens")
    with_tokens = tokenizer.transform(df)

    stopwords_filename = "stopwords-en.txt"
    if not os. path. exists(stopwords_filename):
      !wget https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt


    stop_words = []
    with open(stopwords_filename) as file:
        for line in file:
          word = line.rstrip()
          if word not in stop_words_exceptions:
            stop_words.append(word)

    stop_words.extend(additional_stop_words)
    remover = StopWordsRemover(stopWords=stop_words)
    remover.setInputCol("tokens")
    remover.setOutputCol("clean_tokens")
    clean_tokens = remover.transform(with_tokens)

    return clean_tokens


def tf_idf(df, clean_tokens_column_name, num_features):
    # Perform TFIDF
    hashing_tf = HashingTF(inputCol=clean_tokens_column_name, outputCol="raw_features", numFeatures=num_features)
    featurized_data = hashing_tf.transform(df)

    idf = IDF(inputCol="raw_features", outputCol="vector_space")
    idf_model = idf.fit(featurized_data)
    results = idf_model.transform(featurized_data)

    return results

def convert_title_to_features(df, num_features):
    df_with_title = df.withColumn("title", col("split_tokens")[0])
    clean_tokens = convert_to_tokens(df=df_with_title, input_column_name="title", additional_stop_words=['·', '★', '1', 'in'], stop_words_exceptions={"home"})
    # Extend clean tokens with hand-crafted split tokens
    clean_tokens = clean_tokens.withColumn("clean_tokens", concat(col("clean_tokens"), slice(col("split_tokens"), 2, size(col("split_tokens")) - 1)))
    # TODO: why??? why "home" is in stop words???
    results = tf_idf(df=clean_tokens, clean_tokens_column_name="clean_tokens", num_features=num_features)
    return results

num_features = 15
df_with_features = convert_title_to_features(df=listings_df, num_features=num_features)
df_with_features.select("title", "name", "split_tokens", "clean_tokens", "raw_features", "vector_space").show(20, False)

+----------------------------------+--------------------------------------------------------------------------+----------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|title                             |name                                                                      |split_tokens                                                          |clean_tokens                                                        |raw_features                                         |vector_space                                                                                                                                                      |
+----------------------------------+--------------------------

## LSH

### Implement

In [16]:
from pyspark.ml.feature import BucketedRandomProjectionLSH

def run_lsh(df, input_column, output_column, bucket_length, num_hash_tables):
    """
    Parameter meaning:
     - bucket_length - the length of each hash bucket, a larger bucket lowers the false negative rate
     - num_hash_tables - number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance
    """
    brp = BucketedRandomProjectionLSH()
    brp.setInputCol(input_column)
    brp.setOutputCol(output_column)

    brp.setSeed(123456)
    brp.setBucketLength(bucket_length)
    brp.setNumHashTables(num_hash_tables)

    model = brp.fit(df)
    model.setOutputCol(output_column)

    return model

model = run_lsh(df=df_with_features, input_column="vector_space", output_column="hashes", bucket_length=1.0, num_hash_tables=1)

### Query

#### first

In [17]:
query = df_with_features.first()

print(f"Query name: {query.asDict()['name']}")
print(f"Query ID: {query.asDict()['id']}")

# Parameters
# number of neighbour points which are taken for the comparison
num_neighbours = 50

reccomendations = model.approxNearestNeighbors(df_with_features, query.vector_space, num_neighbours, distCol="distance")
reccomendations.select("id", "name", "clean_tokens", "distance").show(num_neighbours, False)

Query name: Rental unit in Barcelona · ★4.30 · 3 bedrooms · 6 beds · 2 baths
Query ID: 18674
+--------+----------------------------------------------------------------+-----------------------------------------------------------+--------+
|id      |name                                                            |clean_tokens                                               |distance|
+--------+----------------------------------------------------------------+-----------------------------------------------------------+--------+
|7663286 |Rental unit in Barcelona · ★4.64 · 3 bedrooms · 6 beds · 2 baths|[rental, unit, barcelona, 4.5, 3 bedrooms, 6 beds, 2 baths]|0.0     |
|1814378 |Rental unit in Barcelona · ★4.20 · 3 bedrooms · 6 beds · 2 baths|[rental, unit, barcelona, 4, 3 bedrooms, 6 beds, 2 baths]  |0.0     |
|112538  |Rental unit in Barcelona · ★4.38 · 3 bedrooms · 6 beds · 2 baths|[rental, unit, barcelona, 4, 3 bedrooms, 6 beds, 2 baths]  |0.0     |
|22475001|Rental unit in Barcelona · 

In [18]:
import numpy as np

reccomended_ids = np.array(reccomendations.select("id").collect())

#### least

In [19]:
rating_threshold = 3.1
query_3 = df_with_features.filter(df_with_features["rating_bucket"] < rating_threshold).head()
print(f"Query name: {query_3.asDict()['name']}")
print(f"Query ID: {query_3.asDict()['id']}")

reccomendations = model.approxNearestNeighbors(df_with_features, query_3.vector_space, num_neighbours, distCol="distance")
reccomendations.select("id", "name", "clean_tokens", "distance").show(num_neighbours, False)

Query name: Rental unit in Barcelona · ★2.50 · 1 bedroom · 3 beds · 1 bath
Query ID: 562020
+------------------+--------------------------------------------------------------------+---------------------------------------------------------+---------------------+
|id                |name                                                                |clean_tokens                                             |distance             |
+------------------+--------------------------------------------------------------------+---------------------------------------------------------+---------------------+
|562020            |Rental unit in Barcelona · ★2.50 · 1 bedroom · 3 beds · 1 bath      |[rental, unit, barcelona, 2, 1 bedroom, 3 beds, 1 bath]  |0.0                  |
|548085            |Rental unit in Barcelona · 1 bedroom · 3 beds · 1 bath              |[rental, unit, barcelona, 1 bedroom, 3 beds, 1 bath]     |0.0031564109601523436|
|1978145           |Rental unit in Barcelona · 1 bedroom ·

## Download data to Pandas

In [20]:
import numpy as np

df_pandas = df_with_features.select("id", "name", "rating_bucket", "raw_features", "vector_space").toPandas()
df_pandas["vector_space"] = df_pandas["vector_space"].apply(lambda x : np.array(x.toArray()))
df_pandas["vector_space"] = df_pandas["vector_space"].to_numpy()
df_pandas.head()

Unnamed: 0,id,name,rating_bucket,raw_features,vector_space
0,18674,Rental unit in Barcelona · ★4.30 · 3 bedrooms ...,4.0,"(0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.36706427154116883, 0.13..."
1,23197,Rental unit in Sant Adria de Besos · ★4.77 · 3...,4.5,"(0.0, 1.0, 0.0, 0.0, 2.0, 1.0, 1.0, 2.0, 0.0, ...","[0.0, 2.1353226355725248, 0.0, 0.0, 0.24470951..."
2,32711,Rental unit in Barcelona · ★4.46 · 2 bedrooms ...,4.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.1223547571803896, 0.131..."
3,171646,Rental unit in Barcelona · ★4.81 · 2 bedrooms ...,5.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.1223547571803896, 0.131..."
4,171816,Rental unit in Barcelona · ★4.71 · 2 bedrooms ...,4.5,"(0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.1223547571803896, 0.131..."


## LHS Spotify (Annoy)


In [21]:
!pip install annoy




In [22]:
# Restart runtime if package is not found after installation
from annoy import AnnoyIndex

def build_tree(df, metric, num_features, num_trees):
    """
    Build index tree

    Note that it will allocate memory for max(i)+1 items, where i index
    Index corresprond to index in dataframe. Access the row d2.loc[index]
    """
    tree_index = AnnoyIndex(num_features, metric)

    for index, vector in zip(df_subset.index, df_subset["vector_space"]):

        tree_index.add_item(index, vector)

    return tree_index

# TODO: fix SettingWithCopyWarning
def get_ids_by_idx(df, idxs):
    return df.loc[idxs]["id"].to_numpy()

def calculate_neighbours(df, metric, num_features, num_trees):
    tree_index = build_tree(df_subset, metric=metrics[0], num_features=num_features, num_trees=num_trees)
    tree_index.build(num_trees)
    df["reccomendations"] = df.apply(lambda row : get_ids_by_idx(df, tree_index.get_nns_by_vector(row["vector_space"], num_neighbours)), axis = 1)
    return tree_index

sumsample_size = 1000
df_subset = df_pandas[:sumsample_size]

# Parameters
metrics = ["angular", "euclidean", "manhattan", "hamming", "dot"]
# num_features = 15
num_trees = 10
num_neighbours = 50

tree_idx = calculate_neighbours(df_subset, metrics[1], num_features, num_trees)

df_subset[["id", "name", "reccomendations"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["reccomendations"] = df.apply(lambda row : get_ids_by_idx(df, tree_index.get_nns_by_vector(row["vector_space"], num_neighbours)), axis = 1)


Unnamed: 0,id,name,reccomendations
0,18674,Rental unit in Barcelona · ★4.30 · 3 bedrooms ...,"[18674, 112538, 117010, 138055, 162091, 385049..."
1,23197,Rental unit in Sant Adria de Besos · ★4.77 · 3...,"[23197, 275350, 1490933, 188267, 343866, 49699..."
2,32711,Rental unit in Barcelona · ★4.46 · 2 bedrooms ...,"[32711, 135513, 384155, 1061343, 1199594, 1242..."
3,171646,Rental unit in Barcelona · ★4.81 · 2 bedrooms ...,"[171646, 399772, 508212, 594235, 959089, 98891..."
4,171816,Rental unit in Barcelona · ★4.71 · 2 bedrooms ...,"[171816, 46153, 209673, 75003, 443222, 335861,..."
...,...,...,...
995,2685897,Rental unit in Barcelona · ★4.82 · 2 bedrooms ...,"[1330353, 1810923, 2685897, 537176, 598329, 74..."
996,2417803,Rental unit in Barcelona · ★4.86 · 3 bedrooms ...,"[290535, 757149, 877884, 776038, 1448933, 1868..."
997,2424905,Rental unit in Barcelona · ★4.37 · 1 bedroom ·...,"[97529, 273229, 320506, 343861, 360863, 569610..."
998,2689106,Home in Barcelona · ★4.67 · 4 bedrooms · 1 bed...,"[659469, 2689106, 97529, 273229, 320506, 34386..."


### Query something

In [23]:
query = df_subset.loc[3]
print(f"Query: \nid: {query['id']}, rating {query['rating_bucket']}: {query['name']}")
neighbour_idxs = tree_idx.get_nns_by_vector(query.vector_space, num_neighbours)

df_subset[["id", "rating_bucket", "name"]].loc[neighbour_idxs]

Query: 
id: 171646, rating 5.0: Rental unit in Barcelona · ★4.81 · 2 bedrooms · 2 beds · 1 bath


Unnamed: 0,id,rating_bucket,name
3,171646,5.0,Rental unit in Barcelona · ★4.81 · 2 bedrooms ...
77,399772,5.0,Rental unit in Barcelona · ★4.86 · 2 bedrooms ...
159,508212,5.0,Rental unit in Barcelona · ★4.87 · 2 bedrooms ...
200,594235,5.0,Rental unit in Barcelona · ★4.88 · 2 bedrooms ...
444,959089,5.0,Rental unit in Barcelona · ★4.97 · 2 bedrooms ...
448,988917,5.0,Rental unit in Barcelona · ★4.89 · 2 bedrooms ...
539,1211499,5.0,Rental unit in Barcelona · ★4.94 · 2 bedrooms ...
679,1397748,5.0,Rental unit in Barcelona · ★4.90 · 2 bedrooms ...
770,1744379,5.0,Rental unit in Barcelona · ★4.83 · 2 bedrooms ...
872,2152440,5.0,Rental unit in Barcelona · ★4.94 · 2 bedrooms ...


## Nearest neighbours ground truth (for all)


In [24]:
from sklearn.neighbors import KDTree

class GrountTruthCalculator:
    def __init__(self, vectors, ids, num_features, leaf_size=30, metric="euclidean"):
        self.kdt = KDTree(vectors, leaf_size, metric)
        self.ids = ids

    def get(self, query_id, query_vector, num_neighbours):
        gt_neighbours_idx = self.kdt.query(query_vector.reshape(1, -1), k=num_neighbours, return_distance=False)
        gt_real_ids = self.ids[gt_neighbours_idx.flatten()]
        gt_ids_without_query_id = np.delete(gt_real_ids, np.where(gt_real_ids == query_id))

        return gt_ids_without_query_id

print(KDTree.valid_metrics)

df = df_subset
vectors = np.stack(df_subset["vector_space"], axis=0)
ids = np.stack(df_subset["id"], axis=0)
gt_calculator = GrountTruthCalculator(vectors=vectors, ids=ids, num_features=50)
# TODO: fix warning
df["gt_ids"] = df.apply(lambda row : gt_calculator.get(row["id"], row["vector_space"], num_neighbours), axis = 1)


query = df_subset.loc[3]
print(f"Query: \nid: {query['id']}, rating {query['rating_bucket']}: {query['name']}")
gt_neighbour_ids = query["gt_ids"]

df.loc[df["id"].isin(gt_neighbour_ids)][["id", "rating_bucket", "name"]]

['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']
Query: 
id: 171646, rating 5.0: Rental unit in Barcelona · ★4.81 · 2 bedrooms · 2 beds · 1 bath


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["gt_ids"] = df.apply(lambda row : gt_calculator.get(row["id"], row["vector_space"], num_neighbours), axis = 1)


Unnamed: 0,id,rating_bucket,name
28,223096,5.0,Loft in Barcelona · ★4.82 · 2 bedrooms · 2 bed...
38,241614,5.0,Rental unit in Barcelona · ★4.82 · 2 bedrooms ...
77,399772,5.0,Rental unit in Barcelona · ★4.86 · 2 bedrooms ...
83,283231,5.0,Rental unit in Barcelona · ★4.93 · 1 bedroom ·...
87,424294,5.0,Rental unit in Barcelona · ★5.0 · 1 bedroom · ...
151,502436,5.0,Rental unit in Barcelona · ★5.0 · 2 bedrooms ·...
153,503765,5.0,Rental unit in Barcelona · ★4.92 · 2 bedrooms ...
159,508212,5.0,Rental unit in Barcelona · ★4.87 · 2 bedrooms ...
197,590503,5.0,Rental unit in Barcelona · ★4.87 · 2 bedrooms ...
200,594235,5.0,Rental unit in Barcelona · ★4.88 · 2 bedrooms ...


## Metrics for dataset

In [25]:
df["metric_detected"] = df.apply(lambda row : np.sum(np.isin(row.reccomendations, row.gt_ids)) / num_neighbours, axis = 1)
df["metric_missed"] = 1 - df["metric_detected"]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["metric_detected"] = df.apply(lambda row : np.sum(np.isin(row.reccomendations, row.gt_ids)) / num_neighbours, axis = 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["metric_missed"] = 1 - df["metric_detected"]


In [26]:
threshold = 0.7
good_reccomendations = df.loc[df["metric_detected"] > threshold][["id", "name", "reccomendations", "metric_detected"]]
good_reccomendations

Unnamed: 0,id,name,reccomendations,metric_detected
1,23197,Rental unit in Sant Adria de Besos · ★4.77 · 3...,"[23197, 275350, 1490933, 188267, 343866, 49699...",0.88
4,171816,Rental unit in Barcelona · ★4.71 · 2 bedrooms ...,"[171816, 46153, 209673, 75003, 443222, 335861,...",0.98
5,34981,Rental unit in Barcelona · ★4.53 · 4 bedrooms ...,"[34981, 931437, 1613942, 644534, 876101, 19895...",0.74
6,36763,Rental unit in Barcelona · ★4.79 · 1 bedroom ·...,"[36763, 214970, 83283, 95719, 256677, 136670, ...",0.92
7,174382,Rental unit in Barcelona · ★4.39 · 3 bedrooms ...,"[174382, 615381, 2355423, 275346, 2376967, 965...",0.98
...,...,...,...,...
995,2685897,Rental unit in Barcelona · ★4.82 · 2 bedrooms ...,"[1330353, 1810923, 2685897, 537176, 598329, 74...",0.96
996,2417803,Rental unit in Barcelona · ★4.86 · 3 bedrooms ...,"[290535, 757149, 877884, 776038, 1448933, 1868...",0.76
997,2424905,Rental unit in Barcelona · ★4.37 · 1 bedroom ·...,"[97529, 273229, 320506, 343861, 360863, 569610...",0.98
998,2689106,Home in Barcelona · ★4.67 · 4 bedrooms · 1 bed...,"[659469, 2689106, 97529, 273229, 320506, 34386...",0.98


In [27]:
accuracy = good_reccomendations.shape[0] /sumsample_size
accuracy

0.817

## Grid search for parameters

## Nearest neighbours ground truth

In [28]:
from sklearn.neighbors import KDTree

def convert_spark_df_to_pandas(spark_df):
    pandas_df = spark_df.select("id", "name", "vector_space").toPandas()
    vectors = pandas_df["vector_space"].apply(lambda x : np.array(x.toArray()))
    vectors = vectors.to_numpy()
    vectors = np.stack(vectors, axis=0)
    return pandas_df, vectors

def get_gt(query: np.ndarray, query_id: int, num_neighbours: int, real_ids: np.ndarray, vectors_pandas: np.ndarray):
    kdt = KDTree(vectors_pandas, leaf_size=30, metric="euclidean")
    gt_neighbours_idx = kdt.query(query, k=num_neighbours, return_distance=False)
    gt_real_ids = real_ids[gt_neighbours_idx.flatten()]
    gt_ids_without_query_id = np.delete(gt_real_ids, np.where(gt_real_ids == query_id))

    return gt_ids_without_query_id

pandas_df, vectors_pandas = convert_spark_df_to_pandas(df_with_features)
ID = 23197
query = df_with_features.where(df_with_features.id == ID).first()
query_pandas = np.array(query.asDict()["vector_space"].toArray()).reshape(1, -1)
query_id = query.asDict()['id']

num_neighbours= 6
gt_ids = get_gt(query_pandas, query_id, num_neighbours, pandas_df["id"].to_numpy(), vectors_pandas)
print(gt_ids)


[          18918165 967279668909991276           32300579
           31910915 811538092074095964]


In [29]:
from pyspark.sql.functions import udf, array

def calculate_gt(df, num_neighbours):
    def get_gt_one_row(query_id, query_vector):
        """
        WARNING: function uses global variables.
        TODO: Is there a way to pass pandas dataframe to UDF?
        """
        query_pandas = np.array(query_vector.toArray()).reshape(1, -1)
        gt_ids = get_gt(query_pandas, query_id, num_neighbours, pandas_df["id"].to_numpy(), vectors_pandas)
        return [int(n) for n in gt_ids]

    pandas_df, vectors_pandas = convert_spark_df_to_pandas(df_with_features)
    get_gt_one_row_udf = F.udf(get_gt_one_row, T.ArrayType(T.LongType()))

    df_with_gt = df.withColumn("ground_truth", get_gt_one_row_udf("id", "vector_space"))
    return df_with_gt

num_neighbours = 6
newgt = calculate_gt(df=df_with_features, num_neighbours=num_neighbours)
newgt.select("id", "name", "ground_truth").show(10, False)

+------+--------------------------------------------------------------------------+----------------------------------------------------------------------+
|id    |name                                                                      |ground_truth                                                          |
+------+--------------------------------------------------------------------------+----------------------------------------------------------------------+
|18674 |Rental unit in Barcelona · ★4.30 · 3 bedrooms · 6 beds · 2 baths          |[162091, 385049, 117010, 112538, 138055]                              |
|23197 |Rental unit in Sant Adria de Besos · ★4.77 · 3 bedrooms · 4 beds · 2 baths|[18918165, 967279668909991276, 32300579, 31910915, 811538092074095964]|
|32711 |Rental unit in Barcelona · ★4.46 · 2 bedrooms · 3 beds · 1.5 baths        |[1242470, 1199594, 384155, 135513, 1061343]                           |
|171646|Rental unit in Barcelona · ★4.81 · 2 bedrooms · 2 beds · 1 bat

In [30]:
newgt.where(df_with_features.id == ID).select("id", "name", "ground_truth").show(1, False)

+-----+--------------------------------------------------------------------------+----------------------------------------------------------------------+
|id   |name                                                                      |ground_truth                                                          |
+-----+--------------------------------------------------------------------------+----------------------------------------------------------------------+
|23197|Rental unit in Sant Adria de Besos · ★4.77 · 3 bedrooms · 4 beds · 2 baths|[18918165, 967279668909991276, 32300579, 31910915, 811538092074095964]|
+-----+--------------------------------------------------------------------------+----------------------------------------------------------------------+



In [31]:
IDs = list(gt_ids) #[967279668909991276, 18918165]
newgt.where(df_with_features.id.isin(IDs)).select("id", "name", "ground_truth").show(len(IDs), False)

+------------------+-----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+
|id                |name                                                                                     |ground_truth                                                                 |
+------------------+-----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+
|18918165          |Rental unit in Sant Adria de Besos · ★4.49 · 3 bedrooms · 6 beds · 2 baths               |[32300579, 811538092074095964, 899131976222097946, 30944649, 23197]          |
|31910915          |Bed and breakfast in Santa Coloma de Gramenet · ★4.94 · 1 bedroom · 1 bed · 1 shared bath|[967279668909991276, 631760606413807974, 23197, 50804944, 641020247716930473]|
|32300579          |Hostel in Barcelona · 1 bedroom · 1

In [32]:
real_ids = pandas_df["id"].to_numpy()
print(real_ids[-10:])

[971845785379403969 971878671238422092 971879970019192691
 972199746143686425 972213525609560702 972414398498867957
 972517151783308573 972525401820905576 972527438522389761
 972529324055119972]


## Grid search skikit learn

In [33]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np


param_grid = {
    # number of features for TF-IDF
    "num_features": [20, 30],#[5, 10, 20, 30, 50],
    # the length of each hash bucket, a larger bucket lowers the false negative rate.
    "bucket_length": [2.0],#[1.0, 2.0, 3.0],
    # number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.
    "num_hash_tables": [1],#[1, 3, 7, 10],
    "num_neighbours": [5, 15, 50, 100]
}

lsh_estimator = LshEstimator()
grid_search = GridSearchCV(lsh_estimator, param_grid)
best_parameters = grid_search.fit(listings_df)
# score must be somewhere here????
# score = grid_search.score()

print('Best parameter:', best_parameters.best_params_)
print('Best score:', best_parameters.best_score_)


NameError: ignored

## Grid serach for LSH parameters

In [None]:
# from sklearn.base import BaseEstimator
# from sklearn.utils.estimator_checks import check_estimator
# from pyspark import keyword_only
# from pyspark.ml.evaluation import Evaluator
# from pyspark.ml import Estimator
# from pyspark.sql.functions import lit
# from pyspark.ml.param import Param



# class LshEstimator(Estimator, Evaluator):
#     def __init__(self, num_neighbours=5, num_features=10, bucket_length=1.0, num_hash_tables=1):
#         super().__init__()
#         self.num_neighbours = num_neighbours
#         self.num_features = num_features
#         self.bucket_length = bucket_length
#         self.num_hash_tables = num_hash_tables

#         self.num_neighbours_param = Param(self, "num_neighbours", "num_neighbours", 5)
#         self.num_features_param = Param(self, "num_features", "num_features", 10)
#         self.bucket_length_param = Param(self, "bucket_length", "bucket_length", 1.0)
#         self.num_hash_tables_param = Param(self, "num_hash_tables", "num_nnum_hash_tableseighbours", 1)

#     def get_num_neighbours(self):
#         return self.getOrDefault(self.num_neighbours)

#     # Required in Spark >= 3.0
#     def setInputCol(self, value):
#         """
#         Sets the value of :py:attr:`inputCol`.
#         """
#         return self._set(inputCol=value)

#     # Required in Spark >= 3.0
#     def setPredictionCol(self, value):
#         """
#         Sets the value of :py:attr:`predictionCol`.
#         """
#         return self._set(predictionCol=value)

#     @keyword_only
#     def setParams(self, inputCol=None, predictionCol=None, centeredThreshold=1.0):
#         kwargs = self._input_kwargs
#         return self._set(**kwargs)

#     def fit(self, X, y=None):
#         self.df = convert_title_to_features(df=X, num_features=self.num_features)
#         self.model = run_lsh(df=self.df, input_column="vector_space", output_column="hashes", bucket_length=self.bucket_length, num_hash_tables=self.num_hash_tables)
#         self.df = calculate_gt(df=self.df, num_neighbours=self.num_neighbours)

#     def _fit(self, dataset):
#         input_column_name = "vector_space"#self.getInputCol()

#         self.df = convert_title_to_features(df=dataset, num_features=self.num_features)
#         # TODO: move somewhere, this is overdoing. The only dependency is a number of neighbours for GT!!
#         # If it would not be a parameter, then it would be possible to calculate GT outside fit function
#         # Also, it looks horrible - we are creating the hidden state that shall be transferred between the function calls
#         # TODO: create a separate loop for num_neighbours parameter
#         # it will be even cheaper to run it 3 times in the begginning and reference only precalculated columns....
#         self.df = calculate_gt(df=self.df, num_neighbours=self.num_neighbours)
#         self.model = run_lsh(df=self.df, input_column=input_column_name, output_column="hashes", bucket_length=self.bucket_length, num_hash_tables=self.num_hash_tables)
#         return model

#     def calculate_gt(df, num_neighbours):
#         def get_gt_one_row(query_id, query_vector):
#             """
#             WARNING: function uses global variables.
#             TODO: Is there a way to pass pandas dataframe to UDF?
#             """
#             query_pandas = np.array(query_vector.toArray()).reshape(1, -1)
#             gt_ids = get_gt(query_pandas, query_id, num_neighbours, pandas_df["id"].to_numpy(), vectors_pandas)
#             return [int(n) for n in gt_ids]

#         pandas_df, vectors_pandas = convert_spark_df_to_pandas(results)
#         get_gt_one_row_udf = F.udf(get_gt_one_row, T.ArrayType(T.LongType()))

#         df_with_gt = df.withColumn("ground_truth", get_gt_one_row_udf("id", "vector_space"))
#         return df_with_gt

#     def _evaluate(self, dataset) -> float:
#         def evaluate_one_row(reccomendations, ground_truth):
#             recommendations = np.array(recommendations.toArray().tolist())
#             ground_truth = np.array(ground_truth.toArray().tolist())
#             return np.sum(np.isin(reccomendations, ground_truth), axis=0)

#         def get_reccomendations_one_row(query_id, query_vector):
#             reccomendations = model.approxNearestNeighbors(dataset, query_vector, num_neighbours, distCol="distance")
#             reccomended_ids = np.array(reccomendations.select("id").collect())
#             query_id = query.asDict()['id']
#             reccomended_ids = np.delete(reccomended_ids, np.where(reccomended_ids == query_id))
#             return reccomended_ids

#         # Define UDFs
#         get_reccomendations_one_row_udf = F.udf(evaluate_one_row, T.ArrayType(T.LongType()))
#         evaluate_one_row_udf = udf(evaluate_one_row, T.IntegerType())

#         # Calculate score
#         model = self.model
#         dataset.withColumn("reccomendations", get_reccomendations_one_row("id", "vector_space"))
#         dataset.withColumn("score", evaluate_one_row_udf("reccomendations", "ground_truth"))
#         average_score = dataset.select(avg("score")).collect()[0][0]
#         return average_score

#     def predict(self, X):
#         # TODO: will the code work if I pass several queries at a time? - no. it tries to create join! with itself and cannot unlock
#         query = X
#         reccomendations = self.model.approxNearestNeighbors(self.df, query.vector_space, num_neighbours, distCol="distance")
#         reccomended_ids = np.array(reccomendations.select("id").collect())
#         # query_id = query.asDict()['id']
#         # reccomended_ids = np.delete(reccomended_ids, np.where(reccomended_ids == query_id))
#         output_col_name = self.getOutputCol()
#         self.df.withColumn(output_col_name, model.approxNearestNeighbors(self.df, query.vector_space, num_neighbours, distCol="distance").select())
#         # return reccomended_ids

#     def score(self, X, y=None):
#         # ground_truth_k_neighbours = y[:, :self.num_neighbours] # check dimentions
#         reccomensations = self.predict(self, X)
#         score = np.sum(np.isin(reccomensations, self.df_with_ground_truth["ground_truth"]), axis=0)
#         return score

    # def predict_score(self, X, y=None):
    #     self.df.withColumn("score", )

class LshEstimator():
    def __init__(self, num_neighbours=5, num_features=10, bucket_length=1.0, num_hash_tables=1):
        self.num_neighbours = num_neighbours
        self.num_features = num_features
        self.bucket_length = bucket_length
        self.num_hash_tables = num_hash_tables


    def fit(self, X, y=None):
        self.df = convert_title_to_features(df=X, num_features=self.num_features)
        self.model = run_lsh(df=self.df, input_column="vector_space", output_column="hashes", bucket_length=self.bucket_length, num_hash_tables=self.num_hash_tables)
        self.df = calculate_gt(df=self.df, num_neighbours=self.num_neighbours)

    def predict(self, X, y=None):
        # TODO: will the code work if I pass several queries at a time?
        query = X
        reccomendations = model.approxNearestNeighbors(self.df, query.vector_space, num_neighbours, distCol="distance")
        # reccomended_ids = np.array(reccomendations.select("id").collect())
        # query_id = query.asDict()['id']
        # reccomended_ids = np.delete(reccomended_ids, np.where(reccomended_ids == query_id))

        return reccomendations #sreccomended_ids

    def score(self, X, y=None):
        # ground_truth_k_neighbours = y[:, :self.num_neighbours] # check dimentions
        reccomensations = self.predict(self, X)
        score = np.sum(np.isin(reccomensations, self.df_with_ground_truth["ground_truth"]), axis=0)
        return score

lsh = LshEstimator()
lsh.fit(listings_df)
# display(lsh._fit(listings_df))
r = lsh.predict(query)
r.
# display(lsh._evaluate(listings_df))

In [None]:
r.select("id").show(15, False)

In [None]:
from pyspark.ml.evaluation import RankingEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import Evaluator


train, test = listings_df.randomSplit([0.8, 0.2], seed=42)

lsh = Lsh()

param_grid = ParamGridBuilder().addGrid(lsh.num_features_param, [20, 30]) \
.addGrid(lsh.bucket_length_param, [2.0]) \
.addGrid(lsh.num_hash_tables_param, [1]) \
.addGrid(lsh.num_neighbours_param, [5, 15, 50, 100]) \
.build()

evaluator = Lsh()

crossval = CrossValidator(estimator=lsh,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3)

cvModel = crossval.fit(train)



In [None]:

from sklearn.utils.estimator_checks import check_estimator
# from sklearn.utils
check_estimator(LshEstimator())



## Grid search for number of features

Todo:
- check other options of grid search for pyspark, skikit learn seems incompatible
Options: pyspark.ml.tuning