<a href="https://colab.research.google.com/github/shandrayu/mining-massive-databases/blob/main/notebooks/homework_pyspark_yuliia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining massive databases homework

## Colab setup


### Clone repo

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# !wget -q https://raw.githubusercontent.com/tsunrise/colab-github/main/colab_github.py
# import colab_github
# colab_github.github_auth(persistent_key=True)

In [3]:
!git clone git@github.com:shandrayu/mining-massive-databases.git

Cloning into 'mining-massive-databases'...
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


In [4]:
!ls && ls mining-massive-databases

drive  sample_data
ls: cannot access 'mining-massive-databases': No such file or directory


### PySpark

In [5]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [6]:
!wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

--2023-11-03 12:51:24--  https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.95.219, 2a01:4f8:10a:201a::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400395283 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.0-bin-hadoop3.tgz’


2023-11-03 12:51:38 (26.7 MB/s) - ‘spark-3.5.0-bin-hadoop3.tgz’ saved [400395283/400395283]



In [7]:
!tar xzvf spark-3.5.0-bin-hadoop3.tgz > /dev/null


In [8]:
!pip install -q findspark

In [9]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

In [10]:
import findspark
findspark.init()

In [11]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, avg, when
import pandas as pd

In [12]:
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

## Barcelona dataset recommendation system

The goal of this task is to recommend similar apartments (items) based on input query (apartment description).

Tasks:

- Convert text feature with TF-IDF to vector of features
- Grid search for parameters
- Grid search for number of features
- Calculate ground truth
- Choose metrics. Explain choice

- Add to report:
  - Accuracy for 4 different set of parameters
  - Computation time for tuning (grid search) procedure
  - Machine characteristics

## Download and preprocess data

In [13]:
from pyspark import SparkFiles
from pyspark.sql.functions import substring
from pyspark.sql.functions import split
from pyspark.sql import functions as F
import pyspark.sql.types as T

# listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/data/listings.csv.gz" # full data
listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/listings.csv" # short data

def load_file_to_spark(url):
  spark.sparkContext.addFile(url)
  filename = url.split("/")[-1]
  df = spark.read.csv("file://" + SparkFiles.get(filename), header=True, multiLine=True, escape='\"', inferSchema=True)
  return df

def bucket_rating(arr):
  """
  Preprocess rating into 6 buckets:
  - 5
  - 4.5
  - 4
  - 3.5
  - 3
  - 2
  """
  if arr and len(arr) >= 2:
    if isinstance(arr[1], str) and arr[1].startswith("★"):
      try:
        num = float(arr[1][1:])

        if num >= 4.8:
          arr[1] = "5"
        elif num < 4.8 and num >= 4.5:
          arr[1] = "4.5"
        elif num < 4.5 and num >= 4:
          arr[1] = "4"
        elif num < 4 and num >= 3.5:
          arr[1] = "3.5"
        elif num < 3.5 and num >= 3.0:
          arr[1] = "3"
        else:
          arr[1] = "2"
      except:
        # Remove token, most propably some kind of parsing error
        arr.pop(1)
  return arr

def preprocess_listing_names(df):
  df_preprocessed = df.withColumn("name_tokens", split("name", "\\ · "))
  df_preprocessed = df_preprocessed.withColumn("clean_tokens", bucket_rating_udf("name_tokens"))
  return df_preprocessed

bucket_rating_udf = F.udf(bucket_rating, T.ArrayType(T.StringType()))

listings_df = load_file_to_spark(listings_url)
print(f"Dataset size {listings_df.count()}, number of columns {len(listings_df.columns)}")

listings_df = preprocess_listing_names(listings_df)
listings_df.select("name", "price", "name_tokens", "clean_tokens").show(15, False)

Dataset size 18086, number of columns 18
+--------------------------------------------------------------------------+-----+------------------------------------------------------------------------+----------------------------------------------------------------------+
|name                                                                      |price|name_tokens                                                             |clean_tokens                                                          |
+--------------------------------------------------------------------------+-----+------------------------------------------------------------------------+----------------------------------------------------------------------+
|Rental unit in Barcelona · ★4.30 · 3 bedrooms · 6 beds · 2 baths          |202  |[Rental unit in Barcelona, ★4.30, 3 bedrooms, 6 beds, 2 baths]          |[Rental unit in Barcelona, 4, 3 bedrooms, 6 beds, 2 baths]            |
|Rental unit in Sant Adria de Besos · ★4.77 · 3 bed

In [14]:
listings_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: integer (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- number_of_reviews: integer (nullable = true)
 |-- last_review: date (nullable = true)
 |-- reviews_per_month: double (nullable = true)
 |-- calculated_host_listings_count: integer (nullable = true)
 |-- availability_365: integer (nullable = true)
 |-- number_of_reviews_ltm: integer (nullable = true)
 |-- license: string (nullable = true)
 |-- name_tokens: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- clean_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)



## Convert text feature with TF-IDF to vector of features

In [15]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StopWordsRemover

def tf_idf(df, input_column_name, num_features, additional_stop_words, tokenize=True, remove_stop_words=True):
    if tokenize:
      tokenizer = Tokenizer(inputCol=input_column_name, outputCol="tokens")
      with_tokens = tokenizer.transform(df)
    else:
      with_tokens = df

    if remove_stop_words:
      stopwords_filename = "stopwords-en.txt"
      if not os. path. exists(stopwords_filename):
        !wget https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt


      stop_words = []
      with open(stopwords_filename) as file:
          for line in file:
            stop_words.append(line.rstrip())

      stop_words.extend(additional_stop_words)
      remover = StopWordsRemover(stopWords=stop_words)
      remover.setInputCol("tokens")
      remover.setOutputCol("clean_tokens")
      clean_tokens = remover.transform(with_tokens)
    else:
      clean_tokens = with_tokens

    # Perform TFIDF
    hashing_tf = HashingTF(inputCol="clean_tokens", outputCol="raw_features", numFeatures=num_features)
    featurized_data = hashing_tf.transform(clean_tokens)

    idf = IDF(inputCol="raw_features", outputCol="vector_space")
    idf_model = idf.fit(featurized_data)
    results = idf_model.transform(featurized_data)

    return results

results = tf_idf(listings_df, "clean_tokens", 50, ['·', '★', '1', 'in', '<br', '/>', '<b>', '</b>'], tokenize=False, remove_stop_words=False)
results.select("name", "price", "name_tokens", "clean_tokens", "raw_features", "vector_space").show(20, False)

+--------------------------------------------------------------------------+-----+------------------------------------------------------------------------+----------------------------------------------------------------------+-------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|name                                                                      |price|name_tokens                                                             |clean_tokens                                                          |raw_features                               |vector_space                                                                                                         |
+--------------------------------------------------------------------------+-----+------------------------------------------------------------------------+-------------------------------------------------------------------

### LSH

In [16]:
from pyspark.ml.feature import BucketedRandomProjectionLSH

brp = BucketedRandomProjectionLSH()
brp.setInputCol("vector_space")
brp.setOutputCol("hashes")

# Parameters
# the length of each hash bucket, a larger bucket lowers the false negative rate.
bucket_length = 1.0
# number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.
num_hash_tables = 1

brp.setSeed(12345)
brp.setBucketLength(bucket_length)
brp.setNumHashTables(num_hash_tables)

model = brp.fit(results)
model.setOutputCol("hashes")

BucketedRandomProjectionLSHModel: uid=BucketedRandomProjectionLSH_9bb88ca24558, numHashTables=1

In [17]:
# TODO: select low rating apartments for query, there should be less of them

# from pyspark.sql.types import BooleanType

# def rating_less_than(arr, threshold):
#   if arr and len(arr) >= 2:
#     if isinstance(arr[1], str):
#       try:
#         num = float(arr[1][1:])

#         return num < threshold
#       except:
#         False
#   return False

# rating_less_than_udf = F.udf(rating_less_than, T.ArrayType(T.StringType(), T.FloatType))
# # second_element_less_than_udf = F.udf(second_element_less_than, BooleanType())

# rating_threshold = 3.1
# query_candidates = results.withColumn("less_than_3", rating_less_than_udf("clean_tokens", rating_threshold))
# query_candidates.show()


In [18]:
query = results.first()
num_neighbours = 100
print(query.asDict()["name"])
print(query.asDict()["id"])

reccomendations = model.approxNearestNeighbors(results, query.vector_space, num_neighbours, distCol="distance")
reccomendations.select("id", "name", "clean_tokens", "distance").show(num_neighbours, False)

Rental unit in Barcelona · ★4.30 · 3 bedrooms · 6 beds · 2 baths
18674
+------------------+----------------------------------------------------------------+------------------------------------------------------------+------------------+
|id                |name                                                            |clean_tokens                                                |distance          |
+------------------+----------------------------------------------------------------+------------------------------------------------------------+------------------+
|4929694           |Rental unit in Barcelona · ★4.08 · 3 bedrooms · 6 beds · 2 baths|[Rental unit in Barcelona, 4, 3 bedrooms, 6 beds, 2 baths]  |0.0               |
|112538            |Rental unit in Barcelona · ★4.38 · 3 bedrooms · 6 beds · 2 baths|[Rental unit in Barcelona, 4, 3 bedrooms, 6 beds, 2 baths]  |0.0               |
|138055            |Rental unit in Barcelona · ★4.49 · 3 bedrooms · 6 beds · 2 baths|[Rental unit i

## Nearest neighbours ground truth