<a href="https://colab.research.google.com/github/shandrayu/mining-massive-databases/blob/main/notebooks/homework_pyspark_yuliia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining massive databases homework

## Colab setup


### Clone repo

In [None]:
!wget -q https://raw.githubusercontent.com/tsunrise/colab-github/main/colab_github.py
import colab_github
colab_github.github_auth(persistent_key=True)

In [None]:
!git clone git@github.com:shandrayu/mining-massive-databases.git

Cloning into 'mining-massive-databases'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 20 (delta 5), reused 13 (delta 2), pack-reused 0[K
Receiving objects: 100% (20/20), 19.23 KiB | 6.41 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [None]:
!ls && ls mining-massive-databases

colab_github.py  drive	mining-massive-databases  __pycache__  sample_data
homework_report.md  notebooks  project_report.md  README.md


### PySpark

In [None]:
# Uncomment if you do not clone repo
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
!wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

--2023-11-02 10:17:57--  https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400395283 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.0-bin-hadoop3.tgz’


2023-11-02 10:18:10 (32.5 MB/s) - ‘spark-3.5.0-bin-hadoop3.tgz’ saved [400395283/400395283]



In [None]:
!tar xzvf spark-3.5.0-bin-hadoop3.tgz > /dev/null


In [None]:
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, avg, when
import pandas as pd

In [None]:
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

## Barcelona dataset recommendation system

The goal of this task is to recommend similar apartments (items) based on input query (apartment description).

Tasks:

- Convert text feature with TF-IDF to vector of features
- Grid search for parameters
- Grid search for number of features
- Calculate ground truth
- Choose metrics. Explain choice

- Add to report:
  - Accuracy for 4 different set of parameters
  - Computation time for tuning (grid search) procedure
  - Machine characteristics

## Download data

In [26]:
from pyspark import SparkFiles

listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/listings.csv"
reviews_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/reviews.csv"

def load_file_to_spark(url):
  spark.sparkContext.addFile(url)
  filename = url.split("/")[-1]
  df = spark.read.csv("file://" + SparkFiles.get(filename), header=True, multiLine=True, escape='\"', inferSchema=True)
  return df

listings_df = load_file_to_spark(listings_url)
print(f"Dataset size {listings_df.count()}, number of columns {len(listings_df.columns)}")
listings_df.show(15)

reviews_df = load_file_to_spark(reviews_url)
print(f"Dataset size {reviews_df.count()}, number of columns {len(reviews_df.columns)}")
reviews_df.show(15)

Dataset size 18086, number of columns 18
+------+--------------------+-------+--------------------+-------------------+--------------------+-----------------+-----------------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+-----------+
|    id|                name|host_id|           host_name|neighbourhood_group|       neighbourhood|         latitude|        longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|    license|
+------+--------------------+-------+--------------------+-------------------+--------------------+-----------------+-----------------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+-----------+
| 18674|Rental unit in Ba...|  71615|    Mireia An

In [None]:
listings_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: integer (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- number_of_reviews: integer (nullable = true)
 |-- last_review: date (nullable = true)
 |-- reviews_per_month: double (nullable = true)
 |-- calculated_host_listings_count: integer (nullable = true)
 |-- availability_365: integer (nullable = true)
 |-- number_of_reviews_ltm: integer (nullable = true)
 |-- license: string (nullable = true)



## Convert text feature with TF-IDF to vector of features

In [27]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StopWordsRemover

def tf_idf(df, input_column_name, num_features, additional_stop_words):
    # Tokenize
    tokenizer = Tokenizer(inputCol=input_column_name, outputCol="tokens")
    with_tokens = tokenizer.transform(df)

    # Remove stop words
    stopwords_filename = "stopwords-en.txt"
    if not os. path. exists(stopwords_filename):
      !wget https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt


    stop_words = []
    with open(stopwords_filename) as file:
        for line in file:
          stop_words.append(line.rstrip())

    stop_words.extend(additional_stop_words)
    remover = StopWordsRemover(stopWords=stop_words)
    remover.setInputCol("tokens")
    remover.setOutputCol("clean_tokens")
    clean_tokens = remover.transform(with_tokens)

    # Perform TFIDF
    hashing_tf = HashingTF(inputCol="clean_tokens", outputCol="raw_features", numFeatures=num_features)
    featurized_data = hashing_tf.transform(clean_tokens)

    idf = IDF(inputCol="raw_features", outputCol="vector_space")
    idf_model = idf.fit(featurized_data)
    results = idf_model.transform(featurized_data)

    return results

results = tf_idf(listings_df, "name", 15, ['·', '★', '1', 'in'])
results.show(20, False)

+------+--------------------------------------------------------------------------+-------+-------------------------------+-------------------+--------------------------------------------+-----------------+-----------------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+------------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id    |name                                                                      |host_id|host_name                      |neighbourhood_group|n