<a href="https://colab.research.google.com/github/shandrayu/mining-massive-databases/blob/main/notebooks/homework_pyspark_yuliia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining massive databases homework

## Colab setup


In [5]:
!pip install ipython-autotime

%load_ext autotime

time: 413 µs (started: 2023-11-16 12:42:11 +00:00)


### PySpark

In [4]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

time: 10.9 s (started: 2023-11-16 09:27:16 +00:00)


In [5]:
!wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

--2023-11-16 09:27:27--  https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.95.219, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400395283 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.0-bin-hadoop3.tgz’


2023-11-16 09:27:48 (19.3 MB/s) - ‘spark-3.5.0-bin-hadoop3.tgz’ saved [400395283/400395283]

time: 20.4 s (started: 2023-11-16 09:27:27 +00:00)


In [6]:
!tar xzvf spark-3.5.0-bin-hadoop3.tgz > /dev/null


time: 5.52 s (started: 2023-11-16 09:27:48 +00:00)


In [1]:
!pip install -q findspark

In [1]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

In [2]:
import findspark
findspark.init()

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, avg, when
import pandas as pd

In [4]:
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

Tasks:

- Convert text feature with TF-IDF to vector of features
- Grid search for parameters
- Grid search for number of features
- Calculate ground truth
- Choose metrics. Explain choice

- Add to report:
  - Accuracy for 4 different set of parameters
  - Computation time for tuning (grid search) procedure
  - Machine characteristics

## Complete pipeline

1. Download and preprocess - spark
2. Convert text feature with TF-IDF to vector of features - spark
3. Convert to pardas dataframe - pandas
4. Caculate ground truth - make hashable (like dict?)
5. LSH - spotify
6. Metric - from skiti learn for ranking
7. manual Grid search




In [None]:
## C

time: 473 µs (started: 2023-11-15 09:57:12 +00:00)


## Download and preprocess data

## Barcelona dataset

The goal of this task is to recommend similar apartments (items) based on input query (apartment description).

In [5]:
from pyspark import SparkFiles
from pyspark.sql.functions import substring
from pyspark.sql.functions import split
from pyspark.sql import functions as F
import pyspark.sql.types as T

# listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/data/listings.csv.gz" # full data
listings_url = "http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/listings.csv" # short data

def load_file_to_spark(url):
  spark.sparkContext.addFile(url)
  filename = url.split("/")[-1]
  df = spark.read.csv("file://" + SparkFiles.get(filename), header=True, multiLine=True, escape='\"', inferSchema=True)
  return df

def bucket_rating(arr):
  """
  Preprocess rating into 6 buckets:
  - 5
  - 4.5
  - 4
  - 3.5
  - 3
  - 2
  """
  if arr and len(arr) >= 2:
    if isinstance(arr[1], str) and arr[1].startswith("★"):
      try:
        num = float(arr[1][1:])

        if num >= 4.8:
          arr[1] = "5"
        elif num < 4.8 and num >= 4.5:
          arr[1] = "4.5"
        elif num < 4.5 and num >= 4:
          arr[1] = "4"
        elif num < 4 and num >= 3.5:
          arr[1] = "3.5"
        elif num < 3.5 and num >= 3.0:
          arr[1] = "3"
        else:
          arr[1] = "2"
      except:
        # New listitng, no rating. OK for categorical classification
        arr[1] = "0"
  return arr

def preprocess_listing_database(df):
  df_preprocessed = df.withColumn("name_tokens", split("name", "\ · "))
  df_preprocessed = df_preprocessed.withColumn("split_tokens", bucket_rating_udf("name_tokens"))
  df_preprocessed = df_preprocessed.withColumn("rating_bucket", col("split_tokens")[1].cast('float'))
  return df_preprocessed


bucket_rating_udf = F.udf(bucket_rating, T.ArrayType(T.StringType()))
listings_df = load_file_to_spark(listings_url)
listings_df = preprocess_listing_database(listings_df)
listings_df.select("name", "name_tokens", "split_tokens", "rating_bucket").show(5, False)

+--------------------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------------------+-------------+
|name                                                                      |name_tokens                                                             |split_tokens                                                          |rating_bucket|
+--------------------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------------------+-------------+
|Rental unit in Barcelona · ★4.30 · 3 bedrooms · 6 beds · 2 baths          |[Rental unit in Barcelona, ★4.30, 3 bedrooms, 6 beds, 2 baths]          |[Rental unit in Barcelona, 4, 3 bedrooms, 6 beds, 2 baths]            |4.0          |
|Rental unit in Sant Adria de Besos · ★4.77 · 3 bedrooms · 4

## Wikipedia dataset

The goal of the task is to find similar requests among the Ukrainian Wikipedia, using requests for September, 2023.

In [6]:
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import *

df_wiki = spark.read.csv('/content/wikipedia-2023_09.csv', header=True)
df_wiki = df_wiki.select("*").withColumn("id", monotonically_increasing_id())

df_wiki = df_wiki.withColumn("title", col("Page")).drop("Page").drop("Views").drop("Editors").drop("Edits")

In [7]:
df_wiki.show(10)

+---+--------------------+
| id|               title|
+---+--------------------+
|  0|Умєров Рустем Енв...|
|  1|             Ukr.net|
|  2|             Україна|
|  3|Кадиров Рамзан Ах...|
|  4|    Нагірний Карабах|
|  5|             YouTube|
|  6|       Олекса Довбуш|
|  7| Перша світова війна|
|  8|   Українська абетка|
|  9|    Десять заповідей|
+---+--------------------+
only showing top 10 rows



## Convert text feature with TF-IDF to vector of features

In [8]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StopWordsRemover


# spark_feature_extractor.py
def convert_to_tokens(df, input_column_name, stopwords_filename, additional_stop_words, stop_words_exceptions,
                      data):
    if data == "wiki":
      # remove pucntuation (used only for wikipedia)
      df = df.withColumn('title', translate('title', '!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~«»', ''))
      df = df.where(length(col("title")) > 1)
    tokenizer = Tokenizer(inputCol=input_column_name, outputCol="tokens")
    with_tokens = tokenizer.transform(df)

    if not os.path.exists(stopwords_filename) and 'en' in stopwords_filename:
      !wget https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt

    stop_words = []
    with open(stopwords_filename) as file:
        for line in file:
          word = line.rstrip()
          if stop_words_exceptions:
            if word not in stop_words_exceptions:
              stop_words.append(word)
          else:
            stop_words.append(word)

    if additional_stop_words:
      stop_words.extend(additional_stop_words)

    remover = StopWordsRemover(stopWords=stop_words)
    remover.setInputCol("tokens")
    remover.setOutputCol("clean_tokens")
    clean_tokens = remover.transform(with_tokens)

    return clean_tokens


def tf_idf(df, clean_tokens_column_name, num_features):
    # Perform TF-IDF
    hashing_tf = HashingTF(inputCol=clean_tokens_column_name, outputCol="raw_features", numFeatures=num_features)
    featurized_data = hashing_tf.transform(df)

    idf = IDF(inputCol="raw_features", outputCol="vector_space")
    idf_model = idf.fit(featurized_data)
    results = idf_model.transform(featurized_data)

    return results

def convert_title_to_features(df, num_features, data):
    if data == 'barcelona':
      df = df.withColumn("title", col("split_tokens")[0])
      stopwords_filename = "stopwords-en.txt"
      additional_stop_words=['·', '★', '1', 'in']
      stop_words_exceptions={"home"}
    elif data == "wiki":
      stopwords_filename = "stopwords_ua.txt"
      additional_stop_words=['і']
      stop_words_exceptions=None

    clean_tokens = convert_to_tokens(df=df, input_column_name="title",
                                     stopwords_filename=stopwords_filename,
                                     additional_stop_words=additional_stop_words,
                                     stop_words_exceptions=stop_words_exceptions,
                                     data=data)
    if data == 'barcelona':
      # Extend clean tokens with hand-crafted split tokens
      clean_tokens = clean_tokens.withColumn("clean_tokens", concat(col("clean_tokens"), slice(col("split_tokens"), 2, size(col("split_tokens")) - 1)))

    results = tf_idf(df=clean_tokens, clean_tokens_column_name="clean_tokens", num_features=num_features)
    return results

In [51]:
# num_features = 15
# df_with_features = convert_title_to_features(df=df_wiki, num_features=num_features,
#                                              data="wiki")
# df_with_features.select("title", "tokens", "clean_tokens", "raw_features", "vector_space").show(10, False)

## Download data to Pandas

In [49]:
# import numpy as np

# df_pandas = df_with_features.select("id", "title", "raw_features", "vector_space").toPandas()
# df_pandas["vector_space"] = df_pandas["vector_space"].apply(lambda x : np.array(x.toArray()))
# df_pandas["vector_space"] = df_pandas["vector_space"].to_numpy()
# df_pandas.head()

## LHS Spotify (Annoy)


In [16]:
!pip install annoy



In [9]:
# Restart runtime if package is not found after installation
from annoy import AnnoyIndex

# ranking.py
def build_tree(df, metric, num_features, num_trees):
    """
    Build index tree

    Note that it will allocate memory for max(i)+1 items, where i index
    Index corresprond to index in dataframe. Access the row d2.loc[index]

    Assumptions: features are stores in column "vector_space" of df
    """
    tree_index = AnnoyIndex(num_features, metric)

    for index, vector in zip(df.index, df["vector_space"]):

        tree_index.add_item(index, vector)

    return tree_index


def get_ids_by_idx(df, idxs):
    return df.loc[idxs]["id"].to_numpy()


def calculate_neighbours(df, metric, num_features, num_trees, num_neighbours):
    """
    Add column "reccomendations" to df with calculated nearest neighbours
    Returns model (tree index)
    Assumptions: features are stores in column "vector_space" of df
    """
    tree_index = build_tree(df, metric=metric, num_features=num_features, num_trees=num_trees)
    tree_index.build(num_trees)
    df["reccomendations"] = df.apply(lambda row : get_ids_by_idx(df, tree_index.get_nns_by_vector(row["vector_space"],
                                                                                                  n=num_neighbours)), axis = 1)
    return df, tree_index

In [48]:
# # # pipeline.py
# sumsample_size = 1000
# df_subset = df_pandas[:sumsample_size].copy()

# # Parameters
# metrics = ["angular", "euclidean", "manhattan", "hamming", "dot"]
# # num_features = 15
# num_trees = 10
# num_neighbours = 5

# df_subset, tree_idx = calculate_neighbours(df_subset, metrics[1], num_features, num_trees,
#                                            num_neighbours)

# df_subset[["id", "title", "reccomendations"]]

### Query something

In [47]:
# query = df_subset.loc[5]
# print(f"Query: \nid: {query['id']}: {query['title']}")
# neighbour_idxs = tree_idx.get_nns_by_vector(query.vector_space, num_neighbours)

# df_subset[["id", "title"]].loc[neighbour_idxs]

## Nearest neighbours ground truth (for all)


In [10]:
from sklearn.neighbors import KDTree

class GrountTruthCalculator:
    def __init__(self, vectors, ids, num_features, leaf_size=30, metric="euclidean"):
        self.kdt = KDTree(vectors, leaf_size, metric)
        self.ids = ids

    def get(self, query_id, query_vector, num_neighbours):
        gt_neighbours_idx = self.kdt.query(query_vector.reshape(1, -1), k=num_neighbours, return_distance=False)
        gt_real_ids = self.ids[gt_neighbours_idx.flatten()]
        gt_ids_without_query_id = np.delete(gt_real_ids, np.where(gt_real_ids == query_id))

        return gt_ids_without_query_id

def calculate_ground_truth(df, num_features, num_neighbours):
    """
    Add column "gt_ids" to df with ground truth labels
    Returns: None

    Assumptions:
     - features are stores in column "vector_space" of df
     - listing Id wis stored in column "id of df
    """
    vectors = np.stack(df["vector_space"], axis=0)
    ids = np.stack(df["id"], axis=0)
    gt_calculator = GrountTruthCalculator(vectors=vectors, ids=ids, num_features=num_features)
    df["gt_ids"] = df.apply(lambda row : gt_calculator.get(row["id"], row["vector_space"], num_neighbours), axis = 1)

    return df

In [44]:
# # calculate once in the initialization phase. For num_features = max_num_features
# df_subset = calculate_ground_truth(df_subset, num_features=num_features, num_neighbours=num_neighbours)

In [45]:
# query = df_subset.loc[5]
# print(f"Query: \nid: {query['id']}: {query['title']}")
# gt_neighbour_ids = query["gt_ids"]

# df_subset.loc[df_subset["id"].isin(gt_neighbour_ids)][["id", "title"]]

## Metrics for dataset

The accuracy is calculated by the following steps:
1. "metric_accuracy": find the part of the same recommendations by LSH (prediction) and KDTree (ground-truth) from the all recomendations for the
 one title;
2. "accuracy": find the part of the titles, where the "metric_accuracy" is above the threshold

In [11]:
def get_accuracy(df, num_neighbours, ranking_quality_threshold):
    """
    Add column "metric_accuracy" to df with accuracy
    Returns: None

    Assumptions:
     - reccomendations are stores in column "reccomendations" of df
     - ground truth is stored in column "gt_ids" of df
    """
    df["metric_accuracy"] = df.apply(lambda row : np.sum(np.isin(row.reccomendations, row.gt_ids)) / num_neighbours, axis = 1)
    good_reccomendations = df.loc[df["metric_accuracy"] > ranking_quality_threshold]
    accuracy = good_reccomendations.shape[0] / df.shape[0]
    return accuracy

In [43]:
# get_accuracy(df_subset, num_neighbours, 0.7)

## Complete pipeline

In [19]:
import numpy as np
# Restart runtime if package is not found after installation
from annoy import AnnoyIndex

# Metrics
RANKING_QUALITY_THRESHOLD = 0.7

def pipeline(input_df, num_features, metric, num_trees, num_neighbours, subsample_size,
             data):
    # FeatureExtractor - TF-IDF + custom preprocessing
    df_with_features = convert_title_to_features(df=input_df, num_features=num_features,
                                                 data=data)

    # Download selected columns to pandas
    df_pandas = df_with_features.select("id", "title", "raw_features", "vector_space").toPandas()
    df_pandas["vector_space"] = df_pandas["vector_space"].apply(lambda x : np.array(x.toArray()))
    df_pandas["vector_space"] = df_pandas["vector_space"].to_numpy()


    # Get subsample of data
    df_subset = df_pandas[:subsample_size].copy()

    df_subset = calculate_ground_truth(df_subset, num_features=num_features, num_neighbours=num_neighbours)

    # Ranking - Annoy Spotify
    df_subset, _ = calculate_neighbours(df_subset, metric, num_features, num_trees, num_neighbours=num_neighbours)

    # # Scoring
    score = get_accuracy(df_subset, num_neighbours=num_neighbours,
                         ranking_quality_threshold=RANKING_QUALITY_THRESHOLD)
    # print(f'score {score}')
    return np.round(score, 3)

In [18]:
pipeline(input_df=df_wiki, num_features=15, metric="euclidean", num_trees=10, num_neighbours=50,
         subsample_size=1000, data="wiki")

0.96

In [20]:
pipeline(input_df=listings_df, num_features=15, metric="euclidean", num_trees=10, num_neighbours=50,
         subsample_size=1000, data='barcelona')

0.958

## Grid search for parameters

### Wikipedia data

In [14]:
import time

param_grid = {
    # Feature extractor
    "num_features": [5, 15, 30],
    # Ranking
    "metrics": ["angular", "euclidean", "manhattan"],
    "num_trees": [5, 10, 30],
    "num_neighbours": [5, 10, 50],
    # Sample size
    "subsample_size": [200, 500, 1000]
}

def customGridSearch(param_grid):
  results = []

  for num_features in param_grid['num_features']:
    for metric in param_grid['metrics']:
      for num_trees in param_grid["num_trees"]:
        for num_neighbours in param_grid["num_neighbours"]:
          for subsample_size in param_grid['subsample_size']:
            time_0 = time.time()
            accuracy = pipeline(input_df=df_wiki,
                                num_features=num_features,
                                metric=metric,
                                num_trees=num_trees,
                                num_neighbours=num_neighbours,
                                subsample_size=subsample_size)
            time_1 = time.time()
            results.append({
            "num_features": num_features,
            "metric": metric,
            "num_trees": num_trees,
            "num_neighbours": num_neighbours,
            "subsample_size": subsample_size,
            "accuracy": accuracy,
            "time_for_pipeline": np.round(time_1 - time_0, 4)
        })

  return pd.DataFrame(results)

time_0 = time.time()
df_results = customGridSearch(param_grid=param_grid)
time_1 = time.time()
print(f'Time for GridSearch: {np.round(time_1 - time_0, 4)}')
df_results.to_csv('results_gridsearch_wikipedia.csv', index=False)

Time for GridSearch: 310.2179


The Wikipedia data isn't large, it has only 988 titles so that we can take full data for parameter tuning. But we decided also to set up "subsample_size" as a parameter, to see if the accuracy increases/decreases with the smaller subsample_size.

In [20]:
df_results[(df_results['accuracy'] == df_results['accuracy'].max())]

Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
42,5,euclidean,10,50,200,1.0,0.8869
51,5,euclidean,30,50,200,1.0,1.2157
123,15,euclidean,10,50,200,1.0,0.8226
132,15,euclidean,30,50,200,1.0,1.0305
133,15,euclidean,30,50,500,1.0,1.4357
204,30,euclidean,10,50,200,1.0,0.7812
213,30,euclidean,30,50,200,1.0,1.2137
214,30,euclidean,30,50,500,1.0,1.3662
231,30,manhattan,10,50,200,1.0,0.9138
240,30,manhattan,30,50,200,1.0,0.8283


In [19]:
df_results[df_results['subsample_size'] == 1000].sort_values(by=['accuracy'], ascending=False)

Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
134,15,euclidean,30,50,1000,0.999,2.1558
215,30,euclidean,30,50,1000,0.991,2.2193
53,5,euclidean,30,50,1000,0.978,2.5332
125,15,euclidean,10,50,1000,0.960,1.5688
242,30,manhattan,30,50,1000,0.947,2.1182
...,...,...,...,...,...,...,...
86,15,angular,5,10,1000,0.349,1.8323
5,5,angular,5,10,1000,0.342,1.7044
83,15,angular,5,5,1000,0.339,1.8160
2,5,angular,5,5,1000,0.328,1.9026


From the results, we can see that when the all parameters are the same but the subsample size is smaller, we have higher accuracy. There are 10 combinations, where the accuracy is 1.0, and the subsample size is less than the full data size.
So, let's take the comparison full data size (1000) and find the best parameters.

**Optimal parameters for the Wikipedia dataset**

In [22]:
df_results.loc[134]

num_features                15
metric               euclidean
num_trees                   30
num_neighbours              50
subsample_size            1000
accuracy                 0.999
time_for_pipeline       2.1558
Name: 134, dtype: object

**Time for GridSearch (Wiki): 5.2 min**

### Barcelona data

In [21]:
import time

param_grid = {
    # Feature extractor
    "num_features": [5, 15, 30],
    # Ranking
    "metrics": ["angular", "euclidean", "manhattan"],
    "num_trees": [5, 10, 30],
    "num_neighbours": [5, 10, 50],
    # Sample size
    "subsample_size": [500, 1000]
}

def customGridSearch(param_grid):
  results = []

  for num_features in param_grid['num_features']:
    for metric in param_grid['metrics']:
      for num_trees in param_grid["num_trees"]:
        for num_neighbours in param_grid["num_neighbours"]:
          for subsample_size in param_grid['subsample_size']:
            time_0 = time.time()
            accuracy = pipeline(input_df=listings_df,
                                num_features=num_features,
                                metric=metric,
                                num_trees=num_trees,
                                num_neighbours=num_neighbours,
                                subsample_size=subsample_size,
                                data="barcelona")
            time_1 = time.time()
            results.append({
            "num_features": num_features,
            "metric": metric,
            "num_trees": num_trees,
            "num_neighbours": num_neighbours,
            "subsample_size": subsample_size,
            "accuracy": accuracy,
            "time_for_pipeline": np.round(time_1 - time_0, 4)
        })

  return pd.DataFrame(results)

time_0 = time.time()
df_results = customGridSearch(param_grid=param_grid)
time_1 = time.time()
print(f'Time for GridSearch: {np.round(time_1 - time_0, 4)}')
df_results.to_csv('results_gridsearch_barcelona.csv', index=False)

Time for GridSearch: 859.8582


In [22]:
df_results[(df_results['accuracy'] == df_results['accuracy'].max())]

Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
34,5,euclidean,30,50,500,1.0,3.7975
46,5,manhattan,10,50,500,1.0,3.7335
52,5,manhattan,30,50,500,1.0,6.5077
82,15,euclidean,10,50,500,1.0,3.9406
88,15,euclidean,30,50,500,1.0,4.3548
142,30,euclidean,30,50,500,1.0,6.7781
143,30,euclidean,30,50,1000,1.0,7.517


In [23]:
df_results[df_results['subsample_size'] == 1000].sort_values(by=['accuracy'], ascending=False)

Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
143,30,euclidean,30,50,1000,1.000,7.5170
89,15,euclidean,30,50,1000,0.999,6.8460
29,5,euclidean,10,50,1000,0.988,6.1690
35,5,euclidean,30,50,1000,0.988,5.1124
53,5,manhattan,30,50,1000,0.979,4.0779
...,...,...,...,...,...,...,...
25,5,euclidean,10,5,1000,0.521,5.6884
43,5,manhattan,10,5,1000,0.498,4.7963
19,5,euclidean,5,5,1000,0.398,4.6312
1,5,angular,5,5,1000,0.360,4.7152


We can see the same situation with the subsample size as in the wikipedia dataset, so let's take higher subsample size - 1000.
We can see that as for the Wikipedia data most of the top accuracies are with num_trees = 30 and num_neighbours = 50.

**Optimal parameters for the Barcelona dataset**

In [25]:
df_results.loc[143]

num_features                30
metric               euclidean
num_trees                   30
num_neighbours              50
subsample_size            1000
accuracy                   1.0
time_for_pipeline        7.517
Name: 143, dtype: object

**Time for GridSearch (Wiki): 14.3 min**

# Take Barcelona dataset best parameters and apply it to the Wikipedia dataset

In [26]:
df_results_wiki = pd.read_csv('/content/results_gridsearch_wikipedia.csv')
df_results_barcelona = pd.read_csv('/content/results_gridsearch_barcelona.csv')

## Example 1

Let's compare the two accuracies for the Wikipedia data: \
1 - accuracy for the best parameters tuned on the Barcelona dataset \
2 - accuracy for the best parameters tuned on the Wikipedia dataset

In [56]:
print('Wikipedia data with BARCELONA BEST parameters')
df_results_wiki[(df_results_wiki['num_features']==30) &
                (df_results_wiki['num_trees']==30) &
                (df_results_wiki['num_neighbours']==50) &
                (df_results_wiki['subsample_size']==1000) &
                (df_results_wiki['metric']=='euclidean')]

Wikipedia data with BARCELONA BEST parameters


Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
215,30,euclidean,30,50,1000,0.991,2.2193


In [57]:
print('Wikipedia data with WIKIPEDIA BEST parameters')
df_results_wiki[(df_results_wiki['num_features']==15) &
                (df_results_wiki['num_trees']==30) &
                (df_results_wiki['num_neighbours']==50) &
                (df_results_wiki['subsample_size']==1000) &
                (df_results_wiki['metric']=='euclidean')]

Wikipedia data with WIKIPEDIA BEST parameters


Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
134,15,euclidean,30,50,1000,0.999,2.1558


We can see that the accuracy for the Barcelona parameters is lower, but insignificantly.

So, let's see on the another example

## Example 2

In [54]:
print('Barcelona GOOD parameters')
df_results_barcelona[(df_results_barcelona['num_features']==15) &
                (df_results_barcelona['num_trees']==30) &
                (df_results_barcelona['num_neighbours']==10) &
                (df_results_barcelona['subsample_size']==1000) &
                (df_results_barcelona['metric']=='euclidean')]

Barcelona GOOD parameters


Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
87,15,euclidean,30,10,1000,0.943,4.0774


In [55]:
print('Wikipedia data with Barcelona GOOD parameters')
df_results_wiki[(df_results_wiki['num_features']==15) &
                (df_results_wiki['num_trees']==30) &
                (df_results_wiki['num_neighbours']==10) &
                (df_results_wiki['subsample_size']==1000) &
                (df_results_wiki['metric']=='euclidean')]

Wikipedia data with Barcelona GOOD parameters


Unnamed: 0,num_features,metric,num_trees,num_neighbours,subsample_size,accuracy,time_for_pipeline
131,15,euclidean,30,10,1000,0.789,1.9276


**There is the comparison between parameters, that give 94% accuracy on the Barcelona dataset, and 79% accuracy on the Wikipedia dataset. \
This example show more clearly, that the parameters, that are suitable for the Barcelona dataset can be not suitable for the Wikipedia dataset**