#### Names of people in the group

Please write the names of the people in your group in the next cell.

Name of person A Vegard Vaeng Bernhardsen

Name of person B None

In [0]:
# We need to install 'ipython_unittest' to run unittests in a Jupyter notebook
!pip install -q ipython_unittest

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-e57a511b-c7b6-44b8-8935-df527505efb7/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
# Loading modules that we need
from pyspark.sql.dataframe import DataFrame
from collections import Counter
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import array, arrays_zip, avg, col, collect_list, collect_set, desc, explode, expr, lit, monotonically_increasing_id, regexp_replace, size, split, sqrt
import math
from itertools import combinations




In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

#### Subtask 1: implementing two functions
Implement these two functions:
1. 'compute_pearsons_r' that receives a DataFrame and two column names and returns the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between values of two columns;
2. 'make_tag_graph' that in the input receives the DataFrame containing the records related to 'questions' and returns a DataFrame with two columns 'u' and 'v'; the record for row i from the resulting DataFrame is a tuple (u_i, v_i). u_i and v_j are distinct tags and have appeared together for a question.

Please note that you should implement the 'compute_pearsons_r' yourself, so you should not use the 'DataFrame.stat.corr' method. Nevertheless, you can use 'DataFrame.stat.corr' to verify the correctness of your implementation.

In [0]:
def compute_pearsons_r(df: "a DataFrame", col1: "name of column A", col2: "name of column B") -> float:
    # Calculate the means
    mean_x, mean_y = df.select(avg(col(col1)).alias('mean_x'), avg(col(col2)).alias('mean_y')).first()

    # Calculate the components of the Pearson correlation coefficient formula
    numerator = df.select(((col(col1) - mean_x) * (col(col2) - mean_y)).alias('product')).groupBy().sum().collect()[0][0]
    denominator_x = math.sqrt(df.select(((col(col1) - mean_x) ** 2).alias('squared_diff_x')).groupBy().sum().collect()[0][0])
    denominator_y = math.sqrt(df.select(((col(col2) - mean_y) ** 2).alias('squared_diff_y')).groupBy().sum().collect()[0][0])

    r = numerator / (denominator_x * denominator_y)
    return r    

In [0]:
def make_tag_graph(df):
    # Explode tags into separate rows
    df_exploded = df.withColumn("Tag", explode(split(expr("regexp_replace(Tags, '^<|>$', '')"), "><")))
    
    # Convert Tags column to an array
    df_exploded = df_exploded.withColumn("Tags", split(col("Tags"), "><"))
    
    # Group by question ID and collect all tags into an array
    grouped_df = df_exploded.groupby("Id").agg(expr("collect_list(Tag) as Tags"))
    
    # Generate all combinations of tags within each group
    pairs_df = grouped_df.selectExpr("Id", "Tags").rdd.flatMap(lambda row: 
        [(row.Tags[i], row.Tags[j]) for i in range(len(row.Tags)) for j in range(len(row.Tags))]) \
        .toDF(["u", "v"])
    
    # Filter out duplicate pairs where u is equal to v
    pairs_df = pairs_df.filter("u != v")
    
    # Filter out rows with multiple tags, and include single tag pairs
    single_tag_pairs = df_exploded.filter(expr("size(Tags) == 1")).select("Tag", "Tag").toDF("u", "v")
    
    all_pairs_df = pairs_df.union(single_tag_pairs)
    
    return all_pairs_df


#### Subtask 2: implementing three functions
Impelment these three functions:
1. 'get_nodes' that, given the result from execution of 'make_tag_graph', returns a DataFrame with one column named 'id' that includes the tags that have appeared in the tag graph;
2. 'get_edges' that, given the result from execution of 'make_tag_graph', returns a DataFrame with two columns 'src' and 'dst' where 'src' is the source node and 'dst' is the destination node.


Note that the term 'tag graph' in this context refers to the DataFrame reuturned by executing 'make_tag_graph'. Furthermore, 'src' and 'dst' are distinct, so 'src' != 'dst'.

In [0]:
def get_nodes(df: DataFrame) -> DataFrame:
    # Extract unique tags from both columns and remove duplicates
    unique_tags_u = df.select(col("u").alias("id")).distinct()
    unique_tags_v = df.select(col("v").alias("id")).distinct()
    
    # Union the two sets of tags and remove duplicates
    nodes = unique_tags_u.union(unique_tags_v).distinct()
    
    return nodes

def get_edges(df: DataFrame) -> DataFrame:
    # Filter out edges where u = v and rename columns
    filtered_edges = df.filter(df['u'] != df['v']) \
                       .select(col('u').alias('src'), col('v').alias('dst'))
    return filtered_edges



In [0]:
# Loading 'ipython_unittest' so we can use '%%unittest_main' magic command
%load_ext ipython_unittest

The ipython_unittest extension is already loaded. To reload it, use:
  %reload_ext ipython_unittest


#### Subtask 3: validating the implementation by running the tests

Run the cell below and make sure that all the tests run successfully.

In [0]:
%%unittest_main
class TestTask3(unittest.TestCase):
  
  error_threshold = 0.03
  
  def test_corr1(self):
    # Pearson correlation coefficient between 'user reputation' and 'upvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "UpVotes")
    self.assertLessEqual(abs(result-0.5218138310114108), self.error_threshold)
    print(result)
  
  def test_corr2(self):
    # Pearson correlation coefficient between 'user reputation' and 'downvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "DownVotes")
    self.assertLessEqual(abs(result-0.1473558141546844), self.error_threshold)
    print(result)

  def test_corr3(self):
    # Pearson correlation coefficient between 'question score' and the 'number of answers' it received
    result = compute_pearsons_r(posts_df[posts_df["PostTypeId"] == 1], "Score", "AnswerCount")
    self.assertLessEqual(abs(result-0.47855272641249674), self.error_threshold)
    print(result)
    
  def test_make_tag_graph(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    self.assertIsInstance(result, DataFrame)
    
    coulmn_names = Counter(map(str.lower, ['u', 'v']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)), "Missing column(s) or column name mismatch")
    
    display(result)
    
    self.assertEqual(result.count(), 228830)
    
  def test_get_nodes(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    n = get_nodes(result)
    self.assertEqual(n.count(), 638)
    n.show()

  def test_get_edges(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    e = get_edges(result)
    
    coulmn_names = Counter(map(str.lower, ['src', 'dst']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, e.columns)), "Missing column(s) or column name mismatch")
    
    self.assertEqual(e.count(), 225290)
    e.show()
    



0.5218136972418368
0.14735577823040416
0.47530986340467524
+----------------+----------------+
|             src|             dst|
+----------------+----------------+
|       education|     open-source|
|     open-source|       education|
|     data-mining|     definitions|
|     definitions|     data-mining|
|machine-learning|         bigdata|
|machine-learning|          libsvm|
|         bigdata|machine-learning|
|         bigdata|          libsvm|
|          libsvm|machine-learning|
|          libsvm|         bigdata|
|         bigdata|     scalability|
|         bigdata|      efficiency|
|         bigdata|     performance|
|     scalability|         bigdata|
|     scalability|      efficiency|
|     scalability|     performance|
|      efficiency|         bigdata|
|      efficiency|     scalability|
|      efficiency|     performance|
|     performance|         bigdata|
+----------------+----------------+
only showing top 20 rows

+--------------------+
|                  id|
+----

u,v
education,open-source
open-source,education
data-mining,definitions
definitions,data-mining
machine-learning,bigdata
machine-learning,libsvm
bigdata,machine-learning
bigdata,libsvm
libsvm,machine-learning
libsvm,bigdata


Success

......
----------------------------------------------------------------------
Ran 6 tests in 34.913s

OK
Out[97]: <unittest.runner.TextTestResult run=6 errors=0 failures=0>

#### Subtask 4: answering to questions about Spark related concepts

Please write a short description for the terms below---one to two short paragraphs for each term. Don't copy-paste; instead, write your own understanding.

1. What do the terms 'User-Defined Functions (UDFs)', 'Data Locality', 'Bucketing', 'Distributed Filesystem' mean in the context of Spark?

Write your descriptions in the next cell.

Your answers...

User-Defined Functions (UDFs) allows users in Spark to extend the framework's built in features by writing custom functions to operate on data in Spark DataFrames of Datasets. UDFs are useful for applying custom transformations or computations not covered by Spark's basic functions library. They can be written in languages supported by Spark, as Python or Java. They can be registere to use directly within Spark SQL queries. However, while they offers flexibility, UDFs might lead to performace penalties compared to native functions, due to their inabilty to take full advantage over Spark's internal optimizations. 


Data Locality in Spark refers to the placement of data in relation to the compute resources that executes the tasks on said data. Spark is aiming to minimize network traffic by scheduling tasks on nodes close to the data they need. Hence, improving performance. Data locality has multiple levels, ranging from process-local to node-local and no locality (data is fetched from remote nodes). by optimizing for data locality execution times can be significantly lower execution times, by lowering the amount of data needing to be transmitted. 

Bucketing is a data organization technique in Spark which improves the performace on certain types of queries. The data is divided into a fixed number of buckets, based on the value of one or more columns. Bucketing might reduce data shuffling and optimize join operations between large datasets. While the data is bucketed, Spark knows which bucket it should look into when performing a join or aggregationon the bucketed columns. This minimizes the need to shuffle data across the cluster. Bucketing is beneficial for frequently joined large tables, or for improving the performance of aggregation queries.

Distributed Filesystem in the context of Spark, is a storage system designed to store data across multiple nodes in a cluster. This provides high availability and fault tolerance. Distributed filesystems allow Spark to efficiently process large datasets that are distributed across multiple servers, which enables parallel processing and resilience against node failures. 