#### Names of people in the group

Thomas Bjerke

Trym Grande

In [0]:
# We need to install 'ipython_unittest' to run unittests in a Jupyter notebook
!pip install -q ipython_unittest

You should consider upgrading via the '/databricks/python3/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
# Loading modules that we need
from pyspark.sql.dataframe import DataFrame
from collections import Counter
from pyspark.sql.functions import desc
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math


In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_dir) -> DataFrame:
    return spark.read.parquet(table_dir)

base_dir = f"dbfs:/FileStore/dataframes"
users_df = load_df(f"{base_dir}/users")
posts_df = load_df(f"{base_dir}/posts")
badges_df = load_df(f"{base_dir}/badges")
comments_df = load_df(f"{base_dir}/comments")

print(users_df, posts_df, badges_df, comments_df, sep='\n')

DataFrame[Id: int, Reputation: int, CreationDate: timestamp, DisplayName: string, LastAccessDate: timestamp, AboutMe: string, Views: int, UpVotes: int, DownVotes: int]
DataFrame[Id: int, ParentId: int, PostTypeId: int, CreationDate: timestamp, Score: int, ViewCount: int, Body: string, OwnerUserId: int, LastActivityDate: timestamp, Title: string, Tags: string, AnswerCount: int, CommentCount: int, FavoriteCount: int, CloseDate: timestamp]
DataFrame[UserId: int, Name: string, Date: timestamp, Class: int]
DataFrame[PostId: int, Score: int, Text: string, CreationDate: timestamp, UserId: int]


In [0]:
# A helper function to clean a dataframe column before use
def clean_array(arr):
  for i in range(len(arr)):
    if math.isnan(arr[i]):
      arr[i] = 0
  return arr

#### Subtask 1: implementing two functions
Implement these two functions:
1. 'compute_pearsons_r' that receives a DataFrame and two column names and returns the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between values of two columns;
2. 'make_tag_graph' that in the input receives the DataFrame containing the records related to 'questions' and returns a DataFrame with two columns 'u' and 'v'; the record for row i from the resulting DataFrame is a tuple (u_i, v_i). u_i and v_j are distinct tags and have appeared together for a question.

Please note that you should implement the 'compute_pearsons_r' yourself, so you should not use the 'DataFrame.stat.corr' method. Nevertheless, you can use 'DataFrame.stat.corr' to verify the correctness of your implementation.

In [0]:


def pearson_correlation(independent, dependent):
    # covariance
    independent_mean = _arithmetic_mean(independent)
    dependent_mean = _arithmetic_mean(dependent)
    products_mean = _mean_of_products(independent, dependent)
    covariance = products_mean - (independent_mean * dependent_mean)
    
    # standard deviations of independent values
    independent_standard_deviation = _standard_deviation(independent)

    # standard deviations of dependent values
    dependent_standard_deviation = _standard_deviation(dependent)

    # Pearson Correlation Coefficient
    rho = covariance / (independent_standard_deviation * dependent_standard_deviation)

    return rho


def  _arithmetic_mean(data):
    """
    Total / count: the everyday meaning of "average"
    """
    total = 0
    for i in data:
      total+= i
    return total / len(data)


def  _mean_of_products(data1, data2):
    """
    The mean of the products of the corresponding values of bivariate data
    """
    total = 0
    for i in range(0, len(data1)):
        total += (data1[i] * data2[i])
    return total / len(data1)


def  _standard_deviation(data):
    """
    A measure of how individual values typically differ from the mean_of_data.
    The square root of the variance.
    """
    squares = []
    for i in data:
        squares.append(i ** 2)

    mean_of_squares = _arithmetic_mean(squares)
    mean_of_data = _arithmetic_mean(data)
    square_of_mean = mean_of_data ** 2
    variance = mean_of_squares - square_of_mean
    std_dev = math.sqrt(variance)

    return std_dev

      
def compute_pearsons_r(df: DataFrame, col1: str, col2: str) -> float:
  """
  Receives a DataFrame and two column names and returns the Pearson correlation coefficient between values of two columns
  """
  df_pd = df.toPandas()
  x_simple = clean_array(np.array(df_pd[col1]))
  y_simple = clean_array(np.array(df_pd[col2]))
  rho = np.corrcoef(x_simple, y_simple)
  rho = pearson_correlation(x_simple, y_simple)
  return rho

2. 'make_tag_graph' that in the input receives the DataFrame containing the records related to 'questions' and returns a DataFrame with two columns 'u' and 'v'; the record for row i from the resulting DataFrame is a tuple (u_i, v_i). u_i and v_j are distinct tags and have appeared together for a question.

In [0]:
from pyspark.sql import Row

def combine(tag_list: str):
  """
  Input format:"<machine-learning><education><open-source>"
  Output format: [["machine-learning", "education"], ["machine-learning", "open-source"], ["education", "machine-learning"], ["open-source", "machine-learning"]
  """
  # put tags into string array
  tag_list = tag_list[1:-1].split('><')
  
  # enumerate all combinations
  combined_tag_list = []
  for i, tag1 in enumerate(tag_list):
    for j, tag2 in enumerate(tag_list):
      if len(tag_list) != 1 and i == j: continue # ony combine with itself if input is 1 node
      combined_tag_list.append([tag1, tag2])
  
  return combined_tag_list

def make_tag_graph(df: "DataFrame containing question data") -> DataFrame:
    """
    receives the DataFrame containing the records related to 'questions' and returns a DataFrame with two columns 'u' and 'v'; the record for row i from the resulting DataFrame is a tuple (u_i, v_i). u_i and v_j are distinct tags and have appeared together for a question.
    """   
    # get tags column from dataframe and convert to list using flatMap
    tags_list = (df.select('Tags').rdd.flatMap(lambda x: x).collect())

    # convert each list into list of combinations
    data = []
    for tag_list in tags_list:
      combinations = combine(tag_list)
      for combination in combinations:
        data.append(combination)

    # create new dataframe with the new modified data
    columns = ['u', 'v']
    result_df = spark.createDataFrame(data, columns)

    return result_df  

In [0]:
# Importing GraphFrames graph library
from graphframes import *

#### Subtask 2: implementing three functions
Impelment these three functions:
1. 'get_nodes' that, given the result from execution of 'make_tag_graph', returns a DataFrame with one column named 'id' that includes the tags that have appeared in the tag graph;
2. 'get_edges' that, given the result from execution of 'make_tag_graph', returns a DataFrame with two columns 'src' and 'dst' where 'src' is the source node and 'dst' is the destination node.
3. 'compute_pagerank' that receives a GraphFrames graph object in the input and computes the PageRank for nodes in the graph and returns the result as a DataFrame with two columns named 'id' and 'pagerank'; the rows in the in the resulting DataFrame should be sorted by the values of 'pagerank' column.

Note that the term 'tag graph' in this context refers to the DataFrame reuturned by executing 'make_tag_graph'. Furthermore, 'src' and 'dst' are distinct, so 'src' != 'dst'.

In [0]:
def get_nodes(df: DataFrame) -> DataFrame:
  """
  Input: DataFrame of the tag graph
  Returns a DataFrame with one column named 'id' that includes the tags that have appeared in the tag graph
  """
  # get tags column from dataframe and convert to list of tag lists
  tags = (df.select('u').rdd.flatMap(lambda x: x).collect())

  # convert each tag list into lists of combinations
  data = []
  for tag in tags:
    if [tag] not in data:
      data.append([tag])

  # create new dataframe with the new modified data
  columns = ['id']
  result_df = spark.createDataFrame(data, columns)
  return result_df


def get_edges(df: DataFrame) -> DataFrame:
  """
  Input: DataFrame of the tag graph
  Returns a DataFrame with two columns 'src' and 'dst' where 'src' is the source node and 'dst' is the destination node.
  """
  # get u and v column from dataframe and convert to list
  src = (df.select('u').rdd.flatMap(lambda x: x).collect())
  dst = (df.select('v').rdd.flatMap(lambda x: x).collect())
  
  # only keep edges between unique nodes
  data = []
  for i, (u, v) in enumerate(zip(src, dst)):
    if u != v:
      data.append([u, v])
  
  # create new dataframe with the new modified data
  columns = ['src', 'dst']
  result_df = spark.createDataFrame(data, columns)
  return result_df


def compute_pagerank(graph) -> DataFrame:
  """
  Input: a Graphframes graph
  Computes the PageRank for nodes in the graph and returns the result as a DataFrame with two columns named 'id' and 'pagerank'. The rows in the in the resulting DataFrame are sorted by the values of 'pagerank' column.
  """
  pagerank = graph.pageRank(resetProbability=0.01, maxIter=20)
  pagerank = pagerank.vertices.select("id", "pagerank")
  pagerank_sorted = pagerank.sort("pagerank", ascending=False)
  pagerank_sorted.show()
  return pagerank_sorted

In [0]:
# Loading 'ipython_unittest' so we can use '%%unittest_main' magic command
%load_ext ipython_unittest

The ipython_unittest extension is already loaded. To reload it, use:
  %reload_ext ipython_unittest


#### Subtask 3: validating the implementation by running the tests

Run the cell below and make sure that all the tests run successfully.

In [0]:
%%unittest_main
class TestTask3(unittest.TestCase):
  
  error_threshold = 0.03
  
  def test_corr1(self):
    # Pearson correlation coefficient between 'user reputation' and 'upvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "UpVotes")
    self.assertLessEqual(abs(result-0.5218138310114108), self.error_threshold)
    print(result)
  
  def test_corr2(self):
    # Pearson correlation coefficient between 'user reputation' and 'downvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "DownVotes")
    self.assertLessEqual(abs(result-0.1473558141546844), self.error_threshold)
    print(result)

  def test_corr3(self):
    # Pearson correlation coefficient between 'question score' and the 'number of answers' it received
    result = compute_pearsons_r(posts_df[posts_df["PostTypeId"] == 1], "Score", "AnswerCount")
    self.assertLessEqual(abs(result-0.47855272641249674), self.error_threshold)
    print(result)
    
  def test_make_tag_graph(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    self.assertIsInstance(result, DataFrame)
    coulmn_names = Counter(map(str.lower, ['u', 'v']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)), "Missing column(s) or column name mismatch")
    display(result)
    
    self.assertEqual(result.count(), 228830)
  
  def test_get_nodes(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    n = get_nodes(result)
    self.assertEqual(n.count(), 638)
    n.show()

  def test_get_edges(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    e = get_edges(result)
    
    coulmn_names = Counter(map(str.lower, ['src', 'dst']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, e.columns)), "Missing column(s) or column name mismatch")
    
    self.assertEqual(e.count(), 225290)
    e.show()

  def test_compute_pagerank(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    n = get_nodes(result)
    e = get_edges(result)
    g = GraphFrame(n, e)
    ranks = compute_pagerank(g)
    self.assertEqual(ranks.first()[0], 'machine-learning')
    ranks.show()

+-------------------+------------------+
|                 id|          pagerank|
+-------------------+------------------+
|   machine-learning| 65.06445513354025|
|             python|39.023855012958336|
|      deep-learning|30.549353367012298|
|     neural-network|27.246934955987715|
|     classification|19.055424476620466|
|              keras|18.421465374723937|
|       scikit-learn|13.619594750776328|
|         tensorflow|13.611645023685123|
|                nlp|13.057848947784063|
|        time-series| 9.635532199634097|
|                cnn| 8.856429613348013|
|         regression| 8.839562616414858|
|            dataset| 8.058252685638138|
|        data-mining| 7.827399356930459|
|               lstm| 7.487786736747454|
|                  r| 7.468252379232548|
|predictive-modeling| 7.415435637130367|
|         clustering| 7.293056968062318|
|             pandas| 6.661418879160326|
|         statistics| 6.150440201838776|
+-------------------+------------------+
only showing top

u,v
machine-learning,machine-learning
education,open-source
open-source,education
data-mining,definitions
definitions,data-mining
databases,databases
machine-learning,bigdata
machine-learning,libsvm
bigdata,machine-learning
bigdata,libsvm


#### Subtask 4: answering to questions about Spark related concepts

Please write a short description for the terms below---one to two short paragraphs for each term. Don't copy-paste; instead, write your own understanding.

1. What do the terms 'User-Defined Functions (UDFs)', 'Data Locality', 'Bucketing', 'Distributed Filesystem' mean in the context of Spark?

Write your descriptions in the next cell.

# Spark related terms
## User-Defined Functions (UDFs)
User Defined Functions are like function declarations in other programming languages where the user can define their own input to output relationship. In addition, one can also explicitly declare the function to have properties like 'asNondeterministic()' if a random component should be constant for each runtime. These functions can also be used in conjunction with Spark SQL.

## Data Locality
Data locality is a concept used generally in computer science, including Spark. Spark is a distributed system, which means it consists of several nodes, often different computers. This often involves network latency when communicating between the nodes. Therefore, Spark tries to store the data and the code as close together as possible in order to avoid sending packets back and forth over the network. Usually, data is larger than code, so the code will often be sent entirely over to the data node, allowing the process to run locally on a single node. This is called "process local", but there are different levels to this type of locality. "node local" means that the data is on the same node, but different executors - e.g. same computer, but different process instances. "rack local" implies that the code and data is on different nodes, but on the same rack, meaning there will often be a switch and a network cable in between. 

## Bucketing
Bucketing is a type of optimization enabled by Spark by default. This uses a pre-processing technique that uses buckets in order to determine an optimized data partitioning for Spark SQL tables. This method results in partitions that are less dependent on "shuffling" data around, allowing for better runtime. This does, however, take time to compute, meaning it will only be beneficial if the tables are used more than once. Specifically, it will be time efficient if the total runtime saved during all future executions exceeds the time it takes to do the bucketing, or if time spent during runtime is less important than time spent during runtime.

## Distributed Filesystem
Spark uses Hadoop Distributed File System (HDFS). This type of file system has many benefits compared to a regular file system. The main one is being able to store large files spread across many computer nodes in a network. This way, files can be practically infinitely large, and can be much larger than the disk itself. Another benefit is replication by cloning files across nodes. Replication with both high and low data locality allows for both fast and secure storage because data can be accessed fast locally, and if an accident occurs where an entire rack is destroyed, the data will still be available on the node(s) with lower locality.