#### Names of people in the group

Please write the names of the people in your group in the next cell.

Name of person A Vegard Vaeng Bernhardsen

Name of person B None

In [0]:
# Loading modules that we need
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Add your imports below this line

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

# Uncomment if you need
# comments_df = load_df("/user/hive/warehouse/comments")
# badges_df = load_df("/user/hive/warehouse/badges")

#### The Problem: Mining the Interests of Experts

In [0]:
answers_df = posts_df.filter(posts_df.PostTypeId == 2)

user_answers_df = answers_df.join(users_df, answers_df.OwnerUserId == users_df.Id)

questions_df = posts_df.filter(posts_df.PostTypeId == 1).selectExpr("Id as QuestionId", "Tags as QuestionTags")

# Join to get tags of questions answered by users
user_tags_df = user_answers_df.join(questions_df, user_answers_df.ParentId == questions_df.QuestionId)

user_tags_df = user_tags_df.withColumn("ExplodedTag", explode(split(col("QuestionTags"), "><")))

# Compute interest diversity by counting distinct ExplodedTag per user, normalized by total unique tags (638)
user_interest_diversity = user_tags_df.groupBy("OwnerUserId").agg((countDistinct("ExplodedTag") / 638).alias("InterestDiversity"))



final_df = users_df.join(user_interest_diversity, users_df.Id == user_interest_diversity.OwnerUserId)

# Calculate Pearson correlation coefficient
correlation = final_df.stat.corr("Reputation", "InterestDiversity")
print(f"Pearson Correlation Coefficient: {correlation}")


Pearson Correlation Coefficient: 0.7534664108553488


In [0]:
print("""Given the Pearson Correlation Coefficient of 0.753, I think there is a strong positive linear relationship between a user's expertise level, indicated by their reputation, and their diversity of interest. Which is measured by the diversity of tags associated with the questions they have answered. I think this means users with a high reputation, or experts, might have general interests rather than spesific ones.""")

Given the Pearson Correlation Coefficient of 0.753, I think there is a strong positive linear relationship between a user's expertise level, indicated by their reputation, and their diversity of interest. Which is measured by the diversity of tags associated with the questions they have answered. I think this means users with a high reputation, or experts, might have general interests rather than spesific ones.
