#### Names of people in the group

Thomas Bjerke

Trym Grande

In [0]:
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)
  
users_df = load_df("dbfs:/FileStore/dataframes/users")
posts_df = load_df("dbfs:/FileStore/dataframes/posts")
comments_df = load_df("dbfs:/FileStore/dataframes/comments")
badges_df = load_df("dbfs:/FileStore/dataframes/badges")

#### The problem: mining the interests of experts

The primary role of a questions and answering platform such as Stack Exchange is to connect two types of people. Namely, people who have questions in areas such as computer science or data science and knowledgeable people who can answer those questions reliably. Let's call the first category of people' knowledge seekers' and the second one 'expert users' or 'experts' for short.

Here we want to answer a question related to the diversity of topics that experts are interested in using our data. We want to know if expert users only answer questions in a specific set of topics or their interests include a wide variety of topics.

To answer the above question, we will compute the correlation between a user's expertise level and the diversity of topics of questions they have answered. The first step is to define two variables (or measures); first for 'user expertise level' and then for 'user interest diversity'. Then we will use the Pearson correlation coefficient to measure the linear correlation between the two variables. We define the variables as:

   - VariableA (the measure of user expertise level). We will use the 'Reputation' column from 'users' table, which according to Stack Exchange's documentation "is a rough measurement of how much the community trusts you; it is earned by convincing your peers that you know what you're talking about" as an indicator of a user's expertise level on the platform. 

   - VariableB (The measure of user interest diversity). We measure the diversity of a user's interests by computing the total number of distinct tags associated with the questions each user has answered divided by the total number of unique tags which is 638.

Compute the Pearson correlation coefficient between VariableA and VariableB, and based on the result you've got, answer the following question: 

     Do expert users have specif interests or do they have general interests?

Please explain your reasoning on how you reached your answer.

You should use Apache Spark API for your implementation. You can use the Spark implementation of the Pearson correlation coefficient.

### Calculate variable A

In [0]:
variable_a = users_df['Id', 'Reputation']
variable_a.show()

+---+----------+
| Id|Reputation|
+---+----------+
| -1|         1|
|  1|       101|
|  2|       101|
|  3|       101|
|  4|       101|
|  5|       215|
|  6|       101|
|  7|       101|
|  8|       101|
|  9|      1102|
| 10|       101|
| 11|       213|
| 12|       101|
| 14|      2782|
| 15|       101|
| 16|         1|
| 17|       236|
| 18|       101|
| 19|       101|
| 20|       101|
+---+----------+
only showing top 20 rows



In [0]:
def run_query(query: str, df1: DataFrame, df2: DataFrame, df3=None):
    df3: DataFrame
    df1.createOrReplaceTempView("df1")
    df2.createOrReplaceTempView("df2")
    if df3 is not None: 
      df3.createOrReplaceTempView("df3")
    sql_df = spark.sql(query)
    return sql_df

In [0]:
# join users, posts, comments
query = """
SELECT df1.Id, df3.tags FROM 
df1 JOIN df2 ON df1.Id = df2.UserId
JOIN df3 on df2.PostId = df3.Id
ORDER BY df1.Id
"""

In [0]:
user_tags_df = run_query(query, users_df, comments_df, posts_df)

### Calculate variable B

In [0]:
def user_interest_wideness(user_tags_df: DataFrame):
  """
  Param user_tags_df:
    DataFrame containing user id on column 0, and tags on column 1
  Returns new DataFrame containing distinct user ids along with the respective distinct tag count divided by the total number of tags
  """
  TOTAL_TAG_NUMBER = 638
  
  # get tags column from dataframe and convert to list of tag lists
  tags_list = (user_tags_df.select('tags').rdd.flatMap(lambda x: x).collect())
  ids = (user_tags_df.select('Id').rdd.flatMap(lambda x: x).collect())
  
  # aggregate into list of distinct tags for each distinct user
  id_pointer = float('-inf')
  data = [] # user_id, labels
  for (user_id, tags) in zip(ids, tags_list):
    if tags is None: continue
    tags = tags[1:-1].split('><')
    if user_id != id_pointer: # new user
      id_pointer = user_id
      data.append([user_id, []])
    
    # append tags to the same user
    data[len(data)-1][1].extend([tag for tag in tags if tag not in data[len(data)-1][1]])
  
  # replace labels with a count of the labels
  for i in range(len(data)):
    data[i][1] = len(data[i][1])/TOTAL_TAG_NUMBER
  
  # create new dataframe with the new modified data
  columns = ['UserId', 'TagBroadness']
  result_df = spark.createDataFrame(data, columns)
  return result_df

variable_b = user_interest_wideness(user_tags_df)
variable_b.show()

+------+--------------------+
|UserId|        TagBroadness|
+------+--------------------+
|     6|0.004702194357366771|
|    11|0.001567398119122257|
|    14|  0.0109717868338558|
|    17|  0.0109717868338558|
|    21| 0.24921630094043887|
|    24|0.006269592476489028|
|    26|0.012539184952978056|
|    31|0.004702194357366771|
|    34|0.004702194357366771|
|    36|0.009404388714733543|
|    41|0.003134796238244514|
|    51|0.009404388714733543|
|    53|0.003134796238244514|
|    59|0.004702194357366771|
|    62|0.014106583072100314|
|    66|0.003134796238244514|
|    70|0.004702194357366771|
|    75| 0.05329153605015674|
|    77|0.009404388714733543|
|    82|0.009404388714733543|
+------+--------------------+
only showing top 20 rows



### Calculate Pearson correlation using variables A and B

In [0]:
# join variable a and b on user id into the same dataframe
query = "SELECT * FROM df1 JOIN df2 ON df1.Id = df2.UserId"
df = run_query(query, variable_a, variable_b)

# calculate pearson coefficient
pearson = df.corr("Reputation", "TagBroadness", "pearson")
print(pearson)

0.7113864805661866


### Results
The results show that expert users have a positive and high correlation (0.71) to having general interests. This means that expert users have *general* interests.