For each user having ID in the column userId count the amount of his / her common friends with each other user having ID in the column userId.

Print 49 pairs of the users having the largest amount of common friends, ordered in descending order first by the common friends count , then by id of user1 and finally by id of user 2. The format is following: "count user1 user2"7

Example:

234	54719	767867

120	54719 767866

97 50787 327676

To solve this task use the algorithm described in the last video of lesson 1. The overall plan could look like this:

*Create a new column “friend” by exploding of column “friends” (like in the demo iPython notebook)
group the resulting dataframe by the column “friend” (like in the demo iPython notebook)
*create a column “users” by collecting all users with the same id in the column “friend” together (like in the demo iPython notebook)
*sort the elements in the column “users” by the function sort_array
*filter only the rows which have more than 1 element in the column “users”
*for each row emit all possible ordered pairs of users from the column “users” (tip: write a user defined function for this)
*count the number of times each pair has appeared
*with the help of the window function (like in the demo python notebook) select 49 pairs of users who have the biggest amount of common friends
*The sample dataset is located at /data/graphDFSample.

In [2]:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.enableHiveSupport().master("local").getOrCreate()

In [3]:
graphPath = "/data/graphDFSample"

In [4]:
from pyspark.sql.functions import explode, collect_list, size, col, row_number,sort_array,udf,count,desc
from pyspark.sql import Window
from pyspark.sql.types import IntegerType,ArrayType
import itertools
def get_all_pairs(list_): 
    return [ list(el) for el in itertools.combinations(list_,2) ]
get_all_pairs_UDF = udf(get_all_pairs, ArrayType(ArrayType(IntegerType())))

reversedGraph_r = sparkSession.read.parquet(graphPath)
reversedGraph = sparkSession.read.parquet(graphPath) \
    .withColumn("friend", explode('friends')) \
    .groupBy("friend") \
    .agg(collect_list("user").alias("users"))\
    .filter(size(col("users"))>1) 


In [5]:
reversedGraph1=reversedGraph.select("friend",sort_array("users").alias("users")) \
    .withColumn("user_pair", explode(get_all_pairs_UDF("users"))).drop("users") \
    .withColumn('user1', col("user_pair")[0]) \
    .withColumn('user2', col("user_pair")[1]).drop("user_pair")
    
friends_count = reversedGraph1.groupBy(["user1", "user2"]) \
    .agg(count("friend").alias("friends_count"))


root
 |-- friend: integer (nullable = true)
 |-- users: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- users_size: integer (nullable = false)



In [6]:
window = Window.orderBy(col("friends_count").desc(),col("user1"),col("user2"))
top50 = friends_count.withColumn("row_number", row_number().over(window)) \
    .filter(col("row_number") < 50) \
    .select(col("friends_count"), col("user1"), col("user2")) \
    .orderBy(col("friends_count").desc()) \
    .collect()

In [7]:
for val in top50:
    print '%s %s %s' % val

9606655 244
62922315 241
1288836 240
36402159 239
36079654 239
40342046 235
24319760 234
34854364 234
45353567 233
28229916 231
16364918 230
52511791 229
549319 227
5137947 227
65079230 227
17636074 226
49067109 225
53106903 225
6570168 223
44621704 223
34850500 223
27338193 222
32810368 222
25606717 222
34201873 220
6147442 219
62386165 219
45239367 219
32821462 218
30234171 218
63649194 217
53826156 217
13813472 217
26158314 217
17679500 217
14394422 216
7153815 216
13062446 216
36039499 216
64373911 216
12890141 215
20291955 215
36757249 214
64856469 214
40043869 213
34071175 212
11768267 211
38750752 211
3295906 211
