Problem Recap:

Goal: Recommend pages to users based on what their friends have liked, but exclude the pages the user has already liked.
PySpark Implementation:
Identify Friends' Likes: Get the pages liked by each user's friends.
Exclude User's Own Likes: Exclude pages that the user has already liked from the recommendations.

In [0]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("Friends and Likes").getOrCreate()

# Friends data and schema
friends_data = [(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)]
friend_schema = "user_id int , friend_id int"
friends_df = spark.createDataFrame(data=friends_data, schema=friend_schema)

# Likes data and schema
likes_data = [(1, "A"), (1, "B"), (1, "C"), (2, "A"), (3, "B"), (3, "C"), (4, "B")]
like_schema = "user_id int , page_id string"
likes_df = spark.createDataFrame(data=likes_data, schema=like_schema)

# Display the DataFrames (This works in Databricks or similar notebook environments)
friends_df.display()
likes_df.display()

user_id,friend_id
1,2
1,3
1,4
2,1
3,1
3,4
4,1
4,3


user_id,page_id
1,A
1,B
1,C
2,A
3,B
3,C
4,B


In [0]:
from pyspark.sql.functions import col

# Step 1: Join friends_df with likes_df to find the pages liked by friends
friend_likes_df = friends_df.join(
    likes_df, friends_df.friend_id == likes_df.user_id, "inner"
).select(friends_df.user_id, likes_df.page_id)

# Step 2: Remove the pages that the user has already liked
user_likes_df = likes_df.select("user_id", "page_id")

# Perform the anti-join to exclude pages already liked by the user
recommendations_df = friend_likes_df.join(
    user_likes_df,
    (friend_likes_df.user_id == user_likes_df.user_id)
    & (friend_likes_df.page_id == user_likes_df.page_id),
    "left_anti",
)

# Show the final recommended pages for each user
recommendations_df.distinct().display()

user_id,page_id
2,C
3,A
2,B
4,A
4,C


In [0]:
# Register DataFrames as SQL temporary views
friends_df.createOrReplaceTempView("friends")
likes_df.createOrReplaceTempView("likes")

In [0]:
# Spark SQL query with DISTINCT to ensure unique records
result_df = spark.sql(
    """
    SELECT DISTINCT f.user_id, l.page_id
    FROM friends f
    JOIN likes l
    ON f.friend_id = l.user_id
    WHERE (f.user_id, l.page_id) NOT IN (
        SELECT user_id, page_id
        FROM likes
    )
"""
)

# Show the unique result
result_df.display()

user_id,page_id
2,C
3,A
2,B
4,A
4,C


Explanation:
Join Friends with Likes: We perform an inner join between friends_df and likes_df to find the pages that friends of each user have liked. This gives us a DataFrame with columns user_id (the user) and page_id (the page liked by the user's friends).

Exclude User's Own Likes: We then perform a left_anti join between this DataFrame and the likes_df, which effectively removes any pages that the user has already liked.