### Lance an iterative 5-core

We apply an iterative 5-core on the user–item graph (users and books as nodes, ratings as edges). In each iteration, we first removed users with fewer than 5 ratings, then removed books with fewer than 5 ratings and items with extreme average scores (very high or very low ratings based on at most 8 interactions). This “peeling” process was repeated up to 5 times, but in practice the number of rows quickly stabilized, so we stopped once the dataset size no longer changed. The resulting ratings_core matrix is a denser and more reliable subgraph, where both users and books have sufficient interactions to support meaningful collaborative filtering.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:

spark = SparkSession.builder.appName("kcore-stage").getOrCreate()

ratings_merged_for_core = spark.read.parquet("export_core/ratings_merged_for_core.parquet")
ratings_merged_for_core.printSchema()
print("rows:", ratings_merged_for_core.count())




root
 |-- user_id: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- rating: float (nullable = true)
 |-- book_title: string (nullable = true)
 |-- book_author: string (nullable = true)
 |-- year_of_publication: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Language: string (nullable = true)
 |-- Category: string (nullable = true)

rows: 65056


In [3]:
df = ratings_merged_for_core.select("user_id", "isbn", "rating")

max_iter = 5

for i in range(max_iter):
    print(f"\n======= Iteration {i} =======")
    rows_before = df.count()
    print("Rows at start:", rows_before)

    user_stats = df.groupBy("user_id").agg(
        F.count("*").alias("num_ratings")
    )
    users_keep = user_stats.filter(
        F.col("num_ratings") >= 5
    ).select("user_id")

    df_users = df.join(users_keep, on="user_id", how="inner")

    book_stats = df_users.groupBy("isbn").agg(
        F.count("*").alias("num_ratings"),
        F.avg("rating").alias("avg_rating")
    )

    suspicious_books = (
        (F.col("num_ratings") <= 8) &
        ((F.col("avg_rating") >= 9.5) | (F.col("avg_rating") <= 1.0))
    )

    books_keep = book_stats.filter(
        (F.col("num_ratings") >= 5) &   
        ~suspicious_books               
    ).select("isbn")

    df_new = df_users.join(books_keep, on="isbn", how="inner")

    rows_after = df_new.count()
    print("Rows after full iteration:", rows_after)

    if rows_after == rows_before:
        print("Converged, stop.")
        df = df_new
        break

    df = df_new

ratings_core = df




Rows at start: 65056
Rows after full iteration: 64455

Rows at start: 64455
Rows after full iteration: 64427

Rows at start: 64427
Rows after full iteration: 64419

Rows at start: 64419
Rows after full iteration: 64419
Converged, stop.


### The second "overlap"

Although we already removed isolated users once before the k-core step, the topology of the user–item graph changes during k-core pruning: some books and users are dropped, and a few users may lose all of their previously shared items and become isolated again. Therefore, after obtaining ratings_core we enforce the “at least one overlapping book” condition a second time.


In [4]:
book_user_counts = ratings_core.groupBy("isbn") \
    .agg(F.countDistinct("user_id").alias("num_users_for_book"))

overlap_books = book_user_counts.filter(
    F.col("num_users_for_book") >= 2
).select("isbn")

ratings_on_overlap_books = ratings_core.join(
    overlap_books,
    on="isbn",
    how="inner"
)

overlap_users = ratings_on_overlap_books.select("user_id").distinct()

ratings_final = ratings_core.join(
    overlap_users,
    on="user_id",
    how="inner"
)

print("Final rows after enforcing overlap:", ratings_final.count())
print("Final distinct users:", ratings_final.select("user_id").distinct().count())
print("Final distinct books:", ratings_final.select("isbn").distinct().count())


Final rows after enforcing overlap: 64419
Final distinct users: 3776
Final distinct books: 1148


9. Split the data into a ratings table and a book metadata table

We keep two separate core tables: `ratings_clean`, which stores user–book interactions (`user_id`, `isbn`, `rating`), and
`books_clean`, which stores book-level metadata.
This separation avoids duplicating metadata for every rating, makes the rating matrix cleaner for collaborative filtering, and allows us to update book information independently from the ratings.

In [5]:
ratings_clean = ratings_final.select(
    "user_id",
    "isbn",
    "rating"
)

print("Final ratings_clean rows:", ratings_clean.count())

isbn_in_ratings = ratings_clean.select("isbn").distinct()

books_to_clean = ratings_merged_for_core.join(
    isbn_in_ratings,
    on="isbn",
    how="inner"
)

books_before_cleaning = books_to_clean.select(
    "isbn",
    "book_title",
    "book_author",
    "year_of_publication",
    "publisher",
    "Summary",
    "Language",
    "Category"
).dropDuplicates(["isbn"])

print("Final books_clean rows (distinct books):", books_before_cleaning.count())

Final ratings_clean rows: 64419
Final books_clean rows (distinct books): 1148


9. Clean the book metadate 

In [6]:
def is_missing_or_unknown(col):
    return (
        F.col(col).isNull() |
        (F.trim(F.col(col)) == "") |
        (F.lower(F.trim(F.col(col))).isin("unknown", "n/a", "na", "null"))
    )

# Clean book_title and book_author
books_clean = books_before_cleaning \
    .withColumn(
        "book_title",
        F.when(is_missing_or_unknown("book_title"),
               F.lit("Unknown Title"))
         .otherwise(F.trim(F.col("book_title")))
    ) \
    .withColumn(
        "book_author",
        F.when(is_missing_or_unknown("book_author"),
               F.lit("Unknown Author"))
         .otherwise(F.trim(F.col("book_author")))
    )

# Clean year_of_publication
year_str = F.trim(F.col("year_of_publication").cast("string"))

books_clean = books_clean.withColumn(
    "year_of_publication",
    F.when(
        year_str.rlike(r"^[0-9]{4}$"),         
        year_str.cast("int")
    ).otherwise(F.lit(None).cast("int"))       
)

# Clean publisher
books_clean = books_clean.withColumn(
    "publisher",
    F.when(is_missing_or_unknown("publisher"),
           F.lit("Unknown Publisher"))
     .otherwise(F.trim(F.col("publisher")))
)

# Clean Summary
books_clean = books_clean.withColumn(
    "Summary",
    F.when(is_missing_or_unknown("Summary"),
           F.lit("No Summary"))
     .otherwise(F.trim(F.col("Summary")))
)

# Clean Language
books_clean = books_clean.withColumn(
    "Language",
    F.when(is_missing_or_unknown("Language"),
           F.lit("Unknown Language"))
     .otherwise(F.lower(F.trim(F.col("Language"))))
)

# Clean Category
books_clean = books_clean.withColumn(
    "Category",
    F.when(is_missing_or_unknown("Category"),
           F.lit("Unknown Category"))
     .otherwise(F.trim(F.col("Category")))
)

print("Final books_clean rows:", books_clean.count())
books_clean.show(5, truncate=False)

Final books_clean rows: 1148
+---------+-------------------------------------------------------------+----------------+-------------------+------------------------+-------+--------+--------+
|isbn     |book_title                                                   |book_author     |year_of_publication|publisher               |Summary|Language|Category|
+---------+-------------------------------------------------------------+----------------+-------------------+------------------------+-------+--------+--------+
|000649840|Angelas Ashes                                                |Frank Mccourt   |1994               |Harpercollins Uk        |9      |9       |9       |
|006000438|The Death of Vishnu: A Novel                                 |Manil Suri      |2002               |Perennial               |9      |9       |9       |
|006017143|The Night Listener                                           |Armistead Maupin|2000               |HarperCollins Publishers|9      |9       |9       |

### Exporting the cleaned data

In [None]:
pdf_core_ratings = ratings_clean.toPandas()
pdf_core_books = books_clean.toPandas()

import os
import sys

if getattr(sys, 'frozen', False):  
    base_dir = os.path.dirname(sys.executable)
else:  
    base_dir = os.path.dirname(os.path.abspath(__file__))

out_dir = os.path.join(base_dir, "export_core")
os.makedirs(out_dir, exist_ok=True)

output_path_ratings = os.path.join(out_dir, "pdf_core_ratings.parquet")
output_path_books = os.path.join(out_dir, "pdf_core_books.parquet")

print("Saving ratings to:", output_path_ratings)
print("Saving books to:", output_path_books)

pdf_core_ratings.to_parquet(output_path_ratings, index=False)
pdf_core_books.to_parquet(output_path_books, index=False)


# print("Saving ratings_clean (Spark) to: D:/projet_esilv/Mining/export_core/ratings_clean_spark")
# print("Saving books_clean (Spark) to: D:/projet_esilv/Mining/export_core/books_clean_spark")
# 
# ratings_clean.write.mode("overwrite").parquet("export_core/ratings_clean_spark")
# books_clean.write.mode("overwrite").parquet("export_core/books_clean_spark")

# base_dir = r"D:/projet_esilv/Mining/export_core"
# 
# spark_path_ratings = f"file:///{base_dir}/ratings_clean_spark"
# spark_path_books   = f"file:///{base_dir}/books_clean_spark"
# 
# print("Saving ratings_clean to:", spark_path_ratings)
# 
# ratings_clean.write.mode("overwrite").parquet(spark_path_ratings)
# books_clean.write.mode("overwrite").parquet(spark_path_books)

Saving ratings to: D:\projet_esilv\Mining\export_core\pdf_core_ratings.parquet
Saving books to: D:\projet_esilv\Mining\export_core\pdf_core_books.parquet
