
Alvaro Viejo (100451677), Rodrigo Oliver (100451788), Héctor Tienda (100432519)

# Clearning the spoilers dataset

Now that we have created a database with mappings containing approximated book genres, we can now go ahead and use the information to build the final dataframe that we will use for our topic modeling tasks.


In [1]:
from sqlalchemy import create_engine
from getpass import getpass
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy import Table, Column, Integer, String

DATABASE_USER = "librarian"
DATABASE_PASSWD = getpass(f"Enter Database password for user {DATABASE_USER}:")

engine = create_engine(
    f"mysql+mysqldb://{DATABASE_USER}:{DATABASE_PASSWD}@127.0.0.1:3306/books",
    echo=False,
    future=True,
)

Base = declarative_base()

class BookIdtoGenre(Base):
    __tablename__ = "id_genres"

    id = Column(Integer, primary_key=True)
    genre = Column(String(30))

    def __repr__(self):
        return f"{self.id!r} - {self.genre!r}"

We will iterate over all the lines in the file `goodreads_reviews_spoiler.json.gz` to build a dataframe consisting on reviews with genre and rating information.

In [2]:
import os

# Store directory information
BASE_DIR = os.getcwd()
DATA_DIR = "data"
PATH_DATA_DIR = os.path.join(BASE_DIR, DATA_DIR)

In order to create a dataframe easily, we will use a dataclass to represent each review.

In [3]:
from dataclasses import dataclass

@dataclass
class Review(object):
    review_text: str
    rating: int
    book_genre: str

We will have to process ALL the lines in the file at first, in order to create a dataset with balanced classes for training.

In [4]:
from typing import Optional, List, Tuple
from tqdm.notebook import tqdm
from sqlalchemy import select
from sqlalchemy.exc import NoResultFound
import pandas as pd
import gzip
import json

def _cleanReviewSentences(sentences: List[Tuple[int, str]]) -> str:
    """Function that turns the review body into a single string"""
    return " ".join(sentence for _, sentence in sentences)


def getSpoilerData(
    file_name: str, data_dir: str = PATH_DATA_DIR
) -> List[Review]:
    """Function that iterates over a file in search for reviews that contain spoilers"""
    file_path = os.path.join(data_dir, file_name)

    # NOTE: I hate being forced to use append, may someone 
    # suggest a better way?
    reviews: List[Review] = []

    with gzip.open(file_path) as file:
        # Load the file in a lazy manner (line by line)
        for line in tqdm(file):
            json_line = json.loads(line)

            # if the line is not a spoiler, skip it
            if not bool(json_line["has_spoiler"]):
                continue

            # get the book genre in the database
            # no genre is found, skip the review
            try:
                with Session(engine) as session:
                    query = select(BookIdtoGenre).where(BookIdtoGenre.id == json_line["book_id"])
                    genre = session.scalars(query).one().genre
            except NoResultFound:
                continue

            # if the latter succeeded, create the review object
            rating = json_line["rating"]
            review_text = _cleanReviewSentences(json_line["review_sentences"])
            review = Review(review_text, rating, genre)

            reviews.append(review)
    
    return reviews

Now that we have the necessary code, we will generate the reviews, and create a dataframe.

In [5]:
reviews_spoiler = getSpoilerData("goodreads_reviews_spoiler.json.gz")

0it [00:00, ?it/s]

In [6]:
reviews_spoiler_df = pd.DataFrame(reviews_spoiler)

After creating the dataframe, we will store the results in memory, both in csv format, and in our database.

In [7]:
reviews_spoiler_df.to_csv(os.path.join(DATA_DIR, "reviews_spoiler.csv"))

In [8]:
reviews_spoiler_df.to_sql("spoiler_reviews", con=engine)

81642