# Part II STUFF

Our dataset is a subset of the "Amazon Reviews" dataset collected in 2023 by McAuley Lab:

Link to dataset: https://amazon-reviews-2023.github.io/

We are not interested in the entire 571M+ reviews but will be looking specifically at the "Movies_and_TV" subset. Each subset is divided into two .csv files, a review and a metadata.

The review contains - as the name would suggest - the reviews themselves. With zero cleaning, the combined dataset is around 7GB. Almost immediately there are large portions of the dataset we deem to not be relevant data and so we will be shrinking the usable dataset, to get a much cleaner and more potent dataset for the purposes of creating the co-reviewer graph. 

Firstly, an overview of the columns in each dataset:

| Reviews             | Description                                                                 |
|--------------------|-----------------------------------------------------------------------------|
| `rating`           | Star rating given by the reviewer (e.g., 1 to 5)                            |
| `title`            | Title or headline of the review                                             |
| `text`             | The main body text of the review                                            |
| `images`           | Image URLs or attachments included with the review (if any)                 |
| `asin`             | Amazon Standard Identification Number for the specific product              |
| `parent_asin`      | Group identifier to cluster product variants                                |
| `user_id`          | Anonymized unique identifier for the reviewer                               |
| `timestamp`        | Time the review was posted (UNIX or ISO format)                             |
| `helpful_vote`     | Number of people who found the review helpful                               |
| `verified_purchase`| Boolean indicating whether the purchase was verified by Amazon             |

| Metadata           | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| `main_category`  | The top-level category the product belongs to (e.g., "Movies & TV")         |
| `title`          | The main product title (often the movie/show title)                         |
| `subtitle`       | Optional subtitle for the product (e.g., edition/version info)              |
| `average_rating` | The average customer rating for the product                                 |
| `rating_number`  | Total number of ratings the product has received                            |
| `features`       | A list of product features (e.g., language, format)                         |
| `description`    | A longer description or summary of the product                              |
| `price`          | The listed price of the product                                              |
| `images`         | A list of image URLs associated with the product                            |
| `videos`         | Video media links (e.g., trailers or previews), if available                |
| `store`          | The Amazon store/subcategory under which the product is listed              |
| `categories`     | List of categories/tags assigned to the product (e.g., genre, theme)        |
| `details`        | Additional technical or marketing metadata (format, region, etc.)           |
| `parent_asin`    | A group-level identifier for versions of the same product                   |


A couple of these drastically reduce the size like, `helpful_vote` and `verified_purchase`. We decided to remove these because reviews without a helpful vote proably isn't very high quality, or irrelevant. Similarly, if the purchase can't be verified we can't be certain that the review comes from a real user. Reviews with less than 10 words were also deemed to be of lesser quality; The goal is to find descriptive reviews to facilitate language processing. 

Lastly, we removed products that had less than 15 total reviews. These would needlessly bloat our dataset, and make it more noisy. A large portion of the dataset contains products with minimal engagement, these would not facilitate community detection well. We summarized the compression in the table:

| Step                                                  | Description                                                  | Count / Shape          |
|-------------------------------------------------------|--------------------------------------------------------------|------------------------|
| **Initial raw dataset**                               | Total number of reviews loaded                               | 17,328,314             |
|                                                       | Initial dataframe shape                                      | (17,328,314, 10)       |
|                                                       | Unique users (`user_id`)                                     | 6,503,429              |
|                                                       | Unique `parent_asin` values                                  | 747,764                |
| **After filtering reviews with `helpful_vote ≥ 1`**   | Removed unhelpful or unused reviews                          | (4,325,435, 10)        |
| **After removing short reviews (< 10 words)**         | Kept only meaningful reviews                                 | (3,795,557, 10)        |
| **After keeping only `verified_purchase == True`**    | Removed potentially fake/unreliable reviews                  | (2,428,871, 10)        |
| **Total words in cleaned dataset**                    | Word count of all remaining reviews                          | 196,353,406            |
| **After removing products with < 15 reviews**         | Ensured statistical validity of products                     | (1,341,856, 10)        |
|                                                       | Unique ASINs after filtering                                 | 34,333                 |

Thus we are left with 34,333 unique Movies/Shows (Rows). Next, we need to decide how many features to use. As can be seen from the feature tables above - many of the columns are redundant information we wont be needing. After the initial removal of rows, the metadata .csv is ASIN matched. 

Then, we removed non-essential or redundant columns such as `verified_purchase`, `subtitle`, `images_x`, `features`, `images_y`, `videos`, `store`, `details`, `bought_together`, and `author`. After this cleanup, we're left with the following columns in our final merged dataset: `rating`, `review_title`, `text`, `asin`, `parent_asin`, `user_id`, `timestamp`, `helpful_vote`, `main_category`, `movie_title`, `average_rating`, `rating_number`, `description`, `price`, and `categories`.

We have also added sentiment scores for each review using Vader NLTK.

The final dataset has the shape (1341856, 17) (with sentiment scores). Taking up ~0.75GB

In [2]:
# Code from dataset loader down to merging datasets:::

Next up is our graph. The choice of graph is what we call a "Co-Reviews" graph. The co reviews refers to the fact that an edge is formed when two products have been reviewed by the same person. So, if movie X and movie Y has been reviewed by some user, they receive an edge weight of +1 between them. In this way, we create a graph where each node is a Movie/Show (essentially an ASIN product code), each product has at least 16 reviews attatched to them and this is where the text analysis comes into play. 

Therefore, our graph connects movies together that communities enjoy watching - which is one of our goals of this project. The immediate issue with this approach is that some people are very prolific where the vast majority of people write few reviews. Therefore a small amount of single individuals can baloon the edge count making the graph noisy. We therefore prune all edges with weight less than 2, to make sure two or more people have reviewed each pair of ASIN's in order for the node to survive in our final graph.

This leaves us with; Node count: 20711 and edge count: 91728.


In [3]:
# Code that shows how the graph is created:::

# PART III STUFF