# Exploring Amazon dataset
Link: https://amazon-reviews-2023.github.io/

In [1]:
import json
import pandas as pd
from tqdm import tqdm
import networkx as nx
from itertools import combinations, chain
from collections import Counter, defaultdict
import netwulf
from joblib import Parallel, delayed

Starting with "Movies and TV" 

## Extracting the dataset 

Loading the jsonl file and identifying the relevant columns

In [2]:
file_review = "Movies_and_TV.jsonl"
with open(file_review, 'r') as fp:
    for line in fp:
        print(json.loads(line.strip()))
        break

{'rating': 5.0, 'title': 'Five Stars', 'text': "Amazon, please buy the show! I'm hooked!", 'images': [], 'asin': 'B013488XFS', 'parent_asin': 'B013488XFS', 'user_id': 'AGGZ357AO26RQZVRLGU4D4N52DZQ', 'timestamp': 1440385637000, 'helpful_vote': 0, 'verified_purchase': True}


Function taken from https://github.com/hyp1231/AmazonReviews2023/blob/main/benchmark_scripts/kcore_filtering.py and editted to fit our needs

In [3]:
def load_ratings(file):
    inters = []
    with open(file, 'r') as fp:
        for line in tqdm(fp, desc='Load ratings'):
            try:
                dp = json.loads(line.strip())
                item, user, rating, time, text, verified = dp['parent_asin'], dp['user_id'], dp['rating'], dp['timestamp'], dp['text'], dp["verified_purchase"]
                inters.append((item, user, float(rating), int(time), text, bool(verified)))
            except ValueError:
                print(line)
    return inters

In [4]:
file_review = "Movies_and_TV.jsonl"
df_list = load_ratings(file_review)

Load ratings: 17328314it [01:19, 218841.00it/s]


In [5]:
df_review = pd.DataFrame(df_list,columns=["parent_asin","user_id","rating","timestamp","text", "verified_purchase"])
df_review.head()

Unnamed: 0,parent_asin,user_id,rating,timestamp,text,verified_purchase
0,B013488XFS,AGGZ357AO26RQZVRLGU4D4N52DZQ,5.0,1440385637000,"Amazon, please buy the show! I'm hooked!",True
1,B00CB6VTDS,AGKASBHYZPGTEPO6LWZPVJWB2BVA,5.0,1461100610000,My Kiddos LOVE this show!!,True
2,B096Z8Z3R6,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,3.0,1646271834582,Annabella Sciorra did her character justice wi...,True
3,B09M14D9FZ,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,4.0,1645937761864,...there should be more of a range of characte...,False
4,B001H1SVZC,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,5.0,1590639227074,"...isn't always how you expect it to be, but w...",True


In [6]:
df_review.shape

(17328314, 6)

Filter based on if the purchase was verified

In [7]:
df_review_cleaned = df_review[df_review["verified_purchase"] == True]
df_review_cleaned.shape

(13757796, 6)

Filter based on length of review (more than 10 words)

In [8]:
df_review_cleaned = df_review_cleaned[df_review_cleaned['text'].apply(lambda x: len(str(x).split()) >= 10)]
df_review_cleaned.shape

(8006447, 6)

**After basic filtering there are 8,006,447 reviews in the dataset**

Doing the same with the movie meta data

In [11]:
file_product = "meta_Movies_and_TV.jsonl"
with open(file_product, 'r') as fp:
    for line in fp:
        print(json.loads(line.strip()))
        break


{'main_category': 'Prime Video', 'title': 'Glee', 'subtitle': 'UnentitledUnentitled', 'average_rating': 4.7, 'rating_number': 2004, 'features': ['IMDb 6.8', '2013', '22 episodes', 'X-Ray', 'TV-14'], 'description': ['Entering its fourth season, this year the members of New Directions compete amongst themselves to be the "new Rachel" and hold auditions to find new students. Meanwhile, the graduating class leaves the comforts of McKinley where Rachel struggles to please her demanding NYADA teacher (Kate Hudson) and Kurt second-guesses his decision to stay in Lima. Four newcomers also join the musical comedy.'], 'price': 22.39, 'images': [{'360w': 'https://images-na.ssl-images-amazon.com/images/S/pv-target-images/8251ee0b9f888d262cd817a5f1aee0b29ffed56a4535af898b827292f881e169._RI_SX360_FMwebp_.jpg', '480w': 'https://images-na.ssl-images-amazon.com/images/S/pv-target-images/8251ee0b9f888d262cd817a5f1aee0b29ffed56a4535af898b827292f881e169._RI_SX480_FMwebp_.jpg', '720w': 'https://images-na.s

In [12]:
def load_meta(file):
    inters = []
    with open(file, 'r') as fp:
        for line in tqdm(fp, desc='Load ratings'):
            try:
                dp = json.loads(line.strip())
                item, category, avg_rating, n_ratings, title = dp['parent_asin'], dp['main_category'], dp['average_rating'], dp['rating_number'], dp['title'] # changed to relevant types
                if type(avg_rating) == type(None) or type(n_ratings) == type(None): # Removing all items without ratings
                    continue
                inters.append((item, category, float(avg_rating), int(n_ratings), title))
            except ValueError:
                print(line)
    return inters

In [13]:
file_product = "meta_Movies_and_TV.jsonl"
df_list2 = load_meta(file_product)

Load ratings: 748224it [00:09, 78892.29it/s]


In [14]:
df_product = pd.DataFrame(df_list2,columns=["parent_asin","main_category","average_rating","rating_number","title"])
df_product.head()

Unnamed: 0,parent_asin,main_category,average_rating,rating_number,title
0,B00ABWKL3I,Prime Video,4.7,2004,Glee
1,B09WDLJ4HP,Prime Video,3.0,6,One Perfect Wedding
2,B00AHN851G,Movies & TV,5.0,7,How to Make Animatronic Characters - Organic M...
3,B01G9ILXXE,Prime Video,4.3,35,Ode to Joy: Beethoven's Symphony No. 9
4,B009SIYXDA,Prime Video,4.7,360,Ben 10: Alien Force (Classic)


In [15]:
df_product.shape

(747978, 5)

Filtering, so that the main category is only Prime Video or Movies & TV

In [16]:
df_product_cleaned =  df_product[(df_product["main_category"] == "Prime Video") | (df_product["main_category"] == "Movies & TV")]
df_product_cleaned.shape

(708825, 5)

Filtering the number of ratings, so that the minum amount of ratings is 10 

In [17]:
rating_threshold = 10
df_product_cleaned = df_product_cleaned[df_product_cleaned["rating_number"] >= rating_threshold]
df_product_cleaned.shape

(550252, 5)

Merging the 2 dataframes <br> They are merged on parent_asin, using a inner join because we are only interrested in the network where there is both reviews with text and and movies that have a description

In [18]:
df_merged = df_review_cleaned.merge(df_product_cleaned, on="parent_asin", how='inner')
df_merged.head()

Unnamed: 0,parent_asin,user_id,rating,timestamp,text,verified_purchase,main_category,average_rating,rating_number,title
0,B096Z8Z3R6,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,3.0,1646271834582,Annabella Sciorra did her character justice wi...,True,Prime Video,3.9,182,
1,B001H1SVZC,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,5.0,1590639227074,"...isn't always how you expect it to be, but w...",True,Prime Video,4.5,389,
2,B06WVW16WY,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,5.0,1586999747540,As you learn about the very unique characters ...,True,Prime Video,4.8,1966,
3,B07RXM26FG,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,5.0,1569734232700,Our family loved the film. We have kids and th...,True,Prime Video,4.5,57962,
4,B0002J58ME,AGXVBIUFLFGMVLATYXHJYL4A5Q7Q,5.0,1146713492000,This DVD was GREAT! I am a stay at home mom w...,True,Movies & TV,4.6,793,10 Minute Solution: Pilates


In [19]:
df_merged.shape

(7747968, 10)

unique movies

In [20]:
df_merged["parent_asin"].nunique()

442558

unique reviewers

In [21]:
df_merged["user_id"].nunique()

3552587

## Tesing functions

In [22]:
df_merged[df_merged["user_id"] == "AG2L7H23R5LLKDKLBEF2Q3L2MVDA"]

Unnamed: 0,parent_asin,user_id,rating,timestamp,text,verified_purchase,main_category,average_rating,rating_number,title
0,B096Z8Z3R6,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,3.0,1646271834582,Annabella Sciorra did her character justice wi...,True,Prime Video,3.9,182,
1,B001H1SVZC,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,5.0,1590639227074,"...isn't always how you expect it to be, but w...",True,Prime Video,4.5,389,
2,B06WVW16WY,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,5.0,1586999747540,As you learn about the very unique characters ...,True,Prime Video,4.8,1966,


In [23]:
df_merged.groupby("parent_asin")["user_id"].count() # so this is the count of user ids for each movie

parent_asin
000014357X    5
0000143588    1
0001527665    1
0005019281    1
0005059836    2
             ..
B0CBNSVH5L    1
B0CBNV6JYF    4
B0CBNVT5YQ    1
B0CC6R1NTK    1
B0CGGVXKHK    1
Name: user_id, Length: 442558, dtype: int64

In [24]:
groups = df_merged.groupby("parent_asin")["user_id"].apply(list)
groups

parent_asin
000014357X    [AE4YKFLVAOESWGLVISB7HXVRI75Q, AHI5XCQKDUCXXMP...
0000143588                       [AEJ4GSF2IFH4ZVW623BBIGGAUSNQ]
0001527665                       [AF3TSDF4K4I54N7RGKADC4CBQCHQ]
0005019281                       [AGHFYEGKOI4NGDPWHU7VLMBCTVKQ]
0005059836    [AGGVVP35Y2ULXHVBAOO7HCM2Z3VQ, AGXG46F5YQF2VYK...
                                    ...                        
B0CBNSVH5L                       [AHZRARNCAZ3XPR6ZL6PTAY4UMVEQ]
B0CBNV6JYF    [AEXGPWXEFTZFLMV454BOLHNEOYKQ, AGLZJHGY4SFYZA3...
B0CBNVT5YQ                       [AGMBBZFVKNCBVHKB7VCXQBGEDTKQ]
B0CC6R1NTK                       [AEVS5GNGVPRJSXTCBJ7YESUY7RKQ]
B0CGGVXKHK                       [AGSJ4R3YLXZTNSHJK6HQXUF4CTGA]
Name: user_id, Length: 442558, dtype: object

## Creating graph

Creating groups based on reviewer, which makes a list of movies/TV. The list is then sorted. And removed is there is less than 2

In [25]:
# grouped = (
#     df_merged.groupby('parent_asin')['user_id']
#     .apply(lambda g: sorted(g.tolist()))
#     .loc[lambda x: x.str.len() >= 2]
# )

In [26]:
# grouped, grouped.shape

In [None]:
df_graph = df_merged[['user_id', 'parent_asin']].dropna().drop_duplicates()
# 2. Create bidirectional mappings
print("Creating mappings...")
# ASIN to reviewers
asin_to_reviewers = defaultdict(set)
# Reviewer to ASINs
reviewer_to_asins = defaultdict(set)

for _, row in tqdm(df_graph.iterrows(), total=len(df_graph)):
    asin_to_reviewers[row['parent_asin']].add(row['user_id'])
    reviewer_to_asins[row['user_id']].add(row['parent_asin'])

# 3. Create graph and add nodes
G = nx.Graph()
G.add_nodes_from(asin_to_reviewers.keys())

# 4. Parallel edge computation
def compute_edges_for_asin(asin, asin_to_reviewers, all_asins):
    edges = []
    # Get all other ASINs that share at least one reviewer
    co_reviewed_asins = set()
    for reviewer in asin_to_reviewers[asin]:
        co_reviewed_asins.update(reviewer_to_asins[reviewer])
    
    # Remove self and already processed pairs
    co_reviewed_asins.discard(asin)
    co_reviewed_asins = [a for a in co_reviewed_asins if all_asins.index(a) > all_asins.index(asin)]
    
    for other_asin in co_reviewed_asins:
        weight = len(asin_to_reviewers[asin] & asin_to_reviewers[other_asin])
        if weight > 0:
            edges.append((asin, other_asin, weight))
    return edges

print("Computing edges in parallel...")
all_asins = list(asin_to_reviewers.keys())
results = Parallel(n_jobs=-1, prefer="processes")(
    delayed(compute_edges_for_asin)(asin, asin_to_reviewers, all_asins)
    for asin in tqdm(all_asins)
)

# 5. Add edges to graph
print("Building graph...")
for edges in tqdm(results):
    for asin1, asin2, weight in edges:
        G.add_edge(asin1, asin2, weight=weight)

print(f"Graph constructed with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")


Creating mappings...


100%|██████████| 7674902/7674902 [03:23<00:00, 37695.06it/s]


Computing edges in parallel...


  0%|          | 306/442558 [11:20<303:37:45,  2.47s/it] 

In [None]:
# all_pairs = chain.from_iterable(combinations(movies, 2) for movies in grouped)
# coauthor_count = Counter(all_pairs)

MemoryError: 

In [None]:
Graph = nx.Graph()
Graph.add_weighted_edges_from(edgelist)

In [None]:
Graph.number_of_nodes(), Graph.number_of_edges()

(686859, 187974782)

In [None]:
# Find number of isolated nodes
num_isolated = len(list(nx.isolates(Graph)))
print(f"Number of isolated nodes in the network is {num_isolated}.")

Number of isolated nodes in the network is 0.


In [None]:
import numpy as np
from scipy import stats

In [None]:
# Compute the average, median, mode, minimum, and maximum degree of the nodes

# Get the degrees of all the nodes
degrees = [degree for node, degree in Graph.degree()]

avg_deg = np.mean(degrees)
med_deg = np.median(degrees)
mode_deg = stats.mode(degrees, keepdims=True)[0][0] # keepdims=True to get the mode as an array
min_deg = min(degrees)
max_deg = max(degrees)

print(f"Average degree of the nodes is {round(avg_deg, 1)}.")
print(f"Median degree of the nodes is {med_deg}.")
print(f"Mode degree of the nodes is {mode_deg}.")
print(f"Minimum degree of the nodes is {min_deg}.")
print(f"Maximum degree of the nodes is {max_deg}.")

Average degree of the nodes is 547.3.
Median degree of the nodes is 64.0.
Mode degree of the nodes is 1.
Minimum degree of the nodes is 1.
Maximum degree of the nodes is 63055.


In [None]:
netwulf.visualize(Graph)

KeyboardInterrupt: 

In [None]:
netwulf.visualize(
    Graph,
    config={
        'colorPalette': 'category10',  # Use distinct colors
        'zoom': 0.6,
        'nodeCharge': -50,
        'gravity': 0.25
    })

KeyboardInterrupt: 

## Sentiment analysis

for presentation

In [11]:
for idx, item in enumerate(df_review_cleaned["text"]):
    print(item)
    print("---")
    if idx > 10:
        break

Annabella Sciorra did her character justice with her portrayal of a mentally ill, depressed and traumatized individual who projects much of her inner wounds onto others. The challenges she faces with her father were sensitively portrayed and resonate with understanding and love. The ending really isn't an ending, though and feels like it was abandoned with not enough of a closure but other than that, its a decent movie to sit through if you're the type of person who likes to people-watch or analyze the actions of others. Has an independent-movie feel which is also somewhat comforting.
---
...isn't always how you expect it to be, but when its there, you know. That is what this movie is all about. Deep struggles within a broken home and a mother with an addiction and a best friend whom nobody else seems to understand but him. There's loss, there's triumph. There's a bit of magic with a psychic neighbor who makes quite an impact on the boy. This is one definitely worth your time.
---
As y

In [12]:
text_test = "One of the best movies out there! It's even better watching it now than when I was a young child. Loved it then and love it now."
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(text_test)

{'neg': 0.0, 'neu': 0.575, 'pos': 0.425, 'compound': 0.9476}