In [2]:
import pandas as pd
import networkx as nx

# URL pointing to the u.data file
data_url = 'https://raw.githubusercontent.com/sadia-perveen/Data620/main/u.data'

# Read the data directly from the URL
data = pd.read_csv(data_url, sep='\t', names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')

# Construct the bipartite graph
B = nx.Graph()
# Add user nodes, prefixing with 'u' to ensure uniqueness
for user in data['user_id'].unique():
    B.add_node(f'u{user}', bipartite=0)

# Add movie nodes, prefixing with 'm' to ensure uniqueness
for movie in data['movie_id'].unique():
    B.add_node(f'm{movie}', bipartite=1)

# Add edges
for _, row in data.iterrows():
    B.add_edge(f'u{row["user_id"]}', f'm{row["movie_id"]}', weight=row['rating'])

# Apply the island method: only keep edges with a rating of 4 or higher
threshold = 4
B_sub = B.copy()
for u, v, d in B.edges(data=True):
    if d['weight'] < threshold:
        B_sub.remove_edge(u, v)

# Identify distinct groups in the thresholded graph
groups = [B_sub.subgraph(c).copy() for c in nx.connected_components(B_sub) if len(c) > 1]

# Output the results
print(f"Total groups after thresholding: {len(groups)}")
for i, group in enumerate(groups, 1):
    print(f"Group {i} has {len(group.nodes())} nodes.")

# Analyzing the distinct groups
for i, group in enumerate(groups, 1):
    users = [n for n in group.nodes if group.nodes[n]['bipartite'] == 0]
    movies = [n for n in group.nodes if group.nodes[n]['bipartite'] == 1]
    print(f"Group {i} consists of {len(users)} users and {len(movies)} movies.")


Total groups after thresholding: 1
Group 1 has 2389 nodes.
Group 1 consists of 942 users and 1447 movies.


### Inference
Based on the analysis, we can infer that the MovieLens dataset exhibits a high level of interconnectedness among users and movies, even when considering only higher ratings (4 or above). The presence of a single large group comprising 942 users and 1447 movies after thresholding suggests that a substantial portion of users are connected through their common appreciation for a large set of movies. This interconnected nature indicates that users' movie preferences are not isolated but rather form a broad network of shared tastes, which could be leveraged for collaborative filtering and recommendation purposes. The large size of this group highlights the potential for finding common patterns or clusters within the data, which could provide insights into user preferences and movie popularity.