# Exploratory analysis on weighted directed user graph

1. Read users: [all the movies the user has rated] file. (Input)
2. Create weighted directed graph G(V,E).
3. V = user_id and E = if the two users rated the same movie.
4. E(i->j) weight = (intersection of movies rated by i and j)/(movie rated by user i)
5. Find active users using PageRank.

The weighted directed G(V,E) has edges between users that denote the proportion of movies that the two users have rated in common with respect to the total number of movies they have rated by themselves. Hence, higher edge weight indicates that the users have rated pretty much the same subset of movies. However, we can't infer anything about the preferences of the users from here since the graph doesn't take into account the rating given by them to the movies that they've reviewed. So, we look for users with more number of overlaps with its neighbours and can be categorised as "active" i.e. having higher PageRank.

In [1]:
#import required libraries
library(doMC)
library(Kmisc)
library(igraph)
library(data.table)
registerDoMC(8)

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel

Attaching package: 'igraph'

The following object is masked from 'package:Kmisc':

    tree

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union


Attaching package: 'data.table'

The following object is masked from 'package:Kmisc':

    transpose



In [2]:
#set path
setwd("C:/Users/VishankBhatia/Desktop/FellowshipAI/MovieRecomSys")

#reading WD graph data
data_list <-read.table(file="./FINAL/data/wd_um_graph.txt", header = FALSE, sep = "\t", quote="", dec = ".")

#building WD user movie graph
colnames(data_list) = c("node_1", "node_2", "weight")
wd_graph <- graph.data.frame(data_list,directed = TRUE)

#graph properties
cat("The number of nodes: ", vcount(wd_graph))
cat("\nThe number of edges: ", ecount(wd_graph))

#computing pagerank for vertices
wd_pagerank = page.rank(wd_graph, directed = T, damping = 0.85)

The number of nodes:  943
The number of edges:  858220

In [3]:
#top users having highest pagerank
top_users = sort(wd_pagerank$vector, decreasing = T, index.return = T)

#top active users
top_user_id = matrix(names(top_users$x[1:943]), nrow=943, ncol=1)

#pagerank of top active users
top_user_pg = matrix(as.numeric(top_users$x[1:943]), nrow=943, ncol=1)

#top active user_id - pagerank
topactive_user = cbind(top_user_id,top_user_pg)

In [4]:
#top10
top10_user_id = matrix(names(top_users$x[1:10]), nrow=10, ncol=1)
top10_user_pg = matrix(as.numeric(top_users$x[1:10]), nrow=10, ncol=1)
topactive10_user = cbind(top10_user_id,top10_user_pg)

#top100
top100_user_id = matrix(names(top_users$x[1:100]), nrow=100, ncol=1)
top100_user_pg = matrix(as.numeric(top_users$x[1:100]), nrow=100, ncol=1)
topactive100_user = cbind(top100_user_id,top100_user_pg)

#top300
top300_user_id = matrix(names(top_users$x[1:300]), nrow=300, ncol=1)
top300_user_pg = matrix(as.numeric(top_users$x[1:300]), nrow=300, ncol=1)
topactive300_user = cbind(top300_user_id,top300_user_pg)

#top500
top500_user_id = matrix(names(top_users$x[1:500]), nrow=500, ncol=1)
top500_user_pg = matrix(as.numeric(top_users$x[1:500]), nrow=500, ncol=1)
topactive500_user = cbind(top500_user_id,top500_user_pg)

In [5]:
#write the top active user information to file
write.table(topactive_user, file="./FINAL/data/top943userpg.txt", row.names=FALSE, col.names=FALSE, sep = "\t")
write.table(topactive100_user, file="./FINAL/data/top100userpg.txt", row.names=FALSE, col.names=FALSE, sep = "\t")
write.table(topactive300_user, file="./FINAL/data/top300userpg.txt", row.names=FALSE, col.names=FALSE, sep = "\t")
write.table(topactive500_user, file="./FINAL/data/top500userpg.txt", row.names=FALSE, col.names=FALSE, sep = "\t")

In [6]:
#get.edge.ids(graph, vp, directed = TRUE, error = FALSE, multi = FALSE)
fil = degree(wd_graph, v = top_user_id, mode = c("all", "out", "in", "total"),loops = FALSE, normalized = FALSE)

In [7]:
ndColor = rep("lightblue",vcount(induced_subgraph(wd_graph,top10_user_id)))
ndSize = rep(0, vcount(induced_subgraph(wd_graph,top10_user_id)))
name = paste("./FINAL/top10",".png",sep="")
png(name)
plot(induced_subgraph(wd_graph,top10_user_id), vertex.size=ndSize , asp = 9/16, edge.color = "grey", layout=layout.fruchterman.reingold)
dev.off()

Since PageRank is directly proportional to the number of edges of a node i.e. degree of a node, higher PageRank would imply more edges to and from the user node indicating a greater commonality in terms of the movies rated by the user and others. We'll further use a subset of these PageRank values (of users) as additional features in predicting a movie's ratings (for those users).