# Rating Profiles
## Network Analysis - University of Helsinki
* Teemu Koivisto (teemu.koivisto@helsinki.fi)
* Tomáš Vopat (tomas.vopat@helsinki.fi)

Dataset: http://www.occamslab.com/petricek/data/

# Project description

The project plan was to visualize and interpet the profile ratings of dating site users. The data consisted of a directed network with total 135 000 nodes and 17 million edges, with each node (user) having also a categorical variable gender (Female, Male, Unknown). Each rating were given in the scale of 1-10 and the dataset's total size was 245 MBs and 2 MBs respectively.

Our initial goals as written in our project plan were to understand how the ratings of the profiles behaved. Meaning in this context what is the effect of the gender to either given or received ratings. Also we were interested in finding out the correlation between received ratings (popularity) to how you are rated and if there was a correlation between them. It would be interesting to know also if given ratings did produce better received ratings for a user. Another interesting question would be to find out the LGBT (lesbian, gay, bisexual, and transgender) users by the ratio which gender they have rated.

All the analysis was done using a Jupyter notebook and Python, with some of the visualizations done with Gephi. For running graph algorithms we utilized also Spark but without proper resources we couldn't really utilize it to its full potential.

# Analysis

First, we spent a lot of time preprocessing and formatting the data. While the original data wasn't large per say, to group the data into specific subsets we needed to add many features to a single dataframe. For this we used Pandas with pain-stakingly hand-crafted logic.

After getting this large dataframe of various features we subsetted it into groups based on gender and other features, which we have then visualized as pie charts and histograms. **and some small plottings for communities eg eigen somethings**.

The very large amount of edges made deeper graph analysis quite difficult as it was possible to run the algorithms with only a tiny fraction of the data. Thus any subset we used would probably not be a good representative of the dataset as a whole, and their results unreliable. Just visualizing the network itself was quite tedious work, and the fact that the network was directed probably made it even more slower than an undirected graph. We found out through empirical experiments that a network of 15k - 50k edges was the largest possible for visualization using eg Gephi.

## Preprocessing

The data itself was quite clean and didn't have missing values. Only peculiarities were the "Unknown" value for gender categories and some of the ratings were for non-existent profiles.

## Overview

The users (nodes) themselves were quite neatly evenly distributed amongst into females and males with unknown having 1/10 portion.

**pie chart of nodes with gender**

Looking at the edges we can see that the men and females also have equal amount of edges going out from them.

**pie chart of edges with gender**

Going deeper we can then split the ratings into 3 groups, with the thresholds for the different classes as following: 10-6 for **positive** rating, 6-4 for **neutral** and 4-1 for **negative**. Their distribution falls with an overwhelming majority being positive, the other two sharing the rest with 1/4 portions. The mean of all ratings being **ratings mean**.

Yet these ratings do not fall evenly on the individual nodes, and some have disproportionately large amount of positive or negative reviews. The rating distributions for the different genders shows that even genders do not give out ratings in perfectly distributed manner. A noticeable anomaly is the high ratings given to some males in the data, here being the highest bar in the histogram:

**histograms of gender rating distributions**

Taking the means of the ratings of the users and plotting it as a histogram shows that the data indeed follows the Central Limit Theorem and that the distributions of means follow the normal distribution.

**histograms of gender rating mean distributions**

One can then start to wonder, what is the underlying cause for these kind of fluctuations in the data. Perhaps the users with great images of their faces gain a better rating? And maybe there has to be more than one great image? Or that they have written interesting descriptions of themselves or in some other way, are more interesting than the average user.

But however this we cannot infer from this data, alas we can only guess the true indicators for the given ratings. What we can then do however, is to see if there a correlation between popularity (lot of incoming edges) and the average given ratings. This would at least prove that being popular is indeed an indicator for being well-rated in a dating site.

Also other interesting questions that came to our minds were, if there could be a way to subset and visualize LGBT (lesbian, gay, bisexual, and transgender) users by which genders they have rated more.

## Degree distribution

Although the graph was high in edges, its distribution was highly varied. The majority of the users (nodes) had ratings (weighted edges) of only a few dozen. Yet a small portion of users (so-called super nodes) had at best 35 000 out-going edges. This dissimilarity in graph degrees was an interesting phenomenon and we then set out to explore if this would follow a power-law distribution, as it has been often observed in graphs.

**picture**

Using then a normal Maximum Likelihood Estimator (MLE) model to estimate the power-law degree distribution's parameters we received values: **constant=x, scale=y**. Plotting the fitted probability density function against the dataset's degree disribution histogram (in log-scale) we are able to see, that the degree distribution indeed follows a power-law distribution, albeit the parameter estimates were a bit off and a mixture model might be better suitable. (derp couldn't get that plot to work)

**picture of the degree distribution with PDF of fitted MLE power-law distribution model** 

## Lesbian and gay users

Looking at the data we found that there were a smallish subset of users which had rated over 50% more of the other gender than their own. The issue with this method was, that it seems to be quite normal for any user to rate anyone else, so that determining the user's sexuality becomes quite difficult.

But using a threshold of 50% we could subset a quite likely proportion of people that are at least somewhat ambiguous in their sexuality. Then splitting them into two categories, based on their true gender so Male and Female we denote them as "Gay" and "Lesbian" subsets respectively.

For these two groups we found... something.

**pictures**

## Community analysis

A complete community analysis of the users was quite impossible with insufficient resources to run the algorithms with the complete dataset. So we instead used some representative subsets of the data, to visualize some proportions of the network and how they are connected.

## Imports

In [None]:
# imports
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from scipy.stats import powerlaw
from networkx.algorithms import community
%matplotlib inline

## 1) Data Preprocessing

### Reading Data

In [None]:
# loading files
PATH_RATINGS = "data/ratings.csv"
PATH_GENDER = "data/gender.csv"

ratings = pd.read_csv(PATH_RATINGS, names=['user_id', 'profile_id', 'rating'])
gender = pd.read_csv(PATH_GENDER, names=['user_id', 'gender'])

# merging datasets
df = pd.merge(ratings, gender, how = "left", on = "user_id")
df = pd.merge(df, gender, how = "left", left_on = "profile_id", right_on = "user_id").drop("user_id_y", axis = 1)
df.columns = ["user_id", "rating_profile", "rating_weight", "user_gender", "rating_gender"]
df = df[["user_id", "user_gender", "rating_profile", "rating_gender", "rating_weight"]]

# converting gender
df["user_gender"] = df["user_gender"].astype("category")
df["rating_gender"] = df["rating_gender"].astype("category")
print("user gender vals:", df["user_gender"].unique())
print("rating gender vals:", df["rating_gender"].unique())

display(df.info())
print(df.nunique())
display(df)

In [None]:
# There are 135,359 profiles in the dataset, but there are 168,791 unique ids of rated profiles. 
# There are ratings of accounts that do not exist...
# Dropping that...

max_id = df["user_id"].max()
df = df[df["rating_profile"] <= max_id]

### Aggregating Data

In [None]:
# statistics about INCOMING ratings
group_profile = df.groupby(by = ["rating_profile"])
agg_profile = group_profile.agg({"rating_weight": "mean", "user_id": "nunique"})
agg_profile = agg_profile.reset_index()
agg_profile.columns = ["user_id", "inRating_mean", "inRating_count"]

group_profile_gender = df.groupby(by = ["rating_profile", "user_gender"])
agg_profile_gender = group_profile_gender.agg({"rating_weight": "mean", "user_id": "nunique"})
agg_profile_gender = agg_profile_gender.unstack()
agg_profile_gender.columns = ["inRating_mean_female", "inRating_mean_male", "inRating_mean_unknown", "inRating_count_female", "inRating_count_male", "inRating_count_unknown"]
agg_profile_gender = agg_profile_gender.reset_index()
agg_profile_gender = agg_profile_gender.rename({"rating_profile": "user_id"}, axis = 1)

# statistics about OUTGOING ratings
group_user = df.groupby(by = ["user_id"])
agg_user = group_user.agg({"rating_weight": "mean", "rating_profile": "nunique"})
agg_user = agg_user.reset_index()
agg_user.columns = ["user_id", "outRating_mean", "outRating_count"]

group_user_gender = df.groupby(by = ["user_id", "rating_gender"])
agg_user_gender = group_user_gender.agg({"rating_weight": "mean", "rating_profile": "nunique"})
agg_user_gender = agg_user_gender.unstack()
agg_user_gender.columns = ["outRating_mean_female", "outRating_mean_male", "outRating_mean_unknown", "outRating_count_female", "outRating_count_male", "outRating_count_unknown"]
agg_user_gender = agg_user_gender.reset_index()

# merging aggregations
df_detailed = pd.merge(agg_user, agg_profile, how = "left", on = "user_id")
df_detailed = pd.merge(df_detailed, agg_user_gender, how = "left", on = "user_id")
df_detailed = pd.merge(df_detailed, agg_profile_gender, how = "left", on = "user_id")
df_detailed = pd.merge(df_detailed, df[["user_id", "user_gender"]].drop_duplicates(), how = "left", on = "user_id", validate = "one_to_one")

print(df_detailed["user_id"].nunique())
display(df_detailed)

In [None]:
# filling NaN
cols = [
    "inRating_count",
    "inRating_count_female", 
    "inRating_count_male",
    "inRating_count_unknown",
    "outRating_count",
    "outRating_count_female", 
    "outRating_count_male", 
    "outRating_count_unknown"]
for col in cols:
    df_detailed[col] = df_detailed[col].fillna(0)
display(df_detailed)

In [None]:
# reodering columns + creating concise dataset

df_detailed = df_detailed[[
    "user_id", 
    "user_gender", 
    "outRating_count",
    "outRating_count_male",
    "outRating_count_female",
    "outRating_count_unknown",
    "outRating_mean",
    "outRating_mean_male",
    "outRating_mean_female",
    "outRating_mean_unknown",
    "inRating_count",
    "inRating_count_male",
    "inRating_count_female",
    "inRating_count_unknown",
    "inRating_mean",
    "inRating_mean_male",
    "inRating_mean_female",
    "inRating_mean_unknown"]]

df_concise = df_detailed[[
    "user_id", 
    "user_gender", 
    "inRating_count", 
    "outRating_count", 
    "inRating_mean", 
    "outRating_mean"]]

display(df_concise)

### Rate of rating males

In [None]:
# male ratings ratio
df_detailed["outRating_male_ratio"] = df_detailed["outRating_count_male"] / df_detailed["outRating_count"]
df_detailed["inRating_male_ratio"] = df_detailed["inRating_count_male"] / df_detailed["inRating_count"]
df_detailed[["user_id", "user_gender", "outRating_count", "outRating_male_ratio", "inRating_count", "inRating_male_ratio"]]

### NetworkX Graph (nodes with undefined gender removed)

In [None]:
# original graph
G = nx.from_pandas_edgelist(df, source = "user_id", target = "rating_profile", edge_attr = "rating_weight", create_using=nx.DiGraph())

# setting gender to nodes
node_gender = pd.Series(data = df_concise["user_gender"], index = df_concise["user_id"]).to_dict()
nx.set_node_attributes(G, node_gender, name = "gender")

# removing nodes with undefined gender
undefined_nodes = [x for x, y in G.nodes(data = "gender") if y == "U"]
G.remove_nodes_from(undefined_nodes)
G.remove_nodes_from(list(nx.isolates(G)))

print("nodes:", len(G.nodes)) # 119 847
print("edges:", len(G.edges)) # 9 398 037

### Graphs with negative/neutral/positive ratings

In [None]:
# selecting negative / neutral / positive edges
threshold_neutral = 3
threshold_positive = 6

edges_negative = [(u, v) for u, v, e in G.edges(data = "rating_weight") if e <= threshold_neutral]
edges_neutral = [(u, v) for u, v, e in G.edges(data = "rating_weight") if (e > threshold_neutral) and (e <= threshold_positive)]
edges_positive = [(u, v) for u, v, e in G.edges(data = "rating_weight") if (e > threshold_positive)]
print("edges:\n\t# negative = {}\n\t# neutral = {}\n\t# positive = {}".format(len(edges_negative), len(edges_neutral), len(edges_positive)))
# negative = 2171215
# neutral = 2451787
# positive = 4775035

In [None]:
# graph with negative ratings
g_negative = G.copy()
g_negative.remove_edges_from(edges_neutral)
g_negative.remove_edges_from(edges_positive)
g_negative.remove_nodes_from(list(nx.isolates(g_negative)))
print("nodes:", len(g_negative.nodes)) # 117 333
print("edges:", len(g_negative.edges)) # 2 171 215

In [None]:
# graph with neutral ratings
g_neutral = G.copy()
g_neutral.remove_edges_from(edges_negative)
g_neutral.remove_edges_from(edges_positive)
g_neutral.remove_nodes_from(list(nx.isolates(g_neutral)))
print("nodes:", len(g_neutral.nodes)) # 119 229
print("edges:", len(g_neutral.edges)) # 2 451 787

In [None]:
# graph with positive ratings
g_positive = G.copy()
g_positive.remove_edges_from(edges_negative)
g_positive.remove_edges_from(edges_neutral)
g_positive.remove_nodes_from(list(nx.isolates(g_positive)))
print("nodes:", len(g_positive.nodes)) # 119 793
print("edges:", len(g_positive.edges)) # 4 775 035

### Reciprocal graph (full)

In [None]:
# full graph
# print("reciprocity:", nx.reciprocity(G)) # 0.020955633127806995

g_reciprocal = G.to_undirected(reciprocal = True)
g_reciprocal.remove_nodes_from(list(nx.isolates(g_reciprocal)))
print("nodes:", len(g_reciprocal.nodes)) # 31 300
print("edges:", len(g_reciprocal.edges)) # 100 213

### Reciprocal negative/neutral/positive graphs

In [None]:
#print("negative reciprocity:", nx.reciprocity(g_negative)) # 0.00729913896136495
g_reciprocal_negative = g_negative.to_undirected(reciprocal = True)
g_reciprocal_negative.remove_nodes_from(list(nx.isolates(g_reciprocal_negative)))
print("negative nodes:", len(g_reciprocal_negative.nodes)) # 7 191
print("negative edges:", len(g_reciprocal_negative.edges)) # 7 924

In [None]:
#print("neutral reciprocity:", nx.reciprocity(g_neutral)) # 0.0026225769204257956
g_reciprocal_neutral = g_neutral.to_undirected(reciprocal = True)
g_reciprocal_neutral.remove_nodes_from(list(nx.isolates(g_reciprocal_neutral)))
print("neutral nodes:", len(g_reciprocal_neutral.nodes)) # 3 632
print("neutral edges:", len(g_reciprocal_neutral.edges)) # 3 215

In [None]:
#print("positive reciprocity:", nx.reciprocity(g_positive)) # 0.030263233672632766
g_reciprocal_positive = g_positive.to_undirected(reciprocal = True)
g_reciprocal_positive.remove_nodes_from(list(nx.isolates(g_reciprocal_positive)))
print("positive nodes:", len(g_reciprocal_positive.nodes)) # 23 376
print("positive edges:", len(g_reciprocal_positive.edges)) # 72 254

### Reciprocal only males / only females

In [None]:
# only males
g_reciprocal_males = g_reciprocal.copy()
females = [x for x, y in g_reciprocal.nodes(data = "gender") if (y == "F")]
g_reciprocal_males.remove_nodes_from(females)
g_reciprocal_males.remove_nodes_from(list(nx.isolates(g_reciprocal_males)))
print("male nodes:", len(g_reciprocal_males.nodes)) # 11 447
print("male edges:", len(g_reciprocal_males.edges)) # 24 751

In [None]:
# only females
g_reciprocal_females = g_reciprocal.copy()
males = [x for x, y in g_reciprocal.nodes(data = "gender") if (y == "M")]
g_reciprocal_females.remove_nodes_from(males)
g_reciprocal_females.remove_nodes_from(list(nx.isolates(g_reciprocal_females)))
print("female nodes:", len(g_reciprocal_females.nodes)) # 11 608
print("female edges:", len(g_reciprocal_females.edges)) # 25 553

### Saving datasets/graphs

In [None]:
# datasets
dataset_path = "data/datasets/"

df.to_csv(dataset_path + "df.csv", index = False)
df_detailed.to_csv(dataset_path + "df_detailed.csv", index = False)
df_concise.to_csv(dataset_path + "df_concise.csv", index = False)

In [None]:
# graphs
graph_path = "data/graphs/"

nx.write_gpickle(G, graph_path + "full.gpickle")
nx.write_gpickle(g_negative, graph_path + "negative.gpickle")
nx.write_gpickle(g_neutral, graph_path + "neutral.gpickle")
nx.write_gpickle(g_positive, graph_path + "positive.gpickle")

nx.write_gpickle(g_reciprocal, graph_path + "reciprocal.gpickle")
nx.write_gpickle(g_reciprocal_negative, graph_path + "reciprocal_negative.gpickle")
nx.write_gpickle(g_reciprocal_neutral, graph_path + "reciprocal_neutral.gpickle")
nx.write_gpickle(g_reciprocal_positive, graph_path + "reciprocal_positive.gpickle")

nx.write_gpickle(g_reciprocal_males, graph_path + "reciprocal_males.gpickle")
nx.write_gpickle(g_reciprocal_females, graph_path + "reciprocal_females.gpickle")

## 2) Data Analysis

### Loading datasets/graphs

In [None]:
# loading datasets from files
dataset_path = "data/datasets/"

df = pd.read_csv(dataset_path + "df.csv")
df_detailed = pd.read_csv(dataset_path + "df_detailed.csv")
df_concise = pd.read_csv(dataset_path + "data/df_concise.csv")

In [None]:
# loading graphs from files
graph_path = "data/graphs/"

G = nx.read_gpickle(graph_path + "full.gpickle")
g_negative = nx.read_gpickle(graph_path + "negative.gpickle")
g_neutral = nx.read_gpickle(graph_path + "neutral.gpickle")
g_positive = nx.read_gpickle(graph_path + "positive.gpickle")

g_reciprocal = nx.read_gpickle(graph_path + "reciprocal.gpickle")
g_reciprocal_negative = nx.read_gpickle(graph_path + "reciprocal_negative.gpickle")
g_reciprocal_neutral = nx.read_gpickle(graph_path + "reciprocal_neutral.gpickle")
g_reciprocal_positive = nx.read_gpickle(graph_path + "reciprocal_positive.gpickle")

g_reciprocal_males = nx.read_gpickle(graph_path + "reciprocal_males.gpickle")
g_reciprocal_females = nx.read_gpickle(graph_path + "reciprocal_females.gpickle")

In [None]:
print("Total nodes:", df_concise["user_id"].nunique())
print("\tfemales:", df_concise[df_concise["user_gender"] == "F"].shape[0])
print("\tmales:", df_concise[df_concise["user_gender"] == "M"].shape[0])
print("\tunknown:", df_concise[df_concise["user_gender"] == "U"].shape[0])
print('Total edges:', len(df))

### Gender proportion in the dataset

In [None]:
gender_portion = df_concise.groupby("user_gender")["user_id"].nunique()
gender_stats = df_detailed[["user_id", "user_gender", "user_straight"]].groupby(["user_gender", "user_straight"])["user_id"].nunique()
print(gender_stats)

plt.figure(figsize=(15, 10))
plt.suptitle("Users' gender")

plt.subplot(121)
plt.pie(
    gender_portion,
    labels = ["Female", "Male", "Unknown"],
    autopct = "%1.1f%%", 
    colors = ["lightcoral", "royalblue", "grey"])

plt.subplot(122)
plt.pie(
    gender_stats, 
    labels = ["F (lesbian)", "F (straight)", "M (gay)", "M (straight)", "U"], 
    autopct = "%1.1f%%", 
    colors = ["lightpink", "lightcoral", "lightsteelblue","royalblue", "grey"])

plt.show()

### Positive/Negative ratings

In [None]:
count_negative = len(df[df["rating_weight"] < 4])
count_neutral = len(df[(df["rating_weight"] >= 4) & (df["rating_weight"] < 7)])
count_positive = len(df[(df["rating_weight"] >= 7)])

plt.figure(figsize = (5,5))
plt.pie(
    [count_negative, count_neutral, count_positive],
    labels = ["negative", "neutral", "positive"],
    colors = ["lightcoral", "gray", "royalblue"],
    autopct = "%1.1f%%")
plt.show()

### Number of Ratings

In [None]:
inRating = df_concise["inRating_count"]
outRating = df_concise["outRating_count"]

# number of ratings (general)
plt.figure(figsize = (15, 5))
plt.suptitle("Number of ratings (log-scaled)")

ax1 = plt.subplot(121)
plt.hist(df_concise["inRating_count"], bins = 25, log = True)
plt.title("Received: mean = {:.2f}, std = {:.2f}".format(inRating.mean(), inRating.std()))
plt.xlabel("Number of ratings")
plt.ylabel("Number of users")

plt.subplot(122, sharey = ax1)
plt.hist(df_concise["outRating_count"], bins = 25, log = True)
plt.title("Given: mean = {:.2f}, std = {:.2f}".format(outRating.mean(), outRating.std()))
plt.xlabel("Number of ratings")

plt.show()

# number of ratings (by gender)
plt.figure(figsize = (15,15))
plt.suptitle("Number of ratings by gender (log-scaled)")

idx = 1
ax1 = None
for i in ["M", "F", "U"]:
    inRating = df_concise[df_concise["user_gender"] == i]["inRating_count"]
    outRating = df_concise[df_concise["user_gender"] == i]["outRating_count"]
    
    if ax1 is None:
        ax1 = plt.subplot(3, 2, idx)
    else:
        plt.subplot(3, 2, idx, sharey = ax1)
    plt.hist(inRating, bins = 25, log = True)
    plt.title("Received ({}): mean = {:.2f}, std = {:.2f}".format(i, inRating.mean(), inRating.std()))
    plt.xlabel("Number of ratings")
    plt.ylabel("Number of users")

    plt.subplot(3, 2, idx + 1, sharey = ax1)
    plt.hist(outRating, bins = 25, log = True)
    plt.title("Given ({}): mean = {:.2f}, std = {:.2f}".format(i, outRating.mean(), outRating.std()))
    plt.xlabel("Number of ratings")
    plt.ylabel("Number of users")
    idx += 2

plt.show()

### Mean of Ratings

In [None]:
inRating = df_concise[df_concise["inRating_mean"].notna()]["inRating_mean"]
outRating = df_concise[df_concise["outRating_mean"].notna()]["outRating_mean"]

plt.figure(figsize = (15, 5))
plt.suptitle("Mean of ratings")

ax1 = plt.subplot(121)
plt.hist(inRating, bins = 25)
plt.title("Received")
plt.title("Received: mean = {:.2f}, std = {:.2f}".format(inRating.mean(), inRating.std()))
plt.xlabel("Mean of the rating")
plt.ylabel("Number of accounts")

plt.subplot(122, sharey = ax1)
plt.hist(outRating, bins = 25)
plt.title("Given: mean = {:.2f}, std = {:.2f}".format(outRating.mean(), outRating.std()))
plt.xlabel("Mean of the rating")
plt.ylabel("Number of accounts")

plt.show()

# separated by gender
plt.figure(figsize = (15,15))
plt.suptitle("Mean of ratings by gender")

idx = 1
ax1 = None
for i in ["M", "F", "U"]:
    inRating = df_concise[(df_concise["user_gender"] == i) & (df_concise["inRating_mean"].notna())]["inRating_mean"]
    outRating = df_concise[(df_concise["user_gender"] == i) & (df_concise["outRating_mean"].notna())]["outRating_mean"]
    
    if ax1 is None:
        ax1 = plt.subplot(3, 2, idx)
    else:
        plt.subplot(3, 2, idx, sharey = ax1)
    plt.hist(inRating, bins = 25)
    plt.title("Received ({}): mean = {:.2f}, std = {:.2f}".format(i, inRating.mean(), inRating.std()))
    plt.xlabel("Mean of the rating")
    plt.ylabel("Number of accounts")

    plt.subplot(3, 2, idx + 1, sharey = ax1)
    plt.hist(outRating, bins = 25)
    plt.title("Given ({}): mean = {:.2f}, std = {:.2f}".format(i, outRating.mean(), outRating.std()))
    plt.xlabel("Mean of the rating")
    plt.ylabel("Number of users")
    
    idx += 2

plt.show()

### Cumulative Distribution of Ratings

In [None]:
inRating = df_concise["inRating_count"]
outRating = df_concise["outRating_count"]

plt.figure(figsize = (15, 5))
plt.suptitle("Cumulative distribution of Ratings")

plt.subplot(121)
plt.title("Given ratings")
plt.hist(outRating, cumulative = -1, histtype = "stepfilled", log = True, bins = 25)

plt.subplot(122)
plt.title("Received ratings")
plt.hist(inRating, cumulative = -1, histtype = "stepfilled", log = True, bins = 25)

plt.show()

# number of ratings (by gender)
plt.figure(figsize = (15,15))
plt.suptitle("Number of ratings by gender (log-scaled)")

idx = 1
ax1 = None
for i in ["M", "F", "U"]:
    inRating = df_concise[df_concise["user_gender"] == i]["inRating_count"]
    outRating = df_concise[df_concise["user_gender"] == i]["outRating_count"]
    
    if ax1 is None:
        ax1 = plt.subplot(3, 2, idx)
    else:
        plt.subplot(3, 2, idx, sharey = ax1)
    plt.hist(inRating, cumulative = -1, histtype = "stepfilled", log = True, bins = 25)
    plt.title("Received ({})".format(i))
    plt.xlabel("Number of ratings")
    plt.ylabel("Number of users")

    plt.subplot(3, 2, idx + 1, sharey = ax1)
    plt.hist(outRating, cumulative = -1, histtype = "stepfilled", log = True, bins = 25)
    plt.title("Given ({})".format(i))
    plt.xlabel("Number of ratings")
    plt.ylabel("Number of users")
    idx += 2

plt.show()

### Power-law approximation

In [None]:
# Another way of fitting the MLE parameters

data = df_concise["inRating_count"]

def ltarget(theta):
    d = -np.sum(powerlaw.logpdf(data + 0.1, theta[0], theta[1], theta[2]))
    if d == np.inf or np.isnan(d):
        return 10000000
    return d

# It seems that the location is fitted really poorly and it only has to be higher than 30k
prior = np.array([0.2, -1, 40000])

theta_opt = scipy.optimize.minimize(ltarget, prior)

print('found values', theta_opt.x)

In [None]:
t2 = powerlaw.fit(data) # always [0.25556190626454967, -5.3087461247103575e-25, 41315.39602489741]

t = np.linspace(data.min() + 1, data.max(), 100)

t1 = theta_opt.x # eg [1.06399195e-01, -9.90930233e-01, 4.00000000e+04]

plt.plot(t, powerlaw.logpdf(t, t1[0], t1[1], t1[2]), color="g")
plt.plot(t, powerlaw.logpdf(t, t2[0], t2[1], t2[2]), color="r")

In [None]:
plt.figure(figsize = (15, 5))
plt.suptitle("Power-law approximation")

plt.subplot(121)
plt.title("Received ratings")
mle_tuple = powerlaw.fit(df_concise["inRating_count"])
print("Received:\n\talpha = {:.2f}\n\tscale = {:.2f}".format(mle_tuple[0], mle_tuple[2]))
r = powerlaw.rvs(mle_tuple[0], scale = mle_tuple[2], size = 1000)
plt.hist(r, histtype = 'stepfilled', log = True, bins = 25)

plt.subplot(122)
plt.title("Given ratings")
mle_tuple = powerlaw.fit(df_concise["outRating_count"])
print("Given:\talpha = {:.2f}\n\tscale = {:.2f}".format(mle_tuple[0], mle_tuple[2]))
r = powerlaw.rvs(mle_tuple[0], scale = mle_tuple[2], size = 1000)
plt.hist(r, histtype = 'stepfilled', log = True, bins = 25)


plt.show()

In [None]:
# Given ratings
def get_title(name, df):
    mu = np.round(df.mean(), 2)
    sd = np.round(df.std(), 2)
    return "{} - mu = {}, sd = {}".format(name, mu, sd)

data = [
    df["rating_weight"],
    df[df["user_gender"] == "F"]["rating_weight"],
    df[df["user_gender"] == "M"]["rating_weight"],
    df[df["user_gender"] == "U"]["rating_weight"]
]

titles = [
    "All",
    "Females",
    "Males",
    "Unknown"
]

plt.figure(figsize = (15, 10))
for i in range(len(data)):
    plt.subplot(2, 2, i + 1)
    plt.hist(data[i], bins = 10)
    plt.title(get_title(titles[i], data[i]))

plt.show()

## 3) Small Community

### Obratining small community

In [None]:
## obtaining small community
df_small = df_detailed[(df_detailed["user_gender"] == "M") & (df_detailed["user_straight"] == False)]
df_small_edges = df[df["user_id"].isin(df_small["user_id"])]

print("nodes:", df_small.shape[0])
print("edges:", df_small_edges.shape[0])

#df_small[["user_id", "user_gender", "user_straight"]].to_csv("data/gender_small.csv", sep = ",", index = False)
#df_small_edges.to_csv("data/ratings_small.csv", sep = ",", index = False)

### Cumulative Degree Distribution

In [None]:
plt.figure(figsize = (15, 5))
plt.suptitle("Cumulative Distribution of Degrees")

plt.subplot(121)
plt.title("In-degrees")
plt.hist(df_small["inRating_count"], cumulative = -1, bins = 25, log = True)

plt.subplot(122)
plt.title("Out-degrees")
plt.hist(df_small["outRating_count"], cumulative = -1, bins = 25, log = True)

plt.show()

### Average Clustering Coefficient

In [None]:
H = nx.from_pandas_edgelist(df_small_edges, source = "user_id", target = "rating_profile", edge_attr = "user_gender", create_using=nx.DiGraph())
print("average clustering:", nx.average_clustering(H))

### Centralities

In [None]:
# centralities
degc = nx.degree_centrality(H)
#betw = nx.betweenness_centrality(H)
clos = nx.closeness_centrality(H)
eig = nx.eigenvector_centrality_numpy(H)

centrality = pd.DataFrame({"degree": degc, "closeness": clos, "eigenvector": eig})
sns.pairplot(centrality)

### Connected Components

In [None]:
print("Is weakly connected:", nx.is_weakly_connected(H))
components = list(nx.strongly_connected_components(H))

print("# of nodes:", len(H.nodes))
print("# of strongly connected components:", len(components))

## Reciprocal Graph with Positive Ratings (males)

In [None]:
curr_g = G3
curr_df = df_rec_male
curr_df

In [None]:
# removing isolated nodes
isolates = list(nx.isolates(curr_g))
print("# of isolated nodes:", len(isolates))
curr_g.remove_nodes_from(isolates)

In [None]:
print("nodes:", len(curr_g.nodes))
print("edges:", len(curr_g.edges))

In [None]:
# degree distribution
plt.figure(figsize = (15, 5))
degrees = [d for n, d in curr_g.degree()]
plt.hist(degrees, cumulative = -1, bins = 25, log = True)
plt.title("Degree distribution (log-scaled)")
plt.show()

In [None]:
# connected components
conn_components = list(nx.connected_components(curr_g))
print("# of connected components:", len(conn_components))

comp_size = []
for c in conn_components:
    comp_size.append(len(c))
comp_size.sort(reverse = True)
print("sorted sizes of components:", comp_size[:10])

### GraphLab

In [None]:
from graphlab import SFrame, SGraph
from graphlab import pagerank

In [None]:
data = SFrame.read_csv("data/df.csv")
g = SGraph().add_edges(data, src_field = "user_id", dst_field = "rating_profile")

In [None]:
# pagerank
pr = pagerank.create(g, threshold = 0.0001, max_iterations = 20)
pr.summary()
df_pagerank = pr.pagerank.to_dataframe

In [None]:
display(df_pagerank)