# SNA on Kernel votes or How to hack ranks with friends

To get the bronze medal for a Kernel a user must have 5 votes on one kernel.

To get to the Expert tier in Kernels a user must have 5 bronze medals.

*In other words, if you have 4 friends, then each of you can create 5 empty notebooks, and vote each other to make you and your friends Experts in the Kernel section.*

If only i have 5 friends... Just kidding.  But honestly, this is an interesting "feature". A group of 20 people could make each other Masters in Kernels in ~10 minutes. And such trend could easily turn Kaggle into Youtube / Instagram by "vote me and I vote you in response" comments.

Let's explore the voting data to find out if there were such precedents. To do this we need to find clusters or communities of people with high vote rate for each other. We will build a social graph with nodes corresponding to users and edges corresponding to votes. Then we will search for "k-cores" - clusters of fully-connected subgraphs with k nodes. Will we discover "kernel friends"? Let's find out...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import networkx as nx
import seaborn as sns

## Load the data

### Necessary columns

#### Scripts

- `Id` - first version script id

- `AuthorUserId` - author's id


#### Script versions

- `Id` - script version id

- `ScriptId` - first version script id


#### Script votes:

 - `ScriptVersionId` - id

 - `UserId` - who voted for this script

In [None]:
scripts = pd.read_csv("../input/Scripts.csv")
script_versions = pd.read_csv("../input/ScriptVersions.csv")
script_votes = pd.read_csv("../input/ScriptVotes.csv")

In [None]:
vers_votes = pd.merge(script_versions, script_votes, left_on = "Id", right_on = "ScriptVersionId")
vers_votes.head()

In [None]:
scripts_full = pd.merge(scripts, vers_votes, left_on = "Id", right_on = "ScriptId")
scripts_full.head()

## Select only connections among users

In [None]:
edges = scripts_full[["AuthorUserId", "UserId"]]
edges

## Check the distribution of number of votes per user

In [None]:
edges_group = edges.groupby("UserId").apply(len)
sns.distplot(edges_group)

In [None]:
edges_group

# Remove all users which have less than 5 votes, because 5 votes is the threshold to get the bronze medal

In [None]:
edges_group.index[edges_group >= 5]

In [None]:
edges_filter = edges.loc[edges.AuthorUserId.isin(edges_group.index[edges_group >= 5]), :]
edges_filter

## Now to the most interesting part. Find users which voted for each other at least 5 times
In order to do this, we need to find k-cores with 5 output edges and 5 input edges for each of the user in the core

### Find duplicated rows

In [None]:
edges_dup = edges_filter.groupby(['AuthorUserId', 'UserId'], as_index = False).size().reset_index()
edges_dup.columns = ['AuthorUserId', 'UserId', 'Size']
edges_dup

In [None]:
edges_dup = edges_dup.loc[edges_dup.Size >= 5, :]
edges_dup

Interestingly enough, there are only 119 user pairs, who voted for each other more than 4 times. Our next step is to find clusters in this pairs and figure out our "persons of interest".

### Build the graph

In [None]:
G = nx.Graph()
G.clear()
for i in range(len(edges_dup)):
    G.add_edge(edges_dup.iloc[i, 0], edges_dup.iloc[i, 1])
# Remove self-loops, i.e. user which voted for themselves
G.remove_edges_from(G.selfloop_edges())
pos = nx.spring_layout(G)
nx.draw_networkx_labels(G, pos, font_size = 7)
nx.draw(G, pos=pos)

### Find users which voted for each other using k-core discovery algorithm

In [None]:
G = nx.Graph()
G.clear()
for i in range(len(edges_dup)):
    G.add_edge(edges_dup.iloc[i, 0], edges_dup.iloc[i, 1])
# Remove self-loops, i.e. user which voted for themselves
G.remove_edges_from(G.selfloop_edges())
G_5cores = nx.k_core(G, 4)
pos = nx.spring_layout(G_5cores)
nx.draw_networkx_labels(G_5cores, pos, font_size = 7)
nx.draw(G_5cores, pos=pos)

### Find users which voted for each other without graphs

In [None]:
pairs = {}
for i in range(len(edges_dup)):
    src, tgt = edges_dup.iloc[i, 0], edges_dup.iloc[i, 1]
    if (tgt, src) in pairs:
        pairs[src, tgt] = True
    else:
        pairs[src, tgt] = False

interesting = list(filter(lambda x: x[1], pairs.items()))
interesting

## Conclusion

Finally we were able to discover a very suspicious groups of users, who voted for each other more than 5 times. I'm a little bit disappointed that there is only one such group. But are those people really suspicious? We need to look at the timestamps of their votes to understand is this a hack or just a coincidence. I would like to not bother you with such trivial task so please just believe me that they are all clear and nice kagglers.

But what awaits us in the future..? You never know.