# GitHub repositories and users recommendations by embeddings

## Problem Statement

Currently, GitHub has two possibilities to explore users and repositories:
1. Direct search by search term leveraging names and tags.
2. Recommender system under 'Explore' tab which gives suggestions to a user based on his usage of service.

However, there is no possibility to perform a search of connected entities. E.g., find repositories or users highly related to each other.

## Goal of the Project

The goal of this project is to build GitHub repository search/recommender system, which would allow exploring connected repositories and people, by leveraging the underlying graph structure of the repositories database.

## Implemented ML solution

It was decided to build graph nodes embeddings (`repo2vec` and `user2vec`) for the entire GitHub database using [PyTorch-BigGraph (PBG)](https://github.com/facebookresearch/PyTorch-BigGraph). On top of the embeddings representation, we have built query tool with the ranking engine.

# Pipeline

In [1]:
from resources.utils import *

DUMP_DATE = '2019-06'

## Dataset: preprocessing

The goal of this stage is to process the GitHub dataset from http://ghtorrent.org/ project into directed graph. <br>
The source data consists from .csv files which represent tabular SQL data. Each file is one table in source database. <br>
Total size of the dataset is 100GB. The size of the tables used for graph building is ~7GB.

*Database schema:*
![image.png](http://ghtorrent.org/files/schema.png)

The following tables are used for the graph representation:
* Followers
* Watchers
* Project Members

##### Graph structure 
Following nodes are defined as points of interest for this project:
* Users
* Projects

Types of graph edges:
* Follows
* Watches
* Is member of

Possible edges:
* User => follow => User
* User => follow => Project
* User => watch => User
* User => watch => Project
* User => is member => Project

### Download data (http://ghtorrent.org/)

Run `db_download.sh` script (at terminal) to download dump

Extract relationships tables (CSV): `followers.csv`, `watchers.csv`, `project_members.csv`

and metadata: `users.csv`, `projects.csv`

to `data/mysql-%Y-%m-%d` folder (or change directories code dependencies at notebooks)

In [2]:
!tar -C data/ -xvzf data/mysql-{DUMP_DATE}-01.tar.gz mysql-{DUMP_DATE}-01/project_members.csv mysql-{DUMP_DATE}-01/followers.csv mysql-{DUMP_DATE}-01/watchers.csv mysql-{DUMP_DATE}-01/projects.csv mysql-{DUMP_DATE}-01/users.csv

mysql-2019-06-01/followers.csv
mysql-2019-06-01/watchers.csv
mysql-2019-06-01/projects.csv
mysql-2019-06-01/project_members.csv
mysql-2019-06-01/users.csv


### Process SQL structured data into edges

In [2]:
from pyspark import SparkContext, SparkConf, SQLContext

# constants
DATA_FOLDER = f"./data/mysql-{DUMP_DATE}-01/"
FOLLOWERS_PATH = DATA_FOLDER + "followers.csv"
WATCHERS_PATH = DATA_FOLDER + "watchers.csv"
PROJECT_MEMBERS_PATH = DATA_FOLDER + "project_members.csv"

# relations constants
IS_MEMBER_OF = "is_member_of"
FOLLOWS = "follows"
WATCHES = "watches"

In [None]:
def addIdentifiers(row, id_1, id_2):
    '''
    Adds identifier for each id to distinguish them
    '''
    row[0] = id_1 + row[0]
    row[1] = id_2 + row[1]
    return row

def swapColumns(row):
    temp = row[0]
    row[0] = row[1]
    row[1] = temp
    return row

def addRelation(row, relation):
    temp = row[1]
    row[1] = relation
    row.append(temp)
    return row

In [4]:
conf = SparkConf()
sc = SparkContext(conf=conf)

Processing of members table

In [3]:
members_rdd = sc.textFile(PROJECT_MEMBERS_PATH).map(lambda x: x.split(",")[:-2])

members_rdd = members_rdd.map(lambda x: addIdentifiers(x, "repo_id_", "user_id_")).map(swapColumns).map(lambda x: addRelation(x,IS_MEMBER_OF))
members_rdd.take(5)

[['user_id_1', 'is_member_of', 'repo_id_1'],
 ['user_id_2', 'is_member_of', 'repo_id_1'],
 ['user_id_4', 'is_member_of', 'repo_id_1'],
 ['user_id_24', 'is_member_of', 'repo_id_3'],
 ['user_id_5465', 'is_member_of', 'repo_id_3']]

Processing of followers table

In [4]:
followers_rdd = sc.textFile(FOLLOWERS_PATH).map(lambda x: x.split(",")[:-1]) \
    .map(lambda x: addIdentifiers(x, "user_id_", "user_id_")) \
    .map(swapColumns) \
    .map(lambda x: addRelation(x, FOLLOWS))

followers_rdd.take(5)

[['user_id_2', 'follows', 'user_id_1'],
 ['user_id_4', 'follows', 'user_id_1'],
 ['user_id_17896', 'follows', 'user_id_1'],
 ['user_id_21523', 'follows', 'user_id_1'],
 ['user_id_29121', 'follows', 'user_id_1']]

Processing of watchers table

In [5]:
watchers_rdd = sc.textFile(WATCHERS_PATH).map(lambda x: x.split(",")[:-1]) \
    .map(lambda x: addIdentifiers(x, "repo_id_", "user_id_")) \
    .map(swapColumns) \
    .map(lambda x: addRelation(x, WATCHES))

watchers_rdd.take(5)

[['user_id_1', 'watches', 'repo_id_1'],
 ['user_id_2', 'watches', 'repo_id_1'],
 ['user_id_4', 'watches', 'repo_id_1'],
 ['user_id_6', 'watches', 'repo_id_1'],
 ['user_id_7', 'watches', 'repo_id_1']]

### Split into train, validation and test datasets

Merge into one entity

In [None]:
rdd = members_rdd.union(followers_rdd).union(watchers_rdd)

sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(rdd, ['from_entity_id', 'relation', 'to_entity_id'])
df.coalesce(1).write.option("delimiter", "\t").format('com.databricks.spark.csv').options(header='true').save(DATA_FOLDER + 'graph_edges')
shuffledDF = df.randomSplit([0.1, 0.1, 0.8])
val_set = shuffledDF[0]
test_set = shuffledDF[1]
train_set = shuffledDF[2]

Save into files

In [None]:
val_set.coalesce(1).write.option("delimiter", "\t").format('com.databricks.spark.csv').options(header='true').save(DATA_FOLDER + 'graph_edges/validation')
test_set.coalesce(1).write.option("delimiter", "\t").format('com.databricks.spark.csv').options(header='true').save(DATA_FOLDER + 'graph_edges/test')
train_set.coalesce(1).write.option("delimiter", "\t").format('com.databricks.spark.csv').options(header='true').save(DATA_FOLDER + 'graph_edges/train')

### Store metadata

Metainfo we store in MongoDB

In [None]:
def convert_dt(sdt):
    try:
        return pd.datetime.strptime(sdt, '%Y-%m-%d %H:%M:%S')
    except Exception as e:
        print(e)

#### Users

In [5]:
users_df = pd.read_csv(f'{DATA_FOLDER}users.csv',
                       header=None,
                       names=['id', 'login', 'company', 'created_at', 'type', 'fake', 'deleted', 'long', 'lat',
                              'country_code', 'state', 'city', 'location'],
#                        parse_dates=[3],
#                        quoting=1,
#                        quotechar='"',
#                        encoding='utf-8',
                       low_memory=False,
                       error_bad_lines=False).replace({'\\N': None}).set_index('id')
print(users_df.shape)
users_df['created_at'] = users_df['created_at'].map(convert_dt)
users_df = users_df[~users_df['created_at'].isna()]
users_df['fake'] = users_df['fake'].replace({1: True, 0: False, '1': True, '0': False}).astype(bool)
users_df['deleted'] = users_df['deleted'].replace({1: True, 0: False, '1': True, '0': False}).astype(bool)
users_df['long'] = users_df['long'].astype(float)
users_df['lat'] = users_df['lat'].astype(float)
users_df.shape

(30600249, 12)

In [3]:
mc['github']['users'].create_index('login')

'login_1'

In [6]:
to_insert = []

for uid, row in tn(users_df.iterrows(), total=users_df.shape[0]):
    to_insert.append({'_id': uid, **row.to_dict()})
    if len(to_insert) >= 10_000:
        mc['github']['users'].insert_many(to_insert)
        to_insert = []
if len(to_insert):
    mc['github']['users'].insert_many(to_insert)
mc['github']['users'].estimated_document_count()

30600249

Users example
![Users](./resources/images/Users.png)

#### Projects (repos)

In [None]:
for field in ['name', 'owner_id', 'url']:
    mc['github']['projects'].create_index(field)

In [7]:
try:
    stop = False
    skip_count = 0
    for i in range(1000):
        print(i)
        repos_df = pd.read_csv(f'{DATA_FOLDER}projects.csv',
                               skiprows=10_000_000*i,
                               nrows=10_000_000,
                               header=None,
                               na_values=['\\N'],
                               names=["id", "url", "owner_id", "name", "description", "language",
                                      "created_at", "forked_from", "deleted", "updated_at", "?"],
                               error_bad_lines=False,
                               low_memory=False,
                              ).set_index('id')
        c = repos_df.shape[0]
        if c < 10_000_000:
            stop = True
        repos_df['created_at'] = repos_df['created_at'].map(convert_dt)
        repos_df = repos_df[~repos_df['created_at'].isna()]
        repos_df['updated_at'] = repos_df['updated_at'].map(convert_dt)
        repos_df = repos_df[~repos_df['updated_at'].isna()]
        skip_count += (c - repos_df.shape[0])
        repos_df.index = repos_df.index.astype(int)
        repos_df['url'] = repos_df['url'].astype(str).map(lambda x: x[29:])
        repos_df['deleted'] = repos_df['deleted'].replace({1: True, 0: False, 1: True, 0: False}).astype(bool)
        repos_df['owner_id'] = repos_df['owner_id'].astype(int)
        repos_df['forked_from'] = repos_df['forked_from'].astype(float)
        repos_df['?'] = repos_df['?'].astype(float)

        to_insert = []

        for rid, row in tn(repos_df.iterrows(), total=repos_df.shape[0]):
            to_insert.append({'_id': rid, **row.to_dict()})
            if len(to_insert) >= 10_000:
                mc['github']['projects'].insert_many(to_insert)
                to_insert = []
        if len(to_insert):
            mc['github']['project'].insert_many(to_insert)
        if stop:
            break
    m = f"[projects]: stored {mc['github']['projects'].estimated_document_count()} projects, {skip_count} skiped"
    tgn(m)
except Exception as e:
    m = f'[projects]: error - {e}'
    tgn(m)
m

'[projects]: stored 116010000 projects, 135185 skiped'

Repos example
![Projects](./resources/images/Projects.png)

## Training

![Partitions](./resources/images/Partitions.png)

*Figure 1.* The PBG partitioning scheme for large graphs. **Left:** nodes are divided into $P$ partitions that are sized to fit in memory. Edges are divided into buckets based on the partition of their source and destination nodes. In distributed mode, multiple buckets with non-overlapping partitions can be executed in parallel (red squares). **Center:** Entity types with small cardinality do not have to be partitioned; if all entity types used for tail nodes are unpartitioned, then edges can be divided into $P$ buckets based only on source node partitions. **Right:** the ‘inside-out’ bucket order guarantees that buckets have at least one previously-trained embedding partition. Empirically, this ordering produces better embeddings than other alternatives (or random)

![Architecture](./resources/images/Architecture.png)

*Figure 2.* Memory-efficient batched negative sampling. Embeddings are fetched for the $B$ source and destination entities in a batch of edges, as well as $B$ uniformly-sampled source and destination entities. Each chunk of $B_n/2$ edges is corrupted with all source or destination entities in its chunk, as well as the corresponding chunk of the uniform embeddings, resulting in $B_n$ negative examples per positive edge. The negative scores are computed via a batch matrix multiply.

Lerer, Adam, et al. "PyTorch-BigGraph: A Large-scale Graph Embedding System." arXiv preprint arXiv:1903.12287 (2019).
https://arxiv.org/pdf/1903.12287.pdf

Clear dir from possible previous training (or archive configs)

In [14]:
for TEMPLATE in ['entity_count_all_*.txt', 'graph-*_partitioned/*', 'dictionary.json', 'dynamic_rel_count.txt']:
    !rm -f {DATA_FOLDER}graph_edges/{TEMPLATE}

See `graph_config.py` config file for details (`torchbiggraph_train -h` for help)

Let's exclude 'long tail' where entity (repo/user) has less than {ENTITY_MIN_COUNT} relations

In [None]:
ENTITY_MIN_COUNT = 30

Creating needed partitions

In [None]:
%time !torchbiggraph_import_from_tsv --lhs-col=0 --rel-col=1 --rhs-col=2 --entity-min-count={ENTITY_MIN_COUNT} graph_config.py {DATA_FOLDER}graph_edges/graph-*.csv > logs/formating.log

In [21]:
!head -n 20 logs/formating.log

Looking up relation types in the edge files...
- Found 4 relation types
- Removing the ones with fewer than 1 occurrences...
- Left with 4 relation types
- Shuffling them...
Searching for the entities in the edge files...
Entity type all:
- Found 27406591 entities
- Removing the ones with fewer than 10 occurrences...
- Left with 3366929 entities
- Shuffling them...
Preparing entity path ./data/mysql-2019-06-01/graph_edges:
- Writing count of entity type all and partition 0
- Writing count of entity type all and partition 1
- Writing count of entity type all and partition 2
- Writing count of entity type all and partition 3
- Writing count of dynamic relations
Preparing edge path ./data/mysql-2019-06-01/graph_edges/graph-test_partitioned, out of the edges found in ./data/mysql-2019-06-01/graph_edges/graph-test.csv
- Edges will be partitioned in 4 x 4 buckets.
- Processed 100000 edges so far...


In [14]:
for TEMPLATE in ['checkpoint_version.txt', 'embeddings_all_*.h5', 'model.v*.h5']:
    !rm -f {DATA_FOLDER}graph_edges/{TEMPLATE}

Actually, training

In [None]:
%time !torchbiggraph_train graph_config.py -p edge_paths={DATA_FOLDER}graph_edges/graph-train_partitioned > logs/train.log

In [66]:
!head logs/train.log
print('-'*100)
!tail logs/train.log

2019-07-21 03:07:40  Loading entity counts...
2019-07-21 03:07:40  Creating workers...
2019-07-21 03:07:40  Initializing global model...
2019-07-21 03:07:40  Starting epoch 1 / 10 edge path 1 / 1 edge chunk 1 / 1
2019-07-21 03:07:40  edge_path= ./data/mysql-2019-06-01/graph_edges/graph-train_partitioned
2019-07-21 03:07:40  Swapping partitioned embeddings None ( 3 , 3 )
2019-07-21 03:07:40  Loading entities
2019-07-21 03:08:50  ( 3 , 3 ): bucket 1 / 16 : Processed 7652791 edges in 68.59 s ( 0.11 M/sec ); io: 0.51 s ( 1013.96 MB/sec )
2019-07-21 03:08:50  ( 3 , 3 ): loss:  7.348 , violators_lhs:  24.8202 , violators_rhs:  19.5466 , count:  7652791
2019-07-21 03:08:50  Swapping partitioned embeddings ( 3 , 3 ) ( 2 , 2 )
----------------------------------------------------------------------------------------------------
2019-07-21 06:00:18  ( 0 , 0 ): bucket 16 / 16 : Processed 7613224 edges in 66.93 s ( 0.11 M/sec ); io: 1.05 s ( 493.80 MB/sec )
2019-07-21 06:00:18  ( 0 , 0 ): loss:  3.1

## Evaluating

Evaluating on train

In [3]:
!torchbiggraph_eval graph_config.py -p edge_paths={DATA_FOLDER}graph_edges/graph-train_partitioned > logs/eval_train.log

In [4]:
!head logs/eval_train.log
print('-'*100)
!tail logs/eval_train.log

2019-08-14 18:09:29  Starting edge path 1 / 1 (./data/mysql-2019-06-01/graph_edges/graph-train_partitioned)
2019-08-14 18:09:42  ( 0 , 0 ): Processed 6383098 edges in 12 s (0.52M/sec); load time: 0.18 s
2019-08-14 18:09:42  Stats for edge path 1 / 1, bucket ( 0 , 0 ): pos_rank:  6.76915 , mrr:  0.480395 , r1:  0.321494 , r10:  0.813052 , r50:  0.991382 , auc:  0.967479 , count:  6383098
2019-08-14 18:09:54  ( 0 , 1 ): Processed 6584603 edges in 12 s (0.53M/sec); load time: 0.15 s
2019-08-14 18:09:54  Stats for edge path 1 / 1, bucket ( 0 , 1 ): pos_rank:  6.66269 , mrr:  0.48533 , r1:  0.327922 , r10:  0.81517 , r50:  0.992204 , auc:  0.968301 , count:  6584603
2019-08-14 18:10:07  ( 0 , 2 ): Processed 6469912 edges in 12 s (0.53M/sec); load time: 0.18 s
2019-08-14 18:10:07  Stats for edge path 1 / 1, bucket ( 0 , 2 ): pos_rank:  6.52836 , mrr:  0.4888 , r1:  0.33052 , r10:  0.820306 , r50:  0.992694 , auc:  0.968771 , count:  6469912
2019-08-14 18:10:19  ( 0 , 3 ): Processed 6550501 e

Evaluating on validation

In [5]:
!torchbiggraph_eval graph_config.py -p edge_paths={DATA_FOLDER}graph_edges/graph-val_partitioned > logs/eval_val.log

In [6]:
!head logs/eval_val.log
print('-'*100)
!tail logs/eval_val.log

2019-08-14 18:12:50  Starting edge path 1 / 1 (./data/mysql-2019-06-01/graph_edges/graph-val_partitioned)
2019-08-14 18:12:51  ( 0 , 0 ): Processed 799176 edges in 1.7 s (0.48M/sec); load time: 0.12 s
2019-08-14 18:12:51  Stats for edge path 1 / 1, bucket ( 0 , 0 ): pos_rank:  9.97506 , mrr:  0.388542 , r1:  0.236046 , r10:  0.720194 , r50:  0.97284 , auc:  0.943282 , count:  799176
2019-08-14 18:12:53  ( 0 , 1 ): Processed 822664 edges in 1.6 s (0.52M/sec); load time: 0.1 s
2019-08-14 18:12:53  Stats for edge path 1 / 1, bucket ( 0 , 1 ): pos_rank:  9.79906 , mrr:  0.39167 , r1:  0.23917 , r10:  0.723014 , r50:  0.974304 , auc:  0.944365 , count:  822664
2019-08-14 18:12:55  ( 0 , 2 ): Processed 810187 edges in 1.6 s (0.52M/sec); load time: 0.084 s
2019-08-14 18:12:55  Stats for edge path 1 / 1, bucket ( 0 , 2 ): pos_rank:  9.51358 , mrr:  0.397661 , r1:  0.243526 , r10:  0.731363 , r50:  0.975832 , auc:  0.946319 , count:  810187
2019-08-14 18:12:56  ( 0 , 3 ): Processed 819618 edges

## Prepare tensorboard

In [None]:
# !torchbiggraph_export_to_tsv --dict {DATA_FOLDER}/graph_edges/dictionary.json \
# --checkpoint {DATA_FOLDER}/graph_edges/ --out {DATA_FOLDER}/graph_edges/embeddings.tsv

Slow ↑ method

In [6]:
import json
import h5py
import os

!mkdir -p tb
!mkdir -p tb/embeddings

def dropCategories(df, col, threshold):
    drop_list = df[col].value_counts()[threshold:].index.to_list()
    df[col].cat.remove_categories(drop_list, inplace=True)
    return df

For faster results let's see only on repos with 1K+ stars and users with 100+ followers/following

In [3]:
good_rids = pd.read_pickle('./tb/embeddings_old/repos_19k_gte1k.pkl').index.tolist()
len(good_rids)

18670

In [4]:
good_uids = pd.read_pickle('./tb/embeddings_old/users_following_29k_gte100.pkl').index.tolist()
good_uids += pd.read_pickle('./tb/embeddings_old/users_followers_24k_gte100.pkl').index.tolist()
good_uids = list(set(good_uids))
len(good_uids)

47666

Read metadata

In [7]:
repos_cur = mc['github']['projects'].find({"_id": {"$in": good_rids}}, ['url', 'language', 'created_at', 'updated_at'])
repos_df = pd.DataFrame(list(repos_cur)).set_index('_id')[['url', 'language', 'created_at', 'updated_at']]
repos_df.language = repos_df.language.astype("category")
repos_df = dropCategories(repos_df, "language", 42)
repos_df.sample(5)

Unnamed: 0_level_0,url,language,created_at,updated_at
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2355834,behave/behave,Python,2011-10-25 11:02:35,2019-02-25 21:52:24
9907713,gilbitron/Raneto,JavaScript,2014-05-30 10:24:18,2019-02-27 04:35:21
69826495,apollographql/apollo-server,TypeScript,2016-04-21 09:26:01,2019-02-26 15:46:21
89912090,Tencent/LKImageKit,Objective-C,2018-01-03 02:22:30,2019-02-27 06:54:36
36676484,electron/devtron,JavaScript,2016-02-12 22:57:24,2019-02-26 08:50:46


In [8]:
ufields = ['login', 'type', 'fake', 'location', 'deleted', 'country_code', 'created_at']
users_cur = mc['github']['users'].find({"_id": {"$in": good_uids}}, ufields)
users_df = pd.DataFrame(list(users_cur)).set_index('_id')[ufields]
users_df.sample(5)

Unnamed: 0_level_0,login,type,fake,location,deleted,country_code,created_at
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
333093,dolaameng,USR,False,Singapore,False,sg,2008-11-26 05:26:45
41562368,0x7214FF,USR,False,,False,,2011-08-31 01:31:09
569407,alekseybobkov,USR,False,,False,,2010-11-15 01:07:49
35798413,alexholdenmiller,USR,False,New York,False,us,2011-10-20 17:46:47
20472,dscape,USR,False,"London, United Kingdom",False,gb,2008-04-26 16:56:53


In [9]:
path = f'{DATA_FOLDER}/graph_edges'
files = sorted([os.path.join(path, f) for f in os.listdir(path) if f.find("embeddings_all") != -1])
files = files[0:2] + files[8:] + files[2:8]
files

['./data/mysql-2019-06-01//graph_edges/embeddings_all_0.v10.h5',
 './data/mysql-2019-06-01//graph_edges/embeddings_all_1.v10.h5',
 './data/mysql-2019-06-01//graph_edges/embeddings_all_2.v10.h5',
 './data/mysql-2019-06-01//graph_edges/embeddings_all_3.v10.h5']

Next actions need some RAM!

In [10]:
with open(os.path.join(path, "dictionary.json"), "rt") as tf:
    dictionary = json.load(tf)
print(dictionary['relations'])
all_entities = dictionary['entities']['all']
len(all_entities)

['relation', 'watches', 'follows', 'is_member_of']


1529391

In [11]:
pd.Series(all_entities).map(lambda x: x.split('_id_')[0]).value_counts()

user    1091930
repo     437461
dtype: int64

Read embeddings

In [12]:
embeddings=[]
for file in tn(files):
    with h5py.File(file, "r") as hf:
        embedding = hf["embeddings"][...]
    embeddings.append(embedding)
embeddings = np.concatenate(embeddings, axis=0)
embeddings.shape

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




(1529391, 100)

In [13]:
emb_df = pd.Series({i: e for i, e in tn(zip(all_entities, embeddings))}).to_frame(name='embeddings')
emb_df.shape

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




(1529391, 1)

Combine all metainfo with embeddings

In [14]:
repos_emb_df = pd.Series({int(i.split('_')[-1]): e for i, e in emb_df.loc[[f'repo_id_{i}' for i in repos_df.index]]['embeddings'].iteritems()}).to_frame(name='embeddings')
repos_df = repos_df.join(repos_emb_df)
repos_df.sample(5)

Unnamed: 0_level_0,url,language,created_at,updated_at,embeddings
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14471725,justjavac/flarum,JavaScript,2014-12-24 14:01:38,2019-02-27 05:07:21,"[0.20021674, -0.035547603, 0.14687921, 0.09368..."
266651,joewalnes/reconnecting-websocket,JavaScript,2012-01-27 20:28:51,2019-02-27 07:47:27,"[0.8540273, 0.00038430316, 0.2725552, 0.238947..."
28281221,up-for-grabs/up-for-grabs.net,JavaScript,2013-11-20 22:44:13,2019-02-26 12:30:42,"[0.13455452, -0.08811931, -0.020408232, 0.1868..."
5828347,HabitRPG/habitrpg,JavaScript,2012-06-06 22:49:48,2016-02-18 09:17:21,"[0.6335141, -0.049198274, -0.040939253, 0.0998..."
32907987,pybind/pybind11,C++,2015-07-05 19:46:48,2019-02-26 10:47:07,"[0.31675598, -0.11396529, 0.041127887, -0.0468..."


In [15]:
users_emb_df = pd.Series({int(i.split('_')[-1]): e for i, e in emb_df.loc[[f'user_id_{i}' for i in users_df.index]]['embeddings'].iteritems()}).to_frame(name='embeddings')
users_df = users_df.join(users_emb_df)
users_df.sample(5)

Unnamed: 0_level_0,login,type,fake,location,deleted,country_code,created_at,embeddings
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12539209,NathAston,USR,False,London,False,gb,2016-03-14 11:17:47,"[-0.43741047, -0.10959671, 0.17184402, 0.16799..."
14376,jsjohnst,USR,False,New York City,False,us,2008-04-07 02:20:43,"[-0.77265024, 0.03190932, 0.13513942, -0.00035..."
31888096,AeronStory,USR,False,,True,,2016-10-18 15:04:57,"[-0.97922194, 0.38436, 0.2280939, 0.09458058, ..."
39277352,higgsfield,USR,False,,False,,2016-12-06 23:12:52,"[0.5044715, -0.032197986, 0.0016144309, 0.0136..."
13376482,Ljzn,USR,False,home,False,us,2016-05-20 16:57:18,"[-0.77141976, 0.25560012, -0.005199214, -0.012..."


Save for TensorBoard

In [52]:
repos_df.to_pickle("./tb/embeddings/repos_19k_gte1k.pkl")

In [30]:
users_df.to_pickle("./tb/embeddings/users_48k_gte100.pkl")

If trained with `global = True` - embeddings for different types could be comparable (same hyperspace)

In [None]:
# repos_df.index = repos_df.index.map(lambda x: f'r{x}')
# users_df.index = users_df.index.map(lambda x: f'u{x}')
# combined_df = pd.concat([repos_df, users_df])
# combined_df.to_pickle("./tb/embeddings/repos_19k_gte1k_users_48k_gte100.pkl")

↑ embeddings for repos and users are far away from each other - no sense in viewing into combined multidimentional space

(reason - no `global = True` in config for better independent representations)

In [None]:
!rm -rf ./tb/embeddings/.ipynb_checkpoints

Now, `cd tb` & `run2.sh` at terminal with pre-installed docker or read `tb/README.md`

Then, open `HOSTNAME:PORT`, where `HOSTNAME` cane from machine, where the last script ran and `PORT` = `8002` (by default, but configurable)

## Building Annoy (library to do nearest neighbor search)

In [7]:
from annoy import AnnoyIndex

### Repos

Read dataframe

In [8]:
repos_df = pd.read_pickle("./tb/embeddings/repos_19k_gte1k.pkl")
repos_df.shape

(18670, 5)

Dimension

In [9]:
dim = len(repos_df['embeddings'].iloc[0])
dim

100

Mapping for index

In [10]:
repos_ann = AnnoyIndex(dim, 'angular')  # Length of item vector that will be indexed
repos_mapping = {}
for i, (repo_id, e) in enumerate(repos_df['embeddings'].iteritems()):
    repos_ann.add_item(i, list(e))
    repos_mapping[repo_id] = i

Build index

In [11]:
n_trees = 1_000

%time repos_ann.build(n_trees)

CPU times: user 38.7 s, sys: 25.3 ms, total: 38.8 s
Wall time: 38.8 s


True

Save index

In [12]:
repos_ann.save('./data/repos.ann')
pd.to_pickle(repos_mapping, './data/repos_mapping.pkl')

### Users

Read dataframe

In [13]:
users_df = pd.read_pickle("./tb/embeddings/users_48k_gte100.pkl")
users_df.shape

(47666, 8)

Dimension

In [14]:
dim = len(users_df['embeddings'].iloc[0])
dim

100

Mapping for index

In [15]:
users_ann = AnnoyIndex(dim, 'angular')
users_mapping = {}
for i, (user_id, e) in enumerate(users_df['embeddings'].iteritems()):
    users_ann.add_item(i, list(e))
    users_mapping[user_id] = i

Build index

In [16]:
n_trees = 1_000

%time users_ann.build(n_trees)

CPU times: user 1min 54s, sys: 147 ms, total: 1min 54s
Wall time: 1min 54s


True

Save index

In [17]:
users_ann.save('./data/users.ann')
pd.to_pickle(users_mapping, './data/users_mapping.pkl')

## Search for simmilar (nearest neighbors) items

### Repos

Read dataframe

In [18]:
repos_df = pd.read_pickle("./tb/embeddings/repos_19k_gte1k.pkl")
repos_df.shape

(18670, 5)

Dimension

In [19]:
dim = len(repos_df['embeddings'].iloc[0])
dim

100

Read index & mapping

In [20]:
repos_ann = AnnoyIndex(dim, 'angular')
repos_ann.load('./data/repos.ann') # super fast, will just mmap the file
repos_mapping = pd.read_pickle('./data/repos_mapping.pkl')
repos_mapping_rev = {v: k for k, v in repos_mapping.items()}
len(repos_mapping)

18670

In [21]:
def get_nns_by_repo_id(repo_id, n_nn=10):
    res, dist = repos_ann.get_nns_by_item(repos_mapping[repo_id], n_nn, include_distances=True) # will find the n nearest neighbors
    res = [repos_mapping_rev[r] for r in res]
    res = repos_df.loc[res]
    res['distance'] = [round(d, 3) for d in dist]
    return res

Type interesting query for repos search (don't forget about unpopularity limitation)

In [22]:
repo_query = 'apache/spark'

repos_df[repos_df['url'].str.lower().str.contains(repo_query)]

Unnamed: 0_level_0,url,language,created_at,updated_at,embeddings
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8196280,apache/spark,Scala,2014-02-25 08:00:08,2019-02-26 23:57:18,"[0.5067623, 0.013571289, -0.07234435, -0.00304..."


Choose needed one to search nearest neighbors

In [23]:
repo_id = 8196280

get_nns_by_repo_id(repo_id)

Unnamed: 0_level_0,url,language,created_at,updated_at,embeddings,distance
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8196280,apache/spark,Scala,2014-02-25 08:00:08,2019-02-26 23:57:18,"[0.5067623, 0.013571289, -0.07234435, -0.00304...",0.0
45297,apache/kafka,Java,2011-08-15 18:06:16,2019-02-27 04:05:00,"[0.69850904, 0.042590182, 0.0013614872, -0.006...",0.422
11610797,apache/storm,Java,2013-11-05 08:00:14,2019-02-27 04:45:34,"[0.59941286, 0.078122355, 0.035178997, -0.0098...",0.445
14807988,apache/flink,Java,2014-06-07 07:00:10,2019-02-27 04:11:52,"[0.4457919, 0.11843297, -0.097657256, 0.076890...",0.46
39068,apache/hadoop,Java,2009-03-27 14:41:53,2019-02-26 11:50:35,"[0.45037118, 0.10151473, 0.007142633, 0.081586...",0.491
13773,apache/hbase,Java,2014-05-23 07:00:07,2019-02-27 00:06:19,"[0.58324295, 0.006161812, -0.060786575, -0.005...",0.522
10305482,databricks/learning-spark,Java,2014-06-16 04:47:54,2019-02-26 10:02:28,"[0.27490905, -0.076953545, -0.19611663, -0.061...",0.548
1366147,neo4j/neo4j,Java,2012-11-12 08:46:15,2019-02-27 02:33:20,"[0.46941158, -0.06264764, 0.00042413553, -0.08...",0.564
10045,akka/akka,Scala,2009-02-16 12:51:54,2019-02-26 10:48:39,"[0.7051094, -0.10073524, 0.014133396, -0.09673...",0.566
4873,apache/lucene-solr,Java,2016-01-23 08:00:06,2019-02-27 03:04:07,"[0.53500164, 0.0141313, -0.02256584, -0.024738...",0.571


All Big Data, distributed tools

### Users

Read dataframe

In [24]:
users_df = pd.read_pickle("./tb/embeddings/users_48k_gte100.pkl")
users_df.shape

(47666, 8)

Dimension

In [25]:
dim = len(users_df['embeddings'].iloc[0])
dim

100

Read index & mapping

In [26]:
users_ann = AnnoyIndex(dim, 'angular')
users_ann.load('./data/users.ann') # super fast, will just mmap the file
users_mapping = pd.read_pickle('./data/users_mapping.pkl')
users_mapping_rev = {v: k for k, v in users_mapping.items()}
len(users_mapping)

47666

In [27]:
def get_nns_by_user_id(user_id, n_nn=10):
    res, dist = users_ann.get_nns_by_item(users_mapping[user_id], n_nn, include_distances=True) # will find the n nearest neighbors
    res = [users_mapping_rev[r] for r in res]
    res = users_df.loc[res]
    res['distance'] = [round(d, 3) for d in dist]
    return res

Type interesting query for users search (don't forget about unpopularity limitation)

In [28]:
user_query = 'wiki'

users_df[users_df['login'].str.lower().str.contains(user_query)]

Unnamed: 0_level_0,login,type,fake,location,deleted,country_code,created_at,embeddings
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1652246,wikibook,USR,False,,False,,2013-02-22 06:53:02,"[-0.12921746, -0.4415784, -0.40390605, 0.12089..."
4016654,wikimatze,USR,False,Berlin,False,de,2010-05-04 14:46:36,"[-0.3107245, 0.10202297, 0.2773474, -0.2349214..."


Choose needed one to search nearest neighbors

In [29]:
user_id = 1652246

get_nns_by_user_id(user_id)

Unnamed: 0_level_0,login,type,fake,location,deleted,country_code,created_at,embeddings,distance
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1652246,wikibook,USR,False,,False,,2013-02-22 06:53:02,"[-0.12921746, -0.4415784, -0.40390605, 0.12089...",0.0
3617048,gilbutITbook,USR,False,서울특별시 마포구 서교동 467-9,False,,2014-03-19 05:48:20,"[0.1376717, -0.29175922, -0.17365614, -0.05276...",0.839
286093,javajigi,USR,False,,False,,2010-12-13 00:06:41,"[0.064416006, -0.28612432, -0.28292352, 0.0340...",0.853
182762,dalinaum,USR,False,The Peach Blossom Spring,False,,2009-10-28 01:13:40,"[-0.86923313, -0.008363037, -0.31862542, 0.166...",0.873
884777,ihoneymon,USR,False,"Seoul, South Korea",False,kr,2011-06-26 12:56:10,"[-0.037841484, -0.36508644, -0.052486323, 0.08...",0.894
5712011,arahansa,USR,False,"Incheon, South Korea",False,,2014-01-18 12:54:37,"[-0.5806951, -0.10983159, 0.040602323, -0.3568...",0.909
13295853,seoul-opengov,USR,False,"seoul, korea",False,kr,2016-04-25 09:22:59,"[-0.26374817, -0.39799082, -0.021314356, -0.03...",0.912
68757,jongman,USR,False,,False,,2010-03-22 05:25:55,"[-0.29828942, -0.373523, 0.040735707, 0.066990...",0.915
68445,Sangwook,USR,False,seoul,False,kr,2009-09-05 05:40:53,"[-0.6826483, -0.31995863, -0.1767753, 0.093110...",0.937
347837,msbaek,USR,False,"Seoul, KOREA",False,kr,2008-08-27 04:49:51,"[-0.3363302, -0.46947053, -0.050381757, -0.089...",0.938


Korean "cluster"

If we go for their profiles - some similar stuff could be found there

It is essential to remember about absolute distance, not just TOP N, because some entities have neighbors close to them, while others could have not