# Author-Topic Modeling for Github Repository Descriptions
In this turtorial, you can learn how we conduct author-topic modeling on our github repository dataset. 

The author-topic model is an extension of LDA which allow us to build topics representation of associated author labels. In our case, the "documents" refers to repositories' descriptions, and the "author" refers to the owners of the repositories.

We collected github repositories which created in between 2017-01-31 to 2018-03-31 and has more than 40 stars. Total number of repositories is 12366. We only used 11165 repository data whose describtion is written in English.

### Import packages
We used Gensim to conduct Author-Topic Modeling. Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. Langdetect language-detection library we used to detect the languages used in documents. And we also used NLTK to tokenize the documents and get rid of the stop words.

In [2]:
% matplotlib inline
from pymongo import MongoClient
import pymongo
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import matplotlib.pyplot as plt
import json
import nltk
from langdetect import detect
from langdetect import detect_langs
from langdetect import DetectorFactory
from nltk.tokenize import RegexpTokenizer  
from stop_words import get_stop_words  
from nltk.stem.porter import PorterStemmer  
from gensim import corpora, models  
from gensim import utils
import gensim 
import os, re
from random import shuffle



### Loading data
We collected the data from Github API and store them into the MongoDB Atlas database. Pymongo is the package we used to connect our database.

In [3]:
mongo_username = 'Campione'
mongo_password = 'veTRxJL29lpKWwPn'
mongo_url = 'mongodb://%s:%s@cluster0-shard-00-01-i6gcp.mongodb.net:27017/admin' % (
            mongo_username, mongo_password)
client = MongoClient(mongo_url,ssl=True,replicaSet='Cluster0-shard-0',authSource='admin')
info = client.server_info()  # Forces a call.

We select the descriptions and transfer them into pandas Dataframe by runing the following lines. 

In [4]:
db = client.gitdbPro
repos = db.repos
print(repos.count())
descriptions = repos.distinct('description')
df = pd.DataFrame(descriptions)
df.head()

12366


Unnamed: 0,0
0,How to be low-level programmer
1,Import OpenStreetMap data into Unreal Engine 4
2,react-native template to target multiple platf...
3,Connect your App to Multiple Messaging Channel...
4,Proto Actor - Ultra fast distributed actors fo...


In [56]:
len(descriptions)
author_des = repos.find({},{'id':1,'description':1,'owner.login':1})
df_author_des = json_normalize(list(author_des))
df_author_des.head()

Unnamed: 0,_id,description,id,owner.login
0,5abe9b96c44bb82d0c83b395,How to be low-level programmer,77788381,gurugio
1,5abe9b96c44bb82d0c83b396,Import OpenStreetMap data into Unreal Engine 4,77765042,ue4plugins
2,5abe9b96c44bb82d0c83b397,react-native template to target multiple platf...,77784093,react-everywhere
3,5abe9b96c44bb82d0c83b398,Connect your App to Multiple Messaging Channel...,77797132,broidHQ
4,5abe9b96c44bb82d0c83b399,Proto Actor - Ultra fast distributed actors fo...,77786107,AsynkronIT


### Pre-processing and Data Cleaning
The Github descriptions are written in more than 10 different languages. Some of the repositories don't have a descriptions or they use only images in the descriptions. So we need to extract descriptions that written in English and get rid of all empty descriptions before training the model.

In [57]:
#len(df_author_des)
#df_author_des["owner.login"]
DetectorFactory.seed = 0
temp = df_author_des
for index,row in temp.iterrows():
    try:
        if detect(str(row['description'])) == 'zh-cn':
            temp.drop(index, inplace=True)
    except Exception as e:
        temp.drop(index, inplace=True)
        print(str(e))
len(temp)


No features in text.
No features in text.
No features in text.
No features in text.
No features in text.
No features in text.
No features in text.
No features in text.
No features in text.
No features in text.
No features in text.


11655

In [58]:
temp.head()

Unnamed: 0,_id,description,id,owner.login
0,5abe9b96c44bb82d0c83b395,How to be low-level programmer,77788381,gurugio
1,5abe9b96c44bb82d0c83b396,Import OpenStreetMap data into Unreal Engine 4,77765042,ue4plugins
2,5abe9b96c44bb82d0c83b397,react-native template to target multiple platf...,77784093,react-everywhere
3,5abe9b96c44bb82d0c83b398,Connect your App to Multiple Messaging Channel...,77797132,broidHQ
4,5abe9b96c44bb82d0c83b399,Proto Actor - Ultra fast distributed actors fo...,77786107,AsynkronIT


In [96]:
for index,row in temp.iterrows():
    try:
        if row['description'] is None:
            temp.drop(index, inplace=True)
    except Exception as e:
        temp.drop(index, inplace=True)
        print(str(e))
len(temp)

11145

In [170]:
temp = temp.reset_index(drop=True)

Construct a mapping from author names to document IDs.

In [172]:
# Get all author names and their corresponding document IDs.
author2doc = dict()
for index,row in temp.iterrows():
    if not author2doc.get(row['owner.login']):
        # This is a new author.
        author2doc[row['owner.login']] = []
    author2doc[row['owner.login']].append(index)
# Test
author2doc['gurugio']
#len(author2doc)

[0]

In [174]:
# Get all document texts and their corresponding IDs.
des2doc = dict()
for index,row in temp.iterrows():
    des2doc[index] = row['description']

'How to be low-level programmer'

The text will be pre-processed using the following steps:
1. Tokenize text.
2. Remove stopwords.
3. Remove all punctuation and numbers.
4. Add frequent bigrams.
5. Remove frequent and rare words.
6. Remove frequent and rare words.


In [175]:
# Pre-Prosessing
texts = []
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer  
p_stemmer = PorterStemmer()

# loop through document list  
for i in list(des2doc.values()):
    if i is not None:
        raw = i.lower()
        # clean and tokenize document string
        tokens = tokenizer.tokenize(raw)
        
        # remove stop words from tokens  
        stopped_tokens = [i for i in tokens if not i in en_stop]
        
        # stem tokens
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        
        # add tokens to list
        texts.append(stemmed_tokens)


# Get rid of numbers
no_number_texts = []
for i in texts:
    j = [item for item in i if item.isalpha()]
    no_number_texts.append(j)
#print(no_number_texts)
len(texts)
#len(no_number_texts)

11145

In [176]:
docs = no_number_texts
len(docs)

11145

Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance.

In [177]:
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)



Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring > 50% of the time), and rare words (occur < 20 times in total).

In [178]:
# Create a dictionary representation of the documents, and filter out frequent and rare words.
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

_ = dictionary[0]  # This sort of "initializes" dictionary.id2token.

We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words.

In [179]:
# Vectorize data.

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
len(corpus)
#print(corpus)
#Shuffle and split train/test
shuffle(corpus)
train_corpus, test_corpus = corpus[:6999], corpus[7000:]

Let's inspect the dimensionality of our data.

In [180]:
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))


Number of authors: 8260
Number of unique tokens: 746
Number of documents: 11145


### Train model
We train the author-topic model on the data prepared in the previous sections.
The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (id2word) and number of topics (num_topics), the author-topic model requires either an author to document ID mapping (author2doc), or the reverse (doc2author).
Below, we have also (this can be skipped for now):
1. Increased the number of passes over the dataset (to improve the convergence of the optimization problem).
2. Decreased the number of iterations over each document (related to the above).
3. Specified the mini-batch size (chunksize) (primarily to speed up training).
4. Turned off bound evaluation (eval_every) (as it takes a long time to compute).
5. Turned on automatic learning of the alpha and eta priors (to improve the convergence of the optimization problem).
6. Set the random state (random_state) of the random number generator (to make these experiments reproducible).
Then we can load the model, and train it.

In [192]:
# Train AuthorTopicModel
from gensim.models import AuthorTopicModel
model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \
                iterations=50, random_state=1)

We tried to improve the model by training it with different random initializations. Then we evaluate the topic coherence of the model using the top_topics method, and pick the model with the highest topic coherence.

In [193]:
model_list = []
for i in range(5):
    model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                    author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
                    eval_every=0, iterations=1, random_state=i)
    top_topics = model.top_topics(corpus)
    tc = sum([t[1] for t in top_topics])
    model_list.append((model, tc))

We can save the model to local disk so that we don't need to re-do the training process again if we exit the current process.

We can load it from local disk.

In [223]:
# Save model.
model.save('/tmp/model.atmodel')


In [195]:
# Load model.
model = AuthorTopicModel.load('/tmp/model.atmodel')

### Explore author-topic representation
Now that we have trained a model, we can start exploring the authors and the topics.
First, let's simply print the most important words in the topics. Below we have printed topic 0. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic.

In [196]:
model.show_topic(0)

[('base', 0.069723341902688682),
 ('support', 0.031294860416073808),
 ('platform', 0.02745445997944659),
 ('applic', 0.022327567661310291),
 ('librari', 0.020965424192181656),
 ('client', 0.019475310770711209),
 ('rust', 0.01784696014218002),
 ('spring', 0.017613247066807187),
 ('tutori', 0.017609409968538228),
 ('framework', 0.016970801074474896)]

We make a function to help us print the top topics of a particular author easily. It would print topics related to the author in ascending order.

In [197]:
#Let's print the top topics of some authors. First, we make a function to help us do this more easily.
from pprint import pprint

def show_author(name):
    print('\n%s' % name)
    #print('Docs:', model.author2doc[name])
    print('Topics:')
    pprint([(topic[0], model.show_topic(topic[0])) for topic in sorted(model[name], key=lambda x:x[1])])

### Plotting the authors
Now we're going to produce the kind of pacific archipelago looking plot below. The goal of this plot is to give you a way to explore the author-topic representation in an intuitive manner.

We take all the author-topic distributions (stored in model.state.gamma) and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE.

t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.

In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the smallest_author value if you do not want to view all the authors with few documents.


In [198]:
# Plotting Authors
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
smallest_author = 0  # Ignore authors with documents less than this.
authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]
_ = tsne.fit_transform(model.state.gamma[authors, :])  # Result stored in tsne.embedding_

We are now ready to display the plot using BokehJS. Wait patientily since it may take some time.

In [199]:
# Tell Bokeh to display plots inside the notebook.
from bokeh.io import output_notebook
output_notebook()

Now we have the plot. As you can see, there are two significant large node in the plot. Move your mouse onto the node so that you can get detailed information of that node. The size of the node represent the number of documents the author created. The distance between two nodes represents the similarity of topics they concerned.

The biggest node in the top middle represents "fossasia", an open source community of people from all continents based in Asia. If you move the mouse inside the node, you can see that this biggest node is overlap with other nodes, such as "Microsoft" and "facebookresearch".

In [200]:
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource

x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = [model.id2author[a] for a in authors]

# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
author_sizes = [len(model.author2doc[a]) for a in author_names]
radii = [size * scale for size in author_sizes]

source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            author_names=author_names,
            author_sizes=author_sizes,
            radii=radii,
        )
    )

# Add author names and sizes to mouse-over info.
hover = HoverTool(
        tooltips=[
        ("author", "@author_names"),
        ("size", "@author_sizes"),
        ]
    )

p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)

In [232]:
# Show author-topics
#sort(model['Microsoft'])
li = model['Microsoft']
sorted(li,key=lambda x: x[1])


[(8, 0.27661388079860771), (2, 0.3412281114309188), (4, 0.37763043171492894)]

We can take a look at the topics that "Microsoft" concern about so that we can get a sense that what topics are the authors from the biggest cluster concern about.

In [234]:
show_author('Microsoft')


Microsoft
Topics:
[(8,
  [('vue', 0.056159631707267101),
   ('http', 0.043609570727786248),
   ('design', 0.02717124923563383),
   ('js', 0.022487580662235863),
   ('gener', 0.021438158131224818),
   ('vue_js', 0.020461106852293683),
   ('com', 0.019888843017457691),
   ('servic', 0.019287640568173753),
   ('compon', 0.018405963063071625),
   ('ui', 0.017081426994044292)]),
 (2,
  [('android', 0.060251939684616611),
   ('github', 0.027886408191987837),
   ('develop', 0.027105797308662135),
   ('use', 0.025676000044923),
   ('app', 0.020510901833230611),
   ('easi', 0.019846111048806084),
   ('manag', 0.01937548834842244),
   ('googl', 0.017617379733289817),
   ('file', 0.016559120344661848),
   ('librari', 0.014753070190101908)]),
 (4,
  [('network', 0.047570019384068934),
   ('imag', 0.04577618985568991),
   ('object', 0.028135630807080637),
   ('neural', 0.025144879691775288),
   ('window', 0.024869468008436913),
   ('video', 0.02167246286283606),
   ('use', 0.021158593930090286),
 

In [235]:
show_author('facebookresearch')


facebookresearch
Topics:
[(5,
  [('learn', 0.087000349398529547),
   ('python', 0.059455558933786165),
   ('deep', 0.032648626572385826),
   ('machin', 0.031443332185353959),
   ('deep_learn', 0.025438198056312752),
   ('machin_learn', 0.021768696185710493),
   ('use', 0.020157543055956781),
   ('kubernet', 0.0180829352841291),
   ('base', 0.016905498679334933),
   ('code', 0.015152944531592358)]),
 (1,
  [('io', 0.078001174566502532),
   ('swift', 0.058385605651483971),
   ('app', 0.028919176723807661),
   ('server', 0.026496736456531587),
   ('line', 0.023654323708450196),
   ('command', 0.023547776062139314),
   ('api', 0.018621261902416933),
   ('net', 0.018055608443376392),
   ('command_line', 0.01747232653562928),
   ('simpl', 0.016824675910870605)]),
 (9,
  [('tool', 0.064624168726951842),
   ('sourc', 0.050343560601393202),
   ('open', 0.041458813070464659),
   ('open_sourc', 0.032186866228856741),
   ('c', 0.029068175574069377),
   ('written', 0.025659547564784394),
   ('libr

In [219]:
show_author('fossasia')


fossasia
Topics:
[(1,
  [('io', 0.078001174566502532),
   ('swift', 0.058385605651483971),
   ('app', 0.028919176723807661),
   ('server', 0.026496736456531587),
   ('line', 0.023654323708450196),
   ('command', 0.023547776062139314),
   ('api', 0.018621261902416933),
   ('net', 0.018055608443376392),
   ('command_line', 0.01747232653562928),
   ('simpl', 0.016824675910870605)]),
 (2,
  [('android', 0.060251939684616611),
   ('github', 0.027886408191987837),
   ('develop', 0.027105797308662135),
   ('use', 0.025676000044923),
   ('app', 0.020510901833230611),
   ('easi', 0.019846111048806084),
   ('manag', 0.01937548834842244),
   ('googl', 0.017617379733289817),
   ('file', 0.016559120344661848),
   ('librari', 0.014753070190101908)]),
 (3,
  [('js', 0.047193286097207549),
   ('use', 0.037467665799009162),
   ('node', 0.028260300365184961),
   ('gener', 0.026528205499654015),
   ('s', 0.020581330958195344),
   ('framework', 0.019204071338976281),
   ('css', 0.018726219751528512),
   

In [222]:
len(model.author2doc['Microsoft'])

42

In [224]:
show_author('google')


google
Topics:
[(0,
  [('base', 0.069723341902688682),
   ('support', 0.031294860416073808),
   ('platform', 0.02745445997944659),
   ('applic', 0.022327567661310291),
   ('librari', 0.020965424192181656),
   ('client', 0.019475310770711209),
   ('rust', 0.01784696014218002),
   ('spring', 0.017613247066807187),
   ('tutori', 0.017609409968538228),
   ('framework', 0.016970801074474896)]),
 (1,
  [('io', 0.078001174566502532),
   ('swift', 0.058385605651483971),
   ('app', 0.028919176723807661),
   ('server', 0.026496736456531587),
   ('line', 0.023654323708450196),
   ('command', 0.023547776062139314),
   ('api', 0.018621261902416933),
   ('net', 0.018055608443376392),
   ('command_line', 0.01747232653562928),
   ('simpl', 0.016824675910870605)]),
 (4,
  [('network', 0.047570019384068934),
   ('imag', 0.04577618985568991),
   ('object', 0.028135630807080637),
   ('neural', 0.025144879691775288),
   ('window', 0.024869468008436913),
   ('video', 0.02167246286283606),
   ('use', 0.0211

In [225]:
show_author('alibaba')


alibaba
Topics:
[(0,
  [('base', 0.069723341902688682),
   ('support', 0.031294860416073808),
   ('platform', 0.02745445997944659),
   ('applic', 0.022327567661310291),
   ('librari', 0.020965424192181656),
   ('client', 0.019475310770711209),
   ('rust', 0.01784696014218002),
   ('spring', 0.017613247066807187),
   ('tutori', 0.017609409968538228),
   ('framework', 0.016970801074474896)]),
 (2,
  [('android', 0.060251939684616611),
   ('github', 0.027886408191987837),
   ('develop', 0.027105797308662135),
   ('use', 0.025676000044923),
   ('app', 0.020510901833230611),
   ('easi', 0.019846111048806084),
   ('manag', 0.01937548834842244),
   ('googl', 0.017617379733289817),
   ('file', 0.016559120344661848),
   ('librari', 0.014753070190101908)]),
 (4,
  [('network', 0.047570019384068934),
   ('imag', 0.04577618985568991),
   ('object', 0.028135630807080637),
   ('neural', 0.025144879691775288),
   ('window', 0.024869468008436913),
   ('video', 0.02167246286283606),
   ('use', 0.02115