Deduplicating PGA with apollo post #242

r0mainK · 2018-08-30T17:40:34Z

@vmarkovtsev sorry for the delay, here is a first version of the post

Still missing:

waiting for the community detection on the 4 largest graphs, obtained for the 80% threshold, to end -> all 4 have over 100k nodes (see in blog post)
add one last graph so ruby is represented, an maybe a sixth
add links to the Models (waiting to push CCs and CMDs, bags are already up)
~~add an additional doc for apollo to describe how to use the models~~ : PRed and might be worth to do the same for the BOW model in sourced-ml

vmarkovtsev

I think it will be much faster to rewrite the text while preserving all the content and ideas. I will place my version on top of yours and ask you to review it.

vmarkovtsev · 2018-09-04T08:52:39Z

content/post/deduplicating_pga_with_apollo.md

+date: 2018-09-15
+title: "Deduplicating PGA with Apollo"
+image: /post/deduplicating_pga_with_apollo/smile2.png
+description: "We describe how we ran Apollo on PGA, in order to find communities of duplicate files."


PGA -> Public Git Archive
in order to find groups of fuzzy duplicate files.

vmarkovtsev · 2018-09-04T08:56:09Z

content/post/deduplicating_pga_with_apollo.md

+---
+
+
+We [announced](announcing-pga.md) at the beginning of this summer the release of


[announced](../announcing-pga)

vmarkovtsev · 2018-09-04T08:58:59Z

content/post/deduplicating_pga_with_apollo.md

+starred repositories. Now, we'll describe how we tried to deduplicate 
+part of it using our advanced code deduplicator from hell, [src-d/apollo](https://github.com/src-d/apollo).
+Before diving into how we did it, let's quickly look at why. To the best of our 
+knowledge, the only efforts to detect code clones on a massive scale have been 


at massive scale

vmarkovtsev · 2018-09-04T08:59:48Z

content/post/deduplicating_pga_with_apollo.md

+
+We [announced](announcing-pga.md) at the beginning of this summer the release of
+`Public Git Archive`, a dataset containing 3TB of files from GitHub's most 
+starred repositories. Now, we'll describe how we tried to deduplicate 


deduplicate it (no "part" - we will explain this later)

vmarkovtsev · 2018-09-04T09:00:34Z

content/post/deduplicating_pga_with_apollo.md

+part of it using our advanced code deduplicator from hell, [src-d/apollo](https://github.com/src-d/apollo).
+Before diving into how we did it, let's quickly look at why. To the best of our 
+knowledge, the only efforts to detect code clones on a massive scale have been 
+from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which 


which -> who

vmarkovtsev · 2018-09-04T09:00:46Z

content/post/deduplicating_pga_with_apollo.md

+Before diving into how we did it, let's quickly look at why. To the best of our 
+knowledge, the only efforts to detect code clones on a massive scale have been 
+from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which 
+leveraged a huge corpus of over 428 million files from 4 languages to create a 


in 4 languages

vmarkovtsev · 2018-09-04T09:00:56Z

content/post/deduplicating_pga_with_apollo.md

+knowledge, the only efforts to detect code clones on a massive scale have been 
+from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which 
+leveraged a huge corpus of over 428 million files from 4 languages to create a 
+map of code clones in GitHub. To do so, they relied on syntactical features 


vmarkovtsev · 2018-09-04T09:02:12Z

content/post/deduplicating_pga_with_apollo.md

+leveraged a huge corpus of over 428 million files from 4 languages to create a 
+map of code clones in GitHub. To do so, they relied on syntactical features 
+i.e. identifiers (`my_list`, `your_list`, ...) and literals (`if`, `for`, ...), 
+to compute file similarity. Unfortunately PGA is not as big a corpus, and we did 


PGA has fewer files in the HEAD revision, and we

vmarkovtsev · 2018-09-04T09:16:05Z

content/post/deduplicating_pga_with_apollo.md

+i.e. identifiers (`my_list`, `your_list`, ...) and literals (`if`, `for`, ...), 
+to compute file similarity. Unfortunately PGA is not as big a corpus, and we did 
+not want to give our readers a *DéjàVu* by repeating the same analysis. so, we aimed 
+at something  different: in order detect not only copy-pasting between files, but also 


~~in order~~ to detect

r0mainK · 2018-09-04T09:25:41Z

Sure no problem if you think it will be less time consuming, im not gonna be able to do much today (dentist and moving in my flat), so unless you get around doing it today ill push the final graph and its description tonight

vmarkovtsev · 2018-09-05T11:01:19Z

Staging: https://blog-staging.srcd.run/post/deduplicating_pga_with_apollo/

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

r0mainK · 2018-09-05T14:22:40Z

just added the last graph, put in third place given it's size

also reviewed the first paragraph, looks good to me minus a typo I corrected

vmarkovtsev · 2018-09-05T15:47:53Z

@r0mainK Is it possible to replace all JPG plots with PNG?

r0mainK · 2018-09-05T15:50:56Z

Sure np will do it this evening, Ill just screen shot the images
EDIT: done

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

vmarkovtsev · 2018-09-06T12:58:21Z

@r0mainK Can you please add labels to images, e.g.

{{% caption src="/post/difftree/names.png" %}}
An example of a Git tree with some names in their nodes. The names of the nodes are shown between double quotes.
{{% /caption %}}

r0mainK · 2018-09-06T13:09:34Z

@vmarkovtsev sure, the graphs only or also the plots and pie charts ?

vmarkovtsev · 2018-09-06T13:31:20Z

All the graphics - some people look only at the images and the captions.

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>

vmarkovtsev · 2018-09-06T16:07:21Z

@r0mainK what is "Cliques count"? A clique is a fully connected part of the graph and we definitely could not find them because it is an NP-hard problem. Did you mean edges?

vmarkovtsev · 2018-09-06T16:09:44Z

Sorry, question closed, I have read the next paragraph :) Changing to buckets.

vmarkovtsev · 2018-09-06T16:10:33Z

Actually, they are indeed cliques...

vmarkovtsev · 2018-09-06T16:20:24Z

@r0mainK Did you weight the features? Cannot find anywhere that you mention it.

vmarkovtsev · 2018-09-06T16:27:40Z

Staging updated. Is it possible to increase the fonts in the bar charts, histograms and the ratio of distinct filenames plot?

r0mainK · 2018-09-06T17:35:12Z

@vmarkovtsev no I didnt but thought can do it the same way I did for the average feature count (might be better to replace it rather then just add it). it will take a day or two though.
for the fonts pie chart and histograms Ill do it now, but ratio of distinct filenames it will take some time, since it scales quadratically with the number of filenames per cc

vmarkovtsev · 2018-09-07T08:53:37Z

According to @EgorBu weighting gives a big accuracy boost. It also suggests the section about how we labeled our pairs and used them to optimize which @EgorBu promised to write in https://github.com/src-d/backlog/issues/1253

r0mainK · 2018-09-07T09:25:56Z

@vmarkovtsev oh okay didn't understand the question, I thought you were asking if I'd counted the average weights for each feature and per lang - which I just just did. So if you're question is whether I used weights during the hashing for each feature kind no I didn't - I used weights equal to 1. If you're asking whether the bag of features were weighted - yes naturally.

However although @EgorBu 's results could be mentioned when we talk about lack of metrics, they couldn't rly be used, as he tuned for only one language, whereas we're using 5 here - and on a restrained number of files - and as shown below not only the average counts vary, the average weighted counts should also given the average weights:

	identifiers	literals	graphlets	children	uast2seq	node2vec
Average weight cross-language	6.42	6.94	4.98	3.93	4.78	7.67
Average weight for Python files	6.78	7.28	5.44	4.63	5.26	8.91
Average weight for Java files	6.35	7.12	4.48	3.46	4.23	6.64
Average weight for Javascript files	6.34	6.71	4.83	3.36	4.67	7.25
Average weight for Ruby files	5.14	7.01	6.20	5.08	6.27	10.81
Average weight for Go files	6.68	7.35	5.36	4.64	4.88	8.29

Anyway should I replace the counts with previous weighted counts ? It might take some rewriting to the paragraph. Or I could just append results

vmarkovtsev · 2018-09-07T10:16:16Z

I see. Is it hard to rehash the files with

identifiers weighted to 1
literals weighted to 1.25
graphlets weighted 2.5
the rest weighted to 0?

In theory, it will take us less than 4 days without much human participation. Unfortunately, this means the numbers will change.

If that's too much, no worries.

vmarkovtsev · 2018-09-07T10:24:28Z

@r0mainK I dumped the current state, staging updated.

There is only one thing which bothers me: shouldn't we measure the similar file names ratio (very smart idea btw!) in the detected communities instead of the connected components?

vmarkovtsev · 2018-09-07T10:32:11Z

Also what was the timeout for detecting communities in the 4 largest CCs?

r0mainK · 2018-09-07T11:05:59Z

@vmarkovtsev

hash and cc shouldnt take much time, cmd might take a bit more time, depending on the ccs but its feasible, however are you sure we replace and not just append ? it means we will not be using node2vec, children, uast2seq - which is too bad. also, appending might be a bit much, but we'll be able to compare naive vs pseudo optimized - anyway lauching on it now, cluster is not being used atm
thanks ^^ yeah I was kinda thinking about that when doing the Ruby graph, the thing is it might start to be a lot of different plots if we have 4 types of ccs / cmds. it gives nearly the same result for 95% btw, and probly for 80% too (gonna go eat now):

timeout was a day or two for walktrap, less for infomap and more for fastgreedy

vmarkovtsev · 2018-09-07T11:37:24Z

Awesome. No need to show file names in CCs, just communities is enough.

Regarding the weights, indeed we are not using all the feature type we coded - that's what hyperoptimization found. There are still small weights assigned, we ignore them because they do not carry new information.

vmarkovtsev · 2018-09-10T15:20:07Z

@m09 I believe your feedback will be valuable here.

m09

This post is very well written and interesting. In the current form the two biggest drawbacks I see are:

Lack of theory (there is virtually no theory given on the approach, so when I read it for the first time remembering only partially what LSH was it felt like I was looking at cool pictures to look at cool pictures, it was hard to extract meaning from the post).
Too long. To be more precise, as is it would be almost ok (a bit too long already for me) but if we add the necessary theory then it just becomes monstrous.

To be more explicit about the lack of theory, there are only links given for MinHash, LSH, WalkTrap, FuzzyWuzzy and those are about the only algorithms used in the approach, which means no theory is given. It's kinda fine if we expect the reader to actually read all the given links but then we really expect a super dedicated reader.

My suggestion is therefore to split this blog post in a series of 3 blog posts with almost no refactoring from the current state but only some copy paste + quick contextualization of theory from other blog posts/sources.

m09 · 2018-09-10T15:32:14Z

content/post/deduplicating_pga_with_apollo.md

+
+
+We [announced](../announcing-pga) the release of `Public Git Archive`,
+a dataset with 3TB of Git data from the most starred repositories on GitHub, this summer.


I know it's more french-y than english-y but This summer, we annouced... sounds better to me than having it at the end of the sentence.

m09 · 2018-09-10T15:33:20Z

content/post/deduplicating_pga_with_apollo.md

+Now it's time to tell how we tried to deduplicate files in the latest revision
+of the repositories in PGA using our research project for code deduplication, [src-d/apollo](https://github.com/src-d/apollo). Before diving deep, let's quickly see why we created it.
+To the best of our knowledge, the only efforts to detect code clones at massive scale have been 
+by the authors of [DéjàVu project](http://mondego.ics.uci.edu/projects/dejavu/) by Lopes et.al., who 


have been made by Lopes et. al., the authors of ...

m09 · 2018-09-10T15:34:53Z

content/post/deduplicating_pga_with_apollo.md

+
+1. Extract [bags of features](http://www.cs.unc.edu/~lazebnik/spring09/lec18_bag_of_features.pdf) from each file and apply [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to keep only the most relevant items (feature extraction step).
+2. Hash those bags and produce the global pairwise similarity graph of the files
+(Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) step).


s/(Locality/[Locality/

Hash those bags and produce the global pairwise similarity graph of the files [Locality Sensitive Hashing]

to

Produce the global pairwise similarity graph (using [Locality…])

to emphasize the idea over the details of implementation

m09 · 2018-09-10T15:39:36Z

content/post/deduplicating_pga_with_apollo.md

+Percent of files in each language in our corpus.
+{{% /caption %}}
+
+We observed that over 65% of distinct features were **literals** due high variance,


due to high variance

content/post/deduplicating_pga_with_apollo.md

m09 · 2018-09-13T18:34:22Z

content/post/deduplicating_pga_with_apollo.md

+{{% caption src="/post/deduplicating_pga_with_apollo/hist.png" %}}
+Log-log histograms of the number of distinct file names in CCs, at the 80% threshold (left)
+and the 95% threshold (right).
+{{% /caption %}}


the image is not clickable to enlarge it and the basic font is too small/too blurry.

increased font

m09 · 2018-09-13T18:38:21Z

content/post/deduplicating_pga_with_apollo.md

+
+Even though we did not differentiate files by programming language,
+almost no CCs had multi-language files. The only exception is
+very large CCs at 95% threshold. Java, Ruby and Python had similar 


s/is very large/was a very large/
Could also benefit from some explanations.

done, I expanded at the end of the section as there was stuff to say on this

m09 · 2018-09-13T18:42:25Z

content/post/deduplicating_pga_with_apollo.md

+there are too many files and the duplication criteria are subjective.
+
+We manually labelled 2,000 pairs of Java files as almost the same, similar or
+different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation)


The paragraph above says that this is subjective and so clusters cannot be reviewed and here it's objective enough to be annotable.

I would remove the comment in the paragraph above.

why is there a 🐈?

m09 · 2018-09-13T18:42:45Z

content/post/deduplicating_pga_with_apollo.md

+different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation)
+tool kindly developed by our Applications team. We sampled them in a special way
+to cover all possible labels, the labeling process was tricky and funny and we will
+certainly describe it in our next blog post. We further ran hyperparameter


in a future blog post*

m09 · 2018-09-13T18:46:14Z

content/post/deduplicating_pga_with_apollo.md

+(6% of all files), mixed with [eBay's developer program projects](https://github.com/eBayDeveloper/eBay_APICall_CodeSamples)
+(also 6%). The orange community is devoted to [YaviJava](https://github.com/yavijava/yavijava)
+(4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc.
+There must be something deep in common for those codebases and   


which needs fixing ?

The paragraph ends mid-sentence.

aaaah yeah sorry thought you were talking about the links my bad

It's on me for not being informative! Sorry :)

anyway will let @vmarkovtsev complete, since he wrote this ^^

I have a nasty feeling that I wanted to write smth smart there but somehow erased the ending :(

This needs to be fixed before publication

I have a nasty feeling that I fixed it once again and removed for the second time. Damn.

vmarkovtsev · 2018-09-16T05:55:39Z

@m09 This is the biggest difference between a paper and a blog post: the bitter majority of the people who read the latter could not care less about the theory, and they want to be entertained, not taught. The point about each formula in a book reducing the audience by half is true. So our assumption about the "cutting edge" posts is that those who are interested will read deeper, those who are not (90%) will ack the pics, remember the keywords and put a plus on Reddit.

Splitting blog posts into parts is usually a bad idea unless there is more than one topic in the series.
We could write a post specifically about the theory, but there is one big problem: there is nobody to write it. Romain is out, myself has other important posts to write. If you know other, better links to the fundamentals - let's add them.

@r0mainK Do you want to fix the review suggestions yourself or delegate it to me? There are also points which I cannot fix myself.

r0mainK · 2018-09-16T08:56:05Z

@vmarkovtsev no don't worry, I've just been unable to work on it as much with school, and the cluster crashed a couple times so I couldn't work as fast as I wanted. I think by tomorrow night or possibly tonight I should have updated everything hopefully, sorry for the delay. With the new hyperparams there are some changes so you might have to do a last review once I push though.

m09 · 2018-09-16T20:12:07Z

@vmarkovtsev I don't fully agree on the role of blogs. To me they also are a nice place to vulgarize (papers are not). But since vulgarizing would require some rewriting I agree that we should keep this one as-is.

r0mainK · 2018-09-17T07:12:14Z

@vmarkovtsev pushed everything, didnt do anything related to this comment:

Lacks an introduction of the techno used. In the following paragraph we hear plenty of problems about techno we were not introduced to (like k8s, siva, and jgit-spark-connector). Giving a link doesn't replace telling the reader why they're used here in a few words, since most readers won't follow those links.

I think it's ready for a final review/rewrite on your side, if you want I can add this paragraph in the evening, but Im in class all day so cant do it now

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

r0mainK · 2018-09-17T10:19:14Z

@vmarkovtsev fuck I forgot to pull your commits and amended -> push forced my changes, I htink it squashed yours :/

vmarkovtsev · 2018-09-17T10:57:39Z

@r0mainK no worries, as long as they are not erased

r0mainK · 2018-09-17T11:00:13Z

I meant erased, was hoping you still have them ?

vmarkovtsev · 2018-09-17T11:42:28Z

@r0mainK No panic, in this case everything has been preserved :)

vmarkovtsev · 2018-09-19T17:49:03Z

Second pass done. Staging updated. @campoy please review this monster - Romain spent 6 months on this.

m09 · 2018-09-19T17:51:06Z

content/post/deduplicating_pga_with_apollo.md

+optimization with [hyperopt](https://github.com/hyperopt/hyperopt) to determine
+the feature weights, the optimal threshold and the rest of the variables.
+But the found hyperparameters are biased by our ML team's opinions
+and are not necessary the best for everybody. They are also specific to 


necessarily

vmarkovtsev · 2018-09-20T09:44:33Z

cc @vcoisne

vmarkovtsev · 2018-09-26T11:24:05Z

@campoy @vcoisne Friendly 1 week ping

vcoisne · 2018-09-26T16:38:06Z

@vcoisne scheduled for October 4th. Check out our Blog & Press schedule the dates blog post will be published.

vmarkovtsev · 2018-09-26T16:49:41Z

@vcoisne I mean this needs a review from the Devrel team...

campoy

There's a couple comments, but in general LGTM

campoy · 2018-10-04T13:29:09Z

content/post/deduplicating_pga_with_apollo.md

+- 54 million, and we did not want to give our readers a *DéjàVu* by repeating the same analysis. 
+So we aimed at something  different: not only copy-paste between files, but also 
+involuntary rewrites of the same abstractions. Thus we extracted and used semantic features 
+from [Unified Abstract Syntax Trees](https://docs.sourced.tech/babelfish/uast/uast-specification).


Universal, not Unified

campoy · 2018-10-04T14:09:43Z

content/post/deduplicating_pga_with_apollo.md

+HEAD commits in `PGA` contain 54.5 million files spread across 181,481 projects. 
+In order to extract semantic features from the files, we had 
+to limit ourselves to the programming languages with a functional [Babelfish driver](https://docs.sourced.tech/babelfish/languages). This meant ≈26% of files,
+those written in Python, Java, JavaScript, Ruby or Go.


What about PHP and Bash? They're also beta

Not in spring 2018. Added the note.

campoy · 2018-10-04T14:10:57Z

content/post/deduplicating_pga_with_apollo.md

+
+- **identifiers**, such as variable or function names.
+- **literals**, e.g. integer or string constant values.
+- **graphlets**, a feature being composed of the types of UAST node and it's children.


the types of a UAST node? what are types in this context?

also s/it's/its/ or maybe s/it's/their/

campoy · 2018-10-04T14:11:17Z

content/post/deduplicating_pga_with_apollo.md

+- **identifiers**, such as variable or function names.
+- **literals**, e.g. integer or string constant values.
+- **graphlets**, a feature being composed of the types of UAST node and it's children.
+- **children**, a feature being composed of the node's type and it's [quantized](https://en.wikipedia.org/wiki/Quantization_(signal_processing)) 


s/it's/its/

campoy · 2018-10-04T14:12:48Z

content/post/deduplicating_pga_with_apollo.md

+
+- Running Spark on [Siva files](https://github.com/src-d/go-siva)
+through [jgit-spark-connector](https://github.com/src-d/jgit-spark-connector)
+led to massive amount of temporary data during feature extraction on the worker 


"to a massive amount" or "to massive amounts"

campoy · 2018-10-04T14:26:42Z

content/post/deduplicating_pga_with_apollo.md

+there are too many files and the duplication criteria are subjective.
+
+We manually labelled 2,000 pairs of Java files as almost the same, similar or
+different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation)


why is there a 🐈?

campoy · 2018-10-04T14:28:55Z

content/post/deduplicating_pga_with_apollo.md

+
+{{% caption src="/post/deduplicating_pga_with_apollo/drop_opt.png" %}}
+Average ratio of distinct file names per file in communities depending on 
+the minimum number of files, at 80% (left)and 95% (right) thresholds.  


missing space right after (left)

campoy · 2018-10-04T14:31:32Z

content/post/deduplicating_pga_with_apollo.md

+|-------------|-------------|--------------------------|----------------|-------------------|-----------|
+| 2344        | 869,611     | 584                      | 1058           | 3                 |  80%      | 
+
+This CC does not contain artificial vertices - buckets - because as we wrote in section "Landing"


Make "Landing" a link to the section, I think you can point to it by using [Landing](#landing)

campoy · 2018-10-04T14:32:59Z

content/post/deduplicating_pga_with_apollo.md

+
+This is one of the 4 CCs which count together for 42% of all of the 
+Java files in [GoogleAds java-lib](https://github.com/googleads/googleads-java-lib).
+While the other three CCs generally correspond to the single project, our is more


s/our/ours/

campoy · 2018-10-04T14:33:59Z

content/post/deduplicating_pga_with_apollo.md

+(6% of all files), mixed with [eBay's developer program projects](https://github.com/eBayDeveloper/eBay_APICall_CodeSamples)
+(also 6%). The orange community is devoted to [YaviJava](https://github.com/yavijava/yavijava)
+(4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc.
+There must be something deep in common for those codebases and   


This needs to be fixed before publication

vmarkovtsev · 2018-10-04T15:14:54Z

Published as https://blog.sourced.tech/post/deduplicating_pga_with_apollo/

@r0mainK Congratulations! Your internship is now officially complete :-)

vmarkovtsev suggested changes Sep 4, 2018

View reviewed changes

r0mainK and others added 2 commits September 5, 2018 15:00

Deduplicating PGA with apollo post

fa8548c

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

Apply fixes for the first paragraph

69baed5

Fourth graph, corrected typo

00b2556

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

r0mainK and others added 2 commits September 6, 2018 15:51

Added labels, cropped pie charts

ac771d5

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

Optimize PNG

a7a7de9

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>

Rewrite all down to log-log histograms

5bfb355

Dump the writing progress

ba1c054

Dump the writing progress

f10fdf1

vmarkovtsev approved these changes Sep 10, 2018

View reviewed changes

vmarkovtsev changed the title ~~[WIP] Deduplicating PGA with apollo post~~ Deduplicating PGA with apollo post Sep 10, 2018

m09 reviewed Sep 13, 2018

View reviewed changes

update everything to hyperopt

65f7cef

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

vmarkovtsev added 2 commits September 17, 2018 16:18

Put the download links

e26c0aa

Second pass over the text

63ccf16

m09 reviewed Sep 19, 2018

View reviewed changes

Change necessary -> necessarily

b07d1dd

vmarkovtsev added 2 commits October 3, 2018 17:35

Add the last missing link

6d00add

Set the publish date and remove draft

9ff4eef

campoy approved these changes Oct 4, 2018

View reviewed changes

vmarkovtsev added 2 commits October 4, 2018 17:05

Fix Francesc's review

8b5f7da

Merge branch 'master' into post/apollo-pga

0433a56

vmarkovtsev merged commit ead4a48 into src-d:master Oct 4, 2018

vmarkovtsev mentioned this pull request Oct 9, 2018

[PROPOSAL] PGA deduplication #237

Closed

		---


		We [announced](announcing-pga.md) at the beginning of this summer the release of



		We [announced](../announcing-pga) the release of `Public Git Archive`,
		a dataset with 3TB of Git data from the most starred repositories on GitHub, this summer.

Deduplicating PGA with apollo post #242

Deduplicating PGA with apollo post #242

Conversation

r0mainK commented Aug 30, 2018 • edited Loading

vmarkovtsev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r0mainK commented Sep 4, 2018

vmarkovtsev commented Sep 5, 2018

r0mainK commented Sep 5, 2018

vmarkovtsev commented Sep 5, 2018

r0mainK commented Sep 5, 2018 • edited Loading

vmarkovtsev commented Sep 6, 2018

r0mainK commented Sep 6, 2018

vmarkovtsev commented Sep 6, 2018 • edited Loading

vmarkovtsev commented Sep 6, 2018 • edited Loading

vmarkovtsev commented Sep 6, 2018

vmarkovtsev commented Sep 6, 2018

vmarkovtsev commented Sep 6, 2018 • edited Loading

vmarkovtsev commented Sep 6, 2018

r0mainK commented Sep 6, 2018 • edited Loading

vmarkovtsev commented Sep 7, 2018

r0mainK commented Sep 7, 2018 • edited Loading

vmarkovtsev commented Sep 7, 2018

vmarkovtsev commented Sep 7, 2018

vmarkovtsev commented Sep 7, 2018

r0mainK commented Sep 7, 2018 • edited Loading

vmarkovtsev commented Sep 7, 2018

vmarkovtsev commented Sep 10, 2018

m09 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev commented Sep 16, 2018 • edited Loading

r0mainK commented Sep 16, 2018 • edited Loading

m09 commented Sep 16, 2018 • edited Loading

r0mainK commented Sep 17, 2018

r0mainK commented Sep 17, 2018

vmarkovtsev commented Sep 17, 2018

r0mainK commented Sep 17, 2018 • edited Loading

vmarkovtsev commented Sep 17, 2018

vmarkovtsev commented Sep 19, 2018

Choose a reason for hiding this comment

vmarkovtsev commented Sep 20, 2018

vmarkovtsev commented Sep 26, 2018

vcoisne commented Sep 26, 2018

vmarkovtsev commented Sep 26, 2018

campoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r0mainK commented Aug 30, 2018 •

edited

Loading

r0mainK commented Sep 5, 2018 •

edited

Loading

vmarkovtsev commented Sep 6, 2018 •

edited

Loading

vmarkovtsev commented Sep 6, 2018 •

edited

Loading

vmarkovtsev commented Sep 6, 2018 •

edited

Loading

r0mainK commented Sep 6, 2018 •

edited

Loading

r0mainK commented Sep 7, 2018 •

edited

Loading

r0mainK commented Sep 7, 2018 •

edited

Loading

vmarkovtsev commented Sep 16, 2018 •

edited

Loading

r0mainK commented Sep 16, 2018 •

edited

Loading

m09 commented Sep 16, 2018 •

edited

Loading

r0mainK commented Sep 17, 2018 •

edited

Loading