-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicating PGA with apollo post #242
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be much faster to rewrite the text while preserving all the content and ideas. I will place my version on top of yours and ask you to review it.
date: 2018-09-15 | ||
title: "Deduplicating PGA with Apollo" | ||
image: /post/deduplicating_pga_with_apollo/smile2.png | ||
description: "We describe how we ran Apollo on PGA, in order to find communities of duplicate files." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PGA -> Public Git Archive
in order to find groups of fuzzy duplicate files.
--- | ||
|
||
|
||
We [announced](announcing-pga.md) at the beginning of this summer the release of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[announced](../announcing-pga)
starred repositories. Now, we'll describe how we tried to deduplicate | ||
part of it using our advanced code deduplicator from hell, [src-d/apollo](https://github.com/src-d/apollo). | ||
Before diving into how we did it, let's quickly look at why. To the best of our | ||
knowledge, the only efforts to detect code clones on a massive scale have been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at massive scale
|
||
We [announced](announcing-pga.md) at the beginning of this summer the release of | ||
`Public Git Archive`, a dataset containing 3TB of files from GitHub's most | ||
starred repositories. Now, we'll describe how we tried to deduplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deduplicate it (no "part" - we will explain this later)
part of it using our advanced code deduplicator from hell, [src-d/apollo](https://github.com/src-d/apollo). | ||
Before diving into how we did it, let's quickly look at why. To the best of our | ||
knowledge, the only efforts to detect code clones on a massive scale have been | ||
from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which -> who
Before diving into how we did it, let's quickly look at why. To the best of our | ||
knowledge, the only efforts to detect code clones on a massive scale have been | ||
from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which | ||
leveraged a huge corpus of over 428 million files from 4 languages to create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 4 languages
knowledge, the only efforts to detect code clones on a massive scale have been | ||
from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which | ||
leveraged a huge corpus of over 428 million files from 4 languages to create a | ||
map of code clones in GitHub. To do so, they relied on syntactical features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on GitHub
leveraged a huge corpus of over 428 million files from 4 languages to create a | ||
map of code clones in GitHub. To do so, they relied on syntactical features | ||
i.e. identifiers (`my_list`, `your_list`, ...) and literals (`if`, `for`, ...), | ||
to compute file similarity. Unfortunately PGA is not as big a corpus, and we did |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PGA has fewer files in the HEAD revision, and we
i.e. identifiers (`my_list`, `your_list`, ...) and literals (`if`, `for`, ...), | ||
to compute file similarity. Unfortunately PGA is not as big a corpus, and we did | ||
not want to give our readers a *DéjàVu* by repeating the same analysis. so, we aimed | ||
at something different: in order detect not only copy-pasting between files, but also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to detect
Sure no problem if you think it will be less time consuming, im not gonna be able to do much today (dentist and moving in my flat), so unless you get around doing it today ill push the final graph and its description tonight |
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
just added the last graph, put in third place given it's size also reviewed the first paragraph, looks good to me minus a typo I corrected |
@r0mainK Is it possible to replace all JPG plots with PNG? |
Sure np will do it this evening, Ill just screen shot the images |
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
@r0mainK Can you please add labels to images, e.g.
|
@vmarkovtsev sure, the graphs only or also the plots and pie charts ? |
All the graphics - some people look only at the images and the captions. |
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
@r0mainK what is "Cliques count"? A clique is a fully connected part of the graph and we definitely could not find them because it is an NP-hard problem. Did you mean edges? |
Sorry, question closed, I have read the next paragraph :) Changing to buckets. |
Actually, they are indeed cliques... |
@r0mainK Did you weight the features? Cannot find anywhere that you mention it. |
Staging updated. Is it possible to increase the fonts in the bar charts, histograms and the ratio of distinct filenames plot? |
@vmarkovtsev no I didnt but thought can do it the same way I did for the average feature count (might be better to replace it rather then just add it). it will take a day or two though. |
According to @EgorBu weighting gives a big accuracy boost. It also suggests the section about how we labeled our pairs and used them to optimize which @EgorBu promised to write in https://github.com/src-d/backlog/issues/1253 |
@vmarkovtsev oh okay didn't understand the question, I thought you were asking if I'd counted the average weights for each feature and per lang - which I just just did. So if you're question is whether I used weights during the However although @EgorBu 's results could be mentioned when we talk about lack of metrics, they couldn't rly be used, as he tuned for only one language, whereas we're using 5 here - and on a restrained number of files - and as shown below not only the average counts vary, the average weighted counts should also given the average weights:
Anyway should I replace the counts with previous weighted counts ? It might take some rewriting to the paragraph. Or I could just append results |
I see. Is it hard to rehash the files with
In theory, it will take us less than 4 days without much human participation. Unfortunately, this means the numbers will change. If that's too much, no worries. |
@r0mainK I dumped the current state, staging updated. There is only one thing which bothers me: shouldn't we measure the similar file names ratio (very smart idea btw!) in the detected communities instead of the connected components? |
Also what was the timeout for detecting communities in the 4 largest CCs? |
|
Awesome. No need to show file names in CCs, just communities is enough. Regarding the weights, indeed we are not using all the feature type we coded - that's what hyperoptimization found. There are still small weights assigned, we ignore them because they do not carry new information. |
@m09 I believe your feedback will be valuable here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This post is very well written and interesting. In the current form the two biggest drawbacks I see are:
-
Lack of theory (there is virtually no theory given on the approach, so when I read it for the first time remembering only partially what LSH was it felt like I was looking at cool pictures to look at cool pictures, it was hard to extract meaning from the post).
-
Too long. To be more precise, as is it would be almost ok (a bit too long already for me) but if we add the necessary theory then it just becomes monstrous.
To be more explicit about the lack of theory, there are only links given for MinHash, LSH, WalkTrap, FuzzyWuzzy and those are about the only algorithms used in the approach, which means no theory is given. It's kinda fine if we expect the reader to actually read all the given links but then we really expect a super dedicated reader.
My suggestion is therefore to split this blog post in a series of 3 blog posts with almost no refactoring from the current state but only some copy paste + quick contextualization of theory from other blog posts/sources.
|
||
|
||
We [announced](../announcing-pga) the release of `Public Git Archive`, | ||
a dataset with 3TB of Git data from the most starred repositories on GitHub, this summer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's more french-y than english-y but This summer, we annouced... sounds better to me than having it at the end of the sentence.
Now it's time to tell how we tried to deduplicate files in the latest revision | ||
of the repositories in PGA using our research project for code deduplication, [src-d/apollo](https://github.com/src-d/apollo). Before diving deep, let's quickly see why we created it. | ||
To the best of our knowledge, the only efforts to detect code clones at massive scale have been | ||
by the authors of [DéjàVu project](http://mondego.ics.uci.edu/projects/dejavu/) by Lopes et.al., who |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have been made by Lopes et. al., the authors of ...
|
||
1. Extract [bags of features](http://www.cs.unc.edu/~lazebnik/spring09/lec18_bag_of_features.pdf) from each file and apply [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to keep only the most relevant items (feature extraction step). | ||
2. Hash those bags and produce the global pairwise similarity graph of the files | ||
(Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) step). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/(Locality/[Locality/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash those bags and produce the global pairwise similarity graph of the files [Locality Sensitive Hashing]
to
Produce the global pairwise similarity graph (using [Locality…])
to emphasize the idea over the details of implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Percent of files in each language in our corpus. | ||
{{% /caption %}} | ||
|
||
We observed that over 65% of distinct features were **literals** due high variance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to high variance
{{% caption src="/post/deduplicating_pga_with_apollo/hist.png" %}} | ||
Log-log histograms of the number of distinct file names in CCs, at the 80% threshold (left) | ||
and the 95% threshold (right). | ||
{{% /caption %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the image is not clickable to enlarge it and the basic font is too small/too blurry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increased font
|
||
Even though we did not differentiate files by programming language, | ||
almost no CCs had multi-language files. The only exception is | ||
very large CCs at 95% threshold. Java, Ruby and Python had similar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is very large/was a very large/
Could also benefit from some explanations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, I expanded at the end of the section as there was stuff to say on this
there are too many files and the duplication criteria are subjective. | ||
|
||
We manually labelled 2,000 pairs of Java files as almost the same, similar or | ||
different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paragraph above says that this is subjective and so clusters cannot be reviewed and here it's objective enough to be annotable.
I would remove the comment in the paragraph above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a 🐈?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation) | ||
tool kindly developed by our Applications team. We sampled them in a special way | ||
to cover all possible labels, the labeling process was tricky and funny and we will | ||
certainly describe it in our next blog post. We further ran hyperparameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a future blog post*
(6% of all files), mixed with [eBay's developer program projects](https://github.com/eBayDeveloper/eBay_APICall_CodeSamples) | ||
(also 6%). The orange community is devoted to [YaviJava](https://github.com/yavijava/yavijava) | ||
(4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc. | ||
There must be something deep in common for those codebases and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tofix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which needs fixing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paragraph ends mid-sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaaah yeah sorry thought you were talking about the links my bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's on me for not being informative! Sorry :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway will let @vmarkovtsev complete, since he wrote this ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a nasty feeling that I wanted to write smth smart there but somehow erased the ending :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fixed before publication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a nasty feeling that I fixed it once again and removed for the second time. Damn.
@m09 This is the biggest difference between a paper and a blog post: the bitter majority of the people who read the latter could not care less about the theory, and they want to be entertained, not taught. The point about each formula in a book reducing the audience by half is true. So our assumption about the "cutting edge" posts is that those who are interested will read deeper, those who are not (90%) will ack the pics, remember the keywords and put a plus on Reddit. Splitting blog posts into parts is usually a bad idea unless there is more than one topic in the series. @r0mainK Do you want to fix the review suggestions yourself or delegate it to me? There are also points which I cannot fix myself. |
@vmarkovtsev no don't worry, I've just been unable to work on it as much with school, and the cluster crashed a couple times so I couldn't work as fast as I wanted. I think by tomorrow night or possibly tonight I should have updated everything hopefully, sorry for the delay. With the new hyperparams there are some changes so you might have to do a last review once I push though. |
@vmarkovtsev I don't fully agree on the role of blogs. To me they also are a nice place to vulgarize (papers are not). But since vulgarizing would require some rewriting I agree that we should keep this one as-is. |
@vmarkovtsev pushed everything, didnt do anything related to this comment:
I think it's ready for a final review/rewrite on your side, if you want I can add this paragraph in the evening, but Im in class all day so cant do it now |
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
@vmarkovtsev fuck I forgot to pull your commits and amended -> push forced my changes, I htink it squashed yours :/ |
@r0mainK no worries, as long as they are not erased |
I meant erased, was hoping you still have them ? |
@r0mainK No panic, in this case everything has been preserved :) |
Second pass done. Staging updated. @campoy please review this monster - Romain spent 6 months on this. |
optimization with [hyperopt](https://github.com/hyperopt/hyperopt) to determine | ||
the feature weights, the optimal threshold and the rest of the variables. | ||
But the found hyperparameters are biased by our ML team's opinions | ||
and are not necessary the best for everybody. They are also specific to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
necessarily
cc @vcoisne |
@vcoisne scheduled for October 4th. Check out our Blog & Press schedule the dates blog post will be published. |
@vcoisne I mean this needs a review from the Devrel team... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple comments, but in general LGTM
- 54 million, and we did not want to give our readers a *DéjàVu* by repeating the same analysis. | ||
So we aimed at something different: not only copy-paste between files, but also | ||
involuntary rewrites of the same abstractions. Thus we extracted and used semantic features | ||
from [Unified Abstract Syntax Trees](https://docs.sourced.tech/babelfish/uast/uast-specification). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Universal, not Unified
HEAD commits in `PGA` contain 54.5 million files spread across 181,481 projects. | ||
In order to extract semantic features from the files, we had | ||
to limit ourselves to the programming languages with a functional [Babelfish driver](https://docs.sourced.tech/babelfish/languages). This meant ≈26% of files, | ||
those written in Python, Java, JavaScript, Ruby or Go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about PHP and Bash? They're also beta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in spring 2018. Added the note.
|
||
- **identifiers**, such as variable or function names. | ||
- **literals**, e.g. integer or string constant values. | ||
- **graphlets**, a feature being composed of the types of UAST node and it's children. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the types of a UAST node? what are types in this context?
also s/it's/its/ or maybe s/it's/their/
- **identifiers**, such as variable or function names. | ||
- **literals**, e.g. integer or string constant values. | ||
- **graphlets**, a feature being composed of the types of UAST node and it's children. | ||
- **children**, a feature being composed of the node's type and it's [quantized](https://en.wikipedia.org/wiki/Quantization_(signal_processing)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its/
|
||
- Running Spark on [Siva files](https://github.com/src-d/go-siva) | ||
through [jgit-spark-connector](https://github.com/src-d/jgit-spark-connector) | ||
led to massive amount of temporary data during feature extraction on the worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to a massive amount" or "to massive amounts"
there are too many files and the duplication criteria are subjective. | ||
|
||
We manually labelled 2,000 pairs of Java files as almost the same, similar or | ||
different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a 🐈?
|
||
{{% caption src="/post/deduplicating_pga_with_apollo/drop_opt.png" %}} | ||
Average ratio of distinct file names per file in communities depending on | ||
the minimum number of files, at 80% (left)and 95% (right) thresholds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space right after (left)
|-------------|-------------|--------------------------|----------------|-------------------|-----------| | ||
| 2344 | 869,611 | 584 | 1058 | 3 | 80% | | ||
|
||
This CC does not contain artificial vertices - buckets - because as we wrote in section "Landing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make "Landing"
a link to the section, I think you can point to it by using [Landing](#landing)
|
||
This is one of the 4 CCs which count together for 42% of all of the | ||
Java files in [GoogleAds java-lib](https://github.com/googleads/googleads-java-lib). | ||
While the other three CCs generally correspond to the single project, our is more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/our/ours/
(6% of all files), mixed with [eBay's developer program projects](https://github.com/eBayDeveloper/eBay_APICall_CodeSamples) | ||
(also 6%). The orange community is devoted to [YaviJava](https://github.com/yavijava/yavijava) | ||
(4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc. | ||
There must be something deep in common for those codebases and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fixed before publication
Published as https://blog.sourced.tech/post/deduplicating_pga_with_apollo/ @r0mainK Congratulations! Your internship is now officially complete :-) |
@vmarkovtsev sorry for the delay, here is a first version of the post
Still missing:
add an additional doc for apollo to describe how to use the models: PRed and might be worth to do the same for the BOW model in sourced-ml