Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicating PGA with apollo post #242

Merged
merged 20 commits into from
Oct 4, 2018
Merged

Deduplicating PGA with apollo post #242

merged 20 commits into from
Oct 4, 2018

Conversation

r0mainK
Copy link
Contributor

@r0mainK r0mainK commented Aug 30, 2018

@vmarkovtsev sorry for the delay, here is a first version of the post

Still missing:

  • waiting for the community detection on the 4 largest graphs, obtained for the 80% threshold, to end -> all 4 have over 100k nodes (see in blog post)
  • add one last graph so ruby is represented, an maybe a sixth
  • add links to the Models (waiting to push CCs and CMDs, bags are already up)
  • add an additional doc for apollo to describe how to use the models : PRed and might be worth to do the same for the BOW model in sourced-ml

Copy link
Contributor

@vmarkovtsev vmarkovtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be much faster to rewrite the text while preserving all the content and ideas. I will place my version on top of yours and ask you to review it.

date: 2018-09-15
title: "Deduplicating PGA with Apollo"
image: /post/deduplicating_pga_with_apollo/smile2.png
description: "We describe how we ran Apollo on PGA, in order to find communities of duplicate files."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PGA -> Public Git Archive
in order to find groups of fuzzy duplicate files.

---


We [announced](announcing-pga.md) at the beginning of this summer the release of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[announced](../announcing-pga)

starred repositories. Now, we'll describe how we tried to deduplicate
part of it using our advanced code deduplicator from hell, [src-d/apollo](https://github.com/src-d/apollo).
Before diving into how we did it, let's quickly look at why. To the best of our
knowledge, the only efforts to detect code clones on a massive scale have been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at massive scale


We [announced](announcing-pga.md) at the beginning of this summer the release of
`Public Git Archive`, a dataset containing 3TB of files from GitHub's most
starred repositories. Now, we'll describe how we tried to deduplicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deduplicate it (no "part" - we will explain this later)

part of it using our advanced code deduplicator from hell, [src-d/apollo](https://github.com/src-d/apollo).
Before diving into how we did it, let's quickly look at why. To the best of our
knowledge, the only efforts to detect code clones on a massive scale have been
from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which -> who

Before diving into how we did it, let's quickly look at why. To the best of our
knowledge, the only efforts to detect code clones on a massive scale have been
from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which
leveraged a huge corpus of over 428 million files from 4 languages to create a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 4 languages

knowledge, the only efforts to detect code clones on a massive scale have been
from the authors of [DéjàVu](http://mondego.ics.uci.edu/projects/dejavu/), which
leveraged a huge corpus of over 428 million files from 4 languages to create a
map of code clones in GitHub. To do so, they relied on syntactical features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on GitHub

leveraged a huge corpus of over 428 million files from 4 languages to create a
map of code clones in GitHub. To do so, they relied on syntactical features
i.e. identifiers (`my_list`, `your_list`, ...) and literals (`if`, `for`, ...),
to compute file similarity. Unfortunately PGA is not as big a corpus, and we did
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PGA has fewer files in the HEAD revision, and we

i.e. identifiers (`my_list`, `your_list`, ...) and literals (`if`, `for`, ...),
to compute file similarity. Unfortunately PGA is not as big a corpus, and we did
not want to give our readers a *DéjàVu* by repeating the same analysis. so, we aimed
at something different: in order detect not only copy-pasting between files, but also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to detect

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 4, 2018

Sure no problem if you think it will be less time consuming, im not gonna be able to do much today (dentist and moving in my flat), so unless you get around doing it today ill push the final graph and its description tonight

@vmarkovtsev
Copy link
Contributor

r0mainK and others added 2 commits September 5, 2018 15:00
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 5, 2018

just added the last graph, put in third place given it's size

also reviewed the first paragraph, looks good to me minus a typo I corrected

@vmarkovtsev
Copy link
Contributor

@r0mainK Is it possible to replace all JPG plots with PNG?

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 5, 2018

Sure np will do it this evening, Ill just screen shot the images
EDIT: done

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
@vmarkovtsev
Copy link
Contributor

@r0mainK Can you please add labels to images, e.g.

{{% caption src="/post/difftree/names.png" %}}
An example of a Git tree with some names in their nodes. The names of the nodes are shown between double quotes.
{{% /caption %}}

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 6, 2018

@vmarkovtsev sure, the graphs only or also the plots and pie charts ?

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Sep 6, 2018

All the graphics - some people look only at the images and the captions.

r0mainK and others added 2 commits September 6, 2018 15:51
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Sep 6, 2018

@r0mainK what is "Cliques count"? A clique is a fully connected part of the graph and we definitely could not find them because it is an NP-hard problem. Did you mean edges?

@vmarkovtsev
Copy link
Contributor

Sorry, question closed, I have read the next paragraph :) Changing to buckets.

@vmarkovtsev
Copy link
Contributor

Actually, they are indeed cliques...

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Sep 6, 2018

@r0mainK Did you weight the features? Cannot find anywhere that you mention it.

@vmarkovtsev
Copy link
Contributor

Staging updated. Is it possible to increase the fonts in the bar charts, histograms and the ratio of distinct filenames plot?

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 6, 2018

@vmarkovtsev no I didnt but thought can do it the same way I did for the average feature count (might be better to replace it rather then just add it). it will take a day or two though.
for the fonts pie chart and histograms Ill do it now, but ratio of distinct filenames it will take some time, since it scales quadratically with the number of filenames per cc

@vmarkovtsev
Copy link
Contributor

According to @EgorBu weighting gives a big accuracy boost. It also suggests the section about how we labeled our pairs and used them to optimize which @EgorBu promised to write in https://github.com/src-d/backlog/issues/1253

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 7, 2018

@vmarkovtsev oh okay didn't understand the question, I thought you were asking if I'd counted the average weights for each feature and per lang - which I just just did. So if you're question is whether I used weights during the hashing for each feature kind no I didn't - I used weights equal to 1. If you're asking whether the bag of features were weighted - yes naturally.

However although @EgorBu 's results could be mentioned when we talk about lack of metrics, they couldn't rly be used, as he tuned for only one language, whereas we're using 5 here - and on a restrained number of files - and as shown below not only the average counts vary, the average weighted counts should also given the average weights:

identifiers literals graphlets children uast2seq node2vec
Average weight cross-language 6.42 6.94 4.98 3.93 4.78 7.67
Average weight for Python files 6.78 7.28 5.44 4.63 5.26 8.91
Average weight for Java files 6.35 7.12 4.48 3.46 4.23 6.64
Average weight for Javascript files 6.34 6.71 4.83 3.36 4.67 7.25
Average weight for Ruby files 5.14 7.01 6.20 5.08 6.27 10.81
Average weight for Go files 6.68 7.35 5.36 4.64 4.88 8.29

Anyway should I replace the counts with previous weighted counts ? It might take some rewriting to the paragraph. Or I could just append results

@vmarkovtsev
Copy link
Contributor

I see. Is it hard to rehash the files with

  • identifiers weighted to 1
  • literals weighted to 1.25
  • graphlets weighted 2.5
  • the rest weighted to 0?

In theory, it will take us less than 4 days without much human participation. Unfortunately, this means the numbers will change.

If that's too much, no worries.

@vmarkovtsev
Copy link
Contributor

@r0mainK I dumped the current state, staging updated.

There is only one thing which bothers me: shouldn't we measure the similar file names ratio (very smart idea btw!) in the detected communities instead of the connected components?

@vmarkovtsev
Copy link
Contributor

Also what was the timeout for detecting communities in the 4 largest CCs?

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 7, 2018

@vmarkovtsev

  • hash and cc shouldnt take much time, cmd might take a bit more time, depending on the ccs but its feasible, however are you sure we replace and not just append ? it means we will not be using node2vec, children, uast2seq - which is too bad. also, appending might be a bit much, but we'll be able to compare naive vs pseudo optimized - anyway lauching on it now, cluster is not being used atm

  • thanks ^^ yeah I was kinda thinking about that when doing the Ruby graph, the thing is it might start to be a lot of different plots if we have 4 types of ccs / cmds. it gives nearly the same result for 95% btw, and probly for 80% too (gonna go eat now):

image

  • timeout was a day or two for walktrap, less for infomap and more for fastgreedy

@vmarkovtsev
Copy link
Contributor

Awesome. No need to show file names in CCs, just communities is enough.

Regarding the weights, indeed we are not using all the feature type we coded - that's what hyperoptimization found. There are still small weights assigned, we ignore them because they do not carry new information.

@vmarkovtsev vmarkovtsev changed the title [WIP] Deduplicating PGA with apollo post Deduplicating PGA with apollo post Sep 10, 2018
@vmarkovtsev
Copy link
Contributor

@m09 I believe your feedback will be valuable here.

Copy link
Contributor

@m09 m09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This post is very well written and interesting. In the current form the two biggest drawbacks I see are:

  1. Lack of theory (there is virtually no theory given on the approach, so when I read it for the first time remembering only partially what LSH was it felt like I was looking at cool pictures to look at cool pictures, it was hard to extract meaning from the post).

  2. Too long. To be more precise, as is it would be almost ok (a bit too long already for me) but if we add the necessary theory then it just becomes monstrous.

To be more explicit about the lack of theory, there are only links given for MinHash, LSH, WalkTrap, FuzzyWuzzy and those are about the only algorithms used in the approach, which means no theory is given. It's kinda fine if we expect the reader to actually read all the given links but then we really expect a super dedicated reader.

My suggestion is therefore to split this blog post in a series of 3 blog posts with almost no refactoring from the current state but only some copy paste + quick contextualization of theory from other blog posts/sources.



We [announced](../announcing-pga) the release of `Public Git Archive`,
a dataset with 3TB of Git data from the most starred repositories on GitHub, this summer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's more french-y than english-y but This summer, we annouced... sounds better to me than having it at the end of the sentence.

Now it's time to tell how we tried to deduplicate files in the latest revision
of the repositories in PGA using our research project for code deduplication, [src-d/apollo](https://github.com/src-d/apollo). Before diving deep, let's quickly see why we created it.
To the best of our knowledge, the only efforts to detect code clones at massive scale have been
by the authors of [DéjàVu project](http://mondego.ics.uci.edu/projects/dejavu/) by Lopes et.al., who
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have been made by Lopes et. al., the authors of ...


1. Extract [bags of features](http://www.cs.unc.edu/~lazebnik/spring09/lec18_bag_of_features.pdf) from each file and apply [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to keep only the most relevant items (feature extraction step).
2. Hash those bags and produce the global pairwise similarity graph of the files
(Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) step).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/(Locality/[Locality/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash those bags and produce the global pairwise similarity graph of the files [Locality Sensitive Hashing]

to

Produce the global pairwise similarity graph (using [Locality…])

to emphasize the idea over the details of implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Percent of files in each language in our corpus.
{{% /caption %}}

We observed that over 65% of distinct features were **literals** due high variance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to high variance

content/post/deduplicating_pga_with_apollo.md Show resolved Hide resolved
{{% caption src="/post/deduplicating_pga_with_apollo/hist.png" %}}
Log-log histograms of the number of distinct file names in CCs, at the 80% threshold (left)
and the 95% threshold (right).
{{% /caption %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the image is not clickable to enlarge it and the basic font is too small/too blurry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased font


Even though we did not differentiate files by programming language,
almost no CCs had multi-language files. The only exception is
very large CCs at 95% threshold. Java, Ruby and Python had similar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is very large/was a very large/
Could also benefit from some explanations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I expanded at the end of the section as there was stuff to say on this

there are too many files and the duplication criteria are subjective.

We manually labelled 2,000 pairs of Java files as almost the same, similar or
different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paragraph above says that this is subjective and so clusters cannot be reviewed and here it's objective enough to be annotable.

I would remove the comment in the paragraph above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a 🐈?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation)
tool kindly developed by our Applications team. We sampled them in a special way
to cover all possible labels, the labeling process was tricky and funny and we will
certainly describe it in our next blog post. We further ran hyperparameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a future blog post*

(6% of all files), mixed with [eBay's developer program projects](https://github.com/eBayDeveloper/eBay_APICall_CodeSamples)
(also 6%). The orange community is devoted to [YaviJava](https://github.com/yavijava/yavijava)
(4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc.
There must be something deep in common for those codebases and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tofix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which needs fixing ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paragraph ends mid-sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaah yeah sorry thought you were talking about the links my bad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on me for not being informative! Sorry :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway will let @vmarkovtsev complete, since he wrote this ^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a nasty feeling that I wanted to write smth smart there but somehow erased the ending :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fixed before publication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a nasty feeling that I fixed it once again and removed for the second time. Damn.

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Sep 16, 2018

@m09 This is the biggest difference between a paper and a blog post: the bitter majority of the people who read the latter could not care less about the theory, and they want to be entertained, not taught. The point about each formula in a book reducing the audience by half is true. So our assumption about the "cutting edge" posts is that those who are interested will read deeper, those who are not (90%) will ack the pics, remember the keywords and put a plus on Reddit.

Splitting blog posts into parts is usually a bad idea unless there is more than one topic in the series.
We could write a post specifically about the theory, but there is one big problem: there is nobody to write it. Romain is out, myself has other important posts to write. If you know other, better links to the fundamentals - let's add them.

@r0mainK Do you want to fix the review suggestions yourself or delegate it to me? There are also points which I cannot fix myself.

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 16, 2018

@vmarkovtsev no don't worry, I've just been unable to work on it as much with school, and the cluster crashed a couple times so I couldn't work as fast as I wanted. I think by tomorrow night or possibly tonight I should have updated everything hopefully, sorry for the delay. With the new hyperparams there are some changes so you might have to do a last review once I push though.

@m09
Copy link
Contributor

m09 commented Sep 16, 2018

@vmarkovtsev I don't fully agree on the role of blogs. To me they also are a nice place to vulgarize (papers are not). But since vulgarizing would require some rewriting I agree that we should keep this one as-is.

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 17, 2018

@vmarkovtsev pushed everything, didnt do anything related to this comment:

Lacks an introduction of the techno used. In the following paragraph we hear plenty of problems about techno we were not introduced to (like k8s, siva, and jgit-spark-connector). Giving a link doesn't replace telling the reader why they're used here in a few words, since most readers won't follow those links.

I think it's ready for a final review/rewrite on your side, if you want I can add this paragraph in the evening, but Im in class all day so cant do it now

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 17, 2018

@vmarkovtsev fuck I forgot to pull your commits and amended -> push forced my changes, I htink it squashed yours :/

@vmarkovtsev
Copy link
Contributor

@r0mainK no worries, as long as they are not erased

@r0mainK
Copy link
Contributor Author

r0mainK commented Sep 17, 2018

I meant erased, was hoping you still have them ?

@vmarkovtsev
Copy link
Contributor

@r0mainK No panic, in this case everything has been preserved :)

@vmarkovtsev
Copy link
Contributor

Second pass done. Staging updated. @campoy please review this monster - Romain spent 6 months on this.

optimization with [hyperopt](https://github.com/hyperopt/hyperopt) to determine
the feature weights, the optimal threshold and the rest of the variables.
But the found hyperparameters are biased by our ML team's opinions
and are not necessary the best for everybody. They are also specific to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

necessarily

@vmarkovtsev
Copy link
Contributor

cc @vcoisne

@vmarkovtsev
Copy link
Contributor

@campoy @vcoisne Friendly 1 week ping

@vcoisne
Copy link
Contributor

vcoisne commented Sep 26, 2018

@vcoisne scheduled for October 4th. Check out our Blog & Press schedule the dates blog post will be published.

@vmarkovtsev
Copy link
Contributor

@vcoisne I mean this needs a review from the Devrel team...

Copy link
Contributor

@campoy campoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple comments, but in general LGTM

- 54 million, and we did not want to give our readers a *DéjàVu* by repeating the same analysis.
So we aimed at something different: not only copy-paste between files, but also
involuntary rewrites of the same abstractions. Thus we extracted and used semantic features
from [Unified Abstract Syntax Trees](https://docs.sourced.tech/babelfish/uast/uast-specification).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Universal, not Unified

HEAD commits in `PGA` contain 54.5 million files spread across 181,481 projects.
In order to extract semantic features from the files, we had
to limit ourselves to the programming languages with a functional [Babelfish driver](https://docs.sourced.tech/babelfish/languages). This meant ≈26% of files,
those written in Python, Java, JavaScript, Ruby or Go.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about PHP and Bash? They're also beta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in spring 2018. Added the note.


- **identifiers**, such as variable or function names.
- **literals**, e.g. integer or string constant values.
- **graphlets**, a feature being composed of the types of UAST node and it's children.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the types of a UAST node? what are types in this context?

also s/it's/its/ or maybe s/it's/their/

- **identifiers**, such as variable or function names.
- **literals**, e.g. integer or string constant values.
- **graphlets**, a feature being composed of the types of UAST node and it's children.
- **children**, a feature being composed of the node's type and it's [quantized](https://en.wikipedia.org/wiki/Quantization_(signal_processing))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its/


- Running Spark on [Siva files](https://github.com/src-d/go-siva)
through [jgit-spark-connector](https://github.com/src-d/jgit-spark-connector)
led to massive amount of temporary data during feature extraction on the worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to a massive amount" or "to massive amounts"

there are too many files and the duplication criteria are subjective.

We manually labelled 2,000 pairs of Java files as almost the same, similar or
different using [🐈 src-d/code-annotation](https://github.com/src-d/code-annotation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a 🐈?


{{% caption src="/post/deduplicating_pga_with_apollo/drop_opt.png" %}}
Average ratio of distinct file names per file in communities depending on
the minimum number of files, at 80% (left)and 95% (right) thresholds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space right after (left)

|-------------|-------------|--------------------------|----------------|-------------------|-----------|
| 2344 | 869,611 | 584 | 1058 | 3 | 80% |

This CC does not contain artificial vertices - buckets - because as we wrote in section "Landing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make "Landing" a link to the section, I think you can point to it by using [Landing](#landing)


This is one of the 4 CCs which count together for 42% of all of the
Java files in [GoogleAds java-lib](https://github.com/googleads/googleads-java-lib).
While the other three CCs generally correspond to the single project, our is more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/our/ours/

(6% of all files), mixed with [eBay's developer program projects](https://github.com/eBayDeveloper/eBay_APICall_CodeSamples)
(also 6%). The orange community is devoted to [YaviJava](https://github.com/yavijava/yavijava)
(4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc.
There must be something deep in common for those codebases and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fixed before publication

@vmarkovtsev vmarkovtsev merged commit ead4a48 into src-d:master Oct 4, 2018
@vmarkovtsev
Copy link
Contributor

Published as https://blog.sourced.tech/post/deduplicating_pga_with_apollo/

@r0mainK Congratulations! Your internship is now officially complete :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants