Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go-license-detector #195

Merged
merged 12 commits into from
May 1, 2018
Merged

go-license-detector #195

merged 12 commits into from
May 1, 2018

Conversation

vmarkovtsev
Copy link
Contributor

@vmarkovtsev vmarkovtsev commented Mar 31, 2018

TODO:

  • Add logo
  • Add implementation flesh
  • Add funny observations skin
  • Add PGA stats (friendly ping @eiso I believe in your super system)

Deployed as

@vmarkovtsev
Copy link
Contributor Author

I am running the license analysis in a shitty shell script, will finish by today. @eiso no need to run it with go-engine.

@vmarkovtsev vmarkovtsev requested a review from campoy April 6, 2018 16:59
@eiso
Copy link
Member

eiso commented Apr 6, 2018

@vmarkovtsev I am bowing my head in shame...I will make up for this.

Copy link
Contributor

@campoy campoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice post.
I'm afraid it starts to be a bit long, though.
Maybe one of the TODO sections could become a follow-up post?

curious about the licenses distribution. GitHub already detects licenses by leveraging
[benbalter/licensee](https://github.com/benbalter/licensee) Ruby library and the easy solution was
to query GitHub API. However, we were not satisfied with
it's detection quality: many projects which actually contain the license file in a non-standard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its/


The goals were defined from the very beginning:

1. Favor detection rate to the classification accuracy (target data mining instead of compliance).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "detection rate" mean in this context?
Is it recall, speed at prediction time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the clarification below it's a bit more clear, but I wonder whether that's "detection rate" at all.

What about "Favor false positives over false negatives"? Basically saying you'd rather get a non existing license that to miss existing ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

4. Comply with SPDX [licenses list](https://github.com/spdx/license-list-data) and
[detection guidelines](https://spdx.org/spdx-license-list/matching-guidelines).

(1) means that we should rather label a project with a bit inaccurate license than miss it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its/ (please use grammarly or some other corrector)

license completely. The open source compliance departments will not be satisfied with this choice,
as they need the opposite: the missed projects are manually studied. (2) restricts from using a
scripting language such as Python or Ruby, and we chose Go for our implementation. (3) paves
the way to many technical tricks, hacks and heuristics and enables any hardcore and complex code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is (3) related to those technical tricks and hacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you've got a task to tune some metric as much as possible, this leads to the imbalance in other matters. Wearing a software engineer hat here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: compilers. Adding thousands of lines of really hardcore code to optimize code a bit better and win one percent in the benchmarks (true story).

|[boyter/lc](https://github.com/boyter/lc)| 88% \\(\\quad(\\frac{797}{902})\\) | 548 |

The total number of repositories in the dataset is 958, however, only 902 contain any pointer to
the license - we looked though each of them. The rest are mainly "awesome lists" and Chinese projects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/though/through/

also, what's the problem with Chinese projects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chinese projects do not have a license. This is a purely empirical conclusion after looking through tens of them.


There may be also directories which are named like a license file, and we need to look inside.
A few projects contained symbolic links to the actual license texts, and we need to resolve them.
One project even has the license file name as the only content of `LICENSE` and we treat such files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence. Do you mean it has the license file inside of a directory named LICENSE
or that a file named LICENSE exists but the only content is the path to another license description?

Please, clarify in the text.

dramatically. Thus we should first render markup to HTML and then extract plain text content from
HTML. go-license-detector currently supports Markdown through
[russross/blackfriday](https://github.com/russross/blackfriday) and ReST through
[hhatto/gorst](https://github.com/hhatto/gorst). HTMl tags are stripped with `golang.org/x/net/html`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTMl -> HTML

accuracy, and we leverage it. However, our goal is data mining, so we can normalize aggressively.
We designed a three-level normalization pipeline. The first one is SPDX with some other rules
which do not affect the detection accuracy. The second one removes punctuation and lines with
copyright information. We apparently lose some data but our detection more robust to random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"our detection is more"

```
{{% /scroll-panel %}}

Normalized-1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicate what kind of normalization has been done for each step.

thereof that is intentionally submitted to licensor for inclusion in the work
by the or by an individual or legal entity authorized to submit
on behalf of the for the purposes of this definition
submitted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to the comment seems like a TODO?

@vmarkovtsev vmarkovtsev changed the title [WIP] go-license-detector go-license-detector Apr 11, 2018
@vmarkovtsev
Copy link
Contributor Author

@campoy Done, deployed at staging, removed WIP, ready to go.

@@ -71,7 +71,7 @@ The practical implications of this are:
in lexicographic order enumerates them the same way as the `tree` command
line tool:

```text
```nohighlight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this file? seems completely unrelated to the current blog post
if you want to change it, please send a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -24,7 +24,7 @@ code or who is to blame for a bug }:-).

For example, let's blame `src/bufio/bufio.go` from the standard Go Distribution:

```text
```nohighlight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -120,24 +120,35 @@ figure > figcaption {
}
}

.text-monospaced {
.text-monospaced {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this?
If needed send a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -275,8 +275,6 @@ phidang/first-django-blog

The complete dataset is published on [data.world](https://data.world/vmarkovtsev/github-duplicate-repositories).

<script async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_CHTML"></script>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Please send a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id="tspan4195"
x="53.268764"
y="39.352489"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:9.375px;font-family:'Berthold Akzidenz Grotesk';-inkscape-font-specification:'Berthold Akzidenz Grotesk Bold';fill:#ffffff;fill-opacity:1">WTF</tspan></text>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WTF -> WTH
What the fuck vs What the heck

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://www.wtfpl.net/

Will the dramatic effect be lost?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, is that the license we're using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe

similarity detection which we used multiple times in the past. Although 400 items is clearly
not large scale at all, it still makes sense to employ LSH because of the O(1) complexity
guarantee. We saw that it works reasonably well in practice and introduces a small overhead,
say 20MB of memory for the hashes and the vocabulary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/say/around/

[Weighted MinHash](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf)
- again, battle-tested in the past, e.g. in [Apollo](https://github.com/src-d/apollo) or
[bags deduplication](https://blog.sourced.tech/post/minhashcuda/).
After careful tuning of false positive vs. false negative. vs. performance, we decided to set the Jaccard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After careful tuning of false positives, false negatives, and performance, we ...

fmt.Sprintf("^(|.*[-_. ])(%s)(|[-_. ].*)$",
strings.Join(licenseFileNames, "|")))
)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll keep it this way in the blog post, but I'll probably try to replace it later on in the code


Original ([home-assistant/home-assistant](https://raw.githubusercontent.com/home-assistant/home-assistant/dev/LICENSE.md)):
{{% scroll-panel height="400" %}}
```markdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider using the new code shortcode to embed the code, rather than having it right here.

Google's licenseclassifier and Ben Boyter's `lc` on the
[reference 1k dataset](https://github.com/src-d/go-license-detector/blob/master/licensedb/dataset.zip):

|Detector|Detection rate|Time to scan, sec|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider creating a 2D diagram of these metrics, where you compare accuracy and speed.
This would show how much better we perform while being faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once I have time I add this

@vmarkovtsev
Copy link
Contributor Author

@campoy @dpordomingo {{% code %}} is rendered incorrectly inside {{% scroll %}} please fix.

https://blog-staging.srcd.run/post/gld/

@campoy
Copy link
Contributor

campoy commented Apr 12, 2018

Hey @vmarkovtsev, once #202 has been merged you should be able to simply replace

{{% scroll-panel height="400" %}}
{{% code "/post/gld/norm/1.txt" markdown %}}
{{% /scroll-panel %}}

with

{{% code src="/post/gld/norm/1.txt" lang="markdown" height="400" %}}

@vmarkovtsev
Copy link
Contributor Author

@campoy Done with the scroll, deployed at staging https://blog-staging.srcd.run/post/gld/

Copy link
Contributor

@dpordomingo dpordomingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow!!!
you did an awesome work in the post! Thanks also for pushing it into staging, to make the review easier
https://blog-staging.srcd.run/post/gld/

it LGTM... but I have only one concern (sorry because of that, but...)

182,000 Git repositories belonging to most popular projects on GitHub. It's index file contains the licenses
detected by go-license-detector. The following pie chart summarizes the license usage in PGA:

<svg id="pga-licenses"></svg>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed on using {% codepen %} snippet for runnable code, to be hosted on codepen, as done for lapjv post
I think @marnovo can provide the credentials, but for testing you can try with your our account ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David, the code is needed to draw the svg (d3). I have two options: export the svg and use it as an image or use codepen but without any borders, titles, etc. Is it possible with codepen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, id does not worth to use two scripts just to render a static image as we're doing now.
I see two points of view:

  • (a) the script is useful and it could be used by the reader → we could share it → codepen is the better way to do it,
  • (b) the solely purpose of the scripts is to render an image → then let's host it, and get rid of the scripts

If you ask me, I think that the /post/gld/chart.js script you developed can be reused, so it could be interesting for the reader, so I'd vote for (a).

@eiso
Copy link
Member

eiso commented Apr 25, 2018

@vmarkovtsev do you want my review here or is it sufficient already?

Copy link
Contributor

@dpordomingo dpordomingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed my red cross in this PR, and there is no reason for that strong opinion.
So please, just consider my last comment #195 (comment) and proceed as you feel better (because you already did a 👍 great job with the content itself; as usual 🥂 )

@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented Apr 25, 2018

@eiso This is going to be a third round - Francesc has made two passes already :) Seems that the most of the problems have been addressed by now.

thanks @dpordomingo for your kind words! I will export the svg but retain the script.

Copy link
Member

@eiso eiso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great post!


1. Favor false positives over false negatives (target data mining instead of compliance).
2. Perform fast.
3. Detect as many licenses as possible on the [hand-collected dataset of 1,000 top-starred repositories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it more clear that this 1k repos dataset was manually checked and labelled.

After careful tuning of false positives, false negatives, and performance, we decided to set the Jaccard
similarity threshold for our algorithm to 75%, and the hash length to 154 samples.
Since we discard the text structure by treating sequences as sets, we further calculate the Levenshtein
distance to the matched database records in order to determine the precise confidence value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a sentence that explains that you're only calculating the Levenshtein distance on matches from weighted minhash.

distance to the matched database records in order to determine the precise confidence value.

We look at the `README` file if the analyzed project does not contain a license file. This happens
in more than 7% of the cases in the 1k dataset and 66% in Public Git Archive (182,000 repositories).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public Git Archive should link to its page.

Since we had to manually look through hundreds of most-starred projects on GitHub, we noticed
a few funny trends. Many Chinese repositories isolated from the other communities,
awesome list expansion and others. Again, I should devote a separate post to those,
they are funny and also help to understand the picture of open source popularity better.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the offtopic section out since you don't really explain what is going on. Better to be a separate post for another day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased and left the section because I doubt I will find time in the nearest future.

[go-license-detector](https://github.com/src-d/go-license-detector) is a powerful
tool to detect the license of an open source project. It finds considerably
more matches than the others including the one used by GitHub. Detecting licenses
is much fun because of the many details and corner cases. Thanks to go-license-detector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/much fun/a lot of fun/

@vmarkovtsev
Copy link
Contributor Author

@campoy @eiso I fixed all the review comments.

Regarding the 2D plot, I have no time for this unfortunately. Please merge and don't torture me :) The rest can be easily fixed by subsequent edits - this is GitHub.

@eiso
Copy link
Member

eiso commented Apr 26, 2018

No worries @vmarkovtsev, better to have things done well than fast, but yes, you have my LGTM to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants