-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go-license-detector #195
go-license-detector #195
Conversation
a6eb9b6
to
23e9dc8
Compare
I am running the license analysis in a |
@vmarkovtsev I am bowing my head in shame...I will make up for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice post.
I'm afraid it starts to be a bit long, though.
Maybe one of the TODO
sections could become a follow-up post?
content/post/gld.md
Outdated
curious about the licenses distribution. GitHub already detects licenses by leveraging | ||
[benbalter/licensee](https://github.com/benbalter/licensee) Ruby library and the easy solution was | ||
to query GitHub API. However, we were not satisfied with | ||
it's detection quality: many projects which actually contain the license file in a non-standard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its/
content/post/gld.md
Outdated
|
||
The goals were defined from the very beginning: | ||
|
||
1. Favor detection rate to the classification accuracy (target data mining instead of compliance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "detection rate" mean in this context?
Is it recall, speed at prediction time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the clarification below it's a bit more clear, but I wonder whether that's "detection rate" at all.
What about "Favor false positives over false negatives"? Basically saying you'd rather get a non existing license that to miss existing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
content/post/gld.md
Outdated
4. Comply with SPDX [licenses list](https://github.com/spdx/license-list-data) and | ||
[detection guidelines](https://spdx.org/spdx-license-list/matching-guidelines). | ||
|
||
(1) means that we should rather label a project with a bit inaccurate license than miss it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its/ (please use grammarly or some other corrector)
content/post/gld.md
Outdated
license completely. The open source compliance departments will not be satisfied with this choice, | ||
as they need the opposite: the missed projects are manually studied. (2) restricts from using a | ||
scripting language such as Python or Ruby, and we chose Go for our implementation. (3) paves | ||
the way to many technical tricks, hacks and heuristics and enables any hardcore and complex code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is (3) related to those technical tricks and hacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you've got a task to tune some metric as much as possible, this leads to the imbalance in other matters. Wearing a software engineer hat here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example: compilers. Adding thousands of lines of really hardcore code to optimize code a bit better and win one percent in the benchmarks (true story).
content/post/gld.md
Outdated
|[boyter/lc](https://github.com/boyter/lc)| 88% \\(\\quad(\\frac{797}{902})\\) | 548 | | ||
|
||
The total number of repositories in the dataset is 958, however, only 902 contain any pointer to | ||
the license - we looked though each of them. The rest are mainly "awesome lists" and Chinese projects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/though/through/
also, what's the problem with Chinese projects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chinese projects do not have a license. This is a purely empirical conclusion after looking through tens of them.
content/post/gld.md
Outdated
|
||
There may be also directories which are named like a license file, and we need to look inside. | ||
A few projects contained symbolic links to the actual license texts, and we need to resolve them. | ||
One project even has the license file name as the only content of `LICENSE` and we treat such files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence. Do you mean it has the license file inside of a directory named LICENSE
or that a file named LICENSE
exists but the only content is the path to another license description?
Please, clarify in the text.
content/post/gld.md
Outdated
dramatically. Thus we should first render markup to HTML and then extract plain text content from | ||
HTML. go-license-detector currently supports Markdown through | ||
[russross/blackfriday](https://github.com/russross/blackfriday) and ReST through | ||
[hhatto/gorst](https://github.com/hhatto/gorst). HTMl tags are stripped with `golang.org/x/net/html` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTMl -> HTML
content/post/gld.md
Outdated
accuracy, and we leverage it. However, our goal is data mining, so we can normalize aggressively. | ||
We designed a three-level normalization pipeline. The first one is SPDX with some other rules | ||
which do not affect the detection accuracy. The second one removes punctuation and lines with | ||
copyright information. We apparently lose some data but our detection more robust to random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"our detection is more"
content/post/gld.md
Outdated
``` | ||
{{% /scroll-panel %}} | ||
|
||
Normalized-1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indicate what kind of normalization has been done for each step.
content/post/gld.md
Outdated
thereof that is intentionally submitted to licensor for inclusion in the work | ||
by the or by an individual or legal entity authorized to submit | ||
on behalf of the for the purposes of this definition | ||
submitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to the comment
seems like a TODO?
@campoy Done, deployed at staging, removed WIP, ready to go. |
content/post/difftree.md
Outdated
@@ -71,7 +71,7 @@ The practical implications of this are: | |||
in lexicographic order enumerates them the same way as the `tree` command | |||
line tool: | |||
|
|||
```text | |||
```nohighlight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this file? seems completely unrelated to the current blog post
if you want to change it, please send a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content/post/meetup-blame.md
Outdated
@@ -24,7 +24,7 @@ code or who is to blame for a bug }:-). | |||
|
|||
For example, let's blame `src/bufio/bufio.go` from the standard Go Distribution: | |||
|
|||
```text | |||
```nohighlight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/css/entry.css
Outdated
@@ -120,24 +120,35 @@ figure > figcaption { | |||
} | |||
} | |||
|
|||
.text-monospaced { | |||
.text-monospaced { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this?
If needed send a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content/post/minhashcuda.md
Outdated
@@ -275,8 +275,6 @@ phidang/first-django-blog | |||
|
|||
The complete dataset is published on [data.world](https://data.world/vmarkovtsev/github-duplicate-repositories). | |||
|
|||
<script async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_CHTML"></script> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Please send a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id="tspan4195" | ||
x="53.268764" | ||
y="39.352489" | ||
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:9.375px;font-family:'Berthold Akzidenz Grotesk';-inkscape-font-specification:'Berthold Akzidenz Grotesk Bold';fill:#ffffff;fill-opacity:1">WTF</tspan></text> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WTF -> WTH
What the fuck vs What the heck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the dramatic effect be lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, is that the license we're using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe
content/post/gld.md
Outdated
similarity detection which we used multiple times in the past. Although 400 items is clearly | ||
not large scale at all, it still makes sense to employ LSH because of the O(1) complexity | ||
guarantee. We saw that it works reasonably well in practice and introduces a small overhead, | ||
say 20MB of memory for the hashes and the vocabulary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/say/around/
content/post/gld.md
Outdated
[Weighted MinHash](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf) | ||
- again, battle-tested in the past, e.g. in [Apollo](https://github.com/src-d/apollo) or | ||
[bags deduplication](https://blog.sourced.tech/post/minhashcuda/). | ||
After careful tuning of false positive vs. false negative. vs. performance, we decided to set the Jaccard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After careful tuning of false positives, false negatives, and performance, we ...
fmt.Sprintf("^(|.*[-_. ])(%s)(|[-_. ].*)$", | ||
strings.Join(licenseFileNames, "|"))) | ||
) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll keep it this way in the blog post, but I'll probably try to replace it later on in the code
content/post/gld.md
Outdated
|
||
Original ([home-assistant/home-assistant](https://raw.githubusercontent.com/home-assistant/home-assistant/dev/LICENSE.md)): | ||
{{% scroll-panel height="400" %}} | ||
```markdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider using the new code
shortcode to embed the code, rather than having it right here.
Google's licenseclassifier and Ben Boyter's `lc` on the | ||
[reference 1k dataset](https://github.com/src-d/go-license-detector/blob/master/licensedb/dataset.zip): | ||
|
||
|Detector|Detection rate|Time to scan, sec| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider creating a 2D diagram of these metrics, where you compare accuracy and speed.
This would show how much better we perform while being faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once I have time I add this
@campoy @dpordomingo |
Hey @vmarkovtsev, once #202 has been merged you should be able to simply replace {{% scroll-panel height="400" %}}
{{% code "/post/gld/norm/1.txt" markdown %}}
{{% /scroll-panel %}} with {{% code src="/post/gld/norm/1.txt" lang="markdown" height="400" %}} |
@campoy Done with the scroll, deployed at staging https://blog-staging.srcd.run/post/gld/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow!!!
you did an awesome work in the post! Thanks also for pushing it into staging, to make the review easier
https://blog-staging.srcd.run/post/gld/
it LGTM... but I have only one concern (sorry because of that, but...)
content/post/gld.md
Outdated
182,000 Git repositories belonging to most popular projects on GitHub. It's index file contains the licenses | ||
detected by go-license-detector. The following pie chart summarizes the license usage in PGA: | ||
|
||
<svg id="pga-licenses"></svg> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed on using {% codepen %} snippet for runnable code, to be hosted on codepen, as done for lapjv post
I think @marnovo can provide the credentials, but for testing you can try with your our account ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
David, the code is needed to draw the svg (d3). I have two options: export the svg and use it as an image or use codepen but without any borders, titles, etc. Is it possible with codepen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho, id does not worth to use two scripts just to render a static image as we're doing now.
I see two points of view:
- (a) the script is useful and it could be used by the reader → we could share it → codepen is the better way to do it,
- (b) the solely purpose of the scripts is to render an image → then let's host it, and get rid of the scripts
If you ask me, I think that the /post/gld/chart.js
script you developed can be reused, so it could be interesting for the reader, so I'd vote for (a).
@vmarkovtsev do you want my review here or is it sufficient already? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed my red cross in this PR, and there is no reason for that strong opinion.
So please, just consider my last comment #195 (comment) and proceed as you feel better (because you already did a 👍 great job with the content itself; as usual 🥂 )
@eiso This is going to be a third round - Francesc has made two passes already :) Seems that the most of the problems have been addressed by now. thanks @dpordomingo for your kind words! I will export the svg but retain the script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great post!
content/post/gld.md
Outdated
|
||
1. Favor false positives over false negatives (target data mining instead of compliance). | ||
2. Perform fast. | ||
3. Detect as many licenses as possible on the [hand-collected dataset of 1,000 top-starred repositories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it more clear that this 1k repos dataset was manually checked and labelled.
content/post/gld.md
Outdated
After careful tuning of false positives, false negatives, and performance, we decided to set the Jaccard | ||
similarity threshold for our algorithm to 75%, and the hash length to 154 samples. | ||
Since we discard the text structure by treating sequences as sets, we further calculate the Levenshtein | ||
distance to the matched database records in order to determine the precise confidence value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a sentence that explains that you're only calculating the Levenshtein distance on matches from weighted minhash.
content/post/gld.md
Outdated
distance to the matched database records in order to determine the precise confidence value. | ||
|
||
We look at the `README` file if the analyzed project does not contain a license file. This happens | ||
in more than 7% of the cases in the 1k dataset and 66% in Public Git Archive (182,000 repositories). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public Git Archive should link to its page.
content/post/gld.md
Outdated
Since we had to manually look through hundreds of most-starred projects on GitHub, we noticed | ||
a few funny trends. Many Chinese repositories isolated from the other communities, | ||
awesome list expansion and others. Again, I should devote a separate post to those, | ||
they are funny and also help to understand the picture of open source popularity better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the offtopic section out since you don't really explain what is going on. Better to be a separate post for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased and left the section because I doubt I will find time in the nearest future.
content/post/gld.md
Outdated
[go-license-detector](https://github.com/src-d/go-license-detector) is a powerful | ||
tool to detect the license of an open source project. It finds considerably | ||
more matches than the others including the one used by GitHub. Detecting licenses | ||
is much fun because of the many details and corner cases. Thanks to go-license-detector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/much fun/a lot of fun/
No worries @vmarkovtsev, better to have things done well than fast, but yes, you have my LGTM to merge. |
TODO:
Deployed as