go-license-detector #195

vmarkovtsev · 2018-03-31T19:52:18Z

TODO:

Add logo
Add implementation flesh
Add funny observations skin
Add PGA stats (friendly ping @eiso I believe in your super system)

Deployed as

vmarkovtsev · 2018-04-06T16:58:41Z

I am running the license analysis in a ~~shitty~~ shell script, will finish by today. @eiso no need to run it with go-engine.

eiso · 2018-04-06T18:49:57Z

@vmarkovtsev I am bowing my head in shame...I will make up for this.

campoy

Nice post.
I'm afraid it starts to be a bit long, though.
Maybe one of the TODO sections could become a follow-up post?

campoy · 2018-04-09T19:46:48Z

content/post/gld.md

+curious about the licenses distribution. GitHub already detects licenses by leveraging
+[benbalter/licensee](https://github.com/benbalter/licensee) Ruby library and the easy solution was
+to query GitHub API. However, we were not satisfied with
+it's detection quality: many projects which actually contain the license file in a non-standard


s/it's/its/

campoy · 2018-04-09T19:48:24Z

content/post/gld.md

+
+The goals were defined from the very beginning:
+
+1. Favor detection rate to the classification accuracy (target data mining instead of compliance).


What does "detection rate" mean in this context?
Is it recall, speed at prediction time?

After reading the clarification below it's a bit more clear, but I wonder whether that's "detection rate" at all.

What about "Favor false positives over false negatives"? Basically saying you'd rather get a non existing license that to miss existing ones?

campoy · 2018-04-09T19:49:22Z

content/post/gld.md

+4. Comply with SPDX [licenses list](https://github.com/spdx/license-list-data) and
+[detection guidelines](https://spdx.org/spdx-license-list/matching-guidelines).
+
+(1) means that we should rather label a project with a bit inaccurate license than miss it's


s/it's/its/ (please use grammarly or some other corrector)

campoy · 2018-04-09T19:52:37Z

content/post/gld.md

+license completely. The open source compliance departments will not be satisfied with this choice,
+as they need the opposite: the missed projects are manually studied. (2) restricts from using a
+scripting language such as Python or Ruby, and we chose Go for our implementation. (3) paves
+the way to many technical tricks, hacks and heuristics and enables any hardcore and complex code.


how is (3) related to those technical tricks and hacks?

When you've got a task to tune some metric as much as possible, this leads to the imbalance in other matters. Wearing a software engineer hat here.

Example: compilers. Adding thousands of lines of really hardcore code to optimize code a bit better and win one percent in the benchmarks (true story).

campoy · 2018-04-09T21:00:04Z

content/post/gld.md

+|[boyter/lc](https://github.com/boyter/lc)| 88% \\(\\quad(\\frac{797}{902})\\) | 548 |
+
+The total number of repositories in the dataset is 958, however, only 902 contain any pointer to
+the license - we looked though each of them. The rest are mainly "awesome lists" and Chinese projects


s/though/through/

also, what's the problem with Chinese projects?

Chinese projects do not have a license. This is a purely empirical conclusion after looking through tens of them.

campoy · 2018-04-09T21:17:33Z

content/post/gld.md

+
+There may be also directories which are named like a license file, and we need to look inside.
+A few projects contained symbolic links to the actual license texts, and we need to resolve them.
+One project even has the license file name as the only content of `LICENSE` and we treat such files


I don't understand this sentence. Do you mean it has the license file inside of a directory named LICENSE
or that a file named LICENSE exists but the only content is the path to another license description?

Please, clarify in the text.

campoy · 2018-04-09T21:19:02Z

content/post/gld.md

+dramatically. Thus we should first render markup to HTML and then extract plain text content from
+HTML. go-license-detector currently supports Markdown through
+[russross/blackfriday](https://github.com/russross/blackfriday) and ReST through
+[hhatto/gorst](https://github.com/hhatto/gorst). HTMl tags are stripped with `golang.org/x/net/html`


HTMl -> HTML

campoy · 2018-04-09T21:19:53Z

content/post/gld.md

+accuracy, and we leverage it. However, our goal is data mining, so we can normalize aggressively.
+We designed a three-level normalization pipeline. The first one is SPDX with some other rules
+which do not affect the detection accuracy. The second one removes punctuation and lines with
+copyright information. We apparently lose some data but our detection more robust to random


"our detection is more"

campoy · 2018-04-09T21:21:56Z

content/post/gld.md

+```
+{{% /scroll-panel %}}
+
+Normalized-1:


Indicate what kind of normalization has been done for each step.

campoy · 2018-04-09T21:22:51Z

content/post/gld.md

+thereof that is intentionally submitted to licensor for inclusion in the work
+by the or by an individual or legal entity authorized to submit
+on behalf of the for the purposes of this definition
+submitted 


link to the comment seems like a TODO?

vmarkovtsev · 2018-04-11T14:14:31Z

@campoy Done, deployed at staging, removed WIP, ready to go.

campoy · 2018-04-11T20:28:54Z

content/post/difftree.md

@@ -71,7 +71,7 @@ The practical implications of this are:
    in lexicographic order enumerates them the same way as the `tree` command
    line tool:

-    ```text
+    ```nohighlight


why are you changing this file? seems completely unrelated to the current blog post
if you want to change it, please send a different PR

campoy · 2018-04-11T20:29:05Z

content/post/meetup-blame.md

@@ -24,7 +24,7 @@ code or who is to blame for a bug }:-).

 For example, let's blame `src/bufio/bufio.go` from the standard Go Distribution:

-```text
+```nohighlight


same as above

campoy · 2018-04-11T20:29:49Z

src/css/entry.css

@@ -120,24 +120,35 @@ figure > figcaption {
    }
 }

-.text-monospaced  {
+.text-monospaced {


why are you changing this?
If needed send a different PR

campoy · 2018-04-11T20:31:57Z

content/post/minhashcuda.md

@@ -275,8 +275,6 @@ phidang/first-django-blog

 The complete dataset is published on [data.world](https://data.world/vmarkovtsev/github-duplicate-repositories).

-<script async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_CHTML"></script>


Why? Please send a different PR

campoy · 2018-04-11T20:33:23Z

static/post/gld/wtfpl.svg

+       id="tspan4195"
+       x="53.268764"
+       y="39.352489"
+       style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:9.375px;font-family:'Berthold Akzidenz Grotesk';-inkscape-font-specification:'Berthold Akzidenz Grotesk Bold';fill:#ffffff;fill-opacity:1">WTF</tspan></text>


WTF -> WTH
What the fuck vs What the heck

http://www.wtfpl.net/

Will the dramatic effect be lost?

wait, is that the license we're using?

campoy · 2018-04-11T21:42:00Z

content/post/gld.md

+similarity detection which we used multiple times in the past. Although 400 items is clearly
+not large scale at all, it still makes sense to employ LSH because of the O(1) complexity
+guarantee. We saw that it works reasonably well in practice and introduces a small overhead,
+say 20MB of memory for the hashes and the vocabulary.


s/say/around/

campoy · 2018-04-11T21:42:43Z

content/post/gld.md

+[Weighted MinHash](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf)
+- again, battle-tested in the past, e.g. in [Apollo](https://github.com/src-d/apollo) or
+[bags deduplication](https://blog.sourced.tech/post/minhashcuda/).
+After careful tuning of false positive vs. false negative. vs. performance, we decided to set the Jaccard


After careful tuning of false positives, false negatives, and performance, we ...

campoy · 2018-04-11T21:43:55Z

content/post/gld.md

+		fmt.Sprintf("^(|.*[-_. ])(%s)(|[-_. ].*)$",
+			strings.Join(licenseFileNames, "|")))
+)
+```


we'll keep it this way in the blog post, but I'll probably try to replace it later on in the code

campoy · 2018-04-11T21:45:52Z

content/post/gld.md

+
+Original ([home-assistant/home-assistant](https://raw.githubusercontent.com/home-assistant/home-assistant/dev/LICENSE.md)):
+{{% scroll-panel height="400" %}}
+```markdown


I'd consider using the new code shortcode to embed the code, rather than having it right here.

campoy · 2018-04-11T21:49:20Z

content/post/gld.md

+Google's licenseclassifier and Ben Boyter's `lc` on the
+[reference 1k dataset](https://github.com/src-d/go-license-detector/blob/master/licensedb/dataset.zip):
+
+|Detector|Detection rate|Time to scan, sec|


I would consider creating a 2D diagram of these metrics, where you compare accuracy and speed.
This would show how much better we perform while being faster.

Once I have time I add this

vmarkovtsev · 2018-04-12T06:48:35Z

@campoy @dpordomingo {{% code %}} is rendered incorrectly inside {{% scroll %}} please fix.

https://blog-staging.srcd.run/post/gld/

campoy · 2018-04-12T18:22:29Z

Hey @vmarkovtsev, once #202 has been merged you should be able to simply replace

{{% scroll-panel height="400" %}}
{{% code "/post/gld/norm/1.txt" markdown %}}
{{% /scroll-panel %}}

with

{{% code src="/post/gld/norm/1.txt" lang="markdown" height="400" %}}

vmarkovtsev · 2018-04-13T08:46:14Z

@campoy Done with the scroll, deployed at staging https://blog-staging.srcd.run/post/gld/

dpordomingo

Wow!!!
you did an awesome work in the post! Thanks also for pushing it into staging, to make the review easier
https://blog-staging.srcd.run/post/gld/

it LGTM... but I have only one concern (sorry because of that, but...)

dpordomingo · 2018-04-13T16:53:40Z

content/post/gld.md

+182,000 Git repositories belonging to most popular projects on GitHub. It's index file contains the licenses
+detected by go-license-detector. The following pie chart summarizes the license usage in PGA:
+
+<svg id="pga-licenses"></svg>


We agreed on using {% codepen %} snippet for runnable code, to be hosted on codepen, as done for lapjv post
I think @marnovo can provide the credentials, but for testing you can try with your our account ;)

David, the code is needed to draw the svg (d3). I have two options: export the svg and use it as an image or use codepen but without any borders, titles, etc. Is it possible with codepen?

imho, id does not worth to use two scripts just to render a static image as we're doing now.
I see two points of view:

(a) the script is useful and it could be used by the reader → we could share it → codepen is the better way to do it,

(b) the solely purpose of the scripts is to render an image → then let's host it, and get rid of the scripts

If you ask me, I think that the /post/gld/chart.js script you developed can be reused, so it could be interesting for the reader, so I'd vote for (a).

eiso · 2018-04-25T11:53:31Z

@vmarkovtsev do you want my review here or is it sufficient already?

dpordomingo

I reviewed my red cross in this PR, and there is no reason for that strong opinion.
So please, just consider my last comment #195 (comment) and proceed as you feel better (because you already did a 👍 great job with the content itself; as usual 🥂 )

vmarkovtsev · 2018-04-25T15:54:12Z

@eiso This is going to be a third round - Francesc has made two passes already :) Seems that the most of the problems have been addressed by now.

thanks @dpordomingo for your kind words! I will export the svg but retain the script.

eiso

Great post!

eiso · 2018-04-26T09:42:56Z

content/post/gld.md

+
+1. Favor false positives over false negatives (target data mining instead of compliance).
+2. Perform fast.
+3. Detect as many licenses as possible on the [hand-collected dataset of 1,000 top-starred repositories


I would make it more clear that this 1k repos dataset was manually checked and labelled.

eiso · 2018-04-26T09:45:45Z

content/post/gld.md

+After careful tuning of false positives, false negatives, and performance, we decided to set the Jaccard
+similarity threshold for our algorithm to 75%, and the hash length to 154 samples.
+Since we discard the text structure by treating sequences as sets, we further calculate the Levenshtein
+distance to the matched database records in order to determine the precise confidence value.


I would add a sentence that explains that you're only calculating the Levenshtein distance on matches from weighted minhash.

eiso · 2018-04-26T09:46:27Z

content/post/gld.md

+distance to the matched database records in order to determine the precise confidence value.
+
+We look at the `README` file if the analyzed project does not contain a license file. This happens
+in more than 7% of the cases in the 1k dataset and 66% in Public Git Archive (182,000 repositories).


Public Git Archive should link to its page.

eiso · 2018-04-26T09:50:13Z

content/post/gld.md

+Since we had to manually look through hundreds of most-starred projects on GitHub, we noticed
+a few funny trends. Many Chinese repositories isolated from the other communities,
+awesome list expansion and others. Again, I should devote a separate post to those,
+they are funny and also help to understand the picture of open source popularity better.


I would keep the offtopic section out since you don't really explain what is going on. Better to be a separate post for another day.

Rephrased and left the section because I doubt I will find time in the nearest future.

eiso · 2018-04-26T09:51:11Z

content/post/gld.md

+[go-license-detector](https://github.com/src-d/go-license-detector) is a powerful
+tool to detect the license of an open source project. It finds considerably
+more matches than the others including the one used by GitHub. Detecting licenses
+is much fun because of the many details and corner cases. Thanks to go-license-detector


s/much fun/a lot of fun/

vmarkovtsev · 2018-04-26T12:54:49Z

@campoy @eiso I fixed all the review comments.

Regarding the 2D plot, I have no time for this unfortunately. Please merge and don't torture me :) The rest can be easily fixed by subsequent edits - this is GitHub.

eiso · 2018-04-26T13:32:32Z

No worries @vmarkovtsev, better to have things done well than fast, but yes, you have my LGTM to merge.

vmarkovtsev force-pushed the gld branch 6 times, most recently from a6eb9b6 to 23e9dc8 Compare April 1, 2018 19:25

vmarkovtsev requested a review from campoy April 6, 2018 16:59

campoy reviewed Apr 9, 2018

View reviewed changes

vmarkovtsev changed the title ~~[WIP] go-license-detector~~ go-license-detector Apr 11, 2018

vmarkovtsev force-pushed the gld branch from 9fe27ec to e63c889 Compare April 11, 2018 14:13

vmarkovtsev force-pushed the gld branch from 9295936 to b5459cb Compare April 11, 2018 14:27

campoy suggested changes Apr 11, 2018

View reviewed changes

campoy mentioned this pull request Apr 12, 2018

Make {{% code %}} work inside of {{% scroll %}} #201

Closed

vmarkovtsev and others added 9 commits April 13, 2018 10:37

Add the post about go-license-detector

8a1b09d

Add the comparison with boyter/lc

11cfcaf

Apply the new optimized timing result

ca05705

Add PGA stats

d8ca2ac

Fix the review comments

75b6332

Add the conclusion

1f7be28

Bump the date

f3ef473

Fix the second round of reviews

b89dc8a

Remove the outer scroll for text examples

9df3558

vmarkovtsev force-pushed the gld branch from 79d0705 to 9df3558 Compare April 13, 2018 08:38

dpordomingo suggested changes Apr 13, 2018

View reviewed changes

dpordomingo approved these changes Apr 25, 2018

View reviewed changes

eiso requested changes Apr 26, 2018

View reviewed changes

vmarkovtsev added 2 commits April 26, 2018 14:50

Add the static SVG image of PGA licenses

5453d20

Fix Eiso's review

e58cdef

Update the gld time

81f4d8d

campoy merged commit 91249e5 into src-d:master May 1, 2018

vmarkovtsev mentioned this pull request Jun 12, 2018

Add co-occurrence matrix values check to id2vec_preproc src-d/ml#262

Merged


		The goals were defined from the very beginning:

		1. Favor detection rate to the classification accuracy (target data mining instead of compliance).

		@@ -275,8 +275,6 @@ phidang/first-django-blog

		The complete dataset is published on [data.world](https://data.world/vmarkovtsev/github-duplicate-repositories).

		<script async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_CHTML"></script>

go-license-detector #195

go-license-detector #195

Conversation

vmarkovtsev commented Mar 31, 2018 • edited Loading

vmarkovtsev commented Apr 6, 2018

eiso commented Apr 6, 2018

campoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev commented Apr 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev commented Apr 12, 2018

campoy commented Apr 12, 2018

vmarkovtsev commented Apr 13, 2018

dpordomingo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eiso commented Apr 25, 2018

dpordomingo left a comment • edited Loading

Choose a reason for hiding this comment

vmarkovtsev commented Apr 25, 2018 • edited Loading

eiso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev commented Apr 26, 2018

eiso commented Apr 26, 2018

vmarkovtsev commented Mar 31, 2018 •

edited

Loading

dpordomingo left a comment •

edited

Loading

vmarkovtsev commented Apr 25, 2018 •

edited

Loading