[REVIEW] S3 scraper fix #843

Merged
merged 8 commits into from Apr 30, 2013

2 participants

@cstamas
Sonatype member

The S3 scraper is broken, if the bucket being listed would return "paged" response (when the repository has over 1000 unique files in it). The paging logic was creating broken URLs, resulting in having only 1st page fetched, and used to build prefixes file. More about it in the linked issue.

Related issue:
https://issues.sonatype.org/browse/NEXUS-5698

CI
https://bamboo.zion.sonatype.com/browse/NXO-OSSF4-1

@ifedorenko

-1

I need small standalone example project and exact steps to reproduce the problem without this pull applied.

For example, I see no change in proxy repository prefix file with or without this pull for http://repository.springsource.com/maven/bundles/external/ repository mentioned in NEXUS-5696. In both cases the prefix file looks like

# Prefix file generated by Sonatype Nexus
# Do not edit, changes will be overwritten!
/com/jetbrains
/com/informix
/com/mchange
/com/jamonapi
/com/espertech
/com/google
/com/lowagie
/com/mockobjects
/com/caucho
/com/jamesmurty
/com/jcraft
/com/db4o
/com/atlassian
/com/adobe
/com/fasterxml
/com/experlog
/com/h2database
/com/ibm
/com/beust
/com/googlecode
/ch/qos
/com/dumbster
/com/bea 
@cstamas
Sonatype member

@ifedorenko I think you made something wrong with testing, as in my case (virgin boot) the list is 152(+2 comment) lines long.

My guess is that you reused a working directory from "old" (broken) Nexus (with already scraped prefixes.txt with broken S3 scraper), and hence, at boot WL was "detected" but it's update would happen at next tick (24h). To really verify this, either delete the prefixes.txt file and reboot, or perform a forced update by clicking on "Update now" button.

Here is the prefix file you should have after figuring out what went wrong:

# Prefix file generated by Sonatype Nexus
# Do not edit, changes will be overwritten!
/org/openid4java
/net/java
/org/jasig
/javax/el
/org/cyberneko
/org/openxri
/javax/media
/org/jruby
/org/acegisecurity
/org/testng
/javax/jcr
/com/ibm
/org/yaml
/org/fusesource
/robots.txt
/org/junit
/org/antlr
/org/htmlparser
/com/google
/edu/yale
/org/textmining
/javax/jdo
/org/doomdark
/org/slf4j
/com/caucho
/org/nanocontainer
/org/eclipse
/org/objenesis
/net/homeip
/javax/transaction
/org/relaxng
/com/fasterxml
/org/jmock
/com/dumbster
/com/thoughtworks
/com/oracle
/com/jamonapi
/nu/xom
/org/ajax4jsf
/javax/activation
/xpp3/com.springsource.xpp3
/org/aopalliance
/com/rabbitmq
/javax/webbeans
/org/gnu
/org/richfaces
/javax/jms
/javax/jws
/org/samba
/de/berlios
/org/apache
/javax/ejb
/org/powermock
/org/jvnet
/com/jetbrains
/org/springframework
/org/drools
/javax/annotation
/javax/enterprise
/com/mockobjects
/org/hsqldb
/org/coconut
/org/joda
/edu/emory
/org/aspectj
/org/logicalcobwebs
/com/adobe
/com/opensymphony
/org/jolokia
/org/ognl
/com/beust
/ch/qos
/edu/umd
/com/informix
/com/mchange
/org/xmlpull
/net/noderunner
/org/picocontainer
/org/freemarker
/javax/xml
/javax/inject
/javax/mail
/javax/rules
/org/mockito
/com/atlassian
/org/codehaus
/com/sun
/org/mozilla
/javax/resource
/org/igniterealtime
/javax/ws
/org/bouncycastle
/org/dom4j
/org/custommonkey
/org/openqa
/org/easymock
/com/bea
/org/postgresql
/com/espertech
/org/w3
/com/sqljet
/org/w3c
/edu/oswego
/spy/com.springsource.net.spy.memcached
/com/sleepycat
/org/beanshell
/org/hamcrest
/org/python
/com/experlog
/org/jets3t
/com/h2database
/org/dbunit
/org/pdfbox
/javax/persistence
/org/osgi
/javax/portlet
/org/jboss
/org/jgroups
/org/jasypt
/com/lowagie
/org/mortbay
/org/directwebremoting
/com/jamesmurty
/org/fontbox
/org/tuckey
/com/oreilly
/org/rifers
/org/dojotoolkit
/net/jcip
/com/svnkit
/org/opensaml
/org/erlang
/org/jdom
/org/jaxen
/org/objectweb
/javax/validation
/org/json
/org/jfree
/com/jcraft
/org/hibernate
/com/db4o
/javax/servlet
/javax/faces
/org/sat4j
/javax/wsdl
/com/googlecode
/javax/sql
/org/tanukisoftware
/com/mysql
/net/sourceforge
/org/jsoup
/org/hyperic
@ifedorenko

Ok, I can confirm the pull works for repository mentioned in NEXUS-5696. I think I had problems with browser cache, but I am not 100% sure.

I still don't understand how the code works. What happens if the bucket has 5k keys, for example?

@cstamas
Sonatype member

@ifedorenko Amazon S3 caps the response "page" to have max 1000 entries. This is a hard constraint, no client can modify it, so it can be taken granted. If bucket contains 1k+ keys, they will be "paged" (that's the only was to list them all). In your example, it will be 5 HTTP request requesting the 5k entries.

But now that you mentions this, I think the code should be reworked from being recursive to sequential, as in that example, 5 instances of jsoup Documents will be held in memory.

Still, I think this pull should be merged in, as it fixes the bug, and then another one should be made that refactors the diveIn method and make its runtime execution sequential (or looped) over recursive.

@cstamas
Sonatype member

The diveIn method made non-recursive, to make it better heap-friendly.

@cstamas
Sonatype member

CI
https://bamboo.zion.sonatype.com/browse/NXO-OSSF4-3

Job had one unrelated (Search related) failure, the added test passed (is an UT).

@ifedorenko

+1

I did look for recursion but I could not find it. I do see iteration and can see how the code is expected to work now :-)

@cstamas cstamas merged commit 1a148aa into master Apr 30, 2013
@cstamas cstamas deleted the s3-scraper-fix branch Apr 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment