Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

March data dump #881

Closed
klop opened this issue Feb 16, 2016 · 131 comments
Closed

March data dump #881

klop opened this issue Feb 16, 2016 · 131 comments
Assignees
Milestone

Comments

@klop
Copy link

klop commented Feb 16, 2016

Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.

@albertcui
Copy link
Member

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find were
the 500k Dec 2015 and 3.5M dumps.


Reply to this email directly or view it on GitHub
#881.

@howardchung
Copy link
Member

Maybe with skill data this time!
On Feb 16, 2016 10:14 AM, "Albert Cui" notifications@github.com wrote:

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find
were
the 500k Dec 2015 and 3.5M dumps.


Reply to this email directly or view it on GitHub
#881.


Reply to this email directly or view it on GitHub
#881 (comment).

@howardchung howardchung added this to the 2016-4 milestone Feb 16, 2016
@klop
Copy link
Author

klop commented Feb 16, 2016

With skill data would be awesome.

@howardchung howardchung changed the title Data dump by patch March data dump Feb 19, 2016
@howardchung howardchung modified the milestones: 2016-4, 2016-3 Feb 25, 2016
@howardchung
Copy link
Member

do we want to make this a quarterly or semiannual thing?

@albertcui
Copy link
Member

Pushing back because we're doing import right now.

@albertcui albertcui modified the milestones: 2016-4, 2016-3 Mar 27, 2016
@onelivesleft
Copy link
Contributor

Posting to say this would be good quarterly (unless you get the BigQuery thing updating live). Will you post a blog post when the next dump happens?

@howardchung
Copy link
Member

If it were up to me I'd probably do semiannual but if @albertcui wants to do it quarterly I won't say no (he's the one having to export/upload the data anyway).

Regarding future dumps:
I think at some point after we complete the import we will do a massive pg_dump (this would produce a PostgreSQL-specific dump) with every match ever played (~1.2 billion matches, mostly unparsed). This will also aid us in doing a data migration if we need to move our match data somewhere else (possibly because of Google getting too expensive). Then we can do periodic "addendum" dumps to keep updated records exported. It is up to @albertcui if he wants to continue doing the more generic JSON dumps as well.

We could possibly also get away with not keeping snapshots in Google as well (that would save nearly $100 a month).

ETA for import is 10-15 days.

@onelivesleft
Copy link
Contributor

That'd be great: I'd love to be able to query a db about matches (like the official api, but not limited to the last x hundred games). If I have to download a massive file first that's not really a problem.

I take it opening up an api of your own would have too high a bandwidth overhead?

@howardchung
Copy link
Member

Yeah, APIs are expensive to operate.

@mikkelam
Copy link

mikkelam commented Apr 7, 2016

I'm very interested in using the MMR data for machine learning, is this added in this data dump? I suspect one can estimate a players MMR up to a very high accuracy.

@howardchung
Copy link
Member

@albertcui are you planning to dump player_ratings? Or perhaps export a "snapshot" of current MMR data?

@paulodfreitas
Copy link

I think would be nice if dumps are somewhat synchronized with Majors. This way they could be release in "know" intervals and somewhat related with big updates.

@howardchung
Copy link
Member

howardchung commented Apr 19, 2016

import is done. Been talking with @albertcui about doing a full dump this time (with every match ever played).

We'd dump matches, player_matches, and match_skill as CSV. Users would have to join the data themselves.

@onelivesleft
Copy link
Contributor

Sounds good

@howardchung
Copy link
Member

@albertcui I put sample queries in the OP. You may want to try them locally on your devbox first to make sure they work properly.

@howardchung howardchung modified the milestones: 2016-5, 2016-4 Apr 24, 2016
@albertcui
Copy link
Member

albertcui commented Apr 27, 2016

yasp=# COPY matches TO PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/matches.gz' CSV HEADER;
COPY 1191768403
yasp=# COPY match_skill to PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/match_sill.gz' CSV HEADER;
COPY 132447335

matches.gz is 146 GB. Currently exporting player_matches.

@waprin
Copy link

waprin commented Jan 4, 2017

Just got too frustrated and wanted to take a break from this, especially since I was traveling. Getting back home tomorrow, going to build a new PC with a bigger disk, will loop back around on this and learn more about torrents sometime this month. I might even try to host my own torrent tracker, might be a good learning experience.

@howardchung
Copy link
Member

If you just put the blobs on Google Cloud Storage and shared the download links, would you be able to pay for the download bandwidth/storage on your personal account? Or should we wait until we get a torrent working before making a blog post/public announcement?

@rossengeorgiev
Copy link

I know you guys haven't finish dealing with the original dump, but any chance for a fresh one? Like the last month or something. It would be really useful given the dramatic changes of 700.

@howardchung
Copy link
Member

unfortunately the old code we used for dumps doesn't work anymore since the move to Cassandra. No telling when we'll be able to get a new migration script working.

I think @waprin wants to eventually get something set up where match data is directly streamed to BigQuery. If we get that working then it would probably be the best place to obtain fresh data dumps.

@rossengeorgiev
Copy link

I'm unfamiliar with Cassandra, but there seems to be CAPTURE command that would export the result of queries. Couldn't find any details about performance. Maybe that could do it?

I really like the idea of streaming data to BigQuery, but does seem to be happening any time soon. The original issues was about just a slice of data. I'm looking for the same thing a year later and it seems to be even further from happening.

@howardchung
Copy link
Member

howardchung commented Jan 26, 2017 via email

@waprin
Copy link

waprin commented Jan 26, 2017

I just ordered a new HDD so I can just seed from home as soon as it arrives. Torrent does seem like the best option.

Would like to stream matches directly from API into Bigquery so will look into that.

@rossengeorgiev
Copy link

I've recently scrapped all dota matches for January and I'm making it available as a torrent. It's 33mil matches without the dark moon ones.

http://static.rgp.io/dota2_matches_jan2017.torrent

@howardchung howardchung modified the milestones: Backlog, 2016-12 Feb 15, 2017
@7596ff
Copy link
Contributor

7596ff commented Mar 1, 2017

Downloaded and created torrent files from #881 (comment).

Files:

@7596ff
Copy link
Contributor

7596ff commented Mar 7, 2017

@albertcui, I believe my links should be good to go as long as you upload the files to academictorrents. It should resolve this error in my client:
image

@jvanhees
Copy link

Is there anyone seeding the files from @bippum , and is there some sample data available? I've got plenty of space and a home server that can seed 24/7 at 200mbit/s, but I first need to download the data :) . I will leave the torrents provided above running for now, hoping that someone can share them. If there are other torrents available, please let me know.

@howardchung
Copy link
Member

The OP has small sample datasets.

@albertcui can you please upload the torrents to academic torrents?

@albertcui
Copy link
Member

albertcui commented Mar 18, 2017

I've uploaded matches + match_skill. It won't let me update player_matches:

"Sorry, the piece length is too small. The torrent file must be less than 2MB. Increase your piece length to lower the file size" :(

For reference, all the torrents are in this collection: http://academictorrents.com/collection/opendota-formerly-yasp-data-dumps

@7596ff
Copy link
Contributor

7596ff commented Mar 18, 2017

I'll attempt to create the torrent again within the day.

@albertcui
Copy link
Member

Thanks, sorry for the delay. Did uploading the other ones fix the error?

@7596ff
Copy link
Contributor

7596ff commented Mar 18, 2017

Yes it did.

@jvanhees
Copy link

Great, thanks guys, I'm currently downloading the files and will continue to seed them :). Good work!

@7596ff
Copy link
Contributor

7596ff commented Mar 20, 2017

Glad to hear you are able to download them OK. I switched torrent clients (from Transmission to Deluge) after having difficulty getting Transmission to do anything, let alone upload. Now I wake up to see that match_skill and matches are seeding! I updated the player_matches link with a 4 MiB piece torrent, @albertcui if you could upload that one to academictorrents? Thanks.

If this one doesn't work I can try creating it again with 8 MiB pieces.

@howardchung
Copy link
Member

Awesome, once they're all up we can write a release blog post and then we can finally close this! :)

@albertcui
Copy link
Member

@howardchung
Copy link
Member

Blog post published:
http://blog.opendota.com/2017/03/24/datadump2/

@viniciusmr
Copy link

viniciusmr commented Apr 5, 2018

Hi there guys!
Isn't there anywhere else where the files are available?
(Or maybe someone with the file that wants/can join the torrent stream?)

I'm currently downloading the "OpenDota - All Matches from March 2016 - Matches"
(matches.gz , 155.94GB)
However the torrent availability is less than 1 (0.781) which means even if I leave this downloading forever, I won't be able to finish the download, cause there are missing pieces in the stream.
(and actually there is only one seeder =/ )

@7596ff
Copy link
Contributor

7596ff commented Apr 5, 2018

Hi, I'm currently seeding all 3 files with 100% completion, so it should complete eventually.

seeding

@pranavchintala
Copy link

Hello, would anyone be willing to seed player_matches.gz? Haven't been able to find a seeder for a week now and could really use this data for a project! Thanks in advance!

@7596ff
Copy link
Contributor

7596ff commented Sep 19, 2018

The copy I had got corrupted during a transfer between hard drives. I no longer have the original files from the amazon cloud drive location, and they aren't obtainable either. Sorry for the inconvenience.

@pranavchintala
Copy link

Alright no problem, thanks for the response!
Would anybody else have even a subset of this data available? Perhaps something larger than the 4GB samples above would do the trick.

@hanisaf
Copy link

hanisaf commented Aug 17, 2020

I wonder if anyone can still seed matches.gz and player_matches.gz? I'm interested in the data for a research project. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests