March data dump #881

klop · 2016-02-16T18:11:05Z

Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.

Large dataset (torrent seeded by @waprin ) and posted to academic torrents
- Contains every match played to March 2016
- Skill data for most of the matches after May 2015
Small dataset (created from large files by @waprin , these are 4GB samples of the large data set files)
https://storage.googleapis.com/dota-match-dumps/matches_small.csv
https://storage.googleapis.com/dota-match-dumps/player_matches_small.csv
https://storage.googleapis.com/dota-match-dumps/match_skill.csv

albertcui · 2016-02-16T18:14:23Z

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find were
the 500k Dec 2015 and 3.5M dumps.

—
Reply to this email directly or view it on GitHub
#881.

howardchung · 2016-02-16T18:30:33Z

Maybe with skill data this time!
On Feb 16, 2016 10:14 AM, "Albert Cui" notifications@github.com wrote:

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find
were
the 500k Dec 2015 and 3.5M dumps.

—
Reply to this email directly or view it on GitHub
#881.

—
Reply to this email directly or view it on GitHub
#881 (comment).

klop · 2016-02-16T19:38:39Z

With skill data would be awesome.

howardchung · 2016-03-01T03:25:13Z

do we want to make this a quarterly or semiannual thing?

albertcui · 2016-03-27T15:24:45Z

Pushing back because we're doing import right now.

onelivesleft · 2016-03-30T14:04:15Z

Posting to say this would be good quarterly (unless you get the BigQuery thing updating live). Will you post a blog post when the next dump happens?

howardchung · 2016-03-31T07:22:58Z

If it were up to me I'd probably do semiannual but if @albertcui wants to do it quarterly I won't say no (he's the one having to export/upload the data anyway).

Regarding future dumps:
I think at some point after we complete the import we will do a massive pg_dump (this would produce a PostgreSQL-specific dump) with every match ever played (~1.2 billion matches, mostly unparsed). This will also aid us in doing a data migration if we need to move our match data somewhere else (possibly because of Google getting too expensive). Then we can do periodic "addendum" dumps to keep updated records exported. It is up to @albertcui if he wants to continue doing the more generic JSON dumps as well.

We could possibly also get away with not keeping snapshots in Google as well (that would save nearly $100 a month).

ETA for import is 10-15 days.

onelivesleft · 2016-03-31T12:13:59Z

That'd be great: I'd love to be able to query a db about matches (like the official api, but not limited to the last x hundred games). If I have to download a massive file first that's not really a problem.

I take it opening up an api of your own would have too high a bandwidth overhead?

howardchung · 2016-03-31T16:33:14Z

Yeah, APIs are expensive to operate.

mikkelam · 2016-04-07T13:52:27Z

I'm very interested in using the MMR data for machine learning, is this added in this data dump? I suspect one can estimate a players MMR up to a very high accuracy.

howardchung · 2016-04-07T16:38:55Z

@albertcui are you planning to dump player_ratings? Or perhaps export a "snapshot" of current MMR data?

paulodfreitas · 2016-04-12T14:31:41Z

I think would be nice if dumps are somewhat synchronized with Majors. This way they could be release in "know" intervals and somewhat related with big updates.

howardchung · 2016-04-19T08:41:36Z

import is done. Been talking with @albertcui about doing a full dump this time (with every match ever played).

We'd dump matches, player_matches, and match_skill as CSV. Users would have to join the data themselves.

onelivesleft · 2016-04-19T08:44:17Z

Sounds good

howardchung · 2016-04-20T05:33:49Z

@albertcui I put sample queries in the OP. You may want to try them locally on your devbox first to make sure they work properly.

albertcui · 2016-04-27T00:31:35Z

yasp=# COPY matches TO PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/matches.gz' CSV HEADER;
COPY 1191768403
yasp=# COPY match_skill to PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/match_sill.gz' CSV HEADER;
COPY 132447335

matches.gz is 146 GB. Currently exporting player_matches.

waprin · 2017-01-04T19:16:41Z

Just got too frustrated and wanted to take a break from this, especially since I was traveling. Getting back home tomorrow, going to build a new PC with a bigger disk, will loop back around on this and learn more about torrents sometime this month. I might even try to host my own torrent tracker, might be a good learning experience.

howardchung · 2017-01-14T07:18:22Z

If you just put the blobs on Google Cloud Storage and shared the download links, would you be able to pay for the download bandwidth/storage on your personal account? Or should we wait until we get a torrent working before making a blog post/public announcement?

rossengeorgiev · 2017-01-26T13:32:29Z

I know you guys haven't finish dealing with the original dump, but any chance for a fresh one? Like the last month or something. It would be really useful given the dramatic changes of 700.

howardchung · 2017-01-26T13:34:57Z

unfortunately the old code we used for dumps doesn't work anymore since the move to Cassandra. No telling when we'll be able to get a new migration script working.

I think @waprin wants to eventually get something set up where match data is directly streamed to BigQuery. If we get that working then it would probably be the best place to obtain fresh data dumps.

rossengeorgiev · 2017-01-26T15:07:15Z

I'm unfamiliar with Cassandra, but there seems to be CAPTURE command that would export the result of queries. Couldn't find any details about performance. Maybe that could do it?

I really like the idea of streaming data to BigQuery, but does seem to be happening any time soon. The original issues was about just a slice of data. I'm looking for the same thing a year later and it seems to be even further from happening.

howardchung · 2017-01-26T17:34:44Z

Depending on how many matches you need, you can use the API to fetch the match data for randomly selected matches in the time window you want.

…

On Thu, Jan 26, 2017 at 7:07 AM, Rossen Georgiev ***@***.***> wrote: I'm unfamiliar with Cassandra, but there seems to be CAPTURE command that would export the result of queries. Couldn't find any details about performance. Maybe that could do it? I really like the idea of streaming data to BigQuery, but does seem to be happening any time soon. The original issues was about just a slice of data. I'm looking for the same thing a year later and it seems to be even further from happening. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#881 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_UOKx2R0dRSLrT0QDA3fIdnFeusA4Oks5rWLawgaJpZM4HbZFE> .

waprin · 2017-01-26T21:54:07Z

I just ordered a new HDD so I can just seed from home as soon as it arrives. Torrent does seem like the best option.

Would like to stream matches directly from API into Bigquery so will look into that.

rossengeorgiev · 2017-02-11T17:48:18Z

I've recently scrapped all dota matches for January and I'm making it available as a torrent. It's 33mil matches without the dark moon ones.

http://static.rgp.io/dota2_matches_jan2017.torrent

7596ff · 2017-03-01T21:31:37Z

Downloaded and created torrent files from #881 (comment).

Files:

match_skill.gz 1.60 GB (1,643 pieces at 1 MiB/piece)
player_matches.gz 505.4 GB (129,382 pieces at 4MiB/piece)
matches.gz 145.2 GB (74,359 pieces at 2 MiB/piece)

7596ff · 2017-03-07T17:23:45Z

@albertcui, I believe my links should be good to go as long as you upload the files to academictorrents. It should resolve this error in my client:

jvanhees · 2017-03-14T00:09:54Z

Is there anyone seeding the files from @bippum , and is there some sample data available? I've got plenty of space and a home server that can seed 24/7 at 200mbit/s, but I first need to download the data :) . I will leave the torrents provided above running for now, hoping that someone can share them. If there are other torrents available, please let me know.

howardchung · 2017-03-14T01:56:37Z

The OP has small sample datasets.

@albertcui can you please upload the torrents to academic torrents?

albertcui · 2017-03-18T17:45:46Z

I've uploaded matches + match_skill. It won't let me update player_matches:

"Sorry, the piece length is too small. The torrent file must be less than 2MB. Increase your piece length to lower the file size" :(

For reference, all the torrents are in this collection: http://academictorrents.com/collection/opendota-formerly-yasp-data-dumps

7596ff · 2017-03-18T17:49:36Z

I'll attempt to create the torrent again within the day.

albertcui · 2017-03-18T17:55:12Z

Thanks, sorry for the delay. Did uploading the other ones fix the error?

7596ff · 2017-03-18T18:21:15Z

Yes it did.

jvanhees · 2017-03-20T08:53:33Z

Great, thanks guys, I'm currently downloading the files and will continue to seed them :). Good work!

7596ff · 2017-03-20T12:14:54Z

Glad to hear you are able to download them OK. I switched torrent clients (from Transmission to Deluge) after having difficulty getting Transmission to do anything, let alone upload. Now I wake up to see that match_skill and matches are seeding! I updated the player_matches link with a 4 MiB piece torrent, @albertcui if you could upload that one to academictorrents? Thanks.

If this one doesn't work I can try creating it again with 8 MiB pieces.

howardchung · 2017-03-20T16:34:50Z

Awesome, once they're all up we can write a release blog post and then we can finally close this! :)

albertcui · 2017-03-20T21:55:57Z

I've uploaded it here: http://academictorrents.com/details/1a0c5736bb54610ad00a45306df2b33628301409

howardchung · 2017-03-27T02:49:34Z

Blog post published:
http://blog.opendota.com/2017/03/24/datadump2/

viniciusmr · 2018-04-05T16:36:41Z

Hi there guys!
Isn't there anywhere else where the files are available?
(Or maybe someone with the file that wants/can join the torrent stream?)

I'm currently downloading the "OpenDota - All Matches from March 2016 - Matches"
(matches.gz , 155.94GB)
However the torrent availability is less than 1 (0.781) which means even if I leave this downloading forever, I won't be able to finish the download, cause there are missing pieces in the stream.
(and actually there is only one seeder =/ )

7596ff · 2018-04-05T18:57:26Z

Hi, I'm currently seeding all 3 files with 100% completion, so it should complete eventually.

pranavchintala · 2018-09-19T13:41:30Z

Hello, would anyone be willing to seed player_matches.gz? Haven't been able to find a seeder for a week now and could really use this data for a project! Thanks in advance!

7596ff · 2018-09-19T13:48:00Z

The copy I had got corrupted during a transfer between hard drives. I no longer have the original files from the amazon cloud drive location, and they aren't obtainable either. Sorry for the inconvenience.

pranavchintala · 2018-09-19T14:58:17Z

Alright no problem, thanks for the response!
Would anybody else have even a subset of this data available? Perhaps something larger than the 4GB samples above would do the trick.

hanisaf · 2020-08-17T12:59:02Z

I wonder if anyone can still seed matches.gz and player_matches.gz? I'm interested in the data for a research project. Thanks.

howardchung assigned albertcui Feb 16, 2016

howardchung added this to the 2016-4 milestone Feb 16, 2016

howardchung added the data label Feb 19, 2016

howardchung changed the title ~~Data dump by patch~~ March data dump Feb 19, 2016

howardchung mentioned this issue Feb 19, 2016

Skill brackets data not included in the Data Dump? #860

Closed

howardchung modified the milestones: 2016-4, 2016-3 Feb 25, 2016

albertcui mentioned this issue Mar 18, 2016

Dota 2 YASP dataset loaded in BigQuery #924

Closed

albertcui modified the milestones: 2016-4, 2016-3 Mar 27, 2016

howardchung modified the milestones: 2016-5, 2016-4 Apr 24, 2016

howardchung modified the milestones: Backlog, 2016-12 Feb 15, 2017

howardchung closed this as completed Mar 27, 2017

March data dump #881

March data dump #881

Comments

klop commented Feb 16, 2016 • edited by howardchung Loading

albertcui commented Feb 16, 2016

howardchung commented Feb 16, 2016

klop commented Feb 16, 2016

howardchung commented Mar 1, 2016

albertcui commented Mar 27, 2016

onelivesleft commented Mar 30, 2016

howardchung commented Mar 31, 2016

onelivesleft commented Mar 31, 2016

howardchung commented Mar 31, 2016

mikkelam commented Apr 7, 2016

howardchung commented Apr 7, 2016

paulodfreitas commented Apr 12, 2016

howardchung commented Apr 19, 2016 • edited Loading

onelivesleft commented Apr 19, 2016

howardchung commented Apr 20, 2016

albertcui commented Apr 27, 2016 • edited Loading

waprin commented Jan 4, 2017

howardchung commented Jan 14, 2017

rossengeorgiev commented Jan 26, 2017

howardchung commented Jan 26, 2017

rossengeorgiev commented Jan 26, 2017

howardchung commented Jan 26, 2017 via email

waprin commented Jan 26, 2017

rossengeorgiev commented Feb 11, 2017

7596ff commented Mar 1, 2017 • edited Loading

7596ff commented Mar 7, 2017

jvanhees commented Mar 14, 2017

howardchung commented Mar 14, 2017

albertcui commented Mar 18, 2017 • edited Loading

7596ff commented Mar 18, 2017

albertcui commented Mar 18, 2017

7596ff commented Mar 18, 2017

jvanhees commented Mar 20, 2017

7596ff commented Mar 20, 2017 • edited Loading

howardchung commented Mar 20, 2017

albertcui commented Mar 20, 2017

howardchung commented Mar 27, 2017

viniciusmr commented Apr 5, 2018 • edited Loading

7596ff commented Apr 5, 2018

pranavchintala commented Sep 19, 2018

7596ff commented Sep 19, 2018

pranavchintala commented Sep 19, 2018

hanisaf commented Aug 17, 2020

klop commented Feb 16, 2016 •

edited by howardchung

Loading

howardchung commented Apr 19, 2016 •

edited

Loading

albertcui commented Apr 27, 2016 •

edited

Loading

7596ff commented Mar 1, 2017 •

edited

Loading

albertcui commented Mar 18, 2017 •

edited

Loading

7596ff commented Mar 20, 2017 •

edited

Loading

viniciusmr commented Apr 5, 2018 •

edited

Loading