Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Error pulling player stats #123

Closed
jacobmh1177 opened this issue Mar 30, 2019 · 8 comments · Fixed by #124
Closed

XML Error pulling player stats #123

jacobmh1177 opened this issue Mar 30, 2019 · 8 comments · Fixed by #124

Comments

@jacobmh1177
Copy link

The xml bug described in #121 appears to be fixed for the game overview but is still present for player stats. I've built from master and get the following result when trying to pull player stats from a completed game:

>>> games = mlb.combine_games( mlb.games(2019, 3, 29))
>>> print(games[0])
Tigers (0) at Blue Jays (6)
>>> mlb.player_stats(games[0].game_id)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/mlbgame/mlbgame/__init__.py", line 214, in player_stats
    data = mlbgame.stats.player_stats(game_id)
  File "/home/ec2-user/mlbgame/mlbgame/stats.py", line 75, in player_stats
    raw_box_score_tree = etree.parse(raw_box_score).getroot()
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
  File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "http://gd2.mlb.com/components/game/mlb/year_2019/month_03/day_29/gid_2019_03_29_detmlb_tormlb_1/rawboxscore.xml", line 1
lxml.etree.XMLSyntaxError: Space required after the Public Identifier, line 1, column 55
>>> 
@mrgator85
Copy link

seems mlb changed the API on this. there is no data provided at gd2.mlb.com/components/game/mlb/year_2019/month_03/day_31/gid_2019_03_31_pitmlb_cinmlb_1/
rawboxscore.xml

Not sure what was different in the old api between boxscore and rawboxscore, but maybe they combined them?

Not sure how to fix without knowledge of the data in rawboxscore vs boxscore

@mmergola
Copy link

mmergola commented Apr 2, 2019

I actually just solved this same issue, so here's what I did (sorry if there's a cleaner way to show it on GitHub, this is my first real contribution).

In mlbgame folder:

  1. data.py - change line 79/80 from 'rawboxscore.xml' to 'boxscore.xml'

Originally, it looked like this:
return urlopen(GAME_URL.format(year, month, day, game_id, 'rawboxscore.xml'))

Make it this:
return urlopen(GAME_URL.format(year, month, day, game_id, 'boxscore.xml'))

  1. stats.py - in lines 72 and 75, add the number sign to comment them out

Originally, they looked like this:
raw_box_score = mlbgame.data.get_raw_box_score(game_id)
raw_box_score_tree = etree.parse(raw_box_score).getroot()

Make them like these:
#raw_box_score = mlbgame.data.get_raw_box_score(game_id)
#raw_box_score_tree = etree.parse(raw_box_score).getroot()

I think those are the only two changes needed but, if you get an error, comb through the files again and see if anything else is pointing to rawboxscore. I read in one of the notes that MLB used to have two box scores and these scripts pulled both of them. It looks like MLB condensed it to one and the script was trying to still grab both. Rather than just delete that part of the code, I commented it out so I wasn't breaking too much.

Good luck!

@jtonzi
Copy link

jtonzi commented Apr 2, 2019

@mmergola You're right! But you only need to make the change you reference in data.py (I think).

Thanks for posting this!

@trevor-viljoen
Copy link
Contributor

The rawboxscore functionality is still necessary for backward compatibility with games from the past. I would advise against commenting it out or removing it.

@Bill-M123
Copy link

@mmergola: Thx much! @jtonzi : I had success with just changing data.py as well. I did not change stats.py.

@trevor-viljoen: Does making the single change prevent the backward compatibility issue? Simple testing on my part indicated that it worked for 2017 games, but there may be more advanced functionality than I am using.

Again, many thanks.

@trevor-viljoen
Copy link
Contributor

trevor-viljoen commented Apr 4, 2019

@Bill-M123 boxscore and rawboxscore contained different data in the past. Commenting out the rawboxscore code would remove that data from being accessed for old games. A try/except block would be better.

@Bill-M123
Copy link

@trevor-viljoen Thx. I was just about to comment out my question. For my application, that means that all additional batting stats go away sine rawboxscore is removed. You answered this in another thread (#89), and I didn't catch the implication until this morning when I working on the prediction side of my problem...

For me, the backward compatibility is actually a problem. If I use any additional info from previous years for training and it isn't available for predictions, that is problematic. I'll take a careful look at what is missing (which looked like "not much" on a quick look) and see if I can derive it, but my default plan is to just use the info available in boxscore/player_stats.

Best

@mmergola
Copy link

mmergola commented Apr 4, 2019

Now that I'm understanding it all better, what everyone said above is correct (that it's necessary, but can survive with a try/except). I just used the link from trevor-viljoen to update my mlbstats.py script and changed the lines back to what they were for mlbdata.py.

Basically, what I said in my first comment is flawed and only works if you don't care about prior years. If so, use trevor-viljoen's solution.

panzarino added a commit that referenced this issue Apr 4, 2019
Fix issue #123 when rawboxscore.xml is unavailable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants