-
Notifications
You must be signed in to change notification settings - Fork 45.4k
Make boosted_trees Garden-official #4377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@yk5 Could you help to review the code? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but mostly LGTM. I assume you have tested and can run all of the scripts still?
names=["c%02d" % i for i in range(29)] # label + 28 features. | ||
).as_matrix() | ||
finally: | ||
os.remove(temp_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tf.gfile.Remove, for consistency
FLAGS, unparsed = parse_args() | ||
tf.app.run(argv=[sys.argv[0]] + unparsed) | ||
def define_data_download_flags(): | ||
"""Add flags specifying data download arguments.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to ourselves: we should consider having a flags_core fn specifically for download module flags, as I think we now have several separate data_dir definitions. No need to solve here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! @robieta Maybe we should add one in utils/flags?
import sys | ||
|
||
# pylint: disable=g-bad-import-order | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import order seems wrong. Numpy should be below, and we need an enable= statement as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got lint errors in Kokoro checking if numpy goes after absl. :(
names=['c%02d' % i for i in range(29)] # label + 28 features. | ||
).as_matrix() | ||
tf.logging.info("Data processing... taking multiple minutes...") | ||
with gzip.open(temp_filename, "rb") as csv_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my learning, pandas supports reading from .gz directly. Do we prefer to use explicitly gzip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out! It's strange then, as when I tested the original code, I got the following error:
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
That's why we explicitly gzip it here. Any idea on the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe related to pandas version, but as gzip works, I think this change is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which version are you using? I use 0.22.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. the same 0.22.0. Do you get errors when running locally in virtualenv? or in travis or whatever?
FYI, I'm using Linux with virtualenv (python 2.7.13 numpy 1.14.3).
I ran it just now and confirmed pd.read_csv() reads and processes the csv.gz file properly..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see the problem. I ran it with python3. When I test it with python2, it works well as yours. So I will just keep gzip explicitly for py2 and py3 compatibility. Thanks a lot! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Looks good to me.
Hi All,
I update the boosted_trees code to make it more garden-style: