Skip to content

Conversation

yhliang2018
Copy link
Contributor

Hi All,

I update the boosted_trees code to make it more garden-style:

  1. Add official flags in data_download.py, and fix a minor bug
  2. Add benchmark logger in train_higgs.py
  3. Update single quote with double quotes

@yhliang2018 yhliang2018 requested review from a team and karmel as code owners May 25, 2018 23:39
@yhliang2018
Copy link
Contributor Author

@yk5 Could you help to review the code? Thanks!

Copy link
Contributor

@karmel karmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but mostly LGTM. I assume you have tested and can run all of the scripts still?

names=["c%02d" % i for i in range(29)] # label + 28 features.
).as_matrix()
finally:
os.remove(temp_filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tf.gfile.Remove, for consistency

FLAGS, unparsed = parse_args()
tf.app.run(argv=[sys.argv[0]] + unparsed)
def define_data_download_flags():
"""Add flags specifying data download arguments."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to ourselves: we should consider having a flags_core fn specifically for download module flags, as I think we now have several separate data_dir definitions. No need to solve here though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! @robieta Maybe we should add one in utils/flags?

import sys

# pylint: disable=g-bad-import-order
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import order seems wrong. Numpy should be below, and we need an enable= statement as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got lint errors in Kokoro checking if numpy goes after absl. :(

names=['c%02d' % i for i in range(29)] # label + 28 features.
).as_matrix()
tf.logging.info("Data processing... taking multiple minutes...")
with gzip.open(temp_filename, "rb") as csv_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my learning, pandas supports reading from .gz directly. Do we prefer to use explicitly gzip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out! It's strange then, as when I tested the original code, I got the following error:

pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

That's why we explicitly gzip it here. Any idea on the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe related to pandas version, but as gzip works, I think this change is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which version are you using? I use 0.22.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. the same 0.22.0. Do you get errors when running locally in virtualenv? or in travis or whatever?
FYI, I'm using Linux with virtualenv (python 2.7.13 numpy 1.14.3).
I ran it just now and confirmed pd.read_csv() reads and processes the csv.gz file properly..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see the problem. I ran it with python3. When I test it with python2, it works well as yours. So I will just keep gzip explicitly for py2 and py3 compatibility. Thanks a lot! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the fix!

Copy link
Contributor

@yk5 yk5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Looks good to me.

@yhliang2018 yhliang2018 merged commit 191d99a into master May 29, 2018
@yhliang2018 yhliang2018 deleted the feat/boosted_tree branch May 29, 2018 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants