Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise plot_out_of_core_classification.py #12694

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

paragmoteria
Copy link

In this program, Sir Eustache Diemert used first 1000 samples to measure accuracy.
I put my efforts to extend the same program, to separate Train & Test Datasets as per guideline mentioned in README.txt file listed in Reuters-21578 datasets as provided by the UCI ML repository. Test Datasets used to measure accuracy.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

In this program, Sir Eustache Diemert used first 1000 samples to measure accuracy. 
I put my efforts to extend the same program, to separate Train & Test Datasets as per guideline mentioned in README.txt file listed in Reuters-21578 datasets as provided by the UCI ML repository. Test Datasets used to measure accuracy.
@paragmoteria
Copy link
Author

Suggest me if required to enhance my skill

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good idea. Comment on the standard train/test split in the docstring of stream_reuters_documents.

Then please apply PEP8 (with the flake8 tool for instance) to fix some cosmetic issues.

examples/applications/plot_out_of_core_classification.py Outdated Show resolved Hide resolved
@paragmoteria paragmoteria changed the title Update plot_plot_out_of_core_classification.py Revise plot_plot_out_of_core_classification.py Nov 29, 2018
@paragmoteria
Copy link
Author

paragmoteria commented Nov 29, 2018 via email

Copy link
Contributor

@eamanu eamanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look good


def get_minibatch(doc_iter_train, size, pos_class=positive_class):
"""Extract a minibatch of examples, return a tuple X_text, y.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO would be great write the Parameters and returns in the format used on sklearn docs. I mean:

"""Extract a minibatch of examples

Parameters
---------------
...

Return
---------
.....

"""

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still cosmetic nitpicks for now.

examples/applications/plot_out_of_core_classification.py Outdated Show resolved Hide resolved
@@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'):

def handle_starttag(self, tag, attrs):
method = 'start_' + tag
for attr in attrs:
if attr[0] == 'lewissplit':
self.LEWisSplit = attr[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this capitalisation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per guidance to split train / test datasets, "LEWISSPLIT" is attribute that achieve the same. So, my purpose to express this variable in this manner is, Learners are easily identify this attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEWisSplit is not a conventional attribute name. Is lewis_split appropriate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's appropriate

@@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'):

def handle_starttag(self, tag, attrs):
method = 'start_' + tag
for attr in attrs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this only applies to a single tag name, not all, and can be handled in the appropriate handle_* method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.


"""

DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
download_url = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change things like this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true. I must learn to improve my skill.
Thanks for your guidance.

@@ -140,20 +147,22 @@ def end_d(self):
self.topic_d = ""


def stream_reuters_documents(data_path=None):
def stream_reuters_documents(data_path=None, train_test="TRAIN"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename train_test to subset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subset is appropriate, nice!

data_stream_train = stream_reuters_documents(train_test="TRAIN")

# Test Datasets
data_stream_test = stream_reuters_documents(train_test="TEST")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now realised that we are doing two passes through the steam which kind of defeats the purpose. Either the test set is a prefix or we collect it while passing it through the stream

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment?


X_TextTest, y_test = zip(*data_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, camel case does not belong here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greetings!!

Your previous comment (First Comment) was, please use underscores rather than camelCase.

So, to preserve the equality, I do the same.

Thanking you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but you still have camel case here

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like help from someone else to complete this? I like the idea of reusing standard train-test splits, but we need to maintain the code quality as well.

from sklearn.datasets import get_data_home
from sklearn.externals.six.moves import html_parser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer support Python 2, so please just use html.parser

@@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'):

def handle_starttag(self, tag, attrs):
method = 'start_' + tag
for attr in attrs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.

if not len(data_train):
return np.asarray([], dtype=int), np.asarray([], dtype=int)

X_text_test, y_train = zip(*data_train)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the mix of the words "test" and "train" here

total_vect_time = 0.0

# Main loop : iterate on mini-batches of examples
for i, (X_train_text, y_train) in enumerate(minibatch_iterators):
for i, (X_TrainText, y_train) in enumerate(minibatch_iterators):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, no camel case.

@paragmoteria
Copy link
Author

paragmoteria commented Jan 8, 2019 via email

@paragmoteria paragmoteria changed the title Revise plot_plot_out_of_core_classification.py Revise plot_out_of_core_classification.py Jan 19, 2019
Base automatically changed from master to main January 22, 2021 10:50
@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2021

@paragmoteria could you please address the comments of the reviewers and push the required changes to the branch of your PR accordingly?

If you do not understand what is required please ask specific questions so that we can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants