New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise plot_out_of_core_classification.py #12694
base: main
Are you sure you want to change the base?
Conversation
In this program, Sir Eustache Diemert used first 1000 samples to measure accuracy. I put my efforts to extend the same program, to separate Train & Test Datasets as per guideline mentioned in README.txt file listed in Reuters-21578 datasets as provided by the UCI ML repository. Test Datasets used to measure accuracy.
Suggest me if required to enhance my skill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good idea. Comment on the standard train/test split in the docstring of stream_reuters_documents.
Then please apply PEP8 (with the flake8 tool for instance) to fix some cosmetic issues.
Dear Joel,
Greetings!!
Thanks for your assistance.
As per your guidance, I revise my code as per PEP8 with flake8.
Thanks & Regards,
Parag
…On Thu, Nov 29, 2018 at 4:36 AM Joel Nothman ***@***.***> wrote:
***@***.**** commented on this pull request.
This looks like a good idea. Comment on the standard train/test split in
the docstring of stream_reuters_documents.
Then please apply PEP8 (with the flake8 tool for instance) to fix some
cosmetic issues.
------------------------------
In examples/applications/plot_out_of_core_classification.py
<#12694 (comment)>
:
> tick = time.time()
-X_test_text, y_test = get_minibatch(data_stream, 1000)
+X_text_test, y_test = get_testData(data_stream_test, positive_class)
please use underscores rather than camelCase
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12694 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ArUg0vc-hah4UwWry8GHtiWlIx1bRKeTks5uzxbXgaJpZM4Y4LPh>
.
--
Regards,
Parag
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look good
|
||
def get_minibatch(doc_iter_train, size, pos_class=positive_class): | ||
"""Extract a minibatch of examples, return a tuple X_text, y. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO would be great write the Parameters and returns in the format used on sklearn docs. I mean:
"""Extract a minibatch of examples
Parameters
---------------
...
Return
---------
.....
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still cosmetic nitpicks for now.
@@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'): | |||
|
|||
def handle_starttag(self, tag, attrs): | |||
method = 'start_' + tag | |||
for attr in attrs: | |||
if attr[0] == 'lewissplit': | |||
self.LEWisSplit = attr[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this capitalisation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per guidance to split train / test datasets, "LEWISSPLIT" is attribute that achieve the same. So, my purpose to express this variable in this manner is, Learners are easily identify this attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEWisSplit
is not a conventional attribute name. Is lewis_split
appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's appropriate
@@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'): | |||
|
|||
def handle_starttag(self, tag, attrs): | |||
method = 'start_' + tag | |||
for attr in attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely this only applies to a single tag name, not all, and can be handled in the appropriate handle_* method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping.
|
||
""" | ||
|
||
DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/' | ||
download_url = ('http://archive.ics.uci.edu/ml/machine-learning-databases/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change things like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's true. I must learn to improve my skill.
Thanks for your guidance.
@@ -140,20 +147,22 @@ def end_d(self): | |||
self.topic_d = "" | |||
|
|||
|
|||
def stream_reuters_documents(data_path=None): | |||
def stream_reuters_documents(data_path=None, train_test="TRAIN"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename train_test to subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset is appropriate, nice!
data_stream_train = stream_reuters_documents(train_test="TRAIN") | ||
|
||
# Test Datasets | ||
data_stream_test = stream_reuters_documents(train_test="TEST") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now realised that we are doing two passes through the steam which kind of defeats the purpose. Either the test set is a prefix or we collect it while passing it through the stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No comment?
|
||
X_TextTest, y_test = zip(*data_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, camel case does not belong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greetings!!
Your previous comment (First Comment) was, please use underscores rather than camelCase.
So, to preserve the equality, I do the same.
Thanking you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but you still have camel case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like help from someone else to complete this? I like the idea of reusing standard train-test splits, but we need to maintain the code quality as well.
from sklearn.datasets import get_data_home | ||
from sklearn.externals.six.moves import html_parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer support Python 2, so please just use html.parser
@@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'): | |||
|
|||
def handle_starttag(self, tag, attrs): | |||
method = 'start_' + tag | |||
for attr in attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping.
if not len(data_train): | ||
return np.asarray([], dtype=int), np.asarray([], dtype=int) | ||
|
||
X_text_test, y_train = zip(*data_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the mix of the words "test" and "train" here
total_vect_time = 0.0 | ||
|
||
# Main loop : iterate on mini-batches of examples | ||
for i, (X_train_text, y_train) in enumerate(minibatch_iterators): | ||
for i, (X_TrainText, y_train) in enumerate(minibatch_iterators): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, no camel case.
Dear Sir,
Greetings of the day!!
Thanks for your valuable guidance.
Yes, I require help to deliver quality code. So, I can exploring more on
quality code writing.
Thanks,
Parag
…On 08-Jan-2019 7:08 AM, "Joel Nothman" ***@***.***> wrote:
***@***.**** commented on this pull request.
Would you like help from someone else to complete this? I like the idea of
reusing standard train-test splits, but we need to maintain the code
quality as well.
------------------------------
In examples/applications/plot_out_of_core_classification.py
<#12694 (comment)>
:
> from sklearn.datasets import get_data_home
+from sklearn.externals.six.moves import html_parser
We no longer support Python 2, so please just use html.parser
------------------------------
In examples/applications/plot_out_of_core_classification.py
<#12694 (comment)>
:
> @@ -70,6 +72,9 @@ def __init__(self, encoding='latin-1'):
def handle_starttag(self, tag, attrs):
method = 'start_' + tag
+ for attr in attrs:
Ping.
------------------------------
In examples/applications/plot_out_of_core_classification.py
<#12694 (comment)>
:
> +
+def get_minibatch(doc_iter_train, size, pos_class=positive_class):
+ """Extract a minibatch of examples, return a tuple X_text, y.
+
+ Note: size is before excluding invalid docs with no topics assigned.
+
+ """
+ data_train = [(
+ u'{title}\n\n{body}'.format(**doc), pos_class in doc['topics'])
+ for doc in itertools.islice(doc_iter_train, size)
+ if doc['topics']]
+
+ if not len(data_train):
+ return np.asarray([], dtype=int), np.asarray([], dtype=int)
+
+ X_text_test, y_train = zip(*data_train)
I'm confused by the mix of the words "test" and "train" here
------------------------------
In examples/applications/plot_out_of_core_classification.py
<#12694 (comment)>
:
> total_vect_time = 0.0
# Main loop : iterate on mini-batches of examples
-for i, (X_train_text, y_train) in enumerate(minibatch_iterators):
+for i, (X_TrainText, y_train) in enumerate(minibatch_iterators):
please, no camel case.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12694 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ArUg0hqIP3SKtIqNvtIzvO0shDCsdc_Sks5vA_aegaJpZM4Y4LPh>
.
|
@paragmoteria could you please address the comments of the reviewers and push the required changes to the branch of your PR accordingly? If you do not understand what is required please ask specific questions so that we can help you. |
In this program, Sir Eustache Diemert used first 1000 samples to measure accuracy.
I put my efforts to extend the same program, to separate Train & Test Datasets as per guideline mentioned in README.txt file listed in Reuters-21578 datasets as provided by the UCI ML repository. Test Datasets used to measure accuracy.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?