Add new datasets for text classification. #557

zhangguanheng66 · 2019-07-11T17:49:49Z

Add a few supervised learning datasets, including
- AG_NEWS
- SogouNews
- DBpedia
- YelpReviewPolarity
- YelpReviewFull
- YahooAnswers
- AmazonReviewPolarity
- AmazonReviewFull

cpuhrsch · 2019-07-11T20:01:31Z

torchtext/datasets/text_classification.py

+def preprocess(raw_folder, processed_folder, dataset_name):
+    """Preprocess the csv files."""
+
+    def text_normalize(src_filepath, tgt_filepath):


I think you can split this out as well. It seems useful in its own right. But instead of taking a filepath, it can take a string and return a string (or a list of strings).

Sure thing. Will do it.

torchtext/datasets/text_classification.py

cpuhrsch · 2019-07-15T16:42:49Z

torchtext/datasets/text_classification.py

+    return line
+
+
+def preprocess(raw_folder, processed_folder, dataset_name):


for filename in os.listdir('dirname): with open(filename, 'r') as a, open(filename + '.processed', 'w') as b: b.write(text_normalize(a.readline())

I'd delete this and expect the user to know how to do this

Maybe make it an example / add it to the tutorial

I think we could keep it as internal function.
I would like to support the one-line data loading for the supervised learning dataset so we have to minimize the work on user side.

torchtext/datasets/text_classification.py

cpuhrsch · 2019-07-15T16:52:03Z

torchtext/datasets/text_classification.py

+    print('Dataset %s downloaded.' % dataset_name)
+
+
+def text_normalize(line):


Maybe "basic_normalization_english" or such? Is there a reference here (maybe within a paper)? Can be immediately moved into c++. It's very useful in general and we can indicate that.

the supervised learning paper didn't mention how they did pre-processing. fastText just did this pre-processing in the bash script.
I think we could talk with some NLP guys and implement a basic pre-processor in C++.

cpuhrsch · 2019-07-15T16:57:06Z

torchtext/datasets/text_classification.py

+from torchtext.data.iterator import generate_iterators
+
+
+def download(url, raw_folder, dataset_name):


extract_archive( download_from_url( torchtext.datasets.supervised.urls['AG_NEWS']) #returns a path in cache, "path/to/AG_NEWS/folder" ) )

I don't this necessary needs to be its own function

minor change. Put download and extract_archive together.

torchtext/datasets/text_classification.py

cpuhrsch · 2019-07-15T17:12:55Z

torchtext/datasets/text_classification.py

+    return examples
+
+
+def iters(train_examples, test_examples, fields, sort_key,


Let's add batching separately if we see that we need it.

torchtext/datasets/text_classification.py

Add new datasets for text classification.

ffd1f49

zhangguanheng66 mentioned this pull request Jul 11, 2019

add new APIs to build dataset #556

Merged

cpuhrsch reviewed Jul 11, 2019

View reviewed changes

torchtext/datasets/text_classification.py Outdated Show resolved Hide resolved

Guanheng Zhang added 3 commits July 11, 2019 15:18

Split text_normalize out of preprocess function.

5a31dd3

Add docs and test case.

5efa58e

Update README file.

844242a

zhangguanheng66 changed the title ~~Add new datasets for text classification.~~ [WIP] Add new datasets for text classification. Jul 14, 2019

cpuhrsch reviewed Jul 15, 2019

View reviewed changes

torchtext/datasets/text_classification.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Jul 15, 2019

View reviewed changes

torchtext/datasets/text_classification.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Jul 15, 2019

View reviewed changes

torchtext/datasets/text_classification.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Jul 17, 2019

View reviewed changes

torchtext/datasets/text_classification.py Outdated Show resolved Hide resolved

Guanheng Zhang added 12 commits July 22, 2019 07:40

revise generate_iters() function.

b373de9

Remove TextDataset class.

6d5cb03

remove unnecessary library loading

2f20914

change build_vocab to build_dictionary.

4cf4099

convert two functions to the interanls.

c8ec403

Change the API of _load_text_classification_data() function.

0568a04

use a static list for url.

78673a5

use logging.info as print.

58e3bac

combine download and extract_archive

81e5a31

Merge branch 'master' into new_supervised_learning_dataset

c728b57

Fix class APIs

5bbdffc

Update unit test and docs.

8fa87d8

zhangguanheng66 force-pushed the new_supervised_learning_dataset branch from c8f1e9d to 8fa87d8 Compare July 24, 2019 14:25

Lint errors and check dataset size in the unit test.

341b475

Merge branch 'master' into new_supervised_learning_dataset

a7567d2

cpuhrsch merged commit 844f403 into pytorch:master Jul 24, 2019

zhangguanheng66 changed the title ~~[WIP] Add new datasets for text classification.~~ Add new datasets for text classification. Jul 24, 2019

zhangguanheng66 deleted the new_supervised_learning_dataset branch November 25, 2019 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new datasets for text classification. #557

Add new datasets for text classification. #557

zhangguanheng66 commented Jul 11, 2019

cpuhrsch Jul 11, 2019

zhangguanheng66 Jul 11, 2019

cpuhrsch Jul 15, 2019

cpuhrsch Jul 15, 2019

cpuhrsch Jul 15, 2019

zhangguanheng66 Jul 22, 2019 •

edited

Loading

cpuhrsch Jul 15, 2019

zhangguanheng66 Jul 22, 2019 •

edited

Loading

cpuhrsch Jul 15, 2019

cpuhrsch Jul 15, 2019

zhangguanheng66 Jul 22, 2019

cpuhrsch Jul 15, 2019

		return line


		def preprocess(raw_folder, processed_folder, dataset_name):

		print('Dataset %s downloaded.' % dataset_name)


		def text_normalize(line):

		from torchtext.data.iterator import generate_iterators


		def download(url, raw_folder, dataset_name):

		return examples


		def iters(train_examples, test_examples, fields, sort_key,

Add new datasets for text classification. #557

Add new datasets for text classification. #557

Conversation

zhangguanheng66 commented Jul 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Jul 22, 2019 •

edited

Loading

zhangguanheng66 Jul 22, 2019 •

edited

Loading