Add nagoya-university-conversation-corpus #168

abcdefg-1234567 · 2023-04-08T00:58:01Z

Fix GH-48

I would like to confirm the following.

Is it ok adding 'sentence_order_number' which is not in raw data?
Which sentence should we use as the description(metadata)?

kou · 2023-04-08T07:10:43Z

Is it ok adding 'sentence_order_number' which is not in raw data?

In general, it's OK if it's needed.
Could you explain about sentence_order_number?

Which sentence should we use as the description(metadata)?

Could you show candidate sentences?

abcdefg-1234567 · 2023-04-08T07:53:14Z

I thought that we can not determine the order of sentences from dataset created by only raw data.
So, I thought about adding this parameter to determine the order.

How about use following sentences which is in page top(https://mmsrv.ninjal.ac.jp/nucc/)? I could not find English description.

『名大会話コーパス』は，科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」（平成13年度～15年度　研究代表者　大曽美恵子）の一環として作成された，129会話，合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。現在は国立国語研究所に移管され，文字化テキストなどを公開しています。

kou · 2023-04-09T01:21:51Z

I thought that we can not determine the order of sentences from dataset created by only raw data. So, I thought about adding this parameter to determine the order.

Thanks. I also took a look at the implementation.
It seems that we can get the information by data.sentences.each_with_index. So I think that we don't need the additional information explicitly.

How about use following sentences which is in page top(https://mmsrv.ninjal.ac.jp/nucc/)? I could not find English description.

『名大会話コーパス』は，科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」（平成13年度～15年度　研究代表者　大曽美恵子）の一環として作成された，129会話，合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。現在は国立国語研究所に移管され，文字化テキストなどを公開しています。

How about translating "『名大会話コーパス』は，129会話，合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 " and using it? It seems that we can omit "科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」（平成13年度～15年度　研究代表者　大曽美恵子）の一環として作成された" and "現在は国立国語研究所に移管され，文字化テキストなどを公開しています。" from dataset description. Users can find them at the dataset URL.

abcdefg-1234567 · 2023-04-09T04:00:26Z

Thanks. I also took a look at the implementation.
It seems that we can get the information by data.sentences.each_with_index. So I think that we don't need the additional information explicitly.

I understand that we don't need this parameter.
And, I will remove this parameter. Thank you.

How about translating "『名大会話コーパス』は，129会話，合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 " and using it? It seems that we can omit "科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」（平成13年度～15年度　研究代表者　大曽美恵子）の一環として作成された" and "現在は国立国語研究所に移管され，文字化テキストなどを公開しています。" from dataset description. Users can find them at the dataset URL.

I think your idea is appropriate too. Thank you.

kou · 2023-04-20T08:17:25Z

example/nagoya-university-conversation-corpus.rb

+      sentence.participant_id,
+      sentence.content


Suggested change

sentence.participant_id,

sentence.content

sentence.participant_id,

sentence.content

Thanks for pointing out.

kou · 2023-04-20T08:18:42Z

lib/datasets/nagoya-university-conversation-corpus.rb

+      data_path = cache_dir_path + 'nucc.zip'
+      data_url = 'https://mmsrv.ninjal.ac.jp/nucc/nucc.zip'
+      download(data_path, data_url)
+      zip_file = Zip::File.open(data_path)


Could you use ZipExtractor?
Do we need to improve ZipExtractor?

I used ZipExtractor. Thank you for letting me know.

kou · 2023-04-20T08:19:34Z

lib/datasets/nagoya-university-conversation-corpus.rb

+
+      text_file.get_input_stream.each do |input|
+        input.each_line(chomp: true) do |line|
+          line = line.force_encoding('utf-8')


Suggested change

line = line.force_encoding('utf-8')

line.force_encoding('utf-8')

I thought mistakely that this method is not destructive. Thank you.

kou · 2023-04-20T08:19:51Z

lib/datasets/nagoya-university-conversation-corpus.rb

+      text_file.get_input_stream.each do |input|
+        input.each_line(chomp: true) do |line|
+          line = line.force_encoding('utf-8')
+          if line.include?('＠データ')


Can we use start_with? instead of include? here?

I did not know this method, and we can use this. Thank you.

kou · 2023-04-20T08:21:46Z

lib/datasets/nagoya-university-conversation-corpus.rb

+          elsif line.include?('＠参加者') && !line.include?('参加者の関係')
+            participant = Participant.new
+            temp_id, temp_profiles = line.split('：')
+            participant.id = temp_id[4..]
+            participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、')
+
+            participants << participant
+          elsif line.include?('＠参加者の関係')


How about swap these conditions to simplify?

elsif line.include?("＠参加者の関係") # ... elsif line.include?("＠参加者") # ...

I did not notice this and I think this idea is nice. Thank you.

kou · 2023-04-20T08:22:26Z

lib/datasets/nagoya-university-conversation-corpus.rb

+            data.place = line[4..]
+          elsif line.include?('＠参加者') && !line.include?('参加者の関係')
+            participant = Participant.new
+            temp_id, temp_profiles = line.split('：')


Could you specify "how many items" explicitly to avoid too much split?

Suggested change

temp_id, temp_profiles = line.split('：')

temp_id, temp_profiles = line.split('：', 2)

I did not know this argument. Thanks.

abcdefg-1234567 · 2023-04-22T11:31:45Z

@kou
Thank you for your comment above.
Could you please check again when it is convenient for you?

kou · 2023-04-22T20:26:09Z

lib/datasets/nagoya-university-conversation-corpus.rb

+            data.name = line[1..]
+          elsif line.start_with?('＠収集年月日')
+            # mixed cases with and without'：'
+            data.date = line[6..].delete('：')


Should we use delete_prefix here?

Suggested change

data.date = line[6..].delete('：')

data.date = line[6..].delete_prefix('：')

Thank you for telling me this.

kou · 2023-04-22T20:27:55Z

lib/datasets/nagoya-university-conversation-corpus.rb

+            temp_id, temp_profiles = line.split('：', 2)
+            participant.id = temp_id[4..]
+            participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、')


Can we simplify this?

Suggested change

temp_id, temp_profiles = line.split('：', 2)

participant.id = temp_id[4..]

participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、')

participant.id, profiles = line[4..].split('：', 2)

participant.attribute, participant.birthplace, participant.residence = profiles.split('、', 3)

Thanks for pointing out.

kou · 2023-04-22T20:33:20Z

test/test-nagoya-university-conversation-corpus.rb

+    @dataset = Datasets::NagoyaUniversityConversationCorpus.new
+  end
+
+  test('#each_sentences') do


Could you use the following style?

sub_test_case("#each") do test("#sentences") do # ... end test("#participants") do # ... end test("others") do # ... end end

We use #XXX for instance method name because the notation is widely used in Ruby documents.
So we don't want to use #XXX for non-instance method such as #each_sentences.

Thanks for the explanation. I understand.

kou · 2023-04-22T20:35:01Z

test/test-nagoya-university-conversation-corpus.rb

+    first_sentences = @dataset.each.to_a[0].sentences.to_a
+    last_sentences = @dataset.each.to_a[-1].sentences.to_a


How about defining a variable to avoid multiple parsing?

Suggested change

first_sentences = @dataset.each.to_a[0].sentences.to_a

last_sentences = @dataset.each.to_a[-1].sentences.to_a

records = @dataset.each.to_a

first_sentences = records[0].sentences.to_a

last_sentences = records[-1].sentences.to_a

And can we remove .to_a for sentences?

Suggested change

first_sentences = @dataset.each.to_a[0].sentences.to_a

last_sentences = @dataset.each.to_a[-1].sentences.to_a

records = @dataset.each.to_a

first_sentences = records[0].sentences

last_sentences = records[-1].sentences

Thanks for pointing out.

kou · 2023-04-22T20:36:03Z

lib/datasets/nagoya-university-conversation-corpus.rb

+            participants << participant
+          elsif line.start_with?('％ｃｏｍ')
+            data.note = line.split('：', 2)[1]
+          elsif line.start_with?('＠ＥＮＤ')


Suggested change

elsif line.start_with?('＠ＥＮＤ')

elsif line == '＠ＥＮＤ'

Thanks for pointing out.

kou · 2023-04-22T20:40:26Z

lib/datasets/nagoya-university-conversation-corpus.rb

+          elsif line.start_with?('＠ＥＮＤ')
+            sentence = Sentence.new
+            sentence.participant_id = nil
+            sentence.content = '＠ＥＮＤ'


How about adding Sentence#end? that returns true only when sentence.participant_id.nil? and sentence.content.nil? instead of setting '＠ＥＮＤ' content?

Suggested change

sentence.content = '＠ＥＮＤ'

sentence.content = nil

with

Sentence = Struct.new(:participant_id, :content) do def end? participant_id.nil? and content.nil? end end

I did not know such expression. And I think this UI is good. Thank you.

kou · 2023-04-22T20:46:05Z

lib/datasets/nagoya-university-conversation-corpus.rb

+      zip_file = Zip::File.open(data_path)
+      zip_file.each do |entry|
+        next unless entry.file?
+        ZipExtractor.new(data_path).extract_file(entry.name) do |input_stream|


Ah, we need to add ZipExtractor#extract_files for this use case:

class ZipExtractor def extract_files Zip::File.open(@path) do |zip_file| zip_file.each do |entry| next unless entry.file? entry.get_input_stream do |input| yield(input) end end end end end

Then we can use one ZipExtractor object for this:

extractor = ZipExtractor.new(data_path) extractor.extract_files do |input_stream| yield(input_stream) end

I will use this method. Thank you.

abcdefg-1234567 · 2023-05-07T23:51:04Z

@kou
Will you please check this pull request again?

kou

It almost looks good to me!
Could you check my comment?

kou · 2023-05-08T00:30:36Z

test/test-nagoya-university-conversation-corpus.rb

+      assert_equal([
+                     129,
+                     [
+                       'データ１（約３５分）',


Should we keep データ?
It seems that all name has データ prefix.

@kou
I think that the 'データ' string does not hold any particular information, so it can be removed.
May I remove this string?

Yes, please.

@kou
I removed 'データ'. Please check again at your convenience.

kou · 2023-05-08T12:06:50Z

Thanks!

abcdefg-1234567 · 2023-05-08T22:41:28Z

Thank you!

abcdefg-1234567 marked this pull request as ready for review April 20, 2023 00:35

abcdefg-1234567 added 4 commits April 20, 2023 09:57

add nagoya-university-conversation-corpus

26fb267

remove sentence_order_number

e7032e1

add description and the test

a919273

fix example

5a5e441

kou reviewed Apr 20, 2023

View reviewed changes

abcdefg-1234567 force-pushed the features/nagoya-university-conversation-corpus branch from 54a6cfe to 5a5e441 Compare April 22, 2023 08:48

fix in responce to review

a44e29f

kou reviewed Apr 22, 2023

View reviewed changes

fix in responce to review

9129993

kou approved these changes May 8, 2023

View reviewed changes

fix in responce to review

5309bb1

kou merged commit 706b288 into red-data-tools:master May 8, 2023
9 checks passed

	line = line.force_encoding('utf-8')
	line.force_encoding('utf-8')

	temp_id, temp_profiles = line.split('：')
	temp_id, temp_profiles = line.split('：', 2)

	data.date = line[6..].delete('：')
	data.date = line[6..].delete_prefix('：')

		first_sentences = @dataset.each.to_a[0].sentences.to_a
		last_sentences = @dataset.each.to_a[-1].sentences.to_a

-    first_sentences = @dataset.each.to_a[0].sentences.to_a
-    last_sentences = @dataset.each.to_a[-1].sentences.to_a
+    records = @dataset.each.to_a
+    first_sentences = records[0].sentences.to_a
+    last_sentences = records[-1].sentences.to_a

	elsif line.start_with?('＠ＥＮＤ')
	elsif line == '＠ＥＮＤ'

Add nagoya-university-conversation-corpus #168

Add nagoya-university-conversation-corpus #168

Conversation

abcdefg-1234567 commented Apr 8, 2023 • edited by kou

kou commented Apr 8, 2023

abcdefg-1234567 commented Apr 8, 2023

kou commented Apr 9, 2023 • edited

abcdefg-1234567 commented Apr 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abcdefg-1234567 commented Apr 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abcdefg-1234567 commented May 7, 2023

kou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented May 8, 2023

abcdefg-1234567 commented May 8, 2023

abcdefg-1234567 commented Apr 8, 2023 •

edited by kou

kou commented Apr 9, 2023 •

edited