Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nagoya-university-conversation-corpus #168

Conversation

abcdefg-1234567
Copy link
Contributor

@abcdefg-1234567 abcdefg-1234567 commented Apr 8, 2023

Fix GH-48

I would like to confirm the following.

  • Is it ok adding 'sentence_order_number' which is not in raw data?
  • Which sentence should we use as the description(metadata)?

@kou
Copy link
Member

kou commented Apr 8, 2023

  • Is it ok adding 'sentence_order_number' which is not in raw data?

In general, it's OK if it's needed.
Could you explain about sentence_order_number?

  • Which sentence should we use as the description(metadata)?

Could you show candidate sentences?

@abcdefg-1234567
Copy link
Contributor Author

I thought that we can not determine the order of sentences from dataset created by only raw data.
So, I thought about adding this parameter to determine the order.

How about use following sentences which is in page top(https://mmsrv.ninjal.ac.jp/nucc/)? I could not find English description.

『名大会話コーパス』は,科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」 (平成13年度~15年度 研究代表者 大曽美恵子)の一環として作成された,129会話, 合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 現在は国立国語研究所に移管され,文字化テキストなどを公開しています。

@kou
Copy link
Member

kou commented Apr 9, 2023

I thought that we can not determine the order of sentences from dataset created by only raw data. So, I thought about adding this parameter to determine the order.

Thanks. I also took a look at the implementation.
It seems that we can get the information by data.sentences.each_with_index. So I think that we don't need the additional information explicitly.

How about use following sentences which is in page top(https://mmsrv.ninjal.ac.jp/nucc/)? I could not find English description.

『名大会話コーパス』は,科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」 (平成13年度~15年度 研究代表者 大曽美恵子)の一環として作成された,129会話, 合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 現在は国立国語研究所に移管され,文字化テキストなどを公開しています。

How about translating "『名大会話コーパス』は,129会話, 合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 " and using it? It seems that we can omit "科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」 (平成13年度~15年度 研究代表者 大曽美恵子)の一環として作成された" and "現在は国立国語研究所に移管され,文字化テキストなどを公開しています。" from dataset description. Users can find them at the dataset URL.

@abcdefg-1234567
Copy link
Contributor Author

Thanks. I also took a look at the implementation.
It seems that we can get the information by data.sentences.each_with_index. So I think that we don't need the additional information explicitly.

I understand that we don't need this parameter.
And, I will remove this parameter. Thank you.

How about translating "『名大会話コーパス』は,129会話, 合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 " and using it? It seems that we can omit "科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」 (平成13年度~15年度 研究代表者 大曽美恵子)の一環として作成された" and "現在は国立国語研究所に移管され,文字化テキストなどを公開しています。" from dataset description. Users can find them at the dataset URL.

I think your idea is appropriate too. Thank you.

@abcdefg-1234567 abcdefg-1234567 marked this pull request as ready for review April 20, 2023 00:35
Comment on lines 10 to 11
sentence.participant_id,
sentence.content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sentence.participant_id,
sentence.content
sentence.participant_id,
sentence.content

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out.

data_path = cache_dir_path + 'nucc.zip'
data_url = 'https://mmsrv.ninjal.ac.jp/nucc/nucc.zip'
download(data_path, data_url)
zip_file = Zip::File.open(data_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ZipExtractor?
Do we need to improve ZipExtractor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ZipExtractor. Thank you for letting me know.


text_file.get_input_stream.each do |input|
input.each_line(chomp: true) do |line|
line = line.force_encoding('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
line = line.force_encoding('utf-8')
line.force_encoding('utf-8')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought mistakely that this method is not destructive. Thank you.

text_file.get_input_stream.each do |input|
input.each_line(chomp: true) do |line|
line = line.force_encoding('utf-8')
if line.include?('@データ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use start_with? instead of include? here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know this method, and we can use this. Thank you.

Comment on lines 76 to 83
elsif line.include?('@参加者') && !line.include?('参加者の関係')
participant = Participant.new
temp_id, temp_profiles = line.split(':')
participant.id = temp_id[4..]
participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、')

participants << participant
elsif line.include?('@参加者の関係')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about swap these conditions to simplify?

elsif line.include?("@参加者の関係")
  # ...
elsif line.include?("@参加者")
  # ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice this and I think this idea is nice. Thank you.

data.place = line[4..]
elsif line.include?('@参加者') && !line.include?('参加者の関係')
participant = Participant.new
temp_id, temp_profiles = line.split(':')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify "how many items" explicitly to avoid too much split?

Suggested change
temp_id, temp_profiles = line.split(':')
temp_id, temp_profiles = line.split(':', 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know this argument. Thanks.

@abcdefg-1234567 abcdefg-1234567 force-pushed the features/nagoya-university-conversation-corpus branch from 54a6cfe to 5a5e441 Compare April 22, 2023 08:48
@abcdefg-1234567
Copy link
Contributor Author

@kou
Thank you for your comment above.
Could you please check again when it is convenient for you?

data.name = line[1..]
elsif line.start_with?('@収集年月日')
# mixed cases with and without':'
data.date = line[6..].delete(':')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use delete_prefix here?

Suggested change
data.date = line[6..].delete(':')
data.date = line[6..].delete_prefix(':')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for telling me this.

Comment on lines 84 to 86
temp_id, temp_profiles = line.split(':', 2)
participant.id = temp_id[4..]
participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this?

Suggested change
temp_id, temp_profiles = line.split(':', 2)
participant.id = temp_id[4..]
participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、')
participant.id, profiles = line[4..].split(':', 2)
participant.attribute, participant.birthplace, participant.residence = profiles.split('、', 3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out.

@dataset = Datasets::NagoyaUniversityConversationCorpus.new
end

test('#each_sentences') do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the following style?

sub_test_case("#each") do
  test("#sentences") do
    # ...
  end

  test("#participants") do
    # ...
  end

  test("others") do
     # ...
  end
end

We use #XXX for instance method name because the notation is widely used in Ruby documents.
So we don't want to use #XXX for non-instance method such as #each_sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I understand.

Comment on lines 7 to 8
first_sentences = @dataset.each.to_a[0].sentences.to_a
last_sentences = @dataset.each.to_a[-1].sentences.to_a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining a variable to avoid multiple parsing?

Suggested change
first_sentences = @dataset.each.to_a[0].sentences.to_a
last_sentences = @dataset.each.to_a[-1].sentences.to_a
records = @dataset.each.to_a
first_sentences = records[0].sentences.to_a
last_sentences = records[-1].sentences.to_a

And can we remove .to_a for sentences?

Suggested change
first_sentences = @dataset.each.to_a[0].sentences.to_a
last_sentences = @dataset.each.to_a[-1].sentences.to_a
records = @dataset.each.to_a
first_sentences = records[0].sentences
last_sentences = records[-1].sentences

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out.

participants << participant
elsif line.start_with?('%com')
data.note = line.split(':', 2)[1]
elsif line.start_with?('@END')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elsif line.start_with?('@END')
elsif line == '@END'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out.

elsif line.start_with?('@END')
sentence = Sentence.new
sentence.participant_id = nil
sentence.content = '@END'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding Sentence#end? that returns true only when sentence.participant_id.nil? and sentence.content.nil? instead of setting '@END' content?

Suggested change
sentence.content = '@END'
sentence.content = nil

with

Sentence = Struct.new(:participant_id, :content) do
  def end?
    participant_id.nil? and content.nil?
  end
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know such expression. And I think this UI is good. Thank you.

zip_file = Zip::File.open(data_path)
zip_file.each do |entry|
next unless entry.file?
ZipExtractor.new(data_path).extract_file(entry.name) do |input_stream|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we need to add ZipExtractor#extract_files for this use case:

class ZipExtractor
  def extract_files
    Zip::File.open(@path) do |zip_file|
      zip_file.each do |entry|
        next unless entry.file?

        entry.get_input_stream do |input|
          yield(input)
        end
      end
    end
  end
end

Then we can use one ZipExtractor object for this:

extractor = ZipExtractor.new(data_path)
extractor.extract_files do |input_stream|
  yield(input_stream)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use this method. Thank you.

@abcdefg-1234567
Copy link
Contributor Author

@kou
Will you please check this pull request again?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost looks good to me!
Could you check my comment?

assert_equal([
129,
[
'データ1(約35分)',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep データ?
It seems that all name has データ prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
I think that the 'データ' string does not hold any particular information, so it can be removed.
May I remove this string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
I removed 'データ'. Please check again at your convenience.

@kou kou merged commit 706b288 into red-data-tools:master May 8, 2023
9 checks passed
@kou
Copy link
Member

kou commented May 8, 2023

Thanks!

@abcdefg-1234567
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Nagoya University Conversation Corpus
2 participants