New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nagoya-university-conversation-corpus #168
Add nagoya-university-conversation-corpus #168
Conversation
In general, it's OK if it's needed.
Could you show candidate sentences? |
I thought that we can not determine the order of sentences from dataset created by only raw data. How about use following sentences which is in page top(https://mmsrv.ninjal.ac.jp/nucc/)? I could not find English description.
|
Thanks. I also took a look at the implementation.
How about translating "『名大会話コーパス』は,129会話, 合計約100時間の日本語母語話者同士の雑談を文字化したコーパスです。 " and using it? It seems that we can omit "科学研究費基盤研究(B)(2)「日本語学習辞書編纂に向けた電子化コーパス利用によるコロケーション研究」 (平成13年度~15年度 研究代表者 大曽美恵子)の一環として作成された" and "現在は国立国語研究所に移管され,文字化テキストなどを公開しています。" from dataset description. Users can find them at the dataset URL. |
I understand that we don't need this parameter.
I think your idea is appropriate too. Thank you. |
sentence.participant_id, | ||
sentence.content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentence.participant_id, | |
sentence.content | |
sentence.participant_id, | |
sentence.content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out.
data_path = cache_dir_path + 'nucc.zip' | ||
data_url = 'https://mmsrv.ninjal.ac.jp/nucc/nucc.zip' | ||
download(data_path, data_url) | ||
zip_file = Zip::File.open(data_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use ZipExtractor
?
Do we need to improve ZipExtractor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used ZipExtractor. Thank you for letting me know.
|
||
text_file.get_input_stream.each do |input| | ||
input.each_line(chomp: true) do |line| | ||
line = line.force_encoding('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line = line.force_encoding('utf-8') | |
line.force_encoding('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought mistakely that this method is not destructive. Thank you.
text_file.get_input_stream.each do |input| | ||
input.each_line(chomp: true) do |line| | ||
line = line.force_encoding('utf-8') | ||
if line.include?('@データ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use start_with?
instead of include?
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know this method, and we can use this. Thank you.
elsif line.include?('@参加者') && !line.include?('参加者の関係') | ||
participant = Participant.new | ||
temp_id, temp_profiles = line.split(':') | ||
participant.id = temp_id[4..] | ||
participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、') | ||
|
||
participants << participant | ||
elsif line.include?('@参加者の関係') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about swap these conditions to simplify?
elsif line.include?("@参加者の関係")
# ...
elsif line.include?("@参加者")
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice this and I think this idea is nice. Thank you.
data.place = line[4..] | ||
elsif line.include?('@参加者') && !line.include?('参加者の関係') | ||
participant = Participant.new | ||
temp_id, temp_profiles = line.split(':') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify "how many items" explicitly to avoid too much split?
temp_id, temp_profiles = line.split(':') | |
temp_id, temp_profiles = line.split(':', 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know this argument. Thanks.
54a6cfe
to
5a5e441
Compare
@kou |
data.name = line[1..] | ||
elsif line.start_with?('@収集年月日') | ||
# mixed cases with and without':' | ||
data.date = line[6..].delete(':') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use delete_prefix
here?
data.date = line[6..].delete(':') | |
data.date = line[6..].delete_prefix(':') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for telling me this.
temp_id, temp_profiles = line.split(':', 2) | ||
participant.id = temp_id[4..] | ||
participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this?
temp_id, temp_profiles = line.split(':', 2) | |
participant.id = temp_id[4..] | |
participant.attribute, participant.birthplace, participant.residence = temp_profiles.split('、') | |
participant.id, profiles = line[4..].split(':', 2) | |
participant.attribute, participant.birthplace, participant.residence = profiles.split('、', 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out.
@dataset = Datasets::NagoyaUniversityConversationCorpus.new | ||
end | ||
|
||
test('#each_sentences') do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the following style?
sub_test_case("#each") do
test("#sentences") do
# ...
end
test("#participants") do
# ...
end
test("others") do
# ...
end
end
We use #XXX
for instance method name because the notation is widely used in Ruby documents.
So we don't want to use #XXX
for non-instance method such as #each_sentences
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I understand.
first_sentences = @dataset.each.to_a[0].sentences.to_a | ||
last_sentences = @dataset.each.to_a[-1].sentences.to_a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining a variable to avoid multiple parsing?
first_sentences = @dataset.each.to_a[0].sentences.to_a | |
last_sentences = @dataset.each.to_a[-1].sentences.to_a | |
records = @dataset.each.to_a | |
first_sentences = records[0].sentences.to_a | |
last_sentences = records[-1].sentences.to_a |
And can we remove .to_a
for sentences
?
first_sentences = @dataset.each.to_a[0].sentences.to_a | |
last_sentences = @dataset.each.to_a[-1].sentences.to_a | |
records = @dataset.each.to_a | |
first_sentences = records[0].sentences | |
last_sentences = records[-1].sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out.
participants << participant | ||
elsif line.start_with?('%com') | ||
data.note = line.split(':', 2)[1] | ||
elsif line.start_with?('@END') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elsif line.start_with?('@END') | |
elsif line == '@END' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out.
elsif line.start_with?('@END') | ||
sentence = Sentence.new | ||
sentence.participant_id = nil | ||
sentence.content = '@END' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding Sentence#end?
that returns true
only when sentence.participant_id.nil? and sentence.content.nil?
instead of setting '@END'
content?
sentence.content = '@END' | |
sentence.content = nil |
with
Sentence = Struct.new(:participant_id, :content) do
def end?
participant_id.nil? and content.nil?
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know such expression. And I think this UI is good. Thank you.
zip_file = Zip::File.open(data_path) | ||
zip_file.each do |entry| | ||
next unless entry.file? | ||
ZipExtractor.new(data_path).extract_file(entry.name) do |input_stream| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we need to add ZipExtractor#extract_files
for this use case:
class ZipExtractor
def extract_files
Zip::File.open(@path) do |zip_file|
zip_file.each do |entry|
next unless entry.file?
entry.get_input_stream do |input|
yield(input)
end
end
end
end
end
Then we can use one ZipExtractor
object for this:
extractor = ZipExtractor.new(data_path)
extractor.extract_files do |input_stream|
yield(input_stream)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use this method. Thank you.
@kou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It almost looks good to me!
Could you check my comment?
assert_equal([ | ||
129, | ||
[ | ||
'データ1(約35分)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep データ
?
It seems that all name
has データ
prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou
I think that the 'データ' string does not hold any particular information, so it can be removed.
May I remove this string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou
I removed 'データ'. Please check again at your convenience.
Thanks! |
Thank you! |
Fix GH-48
I would like to confirm the following.