Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

詞組用字問題 #28

Closed
laubonghaudoi opened this issue Apr 10, 2020 · 17 comments
Closed

詞組用字問題 #28

laubonghaudoi opened this issue Apr 10, 2020 · 17 comments
Assignees
Labels
discussion Open to discussion | 大家一齊傾 help wanted Extra attention is needed

Comments

@laubonghaudoi
Copy link
Member

laubonghaudoi commented Apr 10, 2020

目前我哋嘅碼表中,有十幾萬個詞係帶粵拼標註嘅。而呢十幾萬個詞條入邊,有一部分係屬於「同一詞彙用不同漢字寫法」嘅。例如下面:

aa3 je4	亞爺
aa3 je4	阿爺
saa1 bou1 aang1 caang1	沙煲甖罉
saa1 bou1 aang1 caang1	沙煲罌罉
saa1 bou1 aang1 caang1	砂煲甖罉
saa1 bou1 aang1 caang1	砂煲罌罉
saa1 bou1 aang1 caang1	砂煲罌𦉘	

我認爲呢啲詞條有必要進行一次清理,確定統一嘅寫法。如果唔係嘅話,打一個音節會出幾個候選項,會令用戶困惑,亦都阻礙粤文書寫系統嘅標準化進程。

所以我啱啱用腳本以「音節完全相同」爲標準,將呢啲詞條都搵咗出嚟,喺呢個txt文件入邊,總共接近九千行。其中第一列爲音節,逗號後面係以此爲音節嘅漢字詞,以空格分隔。

homophones.txt

因爲我係以音節相同爲判斷標準嘅,所以會將啲「同音不同義」嘅詞都加埋入嚟,好似下面嘅詞條噉:

bat1 sik1	不息	不惜	不識	不適	
bei6 jim4	避嫌	鼻炎	

所以我提議,我哋先搵人將呢個文件入邊嘅第二類詞條,即「同音不同義」詞條去除,然後開一次會議,討論呢啲「同詞唔同字」嘅詞條嘅標準寫法,確定究竟寫「阿婆」定係「亞婆」,最尾根據呢個寫法嚟清理呢個碼表。請問各位意見如何?

@laubonghaudoi laubonghaudoi added help wanted Extra attention is needed discussion Open to discussion | 大家一齊傾 labels Apr 10, 2020
@laubonghaudoi laubonghaudoi self-assigned this Apr 10, 2020
@teristam
Copy link

teristam commented May 1, 2020

我覺得呢個係一個好好嘅建議, 對建立廣東話書寫標準好有幫助. 關於第一類, 我覺得可以嘗試搵啲專門研究粵語嘅教授合作(e.g. Ben Sir?) , 可能學術界本身就有標準字庫, 照跟就得.

至少第二類, 有咩方便嘅方法可以幫到手? 例如整個Excel 大家填格仔?

@laubonghaudoi
Copy link
Member Author

laubonghaudoi commented May 1, 2020

@teristam 好感謝你嘅幫助。我哋工作組上次開會討論過,我哋而家嘅計劃係,先將呢啲詞整理起身,然後我去 Amazon Mechanical Turk 或者類似嘅平臺新開一個 task,將呢啲待標註嘅詞條擺上去,然後搵志願者得閒就標下(無償,我哋冇錢畀純靠情懷)。標完之後,我哋再開會討論標準寫法,再按照呢個標準寫法嚟清理詞條。

我哋好似未聽過學術界有相關嘅資料,我哋凈係知道有字音表,冇詞彙表。如果有,可以邊度搵到?唔知Ben Sir有冇瞭解過我哋呢個項目,如果學術界可以幫手就最好添啦。

@ngokchaoho
Copy link

ngokchaoho commented May 2, 2020

https://apps.itsc.cuhk.edu.hk/hanyu/Page/Cover.aspx
呢度有詞彙表,saa bou ngaang caang 得一個,不過佢會話冇得寫。

@laubonghaudoi
Copy link
Member Author

@teristam 我試咗一下用 Amazon Mechanical Turk,發覺佢竟然唔支持Unicode擴展區字符,所以只能夠用返excel呢種原始方法嚟做,個表喺呢度

@ngokchaoho 唔該晒,呢個表有冇 csv 格式嘅文件?得個噉查冇乜用。

@ngokchaoho
Copy link

佢冇CSV,而且底部寫保留版權,我尋晚寫咗個scrawler in private repo,不過既然佢寫有版權,我都唔知可唔可以用喺呢度。

@tanxpyox
Copy link
Collaborator

tanxpyox commented May 3, 2020

https://apps.itsc.cuhk.edu.hk/hanyu/Page/Cover.aspx
呢度有詞彙表,saa bou ngaang caang 得一個,不過佢會話冇得寫。

image

佢竟然話係saa¹ bou¹ aang¹ daang¹ (粵語根本冇daang呢個音節)... 唔知佢收字規則到底係咩呢?

@laubonghaudoi
Copy link
Member Author

總之就係,我哋需要有成套完整嘅csv或者txt之類嘅數據庫,得個查詢功能嘅框冇乜用,冇辦法導入到我哋個碼表入邊。

@ngokchaoho
Copy link

ngokchaoho commented May 3, 2020

https://apps.itsc.cuhk.edu.hk/hanyu/Page/Cover.aspx
呢度有詞彙表,saa bou ngaang caang 得一個,不過佢會話冇得寫。

image

佢竟然話係saa¹ bou¹ aang¹ daang¹ (粵語根本冇daang呢個音節)... 唔知佢收字規則到底係咩呢?

有 saa bou naang caang嘅砂字開頭,不過,saa¹ bou¹ aang¹ daang¹的確係有啲問題唔知係唔係佢打錯。

佢個計劃係「因此,本計畫蒐集學生作文、本地報刊、方言詞典等資料,嘗試找出本地常用方言詞語與普通話通用詞語之差別。這次展出的方言詞是本計畫所蒐集到的小部分資料,參觀者可嘗試指出與粵語詞彙對照之通用漢語詞彙。」

噉應該係幫唔到手啦,一來版權,二來有錯。

@laubonghaudoi
Copy link
Member Author

@ngokchaoho 有錯係正常嘅,冇可能會有完美嘅詞典。而家主要問題係我哋需要佢完整嘅數據庫,即係一個csv或者excel噉,噉先可以同我哋現有嘅碼表數據對照睇邊啲用得邊啲唔用得,邊啲可以攞嚟參考邊啲可以加落去。

嗰個excel任務我哋已經開始安排人嚟做嘞,如果你哋可以搵到人嚟幫手嘅話都非常歡迎。

@teristam
Copy link

teristam commented May 3, 2020

https://apps.itsc.cuhk.edu.hk/hanyu/Page/Cover.aspx
呢度有詞彙表,saa bou ngaang caang 得一個,不過佢會話冇得寫。

image
佢竟然話係saa¹ bou¹ aang¹ daang¹ (粵語根本冇daang呢個音節)... 唔知佢收字規則到底係咩呢?

有 saa bou naang caang嘅砂字開頭,不過,saa¹ bou¹ aang¹ daang¹的確係有啲問題唔知係唔係佢打錯。

佢個計劃係「因此,本計畫蒐集學生作文、本地報刊、方言詞典等資料,嘗試找出本地常用方言詞語與普通話通用詞語之差別。這次展出的方言詞是本計畫所蒐集到的小部分資料,參觀者可嘗試指出與粵語詞彙對照之通用漢語詞彙。」

噉應該係幫唔到手啦,一來版權,二來有錯。

我相信單單詞語列表唔涉及版權問題, 我地甚至唔需要佢嘅漢語對照,錯嘅地方可以搵人手改,自動detect到一大部份已經幫輕好多。

@chaaklau
Copy link
Collaborator

chaaklau commented May 3, 2020

我相信單單詞語列表唔涉及版權問題, 我地甚至唔需要佢嘅漢語對照,錯嘅地方可以搵人手改,自動detect到一大部份已經幫輕好多。

我同一班朋友幾年前整理咗呢個中大嘅表嘅詞條、拼音等等資料
Google Spreadsheet

格式同埋拼音部份有好多錯處,我哋已經順手修正咗。

@teristam
Copy link

teristam commented May 3, 2020

大家好,我用「教育部重編國語字典」同「漢語大詞典」比較咗表格裏面嘅詞組,主要目的係想搵出裏面嘅二類詞組,即係「同音唔同義」。

凡屬於二類詞組,佢地要滿足以下條件
1,同一行嘅所有詞都可以係字典查到
2,佢地嘅解釋並不相同
3,係字典條文裏面相對應嘅「也作」,「見」(如果有)的項目並不相符。係教育部字典搵唔到嘅詞會係漢語大詞典再搵一次
4,字數在三字或以下(因為我發覺四字詞同音唔同義好少見)

經過程式篩選後大約搵到1500個詞符合呢啲條件,我粗略睇過都大致啱,但我冇逐行睇過。附上結果,希望對大家有用。

export.xlsx

@laubonghaudoi
Copy link
Member Author

我啱先諗到,我哋目前呢個 issue 嘅工作可以暫停一下,因爲#50 入邊講咗佢哋仲有4份字典嘅詞庫仲未加入去,我諗住將呢啲詞庫全部加完之後再將嗰個同音詞表統計一次,然後再一次過搵人嚟執。

@laubonghaudoi
Copy link
Member Author

我啱先根據我哋最新版嘅詞庫統計好咗啲同音詞,個文件喺呢度:
homophones.txt
今次可以專門執一次嘞。

@tanxpyox
Copy link
Collaborator

tanxpyox commented Jun 1, 2020

我用「教育部重編國語字典」同「漢語大詞典」比較咗表格裏面嘅詞組,主要目的係想搵出裏面嘅二類詞組,即係「同音唔同義」。

@teristam 唔知你得唔得閒用同一個方法篩選一下以下呢個Spreadsheet入面嘅資料呢?之前擇言佢哋加咗啲新數據入去,所以要由頭標多次。就算已經標咗嗰啲record都可以再crosscheck一次,費時啲人打錯我地睇漏眼。 🙏

https://docs.google.com/spreadsheets/d/1mMHoenbyaXaDgMyr4mJr_3XdtaLEvVkx1mfjEwPIxg4/edit#gid=442925576

@hfhchan
Copy link
Collaborator

hfhchan commented Jun 3, 2020

(補充:想查 https://apps.itsc.cuhk.edu.hk/hanyu/Page/Cover.aspx 但睇唔到字可以先下載字體檔,改名 canton.ttf,然後上載 https://fontdrop.info/).
EC9E:
image

@tanxpyox tanxpyox mentioned this issue Jun 8, 2020
@rime rime deleted a comment from tiujejauci Jun 8, 2020
@laubonghaudoi
Copy link
Member Author

經過同搜狗嘅合作改善,我哋個碼表而家基本上已經解決咗啲無音節詞條同多音詞嘅用字問題,剩低嘅本issue提到嘅用字選擇問題,需要另外再開一個專門項目嚟解決。所以而家閂咗本 issue 先,以後有機會再另外討論點樣從頭執碼表用字。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Open to discussion | 大家一齊傾 help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants