Skip to content

Text Cleaning

Yimin Jing edited this page Feb 21, 2022 · 7 revisions

delete_escape_character

import takin

zh_text = "中国是一个美丽的地方\n请告诉我你在哪儿。\n我一定会去找你\t在我的怀里\t在你的眼里"
en_text = "Today is sunday\nwe are happy\nwe are fun."
print(takin.delete_escape_character(zh_text, lang="zh", add_punc=False))
print(takin.delete_escape_character(zh_text, lang="zh", add_punc=True))
print(takin.delete_escape_character(en_text, lang="en", add_punc=False))
print(takin.delete_escape_character(en_text, lang="en", add_punc=True))

>>> 中国是一个美丽的地方请告诉我你在哪儿我一定会去找你在我的怀里在你的眼里
>>> 中国是一个美丽的地方请告诉我你在哪儿我一定会去找你在我的怀里在你的眼里
>>> Today is sundaywe are happywe are fun.
>>> Today is sunday. we are happy. we are fun.

delete_extra_whitespace

zh_text = "我 们  都非   常快 乐   。 "
en_text = "Takin  ,    is very   useful  .    "
print(takin.delete_extra_whitespace(zh_text, lang="zh"))
print(takin.delete_extra_whitespace(en_text, lang="en"))

>>> 我们都非常快乐>>> Takin, is very useful.

delete_digit

text = "980.152%的人都没有来,1/120的孩子失去了饮水,20.34块蛋糕,100个人,97%的老人"
print(takin.delete_digit(text))

>>> 的人都没有来的孩子失去了饮水块蛋糕个人的老人

delete_punctuation

text1 = "Long;:,.\"??!''·!?;,。:“”、‘’《》[╔ˊ〉〈–η●®·•-~#/*&$|★▶><\^@+[=]()(){%_}?…]"
text2 = "this day is a friday. We are 3.123, 90%....   3. 中国, 3/2=5, 我峨%嵋你+"
print(takin.delete_punctuation(text1))
print(takin.delete_punctuation(text2))

>>> Long
>>> this day is a friday We are 3.123 90%   3. 中国 3/2=5 我峨嵋你

delete_letter

text = "今天的MoonCake真的非常nice啊!"
print(takin.delete_letter(text))

>>> 今天的真的非常啊

delete_chinese

text = "This is another 胜利victory!"
print(takin.delete_chinese(text))

>>> This is another victory!

delete_bracket

text = "机器阅读理解(MRC),【旨在】教机器理解人类语言(language){热爱学习}[hah]<hahgag>"
print(takin.delete_bracket(text))

>>> 机器阅读理解教机器理解人类语言

delete_series_number

text = "1.努力工作;2. 用心学习 (2).用心学习;(3)锻炼身体;4).热爱家庭快乐;6)学习, 7)、(一)、集中学习 (十五)高度集中 (一百二十三)"
print(takin.delete_series_number(text))

>>> 努力工作用心学习 用心学习锻炼身体热爱家庭快乐学习, 集中学习 高度集中

delete_repeated_punc

text = "what's up????????????????...。。《《《"
print(takin.delete_repeated_punc(text))

>>> what's up?.。《