Tools for NLP using Python
This repertory used to handle file io and string cleaning/parsing
Install:
pip install nlp2
Before using :
from nlp2 import *
Arguments
path(String)
: getting all folders under this path (string)
Returns
path(String)(generator)
: path of folders under arguments path Examples
for i in get_folders_from_dir('./corpus/')
print(i)
'./corpus/kdd'
'./corpus/nycd'
Arguments
path(String)
: getting all files under this path (string)
Returns
path(String)(generator)
: path of files under arguments path Examples
for i in get_files_from_dir('./data/')
print(i)
'./data/kdd.txt'
'./data/nycd.txt'
Arguments
path(String)
: getting all files line by lines under this path (string)
Returns
line(String)(generator)
: files line under arguments path
Examples
for i in read_dir_files_into_lines('./data/')
print(i)
'file1 sent1'
'file1 sent2'
...
'file2 sent1'
...
Arguments
path(String)
: getting all files line by lines under this path (string)
Returns
line(String)(generator)
: files line under arguments path
Examples
i = read_dir_files_into_lines('./data/')
print(i)
['file1 sent1','file1 sent2'...'file2 sent1'...]
Arguments
path(String)
: getting content in input file path (string)
Returns
path(String)(generator)
: file line under arguments path
Examples
for i in read_dir_files_into_lines('./data/kdd.txt')
print(i)
'sent1'
'sent2'
...
Arguments
path(String)
: getting content in input file path (string)
Returns
path(String)(generator)
: file line under arguments path
Examples
i = read_dir_files_into_lines('./data/kdd.txt')
print(i)
['sent1','sent2'...]
it will replace old dir if exist,or create a new one
Arguments
dirPath(String)
: dir location
Examples
create_new_dir_always('./data/')
it will create a new dir if not exist
Arguments
dirPath(String)
: dir location that you want to make sure
Returns
path(String)
: dir location with surely exist Examples
i = get_dir_with_notexist_create('./data/kdd')
print(i)
'./data/kdd'
Arguments
path(String)
: file location
Returns
result(Boolean)
: file exist or not,true will be exist Examples
i = is_file_exist('./data/kdd.txt')
print(i)
true
Arguments
path(String)
: dir location
Returns
result(Boolean)
: dir exist or not,true will be exist Examples
i = is_dir_exist('./data/kdd')
print(i)
false
Arguments
url;(String)
: download linksave_dir;(String)
: save location
Returnsresult(string)
: file downloaded location
Examples
i = download_file('https://raw.githubusercontent.com/voidful/voidful_blog/master/assets/post_src/nninmath_3/img1','./data/')
print(i)
./data/img1
Arguments
-
filepath(String)
: csv file path -
list
: csv rows
i = read_csv('./data/kdd.csv')
print(i)
"["sent","hi"]"
Arguments
csv_rows(list)
: list of csv rowsloc(String)
: write location/ file path Returns
i = write_csv(["sent","hi"],'./data/kdd.csv')
Arguments
filepath(String)
: json file path
Returns
json
: json object
i = read_json('./data/kdd.json')
print(i)
"{"sent":"hi"}"
Arguments
json_str(String)
: json context in stringloc(String)
: write location/ file path Returns
i = write_json("{"sent":"hi"}",'./data/kdd.json')
print(i)
"'./data/kdd.json'"
remove http link in context
Arguments
string(String)
: a string may contain http link
Returns
result(String)
: string without any http link
Examples
y = remove_httplink("http://news.IN1802020028.htm 今天天氣http://news.we028.晴朗"))
print(y)
今天天氣 晴朗
remove html element in context
Arguments
string(String)
: a string may contain html element
Returns
result(String)
: string without any html element
Examples
y = clean_htmlelement("<div class=""><p>Phraseg - 一言:新詞發現工具包</p></div>")
print(y)
Phraseg - 一言:新詞發現工具包
remove unused tag in context
Arguments
string(String)
: a string may contain unused tag
Returns
result(String)
: string without any unused tag
Examples
y = clean_unused_tag("[quote]<br>\n無聊得過此帖?!:smile_42: [/quote]<br>\n<br>\n<br>\n認同。<br>\n<br>\n改洋名,只是一個字號。"))
print(y)
無聊得過此帖?!
認同。
改洋名,只是一個字號。
apply all clean method to clean context
clean_unused_tag / clean_htmlelement / clean_httplink
Arguments
string(String)
: a string may contain some garbage
Returns
result(String)
: clean string
Examples
y = clean_all("[i]234282[/i] <div class=""><p>Phraseg - 一言:新詞發現工具包http://news.IN1802020028.htm今天天氣http://news.we028.晴朗</p></div>"))
print(y)
Phraseg - 一言:新詞發現工具包 今天天氣 晴朗
make lines in array form into sentences array
it split line base on any punctuation
Arguments
lines(String Array)
: lines array
Returns
sentences(String Array)
: split all line base on punctuations
Examples
y = split_lines_by_punc(["你好啊.hello,me"]))
print(y)
['你好啊', 'hello', 'me']
it will split sentence into n-grams as many it can
Arguments
sentence(String)
: a string with no punctuation
Returns
ngrams(String Array)
: ngrams array
Examples
split_sentence_to_ngram("加州旅館")
['加','加州',"加州旅","加州旅館","州","州旅","州旅館","旅","旅館","館"]
it will split sentence into n-grams with diff start point as many it can
Arguments
sentence(String)
: a string with no punctuation
Returns
ngrams(Array)
: 2D array with diff start in ngram
Examples
split_sentence_to_ngram_in_part("加州旅館")
[['加','加州',"加州旅","加州旅館"],["州","州旅","州旅館"],["旅","旅館"],["館"]]
it will try to find all possible segments way to split sentence
Arguments
sentence(String)
: input sentence
Returns
seg list(String Array)
: all segments in a array
Examples
split_text_in_all_ways("加州旅館")
['加 州 旅 館', '加 州 旅館', '加 州旅 館', '加 州旅館', '加州 旅館', '加州旅 館', '加州旅館']
use to split sentences in different kind of language Arguments
sentence(String)
: input sentencemerge_non_eng(boolean,optional)
: split non english in char or not
Returns
segment array(String Array)
: word array
split_sentence_to_array('你好 are u 可以',merge_non_eng = True)
['你好', 'are', 'u', '可以']
split_sentence_to_array('你好 are u 可以')
['你', '好', 'are', 'u', '可', '以']
Arguments
words_array(String Array)
: input array
Returns
sentence(String)
: output sentence Examples
join_words_to_sentence(['你好', 'are', "可以"])
你好are可以
split a passage in particular size
if part of a sentence excite chunk size, it still put hole sentence into it
Arguments
passage(String)
: input passagenum_of_paragraphs(int)
: num of character in one chunk
Returns
chunk array(String Array)
: passage in chunk size Examples
passage_into_chunk("xxxxxxxx\noo\nyyzz\ngggggg\nkkkk\n",10)
['xxxxxxxx\noo\n', 'yyzz\ngggggg\n']
Arguments
text(String)
: input text Returnsresult(Boolean)
: whether the text is all English or not Examples
is_all_english("1SGD")
is_all_english("1SG哦")
True
False
Arguments
text(String)
: input text
Returns
result(Boolean)
: whether the text contain number or not Examples
is_contain_number("1SGD")
is_contain_number("SG哦")
True
False
Arguments
text(String)
: input text
Returnsresult(Boolean)
: whether the text contain english or not Examples
is_contain_english("1SGD")
is_contain_english("123哦")
True
False
Arguments
str(String)
: input textlist(String list)
: input string
Returnsresult(Boolean)
: whether the text is a part of list item
Examples
is_list_contain_string("a", ['a', 'dcd'])
is_list_contain_string("a", ['abcd', 'dcd'])
is_list_contain_string("a", ['bdc', 'dcd'])
True
True
False
Arguments
string(String)
: input string which needs turn to half
Returns
(String)
: a half-string
Examples
full2half(",,")
,,
Arguments
text(String)
: input string which needs turn to full
Returns
(String)
: a full-string Examples
half2full(",,")
,,
Vectorize implemented following paper :
Baseline Needs More Love:On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
average pooling
Arguments
pretrained_emb(object)
: pre-trained word embedding that able to get vector in this form :pretrained_emb['word']
emb_size(int)
: size of pre-trained word embeddingcontext(list)
: input doc in list - each item of list must able to gain vector in pretrained_emb like :pretrained_emb[context[0]]
Returns
document vector(list)
: vectorized context
Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_aver(pretrain_wordvec, size, jieba.lcut(context))
max pooling in each dim
Arguments
pretrained_emb(object)
: pre-trained word embedding that able to get vector in this form :pretrained_emb['word']
emb_size(int)
: size of pre-trained word embeddingcontext(list)
: input doc in list - each item of list must able to gain vector in pretrained_emb like :pretrained_emb[context[0]]
Returns
document vector(list)
: vectorized context Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_max(pretrain_wordvec, size, jieba.lcut(context))
concat average pooling and max pooling result
Arguments
pretrained_emb(object)
: pre-trained word embedding that able to get vector in this form :pretrained_emb['word']
emb_size(int)
: size of pre-trained word embeddingcontext(list)
: input doc in list - each item of list must able to gain vector in pretrained_emb like :pretrained_emb[context[0]]
Returns
document vector(list)
: vectorized context Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_concat(pretrain_wordvec, size, jieba.lcut(context))
average pooling in sliding windows then max pooling
Arguments
pretrained_emb(object)
: pre-trained word embedding that able to get vector in this form :pretrained_emb['word']
emb_size(int)
: size of pre-trained word embeddingcontext(list)
: input doc in list - each item of list must able to gain vector in pretrained_emb like :pretrained_emb[context[0]]
windows(int)
: size of sliding windows in array
Returns
document vector(list)
: vectorized context Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_hier(pretrain_wordvec, size, jieba.lcut(context))
cal cosine similarity between two vector Arguments
vector(list)
: vector
Returns
cos similarity(float)
: similarity of two vector Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
input1 = nlp2.doc2vec_concat(pretrain_wordvec, size, "DC")
input2 = nlp2.doc2vec_concat(pretrain_wordvec, size, "漫威")
nlp2.cosine_similarity(input1,input2)
Arguments
length(int)
: length with random string
Returns
randstr(String)
: size will be length in "0123456789ABCDEF" Examples
random_string(10)
D6857CE0F4
Arguments
length(int)
: length with random string
Returns
randstr(String)
: size will be length + timestamp length(10) Examples
random_string_with_timestamp(1)
1435474326D
random value with range in array form
int,float : [min,max]
string : [candidate1,candidate2...]
Arguments
range(array)
: range in array form
Returns
random result(depend on input)
: a random value under input condition Examples
# for string
y = random_value_in_array_form(["SGD","ADAM","XDA"])
print(y)
'ADAM'
# for int
y = random_value_in_array_form([1,12])
print(y)
4
# for float
y = random_value_in_array_form([0.01,1.00])
print(y)
0.34