🔨 nlp2 🔧

Tools for NLP using Python

This repertory used to handle file io and string cleaning/parsing

Usage

Install:

pip install nlp2

Before using :

from nlp2 import *

Features

File Handling
Text cleaning/parsing
Random Utility
Vectorize

File Handling

get_folders_from_dir(path)

Arguments

path(String) : getting all folders under this path (string)

Returns

path(String)(generator) : path of folders under arguments path Examples

for i in get_folders_from_dir('./corpus/')
    print(i)

'./corpus/kdd'
'./corpus/nycd'

get_files_from_dir(path)

Arguments

path(String) : getting all files under this path (string)

Returns

path(String)(generator) : path of files under arguments path Examples

for i in get_files_from_dir('./data/')
    print(i)

'./data/kdd.txt'
'./data/nycd.txt'

read_dir_files_yield_lines(path)

Arguments

path(String) : getting all files line by lines under this path (string)

Returns

line(String)(generator) : files line under arguments path
Examples

for i in read_dir_files_into_lines('./data/')
    print(i)

'file1 sent1'
'file1 sent2'
...
'file2 sent1'
...

read_dir_files_into_lines(path)

Arguments

path(String) : getting all files line by lines under this path (string)

Returns

line(String)(generator) : files line under arguments path
Examples

i = read_dir_files_into_lines('./data/')
print(i)

['file1 sent1','file1 sent2'...'file2 sent1'...]

read_files_yield_lines(path)

Arguments

path(String) : getting content in input file path (string)

Returns

path(String)(generator) : file line under arguments path
Examples

for i in read_dir_files_into_lines('./data/kdd.txt')
    print(i)

'sent1'
'sent2'
...

read_files_into_lines(path)

Arguments

path(String) : getting content in input file path (string)

Returns

path(String)(generator) : file line under arguments path
Examples

i = read_dir_files_into_lines('./data/kdd.txt')
print(i)

['sent1','sent2'...]

create_new_dir_always(dirPath)

it will replace old dir if exist,or create a new one
Arguments

dirPath(String) : dir location
Examples

create_new_dir_always('./data/')

get_dir_with_notexist_create(dirPath):

it will create a new dir if not exist
Arguments

dirPath(String) : dir location that you want to make sure

Returns

path(String) : dir location with surely exist Examples

i = get_dir_with_notexist_create('./data/kdd')
print(i)

'./data/kdd'

is_file_exist(path)

Arguments

path(String) : file location

Returns

result(Boolean) : file exist or not,true will be exist Examples

i = is_file_exist('./data/kdd.txt')
print(i)

true

is_dir_exist(file_dir)

Arguments

path(String) : dir location

Returns

result(Boolean) : dir exist or not,true will be exist Examples

i = is_dir_exist('./data/kdd')
print(i)

false

download_file(url,save_dir)

Arguments

url;(String) : download link
save_dir;(String) : save location
Returns
result(string) : file downloaded location
Examples

i = download_file('https://raw.githubusercontent.com/voidful/voidful_blog/master/assets/post_src/nninmath_3/img1','./data/')
print(i)

./data/img1

read_csv(filepath, generator=False)

Arguments

filepath(String) : csv file path
list : csv rows

i = read_csv('./data/kdd.csv')
print(i)

"["sent","hi"]"

write_csv(csv_rows, loc)

Arguments

csv_rows(list) : list of csv rows
loc(String) : write location/ file path Returns

i = write_csv(["sent","hi"],'./data/kdd.csv')

read_json(filepath)

Arguments

filepath(String) : json file path

Returns

json : json object

i = read_json('./data/kdd.json')
print(i)

"{"sent":"hi"}"

write_json(json_str, loc)

Arguments

json_str(String) : json context in string
loc(String) : write location/ file path Returns

i = write_json("{"sent":"hi"}",'./data/kdd.json')
print(i)

"'./data/kdd.json'"

Text cleaning/parsing

clean_httplink(string)

remove http link in context
Arguments

string(String) : a string may contain http link

Returns

result(String) : string without any http link

Examples

y = remove_httplink("http://news.IN1802020028.htm 今天天氣http://news.we028.晴朗"))
print(y)

今天天氣 晴朗

clean_htmlelement(string)

remove html element in context
Arguments

string(String) : a string may contain html element

Returns

result(String) : string without any html element

Examples

y = clean_htmlelement("<div class=""><p>Phraseg - 一言：新詞發現工具包</p></div>")
print(y)

Phraseg - 一言：新詞發現工具包

clean_unused_tag(string)

remove unused tag in context
Arguments

string(String) : a string may contain unused tag

Returns

result(String) : string without any unused tag

Examples

y = clean_unused_tag("[quote]<br>\n無聊得過此帖？！:smile_42: [/quote]<br>\n<br>\n<br>\n認同。<br>\n<br>\n改洋名，只是一個字號。"))
print(y)

無聊得過此帖？！    
 
  
認同。


改洋名，只是一個字號。

clean_all(string)

apply all clean method to clean context
clean_unused_tag / clean_htmlelement / clean_httplink
Arguments

string(String) : a string may contain some garbage

Returns

result(String) : clean string

Examples

y = clean_all("[i]234282[/i] <div class=""><p>Phraseg - 一言：新詞發現工具包http://news.IN1802020028.htm今天天氣http://news.we028.晴朗</p></div>"))
print(y)

Phraseg - 一言：新詞發現工具包 今天天氣 晴朗

split_lines_by_punc(lines)

make lines in array form into sentences array
it split line base on any punctuation
Arguments

lines(String Array) : lines array

Returns

sentences(String Array) : split all line base on punctuations
Examples

y = split_lines_by_punc(["你好啊.hello，me"]))
print(y)

['你好啊', 'hello', 'me']

split_sentence_to_ngram(sentence)

it will split sentence into n-grams as many it can

be careful with sentence length,long sentence will have worse performance

Arguments

sentence(String) : a string with no punctuation

Returns

ngrams(String Array) : ngrams array

Examples

split_sentence_to_ngram("加州旅館")

['加','加州',"加州旅","加州旅館","州","州旅","州旅館","旅","旅館","館"]

split_sentence_to_ngram_in_part(sentence)

it will split sentence into n-grams with diff start point as many it can

be careful with sentence length,long sentence will have worse performance

Arguments

sentence(String) : a string with no punctuation

Returns

ngrams(Array) : 2D array with diff start in ngram

Examples

split_sentence_to_ngram_in_part("加州旅館")

[['加','加州',"加州旅","加州旅館"],["州","州旅","州旅館"],["旅","旅館"],["館"]]

split_text_in_all_ways(sentence)

it will try to find all possible segments way to split sentence
Arguments

sentence(String) : input sentence

Returns

seg list(String Array) : all segments in a array

Examples

split_text_in_all_ways("加州旅館")

['加 州 旅 館', '加 州 旅館', '加 州旅 館', '加 州旅館', '加州 旅館', '加州旅 館', '加州旅館']

split_sentence_to_array(sentence,merge_non_eng=False)

use to split sentences in different kind of language Arguments

sentence(String) : input sentence
merge_non_eng(boolean,optional) : split non english in char or not

Returns

segment array(String Array) : word array

split_sentence_to_array('你好 are  u 可以',merge_non_eng = True)

['你好', 'are', 'u', '可以']

split_sentence_to_array('你好 are  u 可以')

['你', '好', 'are', 'u', '可', '以']

join_words_to_sentence(words_array):

Arguments

words_array(String Array) : input array

Returns

sentence(String) : output sentence Examples

join_words_to_sentence(['你好', 'are', "可以"])

你好are可以

passage_into_chunk(passage, chunk_size):

split a passage in particular size
if part of a sentence excite chunk size, it still put hole sentence into it
Arguments

passage(String) : input passage
num_of_paragraphs(int) : num of character in one chunk

Returns

chunk array(String Array) : passage in chunk size Examples

passage_into_chunk("xxxxxxxx\noo\nyyzz\ngggggg\nkkkk\n",10)

['xxxxxxxx\noo\n', 'yyzz\ngggggg\n']

is_all_english(text)

Arguments

text(String) : input text Returns
result(Boolean) : whether the text is all English or not Examples

is_all_english("1SGD")
is_all_english("1SG哦")

True
False

is_contain_number(text)

Arguments

text(String) : input text

Returns

result(Boolean) : whether the text contain number or not Examples

is_contain_number("1SGD")
is_contain_number("SG哦")

True
False

is_contain_english(text)

Arguments

text(String) : input text
Returns
result(Boolean) : whether the text contain english or not Examples

is_contain_english("1SGD")
is_contain_english("123哦")

True
False

is_list_contain_string(text)

Arguments

str(String) : input text
list(String list) : input string
Returns
result(Boolean) : whether the text is a part of list item
Examples

is_list_contain_string("a", ['a', 'dcd'])
is_list_contain_string("a", ['abcd', 'dcd'])
is_list_contain_string("a", ['bdc', 'dcd'])

True
True
False

full2half(text)

Arguments

string(String) : input string which needs turn to half

Returns

(String) : a half-string

Examples

full2half("，,")

,,

half2full(text)

Arguments

text(String) : input string which needs turn to full

Returns

(String) : a full-string Examples

half2full("，,")

，，

Vectorize

Vectorize implemented following paper ：
Baseline Needs More Love:On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms

doc2vec_aver(pretrained_emb, emb_size, context)

average pooling
Arguments

pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
emb_size(int) : size of pre-trained word embedding
context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]

Returns

document vector(list) : vectorized context

Examples

from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_aver(pretrain_wordvec, size, jieba.lcut(context))

doc2vec_max(pretrained_emb, emb_size, context)

max pooling in each dim
Arguments

pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
emb_size(int) : size of pre-trained word embedding
context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]

Returns

document vector(list) : vectorized context Examples

from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_max(pretrain_wordvec, size, jieba.lcut(context))

doc2vec_concat(pretrained_emb, emb_size, context)

concat average pooling and max pooling result
Arguments

pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
emb_size(int) : size of pre-trained word embedding
context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]

Returns

document vector(list) : vectorized context Examples

from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_concat(pretrain_wordvec, size, jieba.lcut(context))

doc2vec_hier(pretrained_emb, emb_size, context, windows)

average pooling in sliding windows then max pooling
Arguments

pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
emb_size(int) : size of pre-trained word embedding
context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]
windows(int) : size of sliding windows in array

Returns

document vector(list) : vectorized context Examples

from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
nlp2.doc2vec_hier(pretrain_wordvec, size, jieba.lcut(context))

cosine_similarity(vector 1, vector 2)

cal cosine similarity between two vector Arguments

vector(list) : vector

Returns

cos similarity(float) : similarity of two vector Examples

from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size

input1 = nlp2.doc2vec_concat(pretrain_wordvec, size, "DC")
input2 = nlp2.doc2vec_concat(pretrain_wordvec, size, "漫威")
nlp2.cosine_similarity(input1,input2)

Random Utility

random_string(length)

Arguments

length(int) : length with random string

Returns

randstr(String) : size will be length in "0123456789ABCDEF" Examples

random_string(10)

D6857CE0F4

random_string_with_timestamp(length)

Arguments

length(int) : length with random string

Returns

randstr(String) : size will be length + timestamp length(10) Examples

random_string_with_timestamp(1)

1435474326D

random_value_in_array_form(array)

random value with range in array form
int,float : [min,max]
string : [candidate1,candidate2...]

Arguments

range(array) : range in array form

Returns

random result(depend on input) : a random value under input condition Examples

# for string
y = random_value_in_array_form(["SGD","ADAM","XDA"])
print(y)

'ADAM'

# for int
y = random_value_in_array_form([1,12])
print(y)

4

# for float
y = random_value_in_array_form([0.01,1.00])
print(y)

0.34

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/workflows		.github/workflows
nlp2		nlp2
test_folder		test_folder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

License

voidful/nlp2

Folders and files

Latest commit

History

Repository files navigation

🔨 nlp2 🔧

Usage

Features

File Handling

get_folders_from_dir(path)

get_files_from_dir(path)

read_dir_files_yield_lines(path)

read_dir_files_into_lines(path)

read_files_yield_lines(path)

read_files_into_lines(path)

create_new_dir_always(dirPath)

get_dir_with_notexist_create(dirPath):

is_file_exist(path)

is_dir_exist(file_dir)

download_file(url,save_dir)

read_csv(filepath, generator=False)

write_csv(csv_rows, loc)

read_json(filepath)

write_json(json_str, loc)

Text cleaning/parsing

clean_httplink(string)

clean_htmlelement(string)

clean_unused_tag(string)

clean_all(string)

split_lines_by_punc(lines)

split_sentence_to_ngram(sentence)

be careful with sentence length,long sentence will have worse performance

split_sentence_to_ngram_in_part(sentence)

be careful with sentence length,long sentence will have worse performance

split_text_in_all_ways(sentence)

split_sentence_to_array(sentence,merge_non_eng=False)

join_words_to_sentence(words_array):

passage_into_chunk(passage, chunk_size):

is_all_english(text)

is_contain_number(text)

is_contain_english(text)

is_list_contain_string(text)

full2half(text)

half2full(text)

Vectorize

doc2vec_aver(pretrained_emb, emb_size, context)

doc2vec_max(pretrained_emb, emb_size, context)

doc2vec_concat(pretrained_emb, emb_size, context)

doc2vec_hier(pretrained_emb, emb_size, context, windows)

cosine_similarity(vector 1, vector 2)

Random Utility

random_string(length)

random_string_with_timestamp(length)

random_value_in_array_form(array)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages