BASPRO: A balanced script producer for speech corpus collection based on the genetic algorithm

Paper link: arxiv
How to use BASPRO?(中文)

Data collection

Collect your own data. The final script will consist of the data you collect.

See raw_data.txt for an example of a original file.

#raw_data.txt

第一句話
換行後是下一句
每句的長度有可能不同
也有可能包含非中文字

Data Processing

Step 1 General Filtering

Leave only sentences with ten characters in Chinese
Automatically add indexes to all sentences

import preprocessing
preprocessing.general_filter(input_file_path)

e.g.
preprocessing.general_filter("/path/to/raw_data.txt")

The processed file will be saved as result_s1.txt

input format: sentence
output format: Idx#sentence

Step 2 Segmentation and POS tagging

Segmentation:

jieba

preprocessing.jieba_seg("/path/to/result_s1.txt")

. output files will be saved as {input_file_name}_jieba.txt
input format: Idx#sentence
output format: Idx#sentence#word segmentation

Segmentation & POS tagging:

(1) ckiptagger

preprocessing.ckip_seg("/path/to/result_s1.txt")

. aftering downloading, you will find a directory named data, which contains the models used in the ckiptagger. You need to change the ckip_model_path in preprocessing.ckip_seg() to the path of data directory. Or, put the data directory in the same directory as the preprocessing.py and then change the name of model directory to ckip_data.
. output files will be saved as {input_file_name}_ckip.txt
*input format: Idx#sentence
output format: Idx#sentence#word segmentation#pos tags

(2)ddparser

Additional requirement (Conversions between Traditional Chinese, Simplified Chinese): opencc
Install:

pip install LAC
pip install ddparser
pip install opencc

Run segmentation:

preprocessing.ddparser_seg("/path/to/result_s1.txt")

. output files will be saved as {input_file_name}_ddparser.txt
*input format: Idx#sentence
output format: Idx#sentence#word segmentation#pos tags

(Note: the ddparser cannot run on Mac M1)

Step 3 Sensitive Word Filtering

Remove sentences contain sensitive words (See sensitive_word_list.txt for details)

preprocessing.sensitive_filter("/path/to/inputfile","/path/to/sensitive/word/list", save_rm=True)
e.g.
preprocessing.sensitive_filter("./result_s1_ckip.txt","./sensitive_word_list.txt", save_rm=True)
# output file will be saved as result_s1_ckip_s3.txt

input & output format: Idx#sentence#word segmentation#{pos tags}

Step 4 POS Filtering

(1) Remove sentences contain words longer than 5 characters (Most Chinese words are less than five characters)
(2) Remove sentences contain duplicate words
(3) Remove sentences end or contain certain words

Removal criteria


first character	'二','三','四','五','六','七','八','九','十','也','於','為','還','都','連','並','而','但','再'
first two characters	'而且','於是','就是','無論','因為','其中','為了','儘管','所以','甚至','還是','在於','如果','因此','可能','例如'
in the middle	'吧','嗎','呢','啊','啥'

(4) Select sentences candidate based on POS tags

POS removal criteria

toolkit	include	start	end
ckip	'Nb','Nc','FW'	'DE','SHI','T'	'Caa','Cab','Cba','Cbb','P','T'
ddparser	'LOC','ORG','TIME','PER','w','nz'	'p','u','c'	'xc','u'

ckip POS tag: https://github.com/ckiplab/ckiptagger/wiki/POS-Tags
ddparser POS tag: https://github.com/baidu/lac

# use ckip results only
preprocessing.pos_seg_filter(input_path_ckip="./result_s1_ckip_s3.txt", save_rm=True)
# or use ddparser results only
preprocessing.pos_seg_filter(input_path_ddparser="./result_s1_ddparser_s3.txt", save_rm=True)
# or use both ckip and ddparser results
preprocessing.pos_seg_filter(input_path_ckip="./result_s1_ckip_s3.txt", input_path_ddparser="./result_s1_ddparser.txt", save_rm=True)

. output files will be saved as {input_file_name}_{pos tag}_s4.txt
. if save_rm=True, the removed sentences will be recorded in {input_file_name}_s4_rm.txt
. edit ckip_filter or ddparser_filter function in preprocessing.py if you want to use other criteria
input & output format: Idx#sentence#word segmentation#pos tags

Step 5 Perplexity Filtering

Step 5-0: Install pytorch transformer

pip install pytorch-transformers

Step 5-1: Calculate perplexity scores

preprocessing.get_perplexity("/path/to/inputfile")
e.g.
preprocessing.get_perplexity("./result_s1_ckip_s3_ckipddp_s4.txt") # if you use cpu
#or
preprocessing.get_perplexity("./result_s1_ckip_s3_ckipddp_s4.txt", 'cuda:0') #if you use gpu

Note: recommended to use gpu
. output files will be saved as {file_name}_per.txt
input format: Idx#sentence#{word segmentation#pos tags}
output format: Idx#sentence{#word segmentation#pos tags}#perplexity_score

Step 5-2: Remove sentences have high perplexity

preprocessing.perplexity_filter(input_file_path, save_rm=True, th=4.0)

. th: threshold of perplexity filtering, default value is 4.0
. output files will be saved as {input_file_name}_s5.txt
input format: Idx#sentence{#word segmentation#pos tags}#perplexity_score
output format: Idx#sentence{#word segmentation#pos tags}#perplexity_score

Step 6 Intelligibility Filtering

Select candidate sentences based on the intelligibility scores. The following figure shows how the intelligibility score is calculated.

Step 6-1: Text to Speech

Toolkit	Quality	Processing time for 10 samples	Support Taiwanese Accent	Speaker Gender
Gtts	Excellent	~20s	V	Female
Paddle Speech	Good	~92s	X	Male & Female
ttskit	OK	~22s	X	Male & Female
zhtts	OK?	~15s	X	Female

text2speech.tts_gtts(input_file_path, save_info=True, convert_format=True) #Set convert_format=False if you want to use the original format
text2speech.tts_paddle(input_file_path, save_info=True)
text2speech.tts_ttskit(input_file_path, save_info=True)
text2speech.tts_zhtts(input_file_path, save_info=True)

PaddleSpeech & TTSkit & Zhtts: cannot run on Mac M1

Gtts:
. The original Gtts output format might have some problem when loading with python.
. Load the Gtts output waveform using librosa.load(/wav/file/path) . Install ffmpeg if encounter audioread.exceptions.NoBackendError when using librosa

conda install -c conda-forge ffmpeg

. save_info=True will save the mapping between wavefile index and content in ttx_info_{toolkit}.txt
. output waveform will be save in {file_name}_{toolkit} directory
. this step might take a long time to fininsh

input format: Idx#sentence{#word segmentation#pos tags#perplexity_score}

Step 6-2: Calculate the intelligibility scores based on ASR results

preprocessing.calculate_asr_and_intell(input_file_path, wav_dir_path, auto_corr=True)

. install cn2an, and opencc if you set auto_corr=True. auto_corr=True will automatically convert the Arabic numerals in ASR prediction to Chinese numerals
. this step will take a long time to fininsh
. the index in the input_file_path should match the file name in the wave file directory
e.g.

* input_file.txt
Idx1#第一句話的範例有十字#...#...#...
Idx2#有沒有包含斷詞不影響#...#...#...

* wav_dir
├── Idx1.wav (TTS results of the sentence "第一句話的範例有十字")
├── Idx2.wav (TTS results of the sentence "有沒有包含斷詞不影響")
│   ...
└── IdxN.wav

. output file will be save as {input_file_name}_asr.txt

Step 6-3: Select sentences base based on the intelligibility scores

preprocessing.intelligibility_filter(input_file_path, save_rm=True, th=1.0)

th: threhold of intelligibility filtering, default value is 1.0 output files will be saved as candidate_sentences.txt input format: Idx#sentence{#word segmentation#pos tags#perplexity_score#asr_prediction_result}#intelligibility_score
output format: Idx#sentence{#word segmentation#pos tags#perplexity_score#asr_prediction_result}#intelligibility_score

Script composing

Step 1: Calculate statistics for text corpus

Step 1-0: install pypinyin

pip install pypinyin

Step 1-1: calculate the statistics

preprocessing.calculate_statistics("/path/to/raw_data.txt")

input format: sentence output: dict
(1)gt_syllable.pickle
(2)gt_syllable_with_tone.pickle
(3)gt_initial.pickle
(4)gt_final.pickle

Step 2: Prepare the data for sampling

preprocessing.prepare_data("gt_syllable_with_tone.pickle", "candidate_sentences.txt")

input:
(1) gt_syllable.pickle or gt_syllable_with_tone.pickle
(2) input_sen_list.txt with format: Idx#sentence#word segmentation{#...}

output:
(1) gt_syllable_distribution.npy #real-world syllable distrubution. dimension: (numbers_of_syllables, 1)
(2) gt_syllables_key.pickle # record the mapping of syllables
(3) idx_syllables.npy # record the mapping of sentences and syllables
(4) idx_content.npy # record the content
(5) idx_oriidx.npy # record the mapping of original index and new index

(1), (3), (4) are inputs for sampling

GA-based Sampling

Adjust the hyperparameters

#in sampling.py file

num_of_set = 20  #numbers of the set in the corpus
num_of_sen_in_set = 20 #numbers of the sentences in a set
population_size = 10000 #initial population size of the GA
iteration = 500 # numbers of interation for GA

truth_syllable = np.load('gt_syllable_distribution.npy') #load the results of Data preparation Step2  
idx_syllable = np.load("idx_syllable.npy"). #load the results of Data preparation Step2

Generate a new corpus

python sampling.py --outputdir output

output: (1) best_chro.npy # the best chromosome (the sampled sentences index list)
(2) corpus.txt # the content of best_chro
(3) log.txt # the training history
(4) final_chro.npy # best chromosome in the end of sampleing, usually the same as best_chro.npy

Post-processing (Optional)

Post-processing step provide GA-based and Greedy-based method to replace some sentences in the corpus. Greedy-based method is better for the condition that only a few sentences are required to be replace. On the other hand, GA-based method is more suitable for the condiation that many sentences in the corpus need the replacement.

Step 1: record the unwanted sentences

Write the "index_in_sentence_candidates" in the excluded_idx.txt files.

For example, after reading the sentences in corpus.txt, you want to replace "一起搭多元計程車回家" (index_in_sentence_candidates=4993) with another sentences. Create a text file (e.g. excluded_idx.txt) and record the ${index_in_sentence_candidates}.

# corpus.txt (output of the sampling step)

index_in_sentence_candidates:set_idx:sentence_idx:content
4993:0:0:一起搭多元計程車回家
4290:0:1:比其他種類的草莓還甜
1899:0:2:他是七十一歲的老先生

# excluded_idx.txt (separate the indexes with line)
4993

Step 2: GA-based replacement

Run the sampling.py again. The sentence in excluded_idx.txt will be replaced by other sentences.

python sampling.py --initial_dir output --excluded excluded_idx.txt

Step 2: Greedy-based replacement

The sentence in excluded_idx.txt will be replaced by other sentences.

python greedy.py --initial_dir output --excluded excluded_idx.txt

Working on recording script of language other than Mandrian Chinese

(1) Collect a large text dataset you think can represent the real-world condition or the domain you want.
(2) Design your own data processing method and extract the candidate sentences from the dataset you collected.
(3) Calculate the phoneme, syllable, or other characteristics distribution of the dataset you collected.
(4) Create the gt_syllable_distribution.npy (the distribution you want to learn), idx_syllables.npy (the distribution of each candidate sentence), and idx_content.npy (the content of candidate sentence) as the following format:

. Example:

If a language only contain three words, AA, BB, CC, and their corresponding phonemes are a, b, c. Asssuming the crawled articles are:

AA AA BB CC CC CC CC 
CC AA BB AA AA CC CC CC CC CC

Then, the phonemes of the articles are

a a b c c c c
c a b a a c c c c c

The statistics of the real-world (ground-truth) condition:

gt_syllables = {"a":5,"b":2,"c":10} 
gt_syllables.keys() = ["a","b","c"]  

gt_syllables_distribution = [5, 2, 10]

Assuming the candidate sentences are:

idx_3#AA BB BB
idx_5#BB CC CC

Then,

idx_syllables = [[1,2,0],  #syllable distribution of the 1st sentence in the candidate sentences file  
                 [0,1,2]]  #syllable distribution of the 2nd sentence in the candidate sentences file 

idx_content = [AA BB BB,  
               BB CC CC]

(5) Change the hyperparameters in sampling.py and run the sampling.py

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
TMNews		TMNews
images		images
LICENSE		LICENSE
README.md		README.md
analysis_syllables_from_txt.py		analysis_syllables_from_txt.py
greedy.py		greedy.py
main.py		main.py
perplexity.py		perplexity.py
preprocessing.py		preprocessing.py
raw_data.txt		raw_data.txt
sampling.py		sampling.py
sensitive_word_list.txt		sensitive_word_list.txt
text2speech.py		text2speech.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BASPRO: A balanced script producer for speech corpus collection based on the genetic algorithm

Data collection

Data Processing

Step 1 General Filtering

Step 2 Segmentation and POS tagging

Step 3 Sensitive Word Filtering

Step 4 POS Filtering

Step 5 Perplexity Filtering

Step 6 Intelligibility Filtering

Step 6-1: Text to Speech

Step 6-2: Calculate the intelligibility scores based on ASR results

Step 6-3: Select sentences base based on the intelligibility scores

Script composing

Step 1: Calculate statistics for text corpus

Step 1-0: install pypinyin

Step 1-1: calculate the statistics

Step 2: Prepare the data for sampling

GA-based Sampling

Adjust the hyperparameters

Generate a new corpus

Post-processing (Optional)

Step 1: record the unwanted sentences

Step 2: GA-based replacement

Step 2: Greedy-based replacement

Working on recording script of language other than Mandrian Chinese

About

Releases

Packages

Languages

License

yuwchen/BASPRO

Folders and files

Latest commit

History

Repository files navigation

BASPRO: A balanced script producer for speech corpus collection based on the genetic algorithm

Data collection

Data Processing

Step 1 General Filtering

Step 2 Segmentation and POS tagging

Step 3 Sensitive Word Filtering

Step 4 POS Filtering

Step 5 Perplexity Filtering

Step 6 Intelligibility Filtering

Step 6-1: Text to Speech

Step 6-2: Calculate the intelligibility scores based on ASR results

Step 6-3: Select sentences base based on the intelligibility scores

Script composing

Step 1: Calculate statistics for text corpus

Step 1-0: install pypinyin

Step 1-1: calculate the statistics

Step 2: Prepare the data for sampling

GA-based Sampling

Adjust the hyperparameters

Generate a new corpus

Post-processing (Optional)

Step 1: record the unwanted sentences

Step 2: GA-based replacement

Step 2: Greedy-based replacement

Working on recording script of language other than Mandrian Chinese

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages