Use JointBERT to train condition extraction task
data_loader
tableQA data provider, will yield data table with question and training data.
findMaxSubString
return common string above two text.
sentence_t3_gen
function to extract condition tuple (with length 3) according question and training data. (because original training data have some noise in labeling,
findMaxSubString is used to deal with it)
explode_q_cond
explode the sample from sentence_t3_gen
all_t3_iter
yield all condition extraction training data.
q_t3_writer
serialize all_t3_iter data to local
labeling
transform condition extraction data into JointBERT NER format
dump_dfs_to_dir
save labeling conclusion into JointBERT friendly format
JointProcessor_DROP_SOME
JointProcessor in JointBERT
Trainer_DROP_SOME
Trainer in JointBERT
Notebook to explore a good rule for classify which aggregate keyword to use
agg_classifier_data_loader
data loader for all useful feature and agg label
standlize_agg_column
keep identical agg label from agg label list and remove multi-agg-label samples
transform_stats_to_df
give a token based summary dataframe with evidence score amoung different agg category
kws_dict_after_dict
look up the dataframe produced by transform_stats_to_df to find some keywords for different agg category. this dict has a high recall about 99.8% to cover all situations.
kws_tuple_key_dict
a key value swap format for kws_dict_after_dict. use it to generate a permutations of order rules to perform rule strategy backtest.
different_rule_product_list
generate product of samples from kws_tuple_key_dict with identical agg number this collection with permutation in itertools final produce all sample points (like all strategy generate in vectorbt strategy space)
kw_matcher_func_format
function string format use to generate single order rule labeling function definition (as a sample point———strategy in finance strategy backtest)
one_rule_if_else_generate
use different_rule_product_list and permutations to truly generate all order rule strategy parameter
kw_matcher_func_list
all labeling functions generate by one_rule_if_else_generate i.e. the truly samples’s name in all strategy space.
produce_L_train_iter
because the kw_matcher_func_list has length 2880, and train data
have about 40000, this is a big feature matrix. use iterator on PandasParallelLFApplier (supported by dask) to give parallel apply.
reconstruct_data
combine conclusion of produce_L_train_iter in sparse format (many prediction of labeling functions are zero)
acc_score_s
measure balanced_accuracy_score above different column
in L_train to choose some gold label functions (label functions to give relatively good score as a good rule strategy)
gold_t5_list
reverse map gold strategy (a strategy sub set that can make snorkel label model converge) back to parameters. where t5 is a length with 5 tuple in order that prior elements have more superiority to return first in kw_matcher_func_format
gold_rule_trie
overwrite common used Dictire in nlp to Tupletrie to give a measure friendly format to have a macro-summary above different t5 in gold_t5_list. (like a well organized strategy space for visualize)
count_all_sub_lines
count how many branch in current node in gold_rule_trie
count_all_sub_nodes
count how many sub nodes in current node in gold_rule_trie
depth
depth of current node in gold_rule_trie
count_distinct_sub_nodes
distinct count of count_all_sub_nodes
full_sub_tree_cnt
if generate a worst rule_trie in current node with same depth and sub nodes with current node. how many nodes this worst tree will have. the definition of “worst” is the tree which leaves is the permutation of nodes this tree have. and step to step into higher. a node with high num of full_sub_tree_cnt and low num of count_sub_nodes will become a will sub-tree (branch), because the balance_accurate_score “clip” this sub-tree better.
node_stats_summary
function to generate a summary by the use of above functions on
single node.
sub_tree_stats_summary
function to use node_stats_summary in all first level node in gold_rule_trie_dict (the summary of this function perform like the result table for strategy backtest in vectorbt, simple compare above this summary will give a advice about different order rules) with the discussion in the notebook the final rule is located in simple_label_func
Give conclusion of single table with question input. use this as import module in other script or notebook can simply add or overwrite some *_kws dictionary or pattern_list to satisfy your own context. The usage of this script is located in tableqa-single-valid.ipynb
*_kws dictionary
global dictionary used for aggregate classifier
pattern_list
global list for match a column in table as datatime column
predict_single_sent
retrieve raw conclusion of condition extraction in JointBERT format.
recurrent_extract
use predict_single_sent in recurrent manner to extract condition 、residual (condition remove) components 、conn string (such as ‘and’ ‘or’ lookup string) from question.
rec_more_time
use recurrent_extact in recurrent manner to have exhaustive condition extractions from question.
find_min_common_strings
find common strings subset with a non-greedy filter
time_template_extractor
match pattern_list elements in one column of table data
justify_column_as_datetime
use time_template_extractor over dataframe
justify_column_as_datetime_reduce
merge conclusion of different elements pattern_list in map reduce manner.
choose_question_column
justify which column to ask by firstly look up this question if it is about datetime column
(use justify_column_as_datetime_reduce)
single_label_func
when the question have aggregate kws (i.e. a question about MAX MIN AVG COUNT SUM), decide which aggregate keyword to use. The construction about this function is get from agg-classifier.ipynb with snorkel labeling model and parameters choose methods in strategy backtest (like vectorbt)
simple_special_func
decide if the question is a special question.
the definition of “special” is that, the according sql query about the question is a query with aggregate words (MAX MIN AVG COUNT SUM)
simple_total_label_func
decide the aggregate kw used in sql query (first use simple_special_func and if is special, use single_label_func to decide which agg kw to use)
split_by_cond
function to extract condition and conn string from question and residual components (remove condition) , use fit_kw to retrieve conn string (‘and’ ‘or’)
filter_total_conds
filter the condition extracted by split_by_cond with the help of datatype of table
augment_kw_in_question
some condition keywords can not extract by JointBERT are append by this function by config.
For example:
JointBERT can extract (“城市”, ==, “宁波”) from “城市是宁波的一手房成交均价是多少?” but can not extract (“城市”, ==, “宁波”) from “宁波的一手房成交均价是多少?”
so this function first justify which column in table is a category column, and find if any category in that column as a condition in question, and add it to “question_kw_conds”
choose_question_column_by_rm_conds
same as choose_question_column, but with remove all condition first.
choose_res_by_kws
another method (parallel with choose_question_column) to choose question column use the nearest components with question words.
cat6_to_45_column_type
above simple_total_label_func take “COUNT” and “SUM” as identical. this function distinguish them by the question column datatype, category column use “COUNT” and numerical column use “SUM”
full_before_cat_decomp
function to apply all above functions to question, to have the final prediction of question of tableQA task
when set only_req_columns to True, only return the truly needed prediction of question.
“question”: user input question
“total_conds_filtered”: all extract conditions.
“conn_pred”: connection string (“and” “or”) among conditions
“question_column”: which column to ask question
“agg_pred”: aggregate operator on question column.
because there are many components extract from above, the final agg operator use the max value above them, the final question column retrieve by a particular order (from accurate residual components to full question question word nearest prediction)
Construct finance dictionary by open source knowledge graph in ownthink.
retrieve_entities_data
fetch entity information by ownthink api and save to local.
read_fkg
read finance entities data from local in json load format (as python object)
finance_tag_filter
justify a entity is a finance entity by tag in ownthink response data.
Use finance dictionary and Bertopic model to build profile on table.
template_extractor
extract all chinese components from table
extract_text_info_from_tables_json
extract all text components from table (add header)
eval_df
eval df collection column as python object.
use profile build by tableqa_search.py to filter out finance tables from table collection.
retrieve_finance_tag_df
build finance tag dataframe from json format finance dictionaries.
filter_high_evidence_func
filter out high evidence dataframe part by high evidence dataframe with title and with finance keywords(need_words) in header without title.
low_evidence_finance_df
snorkel labeling function conclusion that predict samples(tables) as finance tables (use labeling function if_top_topic_id if_max_topic_id from label made by Bertopic
from tableqa_search.py, if_other_topic is a observe rule on title, if_tag_in_entities_* use finance_tag_df in different tag group )
After this voting (labeling model fit prediction) and some rule based filter, use filter_high_evidence_func to produce a relatively clean finance table subset, and save this subset’s profile to local
use finance tables’s profile and tableQA_single_table.py to perform finance databaseQA search on finance database
retrieve_table_in_pd_format
instantiation dataframe from finance database with add header on truly data and add alias column to dataframe by config ori_alias_mapping.(this will help table perform better with alias of question column)
search_tables_on_db_iter
use sqlite_utils’ search function on sqlite FTS5 registered meta_table (without loss of generalizations, meta_table in this notebook always point to finance desc table produce by tableqa_finance_unsupervised.py) to retrieve all tables related with kwords in dataframe format by retrieve_table_in_pd_format
search_question
search and sorted table names related with question input and meta_table by bm25 measurement
get_table
get table object from a collection of table objects init by sqlite_utils’ Table class (by name attribute)
extract_special_header_string
use meta_table header column to produce chinese finance special header keywords and english finance special header keywords
this function will help zh_sp_tks and en_sp_tks to build a common asked column mapping between question 、conditions、question column.
produce_token_dict
use sp_tks and sp to build a two level *_sp_dict as extract_special_header_string prepare.
calculate_sim_score_on_s
use zh_sp_dict and en_sp_dict to guess the most match column in question column、conditions with header. Calculate bm25 score as similarity measurement between question and table.
percentile_sort
a composite resort above different scores produced by above bm25 (question table similarity) and tableQA conclusion validation measurement defined in sort_search_tableqa_df’s sort_order_list and sort_func_list. Which use numpy’s percentile to produce confirm quality measurement above different scores (this makes quality scores partition as a collection of statistic stairs that search conclusions can walk down)
sort_search_tableqa_df
add tableQA conclusion validation measurements by sort_order_list and sort_func_list. and bm25 between question and all_text_str_elements as search_score with many other scores defined by above functions.
t5_order is to control the lexicograpic order in qa_score_tuple, which is useful in conclusion of run_sql_search (in that function, qa_score_tuple is more useful than percentile_sort, because that function have truly remove some invalid search conclusions by run sql query)
single_question_search
run question QA above a collection of tables (databaseQA) with the help of percentile_sort on the conclusion of this function, can see the performance of databaseQA. the columns user should care are :
"question_column",
"total_conds_filtered",
"agg_pred",
"conn_pred"
“score”,
“name”
firstly four columns are tableQA conclusion, columns match “score” format, are search score columns, and “name” refer to table name in the database, user can check the table structure by get_table function
run_sql_query
run sql query on conclusion of single_question_search by init sqlite table
run_sql_search
add run_sql_query conclusion to single_question_search and remove invalid samples, where invalid refer to query that have no records(select count(*) statement) on table
prefer to use “qa_score_tuple” as sort scores.