Skip to content

Latest commit

 

History

History
328 lines (230 loc) · 20.6 KB

tableQA_api_documentation.md

File metadata and controls

328 lines (230 loc) · 20.6 KB

tableQA-Chinese Api Documentation


tableQA components


      Use JointBERT to train condition extraction task

data_loader
           tableQA data provider, will yield data table with question and training data.

findMaxSubString
           return common string above two text.

sentence_t3_gen
            function to extract condition tuple (with length 3) according question and training data. (because original training data have some noise in labeling, findMaxSubString is used to deal with it)

explode_q_cond
            explode the sample from sentence_t3_gen

all_t3_iter
            yield all condition extraction training data.

q_t3_writer
            serialize all_t3_iter data to local

labeling
            transform condition extraction data into JointBERT NER format

dump_dfs_to_dir
            save labeling conclusion into JointBERT friendly format

JointProcessor_DROP_SOME
            JointProcessor in JointBERT

Trainer_DROP_SOME
            Trainer in JointBERT


      Notebook to explore a good rule for classify which aggregate keyword to use

agg_classifier_data_loader
            data loader for all useful feature and agg label

standlize_agg_column
            keep identical agg label from agg label list and remove multi-agg-label samples

transform_stats_to_df
            give a token based summary dataframe with evidence score amoung different agg category

kws_dict_after_dict
            look up the dataframe produced by transform_stats_to_df to find some keywords for different agg category. this dict has a high recall about 99.8% to cover all situations.

kws_tuple_key_dict
            a key value swap format for kws_dict_after_dict. use it to generate a permutations of order rules to perform rule strategy backtest.

different_rule_product_list
            generate product of samples from kws_tuple_key_dict with identical agg number this collection with permutation in itertools final produce all sample points (like all strategy generate in vectorbt strategy space)

kw_matcher_func_format
            function string format use to generate single order rule labeling function definition (as a sample point———strategy in finance strategy backtest)

one_rule_if_else_generate
            use different_rule_product_list and permutations to truly generate all order rule strategy parameter

kw_matcher_func_list
            all labeling functions generate by one_rule_if_else_generate i.e. the truly samples’s name in all strategy space.

produce_L_train_iter
            because the kw_matcher_func_list has length 2880, and train data have about 40000, this is a big feature matrix. use iterator on PandasParallelLFApplier (supported by dask) to give parallel apply.

reconstruct_data
            combine conclusion of produce_L_train_iter in sparse format (many prediction of labeling functions are zero)

acc_score_s
            measure balanced_accuracy_score above different column in L_train to choose some gold label functions (label functions to give relatively good score as a good rule strategy)

gold_t5_list
            reverse map gold strategy (a strategy sub set that can make snorkel label model converge) back to parameters. where t5 is a length with 5 tuple in order that prior elements have more superiority to return first in kw_matcher_func_format

gold_rule_trie
            overwrite common used Dictire in nlp to Tupletrie to give a measure friendly format to have a macro-summary above different t5 in gold_t5_list. (like a well organized strategy space for visualize)

count_all_sub_lines
            count how many branch in current node in gold_rule_trie

count_all_sub_nodes
            count how many sub nodes in current node in gold_rule_trie

depth
            depth of current node in gold_rule_trie

count_distinct_sub_nodes
            distinct count of count_all_sub_nodes

full_sub_tree_cnt
            if generate a worst rule_trie in current node with same depth and sub nodes with current node. how many nodes this worst tree will have. the definition of “worst” is the tree which leaves is the permutation of nodes this tree have. and step to step into higher. a node with high num of full_sub_tree_cnt and low num of count_sub_nodes will become a will sub-tree (branch), because the balance_accurate_score “clip” this sub-tree better.

node_stats_summary
            function to generate a summary by the use of above functions on single node.

sub_tree_stats_summary
            function to use node_stats_summary in all first level node in gold_rule_trie_dict (the summary of this function perform like the result table for strategy backtest in vectorbt, simple compare above this summary will give a advice about different order rules) with the discussion in the notebook the final rule is located in simple_label_func


      Give conclusion of single table with question input. use this as import module in other script or notebook can simply add or overwrite some *_kws dictionary or pattern_list to satisfy your own context. The usage of this script is located in tableqa-single-valid.ipynb

*_kws dictionary
            global dictionary used for aggregate classifier

pattern_list
            global list for match a column in table as datatime column

predict_single_sent
            retrieve raw conclusion of condition extraction in JointBERT format.

recurrent_extract
            use predict_single_sent in recurrent manner to extract condition 、residual (condition remove) components 、conn string (such as ‘and’ ‘or’ lookup string) from question.

rec_more_time
            use recurrent_extact in recurrent manner to have exhaustive condition extractions from question.

find_min_common_strings
            find common strings subset with a non-greedy filter

time_template_extractor
            match pattern_list elements in one column of table data

justify_column_as_datetime
            use time_template_extractor over dataframe

justify_column_as_datetime_reduce
            merge conclusion of different elements pattern_list in map reduce manner.

choose_question_column
            justify which column to ask by firstly look up this question if it is about datetime column (use justify_column_as_datetime_reduce)

single_label_func
            when the question have aggregate kws (i.e. a question about MAX MIN AVG COUNT SUM), decide which aggregate keyword to use. The construction about this function is get from agg-classifier.ipynb with snorkel labeling model and parameters choose methods in strategy backtest (like vectorbt)

simple_special_func
            decide if the question is a special question. the definition of “special” is that, the according sql query about the question is a query with aggregate words (MAX MIN AVG COUNT SUM)

simple_total_label_func
            decide the aggregate kw used in sql query (first use simple_special_func and if is special, use single_label_func to decide which agg kw to use)

split_by_cond
            function to extract condition and conn string from question and residual components (remove condition) , use fit_kw to retrieve conn string (‘and’ ‘or’)

filter_total_conds
            filter the condition extracted by split_by_cond with the help of datatype of table

augment_kw_in_question
            some condition keywords can not extract by JointBERT are append by this function by config. For example:             JointBERT can extract (“城市”, ==, “宁波”) from “城市是宁波的一手房成交均价是多少?” but can not extract (“城市”, ==, “宁波”) from “宁波的一手房成交均价是多少?” so this function first justify which column in table is a category column, and find if any category in that column as a condition in question, and add it to “question_kw_conds”

choose_question_column_by_rm_conds
            same as choose_question_column, but with remove all condition first.

choose_res_by_kws
            another method (parallel with choose_question_column) to choose question column use the nearest components with question words.

cat6_to_45_column_type
            above simple_total_label_func take “COUNT” and “SUM” as identical. this function distinguish them by the question column datatype, category column use “COUNT” and numerical column use “SUM”

full_before_cat_decomp
            function to apply all above functions to question, to have the final prediction of question of tableQA task when set only_req_columns to True, only return the truly needed prediction of question.

“question”: user input question
“total_conds_filtered”: all extract conditions.
“conn_pred”: connection string (“and” “or”) among conditions
“question_column”: which column to ask question
“agg_pred”: aggregate operator on question column.

            because there are many components extract from above, the final agg operator use the max value above them, the final question column retrieve by a particular order (from accurate residual components to full question question word nearest prediction)


databaseQA components


      Construct finance dictionary by open source knowledge graph in ownthink.

retrieve_entities_data
            fetch entity information by ownthink api and save to local.

read_fkg
            read finance entities data from local in json load format (as python object)

finance_tag_filter
            justify a entity is a finance entity by tag in ownthink response data.


      Use finance dictionary and Bertopic model to build profile on table.

template_extractor
            extract all chinese components from table

extract_text_info_from_tables_json
            extract all text components from table (add header)

eval_df
            eval df collection column as python object.


      use profile build by tableqa_search.py to filter out finance tables from table collection.

retrieve_finance_tag_df
            build finance tag dataframe from json format finance dictionaries.

filter_high_evidence_func
            filter out high evidence dataframe part by high evidence dataframe with title and with finance keywords(need_words) in header without title.

low_evidence_finance_df
           snorkel labeling function conclusion that predict samples(tables) as finance tables (use labeling function if_top_topic_id if_max_topic_id from label made by Bertopic from tableqa_search.py, if_other_topic is a observe rule on title, if_tag_in_entities_* use finance_tag_df in different tag group )             After this voting (labeling model fit prediction) and some rule based filter, use filter_high_evidence_func to produce a relatively clean finance table subset, and save this subset’s profile to local


      use finance tables’s profile and tableQA_single_table.py to perform finance databaseQA search on finance database

retrieve_table_in_pd_format
            instantiation dataframe from finance database with add header on truly data and add alias column to dataframe by config ori_alias_mapping.(this will help table perform better with alias of question column)

search_tables_on_db_iter
            use sqlite_utils’ search function on sqlite FTS5 registered meta_table (without loss of generalizations, meta_table in this notebook always point to finance desc table produce by tableqa_finance_unsupervised.py) to retrieve all tables related with kwords in dataframe format by retrieve_table_in_pd_format

search_question
            search and sorted table names related with question input and meta_table by bm25 measurement

get_table
            get table object from a collection of table objects init by sqlite_utils’ Table class (by name attribute)

extract_special_header_string
            use meta_table header column to produce chinese finance special header keywords and english finance special header keywords
this function will help zh_sp_tks and en_sp_tks to build a common asked column mapping between question 、conditions、question column.

produce_token_dict
            use sp_tks and sp to build a two level *_sp_dict as extract_special_header_string prepare.

calculate_sim_score_on_s
            use zh_sp_dict and en_sp_dict to guess the most match column in question column、conditions with header. Calculate bm25 score as similarity measurement between question and table.

percentile_sort
            a composite resort above different scores produced by above bm25 (question table similarity) and tableQA conclusion validation measurement defined in sort_search_tableqa_df’s sort_order_list and sort_func_list. Which use numpy’s percentile to produce confirm quality measurement above different scores (this makes quality scores partition as a collection of statistic stairs that search conclusions can walk down)

sort_search_tableqa_df
            add tableQA conclusion validation measurements by sort_order_list and sort_func_list. and bm25 between question and all_text_str_elements as search_score with many other scores defined by above functions. t5_order is to control the lexicograpic order in qa_score_tuple, which is useful in conclusion of run_sql_search (in that function, qa_score_tuple is more useful than percentile_sort, because that function have truly remove some invalid search conclusions by run sql query)

single_question_search
            run question QA above a collection of tables (databaseQA) with the help of percentile_sort on the conclusion of this function, can see the performance of databaseQA. the columns user should care are :

"question_column",
"total_conds_filtered",
"agg_pred",
"conn_pred"
score”,
“name”

            firstly four columns are tableQA conclusion, columns match “score” format, are search score columns, and “name” refer to table name in the database, user can check the table structure by get_table function

run_sql_query
            run sql query on conclusion of single_question_search by init sqlite table

run_sql_search
            add run_sql_query conclusion to single_question_search and remove invalid samples, where invalid refer to query that have no records(select count(*) statement) on table prefer to use “qa_score_tuple” as sort scores.