Paper内容主要被明确划分为了3部分：
1. Title：论文标题。
2. Abstract：论文摘要。
3. Main：论文正文，包括Introduction、Methodology、Conclusion等内容。

主要使用SciPDF Parser进行解析

# 一、paper部分

## 1. 解析pdf文件

1. 运行SciPDF解析环境
```shell
# 公司GPU服务器
bash /data/temp/paper_parsers/promote-openreview/scipdf_parser/serve_grobid.sh
```

2. 使用SciPDF解析文件夹中的PDF文件
```shell
python /data/temp/paper_parsers/promote-openreview/pdf_parser/scipdf_parser.py --dir_path <论文PDF所在文件夹>
```

3. 解析后得到的json文件将存放在同目录下的“scipdf_parser_results文件夹”中
```shell
/data/temp/paper_parsers/promote-openreview/pdf_parser/scipdf_parser_results
```

In [1]:
# 指定未经处理的paper文件和未经处理的review文件的路径
scipdf_path = r"/Users/theon/Desktop/CV_project/julyacademic_gpt/process_data/已解析但待处理的数据(5000篇)/papers"
review_jsonl_path = r"/Users/theon/Desktop/CV_project/julyacademic_gpt/process_data/已解析但待处理的数据(5000篇)/reviews/notes.jsonl"

## 2. 读取paper json文件

In [2]:
# 指定必要库
import json
from pathlib import Path
import pandas as pd
import warnings
from tqdm import tqdm
import re

warnings.filterwarnings("ignore")

In [3]:
# 读取解析出的json文件并组成dataframe
#df_sci = pd.DataFrame()
#for json_file in tqdm(Path(scipdf_path).glob("*.json")):
#    with open(json_file, "r") as fp:
#        dic = json.load(fp)
#        dic["forum"] = json_file.stem
#    df_sci = df_sci.append(dic, ignore_index=True)

In [4]:
data_list = []
for json_file in tqdm(Path(scipdf_path).glob("*.json")):
    with open(json_file, "r") as fp:
        dic = json.load(fp)
        dic["forum"] = json_file.stem
        data_list.append(dic)

df_sci = pd.DataFrame(data_list)

5000it [00:01, 3442.25it/s]


In [5]:
df_sci.columns

Index(['title', 'authors', 'pub_date', 'abstract', 'sections', 'references',
       'figures', 'formulas', 'doi', 'forum'],
      dtype='object')

## 3. 处理main（正文）

In [6]:
# 处理可疑换行符，将可疑换行符处理成\n
# 将\n\n\n以及\n\n转为\n
# 去掉各种疑似\\n
regex_3n = re.compile(r"\n\n\n", flags=(re.S))
regex_spn = re.compile(r"\n \n", flags=(re.S))
regex_sln = re.compile(r"\\n", flags=(re.S))
regex_2n = re.compile(r"\n\n", flags=(re.S))

regexs_n = {
    "regex_3n": regex_3n,
    "regex_spn": regex_spn,
    "regex_sln": regex_sln,
    "regex_2n": regex_2n
}
def replace_1n(content, regexs=regexs_n):
    content = re.sub(regexs["regex_3n"], "\n", content)
    content = re.sub(regexs["regex_spn"], "\n", content)
    content = re.sub(regexs["regex_sln"], "\n", content)
    content = re.sub(regexs["regex_2n"], "\n", content)
    content = re.sub(regexs["regex_spn"], "\n", content)
    content = re.sub(regexs["regex_2n"], "\n", content)
    return content

In [7]:
# 将sections中的多个元素处理成单个文本，即main（正文）
# PAPER {heading1}:\n{text1}\nPAPER_{heading2}:\n{text2}...
# 同时调用上方的replace_1n函数针对main处理多换行符、可疑换行符为单换行符

def sctions2str(sections):
    p_strings = []
    for section in sections:
        p_heading = section["heading"]
        p_text = section["text"]
        p_string = "PAPER {}".format(p_heading) + ":\n" + p_text
        p_strings.append(p_string)
    p_strings = "\n".join(p_strings)
    p_strings = replace_1n(p_strings)
    return p_strings

df_sci["main"] = df_sci["sections"].apply(lambda x: sctions2str(x))
df_sci["main"]

0       PAPER Introduction:\nDeep neural networks (DNN...
1       PAPER INTRODUCTION:\nDeep neural networks (DNN...
2       PAPER INTRODUCTION:\nDefining efficient optimi...
3       PAPER INTRODUCTION:\nSimply minimizing the sta...
4       PAPER INTRODUCTION:\nSupervised learning algor...
                              ...                        
4995    PAPER Introduction:\nCurious free-play has bee...
4996    PAPER INTRODUCTION:\nAutomatic detection of so...
4997    PAPER Introduction:\nMachine learning is the p...
4998    PAPER INTRODUCTION:\n"All samples are equal, b...
4999    PAPER INTRODUCTION:\nLearning discrete represe...
Name: main, Length: 5000, dtype: object

In [8]:
# 统计main单词数
df_sci["main_count"] = df_sci["main"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_sci["main_count"].head()

0     2450
1    15049
2     6876
3     7712
4     5523
Name: main_count, dtype: int64

In [9]:
# 查看单词数排序
# 发现存在单词数为0的情况，也有单词数高达140K的情况
# 有必要进行一定的极端值处理操作
df_sci["main_count"].sort_values()

4800         0
1703        95
3017       112
914        122
2081       134
         ...  
2871     70799
2597     79886
225      93561
976     103113
1055    138736
Name: main_count, Length: 5000, dtype: int64

In [10]:
# 观察main单词数的分位数情况
# 可知140K单词数已属于极端情况、260的单词数也过短
df_sci["main_count"].quantile(q=[0.001, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 0.999, 1])

0.001       151.982
0.100      2128.000
0.250      3942.000
0.500      5343.000
0.750      6846.250
0.900      9557.600
0.950     12321.000
0.990     22538.570
0.999     65616.188
1.000    138736.000
Name: main_count, dtype: float64

In [11]:
# 去除正文长度大于99分位数以及小于10分位数的项（去极端值）
df_sci_handle = df_sci[(df_sci["main_count"] < df_sci["main_count"].quantile(q=0.99)) & (df_sci["main_count"] > df_sci["main_count"].quantile(q=0.1))] 
print("{} ==> {}".format(df_sci.shape[0], df_sci_handle.shape[0]))

5000 ==> 4448


## 4. 处理title

In [12]:
# 针对title处理多换行符、可疑换行符为单换行符
df_sci_handle["title_new"] = df_sci_handle["title"].apply(lambda x: replace_1n(x))
df_sci_handle["title_new"]

0       Defending against Model Stealing via Verifying...
1       EXPLAINING REPRESENTATION BOTTLENECKS OF CONVO...
2       EFFICIENT WASSERSTEIN NATURAL GRADIENTS FOR RE...
3       Under review as a conference paper at ICLR 202...
4       SOME CONSIDERATIONS ON LEARNING TO EXPLORE VIA...
                              ...                        
4994    Why a Naive Way to Combine Symbolic and Latent...
4995    Curious Exploration via Structured World Model...
4997                       Self-Referential Meta Learning
4998            LEARNING FROM SAMPLES OF VARIABLE QUALITY
4999    SELF-SUPERVISED LEARNING OF DISCRETE SPEECH RE...
Name: title_new, Length: 4448, dtype: object

In [13]:
# 统计title单词数
df_sci_handle["title_new_count"] = df_sci_handle["title_new"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_sci_handle["title_new_count"]

0        9
1        7
2        7
3       18
4        9
        ..
4994    15
4995    10
4997     3
4998     6
4999     6
Name: title_new_count, Length: 4448, dtype: int64

In [14]:
# 查看单词数排序
# 发现存在单词数为0的情况，也有单词数高达148的情况，看上去似乎不太合乎常理
# 有必要进行一定的极端值处理操作
# 后续可以从review侧获取相应的title信息来填补
df_sci_handle["title_new_count"].sort_values()

2965     0
2142     0
3541     0
726      0
2124     0
        ..
3240    25
3101    26
1050    26
1790    28
4768    61
Name: title_new_count, Length: 4448, dtype: int64

## 5. 处理abstract

In [15]:
# 针对abstract处理多换行符、可疑换行符为单换行符
df_sci_handle["abstract_new"] = df_sci_handle["abstract"].apply(lambda x: replace_1n(x))
df_sci_handle["abstract_new"]

0       Well-trained models are valuable intellectual ...
1       In this paper, we prove representation bottlen...
2       A novel optimization approach is proposed for ...
3       A lot of theoretical and empirical evidence sh...
4       We consider the problem of exploration in meta...
                              ...                        
4994    We compare a rule-based approach for knowledge...
4995    It has been a long-standing dream to design ar...
4997    Meta Learning automates the search for learnin...
4998    Training labels are expensive to obtain and ma...
4999    We propose vq-wav2vec to learn discrete repres...
Name: abstract_new, Length: 4448, dtype: object

In [16]:
# 统计abstract单词数
df_sci_handle["abstract_new_count"] = df_sci_handle["abstract_new"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_sci_handle["abstract_new_count"]

0       150
1       163
2        80
3       164
4        55
       ... 
4994    100
4995    191
4997    121
4998    147
4999     79
Name: abstract_new_count, Length: 4448, dtype: int64

In [17]:
# 查看单词数排序
# 发现存在单词数为0的情况
# 有必要进行一定的极端值处理操作
# 后续可以从review侧获取相应的abstract信息来填补
df_sci_handle["abstract_new_count"].sort_values()

274        0
1896       0
1753       0
1257       0
3030       0
        ... 
31      1142
3936    1240
2069    1384
4497    1916
1202    2067
Name: abstract_new_count, Length: 4448, dtype: int64

In [18]:
# 总览列名
df_sci_handle.columns

Index(['title', 'authors', 'pub_date', 'abstract', 'sections', 'references',
       'figures', 'formulas', 'doi', 'forum', 'main', 'main_count',
       'title_new', 'title_new_count', 'abstract_new', 'abstract_new_count'],
      dtype='object')

# 二、review部分

由 https://gitee.com/remixa/promote-openreview/blob/master/utils/openreview_crawler.py 爬取得到，并存储为jsonl格式。

## 1. 读取review jsonl文件

In [19]:
# 同 https://gitee.com/remixa/promote-openreview/blob/master/utils/openreview_processor.py
# 定义读取器，将jsonl格式的review读取成dataframe格式
import jsonlines
import pandas as pd

class openreview_proccessor():
    def __init__(self, jsonl_path):
        self.df = self._load_jsonl_to_dataframe(jsonl_path)
        self.df_sub = pd.DataFrame()
    
    def _load_jsonl_to_dataframe(self, jsonl_path):
        msg_list = []
        with open(jsonl_path, 'r', encoding='utf-8') as file:
            for line_dict in jsonlines.Reader(file):
                msg_dict = {}
                for k, v in line_dict['basic_dict'].items():
                    msg_dict['b_' + k] = v
                msg_list.append(msg_dict)
                for review_msg in line_dict["reviews_msg"]:
                    msg_dict_copy = msg_dict.copy()
                    pure_review_msg = {
                        'r_id': review_msg.get('id', None),
                        'r_number': review_msg.get('number', None),
                        'r_replyto': review_msg.get('replyto', None),
                        'r_invitation': review_msg.get('invitation', None),
                        'r_signatures': ','.join(review_msg['signatures']) if review_msg.get('signatures', None) else None,
                        'r_readers': review_msg.get('readers', None),
                        'r_nonreaders': review_msg.get('nonreaders', None),
                        'r_writers': review_msg.get('writers', None)  
                    }
                    pure_content_msg = {}
                    pure_content_msg['c_content'] = review_msg['content']
                    for k, v in review_msg['content'].items(): 
                        pure_content_msg['c_' + k] = v
                    pure_review_msg.update(pure_content_msg)
                    msg_dict_copy.update(pure_review_msg)
                    msg_list.append(msg_dict_copy)
        dataframe = pd.DataFrame(msg_list)
        dataframe['c_final_decision'] = self._fill_decision(dataframe)
        return dataframe
    
    def _fill_decision2(self, dataframe):
        return dataframe['c_decision'].fillna('Unknown').map(lambda x: 'Accepted' if x.startswith('Accept') else 
                                                          'Rejected' if x.startswith('Reject') else x)
    def _fill_decision(self, dataframe):
        return dataframe['c_decision'].map(lambda x: x if pd.isnull(x) else
                                           'Accepted' if 'accept' in x.lower() else 
                                            'Rejected' if 'reject' in x.lower() else "Unknown")
    
    def get_sub(self, mode=None):
        # 仅带有review的df
        df_sub = self.df.dropna(subset=self.df.filter(regex='^(?!b_*)').columns, how='all')
        if mode == 'decision':
            # review类型仅为decision的df
            df_sub = df_sub[df_sub['r_invitation'].str.contains('Decision')]
        elif mode == 'other':
            # review类型仅为非decision的df
            df_sub = df_sub[~df_sub['r_invitation'].str.contains('Decision')]
        elif mode == 'accepted':
            # decision中被采纳的df
            df_sub = df_sub[df_sub['c_final_decision'].isin(['Accepted'])]
        elif mode == 'rejected':
            # decision中未被采纳的df
            df_sub = df_sub[df_sub['c_final_decision'].isin(['Rejected'])]

        self.df_sub = df_sub
        return
    
    def get_total_shape(self):
        return self.df.shape
    
    def get_sub_shape(self):
        return self.df_sub.shape

In [20]:
# 读取review
orp = openreview_proccessor(review_jsonl_path)
orp.get_sub()
# 原数据中存在部分数据是无review的情况，仅取带review的数据
df_reviews = orp.df_sub
df_reviews.head()

Unnamed: 0,b_forum,b_title,b_url,b_pub_date,b_abstract,b_TL;DR,b_authors,b_keywords,b_venue,b_venue_id,...,c_overall_recommendation,c_preliminary_rating,c_suggested_changes,c_reason_for_not_giving_a_higher_recommendation,c_reason_for_not_giving_a_lower_recommendation,c_comments_and_feedback_to_the_authors,c_consent_to_archive,c_Reviewer expertise,c_Review (Strengths/Weaknesses),c_final_decision
1,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,
2,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,
3,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,
4,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,Rejected
6,rye7IMbAZ,Explicit Induction Bias for Transfer Learning...,https://openreview.net/forum?id=rye7IMbAZ,--,"In inductive transfer learning, fine-tuning pr...","In inductive transfer learning, fine-tuning pr...","Xuhong LI,Yves GRANDVALET,Franck DAVOINE","transfer Learning,convolutional networks,fine-...",--,--,...,,,,,,,,,,


## 2. 处理review文本（c_content）

In [21]:
# c_content字段即为dict格式的正文
df_reviews["c_content"]

1        {'title': 'There is a major technical flaw in ...
2        {'title': 'Interesting unsupervised structure ...
3        {'title': 'Promising method, inconclusive resu...
4        {'decision': 'Reject', 'title': 'ICLR 2018 Con...
6        {'title': 'well written, needs more comparison...
                               ...                        
25462    {'title': 'Review of paper 1', 'review': 'This...
25464    {'title': 'Risk-based ring vaccination appears...
25465    {'title': 'This paper proposed a risk-based ri...
25466    {'title': 'Review of the paper on risk-based r...
25467    {'title': 'Review of risk-based ring vaccinati...
Name: c_content, Length: 20486, dtype: object

### I. 去除author回复

In [22]:
# review中夹杂着论文author自己的回复，属于无用信息需要去除
# 去除review中的author回复
df_reviews_handle = df_reviews[~df_reviews["r_signatures"].str.contains("Author")]
print("{} ==> {}".format(df_reviews.shape[0], df_reviews_handle.shape[0]))

20486 ==> 18456


### II. 去除低频子标题

In [23]:
# 一些出现次数较少的标题可能会作为噪声存在，影响模型训练
# 统计review内容中子标题出现的分位数
# 大概10%的标题出现了5.8次，20%的标题出现了54.4次，...
df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum().quantile(q=[0.1, 0.2, 0.25, 0.5, 0.75, 0.9, 0.995, 0.999,1])

0.100        5.600
0.200       16.000
0.250       17.000
0.500       53.000
0.750      259.500
0.900     3368.800
0.995     9933.910
0.999    11874.782
1.000    12360.000
dtype: float64

In [24]:
# 查看出现次数少于0.2分位数（即54.4次）的子标题
df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum()[(df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum() < df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum().quantile(q=0.2))].index

Index(['c_withdrawal confirmation',
       'c_[Optional] Respond to feedback request by the authors',
       'c_metaReview', 'c_recommendation_for_accepted_papers',
       'c_overall evaluation', 'c_reviewer's confidence', 'c_significance',
       'c_scholarship', 'c_Bio_Award', 'c_reviews_visibility', 'c_novelty',
       'c_pdf', 'c_zip_file', 'c_strengths_weaknesses', 'c_expertise',
       'c_useful', 'c_writing', 'c_superlative', 'c_Sufficiently Alt',
       'c_Conflicts', 'c_Review Inclusion', 'c_reviewtext',
       'c_problemstatement', 'c_litreview', 'c_accessibility', 'c_results',
       'c_reviewerconfidence', 'c_groundsforrejection', 'c_consent_to_archive',
       'c_Reviewer expertise', 'c_Review (Strengths/Weaknesses)'],
      dtype='object')

In [25]:
# 查看大于0.2分位数的相对高频子标题

df_reviews_rest_subtitle = df_reviews_handle.filter(regex='^(?!c_conten*)')[df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum()[(df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum() > df_reviews_handle.filter(regex='^c_*').filter(regex='^(?!c_conten*)').notnull().sum().quantile(q=0.2))].index]
df_reviews_rest_subtitle.columns

Index(['c_title', 'c_rating', 'c_review', 'c_confidence', 'c_decision',
       'c_comment', 'c_metareview', 'c_recommendation',
       'c_experience_assessment',
       'c_review_assessment:_thoroughness_in_paper_reading',
       ...
       'c_submission_track', 'c_overall_rating', 'c_recommended_decision',
       'c_overall_recommendation', 'c_preliminary_rating',
       'c_suggested_changes',
       'c_reason_for_not_giving_a_higher_recommendation',
       'c_reason_for_not_giving_a_lower_recommendation',
       'c_comments_and_feedback_to_the_authors', 'c_final_decision'],
      dtype='object', length=127)

In [26]:
# 去除列名中的“c_”
# "c_"是初期数据认知阶段用以区分基本信息（b_）、review信息(r_)、review正文（c_）时所添加的
# 现需要根据相对高频的列名来从review正文c_content中保留高频子标题及其内容，剔除低频子标题及其内容

rm_regex = re.compile("^c_(.*)")
def remove_c_(col, rm_regex=rm_regex):
    res_col = re.search(rm_regex, col)
    return res_col.group(1)

# 获取需要保留的高频列名，并去除列名中的“c_”
rest_subtitle = df_reviews_rest_subtitle.columns
rest_subtitle = [remove_c_(rst) for rst in rest_subtitle]
rest_subtitle

['title',
 'rating',
 'review',
 'confidence',
 'decision',
 'comment',
 'metareview',
 'recommendation',
 'experience_assessment',
 'review_assessment:_thoroughness_in_paper_reading',
 'review_assessment:_checking_correctness_of_experiments',
 'review_assessment:_checking_correctness_of_derivations_and_theory',
 'summary_of_the_paper',
 'main_review',
 'summary_of_the_review',
 'correctness',
 'technical_novelty_and_significance',
 'empirical_novelty_and_significance',
 'flag_for_ethics_review',
 'details_of_ethics_concerns',
 'Q1 Summary and contributions',
 'Q2 Assessment of the paper',
 'Q2(1) Originality/Novelty',
 'Q2(2) Significance/Impact',
 'Q2(3) Correctness/Technical quality',
 'Q2(4) Quality of experiments (Optional)',
 'Q2(5) Reproducibility',
 'Q2(6) Clarity of writing',
 'Q3 Main strengths',
 'Q4 Main weakness',
 'Q5 Detailed comments to the authors',
 'Q6 Overall score',
 'Q7 Justification for your score',
 'Q8 Confidence in your score',
 'Q9 Complying with reviewing in

In [27]:
# 基于上述余留子标题，处理c_content内的文本
# 1. 去除上方提及的低频子标题（少于0.2分位数）
# 2. 采用\n拼接各个子标题
# 3. 得到的文本格式为“REVIEW {子标题1}:\n{子标题1相关内容}\nREVIEW {子标题2}:\n{子标题2相关内容}”

def content_dict2str(content, rest=rest_subtitle):
    strings = []
    final_string = ""
    for k, v in content.items():
        # 剔除非保留项
        if k not in rest:
            continue
        else:
            sub_string = "REVIEW {}:\n{}".format(k, v)
            strings.append(sub_string)
        final_string = "\n".join(strings)
    return final_string

df_reviews_handle["c_content_str"] = df_reviews_handle["c_content"].apply(lambda x: content_dict2str(x))
df_reviews_handle["c_content_str"]

1        REVIEW title:\nThere is a major technical flaw...
2        REVIEW title:\nInteresting unsupervised struct...
3        REVIEW title:\nPromising method, inconclusive ...
4        REVIEW decision:\nReject\nREVIEW title:\nICLR ...
6        REVIEW title:\nwell written, needs more compar...
                               ...                        
25462    REVIEW title:\nReview of paper 1\nREVIEW revie...
25464    REVIEW title:\nRisk-based ring vaccination app...
25465    REVIEW title:\nThis paper proposed a risk-base...
25466    REVIEW title:\nReview of the paper on risk-bas...
25467    REVIEW title:\nReview of risk-based ring vacci...
Name: c_content_str, Length: 18456, dtype: object

### III. 去除reference

In [28]:
# review本身文本量并不算大，但部分review会列举reference，导致reference占了review内容中较大的一部分
# 且reference信息量都比较有限，因此需要将其从review文本中识别出来并剔除
# 需要留意的是，不能单纯通过识别到“reference”这个词就断定是在列举参考文献
# 因为确实存在只是提及了“reference”这个词的情况，并非是在列举参考文献
# 需要撰写正则表达式对这种情况加以区分

# 以\nreference作为开头，以19XX.或20XX.作为结尾，这是列举一大段参考文献的范式，
# 并且匹配模式设定为匹配换行符和忽略大小写
# flags定义2种匹配模式，(re.S|re.I)分别指支持“匹配换行符（即匹配所有行，无需换行符隔开）”和“忽略大小写”
regex = re.compile("\n(reference)(.*)(19|20\d+).*?\.", flags=(re.S|re.I))

# 检查出列举有参考摘要的review
def check_ref(content, regex=regex):
    res = re.search(regex, content)
    return res

df_reviews_handle[df_reviews_handle["c_content_str"].apply(lambda x: True if check_ref(x) else False)]["c_content_str"]

39       REVIEW title:\nVery interesting work and the p...
215      REVIEW title:\nRelevant work, but not executed...
343      REVIEW title:\nInteresting paper but lacking i...
369      REVIEW title:\nReview\nREVIEW rating:\n7: Good...
398      REVIEW title:\nInteresting direction for explo...
                               ...                        
23958    REVIEW confidence:\n5: You are absolutely cert...
24027    REVIEW confidence:\n3: You are fairly confiden...
24183    REVIEW confidence:\n2: You are willing to defe...
24254    REVIEW title:\nInteresting geometric-based met...
25103    REVIEW title:\nEasy to follow, reproducible wo...
Name: c_content_str, Length: 247, dtype: object

In [29]:
# 任取其中一条来查看，可见其倒数第9行至倒数第3行确实是在列举参考文献
print(df_reviews_handle["c_content_str"].loc[39])

REVIEW title:
Very interesting work and the proposed approach is well explained. The experimental section could be improved.
REVIEW rating:
8: Top 50% of accepted papers, clear accept
REVIEW review:
Summary:
The manuscript extends the Neural Expectation Maximization framework by integrating an interaction function that allows asymmetric pairwise effects between objects. The network is demonstrated to learn compositional object representations which group together pixels, optimizing a predictive coding objective. The effectiveness of the approach is demonstrated on bouncing balls sequences and gameplay videos from Space Invaders. The proposed R-NEM model generalizes

Review:
Very interesting work and the proposed approach is well explained. The experimental section could be improved.
I have a few questions/comments:
1) Some limitations could have been discussed, e.g. how would the model perform on sequences involving more complicated deformations of objects than in the Space Invaders ex

In [30]:
# 去除review中的reference
df_reviews_handle["c_content_str"] = df_reviews_handle["c_content_str"].apply(lambda x: re.sub(regex, "", x))
df_reviews_handle.head()

Unnamed: 0,b_forum,b_title,b_url,b_pub_date,b_abstract,b_TL;DR,b_authors,b_keywords,b_venue,b_venue_id,...,c_preliminary_rating,c_suggested_changes,c_reason_for_not_giving_a_higher_recommendation,c_reason_for_not_giving_a_lower_recommendation,c_comments_and_feedback_to_the_authors,c_consent_to_archive,c_Reviewer expertise,c_Review (Strengths/Weaknesses),c_final_decision,c_content_str
1,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,REVIEW title:\nThere is a major technical flaw...
2,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,REVIEW title:\nInteresting unsupervised struct...
3,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,,"REVIEW title:\nPromising method, inconclusive ..."
4,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,https://openreview.net/forum?id=ryjw_eAaZ,--,We introduce an unsupervised structure learnin...,A principled approach for structure learning o...,"Raanan Y. Yehezkel Rohekar,Guy Koren,Shami Nis...","unsupervised learning,structure learning,deep ...",--,--,...,,,,,,,,,Rejected,REVIEW decision:\nReject\nREVIEW title:\nICLR ...
6,rye7IMbAZ,Explicit Induction Bias for Transfer Learning...,https://openreview.net/forum?id=rye7IMbAZ,--,"In inductive transfer learning, fine-tuning pr...","In inductive transfer learning, fine-tuning pr...","Xuhong LI,Yves GRANDVALET,Franck DAVOINE","transfer Learning,convolutional networks,fine-...",--,--,...,,,,,,,,,,"REVIEW title:\nwell written, needs more compar..."


In [31]:
# 再次查看之前的那条review，可见参考文献部分被剔除了
print(df_reviews_handle["c_content_str"].loc[39])

REVIEW title:
Very interesting work and the proposed approach is well explained. The experimental section could be improved.
REVIEW rating:
8: Top 50% of accepted papers, clear accept
REVIEW review:
Summary:
The manuscript extends the Neural Expectation Maximization framework by integrating an interaction function that allows asymmetric pairwise effects between objects. The network is demonstrated to learn compositional object representations which group together pixels, optimizing a predictive coding objective. The effectiveness of the approach is demonstrated on bouncing balls sequences and gameplay videos from Space Invaders. The proposed R-NEM model generalizes

Review:
Very interesting work and the proposed approach is well explained. The experimental section could be improved.
I have a few questions/comments:
1) Some limitations could have been discussed, e.g. how would the model perform on sequences involving more complicated deformations of objects than in the Space Invaders ex

### IV. 将可疑换行符转换为单换行符

In [32]:
# 将\n\n\n以及\n\n转为\n
# 去掉各种疑似\\n
regex_3n = re.compile(r"\n\n\n", flags=(re.S))
regex_spn = re.compile(r"\n \n", flags=(re.S))
regex_sln = re.compile(r"\\n", flags=(re.S))
regex_2n = re.compile(r"\n\n", flags=(re.S))

regexs_n = {
    "regex_3n": regex_3n,
    "regex_spn": regex_spn,
    "regex_sln": regex_sln,
    "regex_2n": regex_2n
}
def replace_1n(content, regexs=regexs_n):
    content = re.sub(regexs["regex_3n"], "\n", content)
    content = re.sub(regexs["regex_spn"], "\n", content)
    content = re.sub(regexs["regex_sln"], "\n", content)
    content = re.sub(regexs["regex_2n"], "\n", content)
    content = re.sub(regexs["regex_spn"], "\n", content)
    content = re.sub(regexs["regex_2n"], "\n", content)
    return content

df_reviews_handle["c_content_str"] = df_reviews_handle["c_content_str"].apply(lambda x: replace_1n(x))
df_reviews_handle["c_content_str"].iloc[0]

"REVIEW title:\nThere is a major technical flaw in this paper. And some experiment settings are not convincing.\nREVIEW rating:\n4: Ok but not good enough - rejection\nREVIEW review:\nThe paper proposes an unsupervised structure learning method for deep neural networks. It first constructs a fully visible DAG by learning from data, and decomposes variables into autonomous sets. Then latent variables are introduced and stochastic inverse is generated. Later a deep neural network structure is constructed based on the discriminative graph. Both the problem considered in the paper and the proposed method look interesting. The resulting structure seems nice.\nHowever, the reviewer indeed finds a major technical flaw in the paper. The foundation of the proposed method is on preserving the conditional dependencies in graph G. And each step mentioned in the paper, as it claims, can preserve all the conditional dependencies. However, in section 2.2, it seems that the stochastic inverse cannot. 

### V. 为review划分大体类别

In [33]:
# review类别大体分为3类：Reviewer、Area_Chair、Program_Chair
# 根据r_signatures_class将review大体划分为该三大类
# 后续可根据这3类信息对单条paper的多个review顺序进行编排
df_reviews_handle["r_signatures_class"] = ["Reviewer"] * df_reviews_handle.shape[0]
df_reviews_handle.loc[df_reviews_handle["r_signatures"].str.contains("(A|a)rea"), "r_signatures_class"] = "Area_Chair"
df_reviews_handle.loc[df_reviews_handle["r_signatures"].str.contains("Program*"), "r_signatures_class"] = "Program_Chair"
df_reviews_handle["r_signatures_class"].unique()

array(['Reviewer', 'Program_Chair', 'Area_Chair'], dtype=object)

### VI. 取出后续训练真正会使用到的字段

In [34]:
# 查看现有字段，存在诸多冗余项
df_reviews_handle.columns

Index(['b_forum', 'b_title', 'b_url', 'b_pub_date', 'b_abstract', 'b_TL;DR',
       'b_authors', 'b_keywords', 'b_venue', 'b_venue_id',
       ...
       'c_suggested_changes',
       'c_reason_for_not_giving_a_higher_recommendation',
       'c_reason_for_not_giving_a_lower_recommendation',
       'c_comments_and_feedback_to_the_authors', 'c_consent_to_archive',
       'c_Reviewer expertise', 'c_Review (Strengths/Weaknesses)',
       'c_final_decision', 'c_content_str', 'r_signatures_class'],
      dtype='object', length=193)

In [35]:
# 取出真正可能需要使用到的字段
# b_forum作为id用于与paper进行匹配
# b_title为和b_abstract为论文的信息，后续可用于填补上述paper的title和abstract缺失的情况
# c_content_str为review正文
# r_signatures_class为review的大类用于编排多轮对话中review的顺序，Reviewer靠前、Area_Chair为次、Program_Chair靠后
# c_title是review title，用以进一步判断review是否应当放后面，因为通常起决定性的review包含有字符串“Decision”
df_reviews_msg = df_reviews_handle[["b_forum", "b_title", "b_abstract", "c_content_str", "r_signatures_class", "c_title"]]
df_reviews_msg

Unnamed: 0,b_forum,b_title,b_abstract,c_content_str,r_signatures_class,c_title
1,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nThere is a major technical flaw...,Reviewer,There is a major technical flaw in this paper....
2,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nInteresting unsupervised struct...,Reviewer,Interesting unsupervised structure learning al...
3,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,"REVIEW title:\nPromising method, inconclusive ...",Reviewer,"Promising method, inconclusive results"
4,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW decision:\nReject\nREVIEW title:\nICLR ...,Program_Chair,ICLR 2018 Conference Acceptance Decision
6,rye7IMbAZ,Explicit Induction Bias for Transfer Learning...,"In inductive transfer learning, fine-tuning pr...","REVIEW title:\nwell written, needs more compar...",Reviewer,"well written, needs more comparisons/analysis"
...,...,...,...,...,...,...
25462,qkDCSV-RMt,Spectral Clustering Identifies High-risk Opioi...,National opioid prescribing guidelines and rel...,REVIEW title:\nReview of paper 1\nREVIEW revie...,Reviewer,Review of paper 1
25464,N0qlvDjnEv,Risk-Based Ring Vaccination: A Strategy for Pa...,"Throughout an infectious disease crisis, resou...",REVIEW title:\nRisk-based ring vaccination app...,Reviewer,Risk-based ring vaccination appears promising....
25465,N0qlvDjnEv,Risk-Based Ring Vaccination: A Strategy for Pa...,"Throughout an infectious disease crisis, resou...",REVIEW title:\nThis paper proposed a risk-base...,Reviewer,This paper proposed a risk-based ring vaccinat...
25466,N0qlvDjnEv,Risk-Based Ring Vaccination: A Strategy for Pa...,"Throughout an infectious disease crisis, resou...",REVIEW title:\nReview of the paper on risk-bas...,Reviewer,Review of the paper on risk-based rink vaccina...


In [36]:
# 数据中仍存在缺失值与双横杠“--”，需要将它们处理成空字符串
# 先将“--”处理成缺失值，再统一将缺失值替换为空字符串
df_reviews_msg = df_reviews_msg.replace("--", pd.NA).fillna("")
df_reviews_msg.head()

Unnamed: 0,b_forum,b_title,b_abstract,c_content_str,r_signatures_class,c_title
1,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nThere is a major technical flaw...,Reviewer,There is a major technical flaw in this paper....
2,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nInteresting unsupervised struct...,Reviewer,Interesting unsupervised structure learning al...
3,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,"REVIEW title:\nPromising method, inconclusive ...",Reviewer,"Promising method, inconclusive results"
4,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW decision:\nReject\nREVIEW title:\nICLR ...,Program_Chair,ICLR 2018 Conference Acceptance Decision
6,rye7IMbAZ,Explicit Induction Bias for Transfer Learning...,"In inductive transfer learning, fine-tuning pr...","REVIEW title:\nwell written, needs more compar...",Reviewer,"well written, needs more comparisons/analysis"


### VII. 编排单条paper的多个review的顺序

In [37]:
# 编排顺序
# 根据r_signatures_class，Reviewer靠前，Area_Chair其次，Program_Chair最后
# 如果有数个Program_Chair，则c_title带Decision的就放最后，否则就随机

class sorter:
    def __init__(self):
        self.forum_dict = {}
        
    def sort_review(self, row):
        self.forum_dict[row["b_forum"]] = self.forum_dict.get(row["b_forum"], 
                                                              {"reviewer_count": 0, 
                                                             "area_chair_count": 50, 
                                                             "program_chair_basic_count": 100, 
                                                             "program_chair_decision_count": 150})
        if row["r_signatures_class"] == "Reviewer":
            self.forum_dict[row["b_forum"]]["reviewer_count"] = self.forum_dict[row["b_forum"]]["reviewer_count"] + 1
            return self.forum_dict[row["b_forum"]]["reviewer_count"]
        elif row["r_signatures_class"] == "Area_Chair":
            self.forum_dict[row["b_forum"]]["area_chair_count"] = self.forum_dict[row["b_forum"]]["area_chair_count"] + 1
            return self.forum_dict[row["b_forum"]]["area_chair_count"]
        else:
            if "Decision" in row["c_title"]:
                self.forum_dict[row["b_forum"]]["program_chair_decision_count"] = self.forum_dict[row["b_forum"]]["program_chair_decision_count"] + 1
                return self.forum_dict[row["b_forum"]]["program_chair_decision_count"]
            else:
                self.forum_dict[row["b_forum"]]["program_chair_basic_count"] = self.forum_dict[row["b_forum"]]["program_chair_basic_count"] + 1
                return self.forum_dict[row["b_forum"]]["program_chair_basic_count"]
srt = sorter()
df_reviews_msg["r_order"] = df_reviews_msg.apply(lambda x: srt.sort_review(x), axis=1)

df_reviews_msg.head()

# 后续如果要做多轮对话的话，根据r_order的大小顺序由小至大进行编排即可

Unnamed: 0,b_forum,b_title,b_abstract,c_content_str,r_signatures_class,c_title,r_order
1,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nThere is a major technical flaw...,Reviewer,There is a major technical flaw in this paper....,1
2,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nInteresting unsupervised struct...,Reviewer,Interesting unsupervised structure learning al...,2
3,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,"REVIEW title:\nPromising method, inconclusive ...",Reviewer,"Promising method, inconclusive results",3
4,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW decision:\nReject\nREVIEW title:\nICLR ...,Program_Chair,ICLR 2018 Conference Acceptance Decision,151
6,rye7IMbAZ,Explicit Induction Bias for Transfer Learning...,"In inductive transfer learning, fine-tuning pr...","REVIEW title:\nwell written, needs more compar...",Reviewer,"well written, needs more comparisons/analysis",1


### VIII. 检查review侧提供的paper title信息

In [38]:
# 检查review侧提供的paper信息（title、abstract）是否异常
# 后续将使用review侧提供的paper信息来填补paper侧的缺失信息

# 将可疑换行符替换为单换行符
df_reviews_msg["b_title_new"] = df_reviews_msg["b_title"].apply(lambda x: replace_1n(x))
df_reviews_msg["b_title_new"].head()

1    Unsupervised Deep Structure Learning by Recurs...
2    Unsupervised Deep Structure Learning by Recurs...
3    Unsupervised Deep Structure Learning by Recurs...
4    Unsupervised Deep Structure Learning by Recurs...
6     Explicit Induction Bias for Transfer Learning...
Name: b_title_new, dtype: object

In [39]:
# 统计title单词数
# 发现存在单词数为1的情况
# 实际上是正常现象，有的论文确实就是以技术名称为论文标题，所以是有可能仅有1个单词的
df_reviews_msg["b_title_count"] = df_reviews_msg["b_title_new"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_reviews_msg["b_title_count"].sort_values()

223       1
224       1
225       1
226       1
227       1
         ..
11546    23
23751    25
23750    25
23753    25
23752    25
Name: b_title_count, Length: 18456, dtype: int64

In [40]:
# 查看分位数情况
# 整体比较正常，暂未发现无内容和异常情况
df_reviews_msg["b_title_count"].quantile(q=[0.1,0.25,0.5,0.75,0.9,0.99,1])

0.10     5.0
0.25     6.0
0.50     8.0
0.75    10.0
0.90    12.0
0.99    15.0
1.00    25.0
Name: b_title_count, dtype: float64

### IX. 检查review侧提供的paper abstract信息

In [41]:
# 检查review侧提供的paper信息（title、abstract）是否异常
# 后续将使用review侧提供的paper信息来填补paper侧的缺失信息

# 将可疑换行符替换为单换行符
df_reviews_msg["b_abstract_new"] = df_reviews_msg["b_abstract"].apply(lambda x: replace_1n(x))
df_reviews_msg["b_abstract_new"].head()

1    We introduce an unsupervised structure learnin...
2    We introduce an unsupervised structure learnin...
3    We introduce an unsupervised structure learnin...
4    We introduce an unsupervised structure learnin...
6    In inductive transfer learning, fine-tuning pr...
Name: b_abstract_new, dtype: object

In [42]:
# 统计abstarct单词数
# 发现存在单词数为0的情况
# 后续可以与paper侧互补，若实在不存在abstarct则将该条paper剔除
df_reviews_msg["b_abstract_count"] = df_reviews_msg["b_abstract_new"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_reviews_msg["b_abstract_count"].sort_values()

19216     33
1985      37
1983      37
1982      37
1984      37
        ... 
11374    455
11371    455
19785    552
19787    552
19786    552
Name: b_abstract_count, Length: 18456, dtype: int64

In [43]:
# 查看分位数情况
# 整体比较正常
df_reviews_msg["b_abstract_count"].quantile(q=[0.01, 0.1,0.25,0.5,0.75,0.9,0.99,1])

0.01     72.0
0.10    116.0
0.25    142.0
0.50    169.0
0.75    199.0
0.90    229.0
0.99    304.0
1.00    552.0
Name: b_abstract_count, dtype: float64

# 三、整合paper与review

In [44]:
# 去除无用变量，释放内存占用
del df_sci, df_reviews, df_reviews_rest_subtitle, df_reviews_handle

## 1. paper侧与review侧信息（title、abstract）互补

### I. 将paper和review根据forum进行merge

In [45]:
# 查看paper的字段
df_sci_handle.columns

Index(['title', 'authors', 'pub_date', 'abstract', 'sections', 'references',
       'figures', 'formulas', 'doi', 'forum', 'main', 'main_count',
       'title_new', 'title_new_count', 'abstract_new', 'abstract_new_count'],
      dtype='object')

In [46]:
# 查看review的字段
df_reviews_msg.columns

Index(['b_forum', 'b_title', 'b_abstract', 'c_content_str',
       'r_signatures_class', 'c_title', 'r_order', 'b_title_new',
       'b_title_count', 'b_abstract_new', 'b_abstract_count'],
      dtype='object')

In [47]:
df_merge = pd.merge(df_reviews_msg[["b_forum","b_title_new","b_abstract_new","c_content_str",
                             "r_signatures_class","r_order","b_title_count",
                             "b_abstract_count"]],
                            df_sci_handle[["forum","title_new","abstract_new","main"]],
                            how="left",
                            left_on="b_forum",
                            right_on="forum")
df_merge.head()

Unnamed: 0,b_forum,b_title_new,b_abstract_new,c_content_str,r_signatures_class,r_order,b_title_count,b_abstract_count,forum,title_new,abstract_new,main
0,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nThere is a major technical flaw...,Reviewer,1,8,230,ryjw_eAaZ,UNSUPERVISED DEEP STRUCTURE LEARNING BY RECURS...,We introduce an unsupervised structure learnin...,"PAPER INTRODUCTION:\nOver the last decade, dee..."
1,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW title:\nInteresting unsupervised struct...,Reviewer,2,8,230,ryjw_eAaZ,UNSUPERVISED DEEP STRUCTURE LEARNING BY RECURS...,We introduce an unsupervised structure learnin...,"PAPER INTRODUCTION:\nOver the last decade, dee..."
2,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,"REVIEW title:\nPromising method, inconclusive ...",Reviewer,3,8,230,ryjw_eAaZ,UNSUPERVISED DEEP STRUCTURE LEARNING BY RECURS...,We introduce an unsupervised structure learnin...,"PAPER INTRODUCTION:\nOver the last decade, dee..."
3,ryjw_eAaZ,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,REVIEW decision:\nReject\nREVIEW title:\nICLR ...,Program_Chair,151,8,230,ryjw_eAaZ,UNSUPERVISED DEEP STRUCTURE LEARNING BY RECURS...,We introduce an unsupervised structure learnin...,"PAPER INTRODUCTION:\nOver the last decade, dee..."
4,rye7IMbAZ,Explicit Induction Bias for Transfer Learning...,"In inductive transfer learning, fine-tuning pr...","REVIEW title:\nwell written, needs more compar...",Reviewer,1,9,138,rye7IMbAZ,EXPLICIT INDUCTION BIAS FOR TRANSFER LEARN-ING...,"In inductive transfer learning, fine-tuning pr...",PAPER INTRODUCTION:\nIt is now well known that...


### II. 信息互补

基于上述分析结果，可制定如下信息互补规则：
1. title选取：因为review侧的title较为正常，以review侧的title为准。
2. abstract选取：以review侧的abstract为准，如果review侧提供的abstract单词数大于总体的99分位数，那么可以查阅paper侧提供的abstract，如果paper侧存在abstract且字数小于review侧abstract，则选取paper侧abstract作为最终abstract，否则仍取review侧abstract。

In [48]:
# title互补
df_merge["final_title"] = df_merge["b_title_new"]

In [49]:
# abstract互补
abs99 = df_merge["b_abstract_count"].quantile(q=0.99)

def find_abs(row, th=abs99):
    if row["b_abstract_count"] > th:
        if row["abstract_new"] and pd.notnull(row["abstract_new"]):
            return row["abstract_new"]
        else:
            return row["b_abstract_new"]
    elif row["b_abstract_count"] == 0:
        if row["abstract_new"] and pd.notnull(row["abstract_new"]):
            return row["abstract_new"]
        else:
            return ""
    else:
        return row["b_abstract_new"]

df_merge["final_abstract"] = df_merge.apply(lambda x: find_abs(x), axis=1)
df_merge["final_abstract"]

0        We introduce an unsupervised structure learnin...
1        We introduce an unsupervised structure learnin...
2        We introduce an unsupervised structure learnin...
3        We introduce an unsupervised structure learnin...
4        In inductive transfer learning, fine-tuning pr...
                               ...                        
18451    National opioid prescribing guidelines and rel...
18452    Throughout an infectious disease crisis, resou...
18453    Throughout an infectious disease crisis, resou...
18454    Throughout an infectious disease crisis, resou...
18455    Throughout an infectious disease crisis, resou...
Name: final_abstract, Length: 18456, dtype: object

In [50]:
# 查看互补后abstract的单词数，发现仍存在无信息项（单词数为0的情况）
df_merge["final_abstract"].map(lambda x: len(x.split()) if pd.notnull(x) else 0).sort_values()

14382     33
1500      37
1498      37
1497      37
1499      37
        ... 
8413     458
10754    458
10753    458
8416     458
8415     458
Name: final_abstract, Length: 18456, dtype: int64

### III. 去除无信息项

In [51]:
# 去除无信息项
df_merge_drop = df_merge.dropna(subset=["final_title","final_abstract","main","c_content_str"], how="any")[["forum","r_order","c_content_str","final_title","final_abstract","main"]]
print("{} ==> {}".format(df_merge.shape[0], df_merge_drop.shape[0]))

18456 ==> 17881


## 2. 由补信息后的title、abstract、main拼接得到paper内容

In [52]:
# 拼接final_title、final_abstract、main

def concat_paper(row):
    paper_text = "PAPER Title:\n{}\nPAPER Abstract:\n{}\n{}".format(row["final_title"], row["final_abstract"], row["main"])
    paper_text = replace_1n(paper_text)
    return paper_text

df_merge_drop["paper"] = df_merge_drop.apply(lambda x: concat_paper(x), axis=1)
print(df_merge_drop["paper"].iloc[0])

PAPER Title:
Unsupervised Deep Structure Learning by Recursive Dependency Analysis
PAPER Abstract:
We introduce an unsupervised structure learning algorithm for deep, feed-forward, neural networks. We propose a new interpretation for depth and inter-layer connectivity where a hierarchy of independencies in the input distribution is encoded in the network structure. This results in structures allowing neurons to connect to neurons in any deeper layer skipping intermediate layers. Moreover, neurons in deeper layers encode low-order (small condition sets) independencies and have a wide scope of the input, whereas neurons in the first layers encode higher-order (larger condition sets) independencies and have a narrower scope. Thus, the depth of the network is automatically determined---equal to the maximal order of independence in the input distribution, which is the recursion-depth of the algorithm. The proposed algorithm constructs two main graphical models: 1) a generative latent graph 

In [53]:
# 统计paper单词数
df_merge_drop["paper_count"] = df_merge_drop["paper"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_merge_drop["paper_count"].quantile(q=[0.1, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 0.995, 0.999, 1])

0.100     4169.0
0.250     4946.0
0.500     5943.0
0.750     7345.0
0.900    10141.0
0.950    12372.0
0.990    17302.0
0.995    19136.0
0.999    21957.0
1.000    22339.0
Name: paper_count, dtype: float64

In [54]:
# 统计review单词数
df_merge_drop["review_count"] = df_merge_drop["c_content_str"].map(lambda x: len(x.split()) if pd.notnull(x) else 0)
df_merge_drop["review_count"].quantile(q=[0.1, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 0.995, 0.999, 1])

0.100      72.00
0.250     205.00
0.500     392.00
0.750     579.00
0.900     794.00
0.950     958.00
0.990    1375.20
0.995    1573.60
0.999    2090.12
1.000    4150.00
Name: review_count, dtype: float64

In [55]:
# 统计paper+review单词数
(df_merge_drop["paper_count"] + df_merge_drop["review_count"]).quantile(q=[0.1, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1])

0.10     4518.0
0.25     5347.0
0.50     6379.0
0.75     7838.0
0.90    10648.0
0.95    12875.0
0.99    17821.2
1.00    23254.0
dtype: float64

## 四、分别保存paper与review

In [57]:
# 重命名
df_merge_drop.rename(columns={"c_content_str": "review"}, inplace=True)
# 仅保留有用字段
df_merge_drop = df_merge_drop[["forum","r_order","paper","review","paper_count","review_count","final_title","final_abstract","main"]]

### I. 保存paper

In [58]:
# 取出paper
df_paper_final = df_merge_drop[["forum", "paper", "paper_count","final_title", "final_abstract", "main"]].drop_duplicates(keep="first")
# 重命名
df_paper_final.rename(columns={"paper_count": "paper_word_count", "final_title": "title", "final_abstract": "abstract"}, inplace=True)
df_paper_final

Unnamed: 0,forum,paper,paper_word_count,title,abstract,main
0,ryjw_eAaZ,PAPER Title:\nUnsupervised Deep Structure Lear...,6025,Unsupervised Deep Structure Learning by Recurs...,We introduce an unsupervised structure learnin...,"PAPER INTRODUCTION:\nOver the last decade, dee..."
4,rye7IMbAZ,PAPER Title:\n Explicit Induction Bias for Tra...,5630,Explicit Induction Bias for Transfer Learning...,"In inductive transfer learning, fine-tuning pr...",PAPER INTRODUCTION:\nIt is now well known that...
8,ryZElGZ0Z,PAPER Title:\nDiscovery of Predictive Represen...,7365,Discovery of Predictive Representations With a...,The ability of an agent to {\em discover} its ...,PAPER INTRODUCTION:\nThe idea that an agent's ...
12,ryZ3KCy0W,PAPER Title:\nLink Weight Prediction with Node...,3432,Link Weight Prediction with Node Embeddings,Application of deep learning has been successf...,PAPER INTRODUCTION:\nDeep learning has outperf...
16,ryUlhzWCZ,PAPER Title:\nTRUNCATED HORIZON POLICY SEARCH:...,7205,TRUNCATED HORIZON POLICY SEARCH: COMBINING REI...,"In this paper, we propose to combine imitation...",PAPER INTRODUCTION:\nReinforcement Learning (R...
...,...,...,...,...,...,...
18439,el1iPvMqqTY,PAPER Title:\nSample-Efficient Mapspace Optimi...,3471,Sample-Efficient Mapspace Optimization for DNN...,Achieving high performance for machine learnin...,PAPER I. INTRODUCTION:\nDomain-specific hardwa...
18443,DZvNrRNas6z,PAPER Title:\nAsynchronous Decentralized Feder...,3123,Asynchronous Decentralized Federated Lifelong ...,Federated learning is a recent development in ...,"PAPER INTRODUCTION:\nMedical imaging, MRI (Mag..."
18444,u9zVZTg_Ky,PAPER Title:\nPhysics-informed neural networks...,4857,Physics-informed neural networks integrating c...,Modelling and predicting the behaviour of infe...,PAPER INTRODUCTION:\nThe emergence of severe a...
18448,qkDCSV-RMt,PAPER Title:\nSpectral Clustering Identifies H...,4408,Spectral Clustering Identifies High-risk Opioi...,National opioid prescribing guidelines and rel...,PAPER INTRODUCTION:\nNational prescribing guid...


In [59]:
df_paper_final.shape

(3837, 6)

In [60]:
# 保存paper
df_paper_final.to_csv("./processed_paper_final.csv", index=False)
del df_paper_final

### II. 保存review

In [61]:
# 取出review
df_review_final = df_merge_drop[["forum", "r_order", "review", "review_count"]]
# 重命名
df_review_final.rename(columns={"review_count": "review_word_count"}, inplace=True)

In [62]:
# 将r_order替换成由1开始的基本顺序计数
r_orders = []
df_review_final_order = pd.DataFrame()

for idx, fm in tqdm(df_review_final.groupby("forum")):
    for order, rv in enumerate(fm.sort_values(by="r_order", ascending=True)["r_order"]):
        r_orders.append(order + 1)
    df_review_final_order = pd.concat([df_review_final_order, fm])

  0%|          | 0/3837 [00:00<?, ?it/s]

100%|██████████| 3837/3837 [00:00<00:00, 4412.02it/s]


In [63]:
df_review_final_order["r_order"] = r_orders
df_review_final_order.head()

Unnamed: 0,forum,r_order,review,review_word_count
18049,-0sywUv8ryL,1,REVIEW title:\nInteresting problem but needs a...,1475
18050,-0sywUv8ryL,2,"REVIEW title:\nReview for ""SoK: Virtualization...",584
18051,-0sywUv8ryL,3,REVIEW title:\nLack of technical depth and org...,271
18052,-0sywUv8ryL,4,REVIEW title:\nReview: Virtualization Classifi...,456
18053,-0sywUv8ryL,5,REVIEW metareview:\n**Meta-review Security are...,188


In [64]:
df_review_final_order.shape

(17881, 4)

In [65]:
# 保存review
df_review_final_order.to_csv("./processed_review_final.csv", index=False)
del df_review_final_order

## 五、组织数据格式

### I. 单轮形式

In [66]:
# 单轮组织格式展示
template_single_show = "User: Please reivew this paper or give some sugguestion.\n\nAssistant: Ok, please provide detailed infomation or provide paper to review.\n\nUser: This is the paper:\n{paper}\n\nAssistant: This is the review:\n{reivew}"
print(template_single_show)

User: Please reivew this paper or give some sugguestion.

Assistant: Ok, please provide detailed infomation or provide paper to review.

User: This is the paper:
{paper}

Assistant: This is the review:
{reivew}


In [67]:
# 进行单轮转换
template_single = "User: Please reivew this paper or give some sugguestion.\n\nAssistant: Ok, please provide detailed infomation or provide paper to review.\n\nUser: This is the paper:\n{}\n\nAssistant: This is the review:\n{}"

def make_single(row, template=template_single):
    return template_single.format(row["paper"], row["review"])

df_merge_drop["final_text_single"] = df_merge_drop.apply(lambda x: make_single(x), axis=1)
print(df_merge_drop["final_text_single"].iloc[0])

User: Please reivew this paper or give some sugguestion.

Assistant: Ok, please provide detailed infomation or provide paper to review.

User: This is the paper:
PAPER Title:
Unsupervised Deep Structure Learning by Recursive Dependency Analysis
PAPER Abstract:
We introduce an unsupervised structure learning algorithm for deep, feed-forward, neural networks. We propose a new interpretation for depth and inter-layer connectivity where a hierarchy of independencies in the input distribution is encoded in the network structure. This results in structures allowing neurons to connect to neurons in any deeper layer skipping intermediate layers. Moreover, neurons in deeper layers encode low-order (small condition sets) independencies and have a wide scope of the input, whereas neurons in the first layers encode higher-order (larger condition sets) independencies and have a narrower scope. Thus, the depth of the network is automatically determined---equal to the maximal order of independence in

In [68]:
# 保存为jsonl文件，每个元素为单条数据
import jsonlines
from tqdm import tqdm

single_jsonl_path = "single_data.jsonl"
for final_text_single in tqdm(df_merge_drop["final_text_single"]):
    dict_text_single = {"text": final_text_single}
    with jsonlines.open(single_jsonl_path, mode="a") as file_single:
        file_single.write(dict_text_single)

100%|██████████| 17881/17881 [00:03<00:00, 5903.41it/s]


### II. 多轮形式

In [69]:
# 多轮组织格式展示
template_multiple_show = "User: Please reivew this paper or give some sugguestion.\n\nAssistant: Ok, please provide detailed infomation or provide paper to review.\n\nUser: This is the paper:\n{paper}\n\nAssistant: This is the review:\n{reivew1}\n\nUser: Any more?\n\nAssistant:\n{review2}\n\nUser: Any more?\n\nAssistant:\n{review3}"
print(template_multiple_show)

User: Please reivew this paper or give some sugguestion.

Assistant: Ok, please provide detailed infomation or provide paper to review.

User: This is the paper:
{paper}

Assistant: This is the review:
{reivew1}

User: Any more?

Assistant:
{review2}

User: Any more?

Assistant:
{review3}


In [70]:
# 保存为jsonl文件，每个元素为单条数据

from tqdm import tqdm

template_multiple_head = "User: Please reivew this paper or give some sugguestion.\n\nAssistant: Ok, please provide detailed infomation or provide paper to review."
template_multiple_first = "User: This is the paper:\n{}\n\nAssistant: This is the review:\n{}"
template_multiple_round = "User: Any more?\n\nAssistant: This is another review:\n{}"


multiple_jsonl_path = "multiple_data.jsonl"
for idx, fm in tqdm(df_merge_drop.groupby("forum")):
    thepaper = fm["paper"].iloc[0]
    template_multiple_list = []
    for order, rv in enumerate(fm.sort_values(by="r_order", ascending=True)["review"]):
        if order == 0:
            template_multiple_list.append(template_multiple_head)
            template_multiple_list.append(template_multiple_first.format(thepaper, rv))
        else:
            template_multiple_list.append(template_multiple_round.format(rv))
    final_text_multiple = "\n\n".join(template_multiple_list)
    if final_text_multiple:
        dict_text_multiple = {"text": final_text_multiple}
        with jsonlines.open(multiple_jsonl_path, mode="a") as file_multiple:
            file_multiple.write(dict_text_multiple)

100%|██████████| 3837/3837 [00:01<00:00, 3105.90it/s]
