# Preprocessing for PubMed Dataset

## 1. Load Dataset

In [68]:
# load by pandas
import pandas as pd
train_df = pd.read_json('../datasets/pubmed-dataset-copy/train.json', lines=True)
val_df = pd.read_json('../datasets/pubmed-dataset-copy/val.json', lines=True)
test_df = pd.read_json('../datasets/pubmed-dataset-copy/test.json', lines=True)

* **Pubmed statistic:**
  | Split | samples | 
  | --- | --- |
  | Train | 119924 | 
  | Val | 6633 |
  | Test | 6658 |

* **a sample of vanilla dataset**
  * labels seems to used for classification task, the labels of vanilla datasets are all None, i didn't do any processing.
  
  | article_id | article_text | abstract_text | labels | section_names | sections |
  | --- | --- | --- | --- | --- | --- |
  | str | List[str] | List[str] | None | List[str] | List[List[str]] |
  | 'PMC3872579' | split by sentence<br>(似乎按照句子进行分割了, 如果要使用需要进行join) | ['<BOS> background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . </EOS>', <br> '<BOS> the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . </EOS>', <br> '<BOS> however , there were no significant changes among boys or total population . </EOS>', <br> '<BOS> the mean of all anthropometric indices changed significantly after intervention both among girls and boys as well as in total population . </EOS>', <br> "<BOS> the pre- and post - test education assessment in both groups showed that the student 's average knowledge score has been significantly increased from 12.5  3.2 to 16.8  4.3 ( p < 0.0001).conclusion : this study demonstrates the potential success and scalability of school feeding programs in iran . </EOS>", <br> '<BOS> community nutrition intervention based on the advocacy process model is effective on reducing the prevalence of underweight specifically among female school aged children . </EOS>'] <br> 分割方式不明, 需要进行join并重新分割 | None | ['INTRODUCTION', <br>'MATERIALS AND METHODS', <br>'Participants'Instruments', <br>'Procedure', <br>'First step', <br>'Second step', <br>'Third step', <br>'Forth step', <br>'Interventions', <br>'Fifth step (assessment)', <br>'Data analysis', <br>'RESULTS', <br>'DISCUSSION', <br>'CONCLUSION'] <br><br>请注意METHODS包含了从Participants'Instruments到Data analysis的部分 | [<br>section[seq1, seq2, ...], <br>section[seq1, seq2, ...], <br>...] |
  | reformat之后的格式 | --- | --- | --- | --- | --- |
  | str | List[str] | list[str] | None | List[str] | List[str]|

* **Final sample statistic after preprocessing**
  | Split | samples | 
  | --- | --- |
  | Train | 24793 | 
  | Val | 1398 |
  | Test | 1429 |


## 2. Dataset Preprocessing

### 2.1 Keyword Table
The summary(abstract) of pubmed dataset have not been split into multiple pieces, but luckily, it has section strutures so we could mannually process it by keyword matching.
> reference: [GovReport](https://arxiv.org/pdf/2104.02112v2.pdf)
> see page 18

In [69]:
segmentation_keyword_table = {
    # Introduction and Literature
    'part_1': ['introduction', 'case', 'objectives', 'purposes', 
               'objective', 'purpose', 'background', 'literature',
               'aim', 'aims'],
    
    # Methods
    'part_2': ['material and methods',
               'materials and methods', 'methods', 'techniques', 'methodology',
               'materials', 'research design', 'study design'],
    
    # Results
    'part_3': ['result', 'results', 'experiments', 'observations'],
    
    # Discussion and Conlusion
    'part_4': ['discussion', 'limitation', 'conclusions', 
               'conclusion', 'concluding', 'comment', 'comments', 
               'summary', 'concluding remarks'],
}

## 2.2 Sections

### 2.2.1 Concat sentences into one setence <br>(每一行是一个完整的sections, 由多个section组成, 其中每个section有多个句子, 需要进行合并)

In [70]:
train_df

Unnamed: 0,article_id,article_text,abstract_text,labels,section_names,sections
0,PMC3872579,[a recent systematic analysis showed that in 2...,[<S> background : the present study was carrie...,,"[INTRODUCTION, MATERIALS AND METHODS, Particip...",[[a recent systematic analysis showed that in ...
1,PMC3770628,[it occurs in more than 50% of patients and ma...,[<S> backgroundanemia in patients with cancer ...,,"[Introduction, Patients and methods, Study des...",[[it occurs in more than 50% of patients and m...
2,PMC5330001,"[tardive dystonia ( td ) , a rarer side effect...",[<S> tardive dystonia ( td ) is a serious side...,,"[INTRODUCTION, CASE REPORT, DISCUSSION, Declar...","[[tardive dystonia ( td ) , a rarer side effec..."
3,PMC4386667,"[lepidoptera include agricultural pests that ,...",[<S> many lepidopteran insects are agricultura...,,"[1. Introduction, 2. Insect Immunity, 3. Signa...",[[lepidoptera include agricultural pests that ...
4,PMC4307954,[syncope is caused by transient diffuse cerebr...,[<S> we present an unusual case of recurrent c...,,"[Introduction, Case report, Discussion, Confli...",[[syncope is caused by transient diffuse cereb...
...,...,...,...,...,...,...
119919,PMC3502213,[eukaryotic cells depend on vesicle - mediated...,[<S> long - distance trafficking of membranous...,,"[Introduction, Motor-Dependent Transport of Ra...",[[eukaryotic cells depend on vesicle - mediate...
119920,PMC3198562,[as regards the selection criteria of the post...,[<S> aims and objectives : to study the stress...,,"[INTRODUCTION, MATERIALS AND METHODS, Modeling...",[[fiber post systems are routinely used in res...
119921,PMC4436536,[in most of the peer review publications in th...,[<S> abstractbackgroundthe objective of this s...,,"[Introduction, Methods, Results, Discussion, L...",[[in most of the peer review publications in t...
119922,PMC4251613,[the reveal registry is a longitudinal registr...,[<S> background : patients with pulmonary arte...,,"[TRIAL REGISTRY:, Materials and Methods, REVEA...","[[], [the reveal registry is a longitudinal re..."


In [71]:
train_df['sections'][0]

[['a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries .',
  'in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively .',
  'the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% .',
  'anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight .',
  'snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states',
  'there are also some reports regarding school feeding programs in developi

In [72]:
def sec_join(df):
    # 每一行
    sample_list = []
    for sample in df['sections']:
        # 每一个sample的每一个section
        sections = []
        for section in sample: 
            temp = " ".join(section)
            sections.append(temp)
        
        sample_list.append(sections)
    
    return sample_list

In [73]:
for dataset in [train_df, val_df, test_df]:
    dataset['sections'] = sec_join(dataset)

In [74]:
train_df.iloc[0]['sections']

["a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively . the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% . anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight . snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states there are also some reports regarding school feeding programs in developing countries . in vietnam 

### 2.2.2 删除空行（包括None）

* 删除labels列, 原数据集里就是全空值

In [75]:
# 首先需要删除labels列, 原数据集里面的labels列是空的
for dataset in [train_df, val_df, test_df]:
    dataset.drop(columns=['labels'], inplace=True)

* 再删除有空值的行,确保没有问题

In [83]:
# 删除有空值的行
for dataset in [train_df, val_df, test_df]:
    dataset.dropna(inplace=True)

* 过滤掉文章本身为空的行

In [103]:
train_df = train_df[train_df['article_text'].apply(lambda x: x != [""])]
val_df = val_df[val_df['article_text'].apply(lambda x: x != [""])]
test_df = test_df[test_df['article_text'].apply(lambda x: x != [""])]

In [95]:
# 似乎只有train数据集里有空的article_text
train_df

Unnamed: 0,article_id,article_text,abstract_text,section_names,sections
0,PMC3872579,[a recent systematic analysis showed that in 2...,[<S> background : the present study was carrie...,"[INTRODUCTION, MATERIALS AND METHODS, Particip...",[a recent systematic analysis showed that in 2...
1,PMC3770628,[it occurs in more than 50% of patients and ma...,[<S> backgroundanemia in patients with cancer ...,"[Introduction, Patients and methods, Study des...",[it occurs in more than 50% of patients and ma...
2,PMC5330001,"[tardive dystonia ( td ) , a rarer side effect...",[<S> tardive dystonia ( td ) is a serious side...,"[INTRODUCTION, CASE REPORT, DISCUSSION, Declar...","[tardive dystonia ( td ) , a rarer side effect..."
3,PMC4386667,"[lepidoptera include agricultural pests that ,...",[<S> many lepidopteran insects are agricultura...,"[1. Introduction, 2. Insect Immunity, 3. Signa...","[lepidoptera include agricultural pests that ,..."
4,PMC4307954,[syncope is caused by transient diffuse cerebr...,[<S> we present an unusual case of recurrent c...,"[Introduction, Case report, Discussion, Confli...",[syncope is caused by transient diffuse cerebr...
...,...,...,...,...,...
119919,PMC3502213,[eukaryotic cells depend on vesicle - mediated...,[<S> long - distance trafficking of membranous...,"[Introduction, Motor-Dependent Transport of Ra...",[eukaryotic cells depend on vesicle - mediated...
119920,PMC3198562,[as regards the selection criteria of the post...,[<S> aims and objectives : to study the stress...,"[INTRODUCTION, MATERIALS AND METHODS, Modeling...",[fiber post systems are routinely used in rest...
119921,PMC4436536,[in most of the peer review publications in th...,[<S> abstractbackgroundthe objective of this s...,"[Introduction, Methods, Results, Discussion, L...",[in most of the peer review publications in th...
119922,PMC4251613,[the reveal registry is a longitudinal registr...,[<S> background : patients with pulmonary arte...,"[TRIAL REGISTRY:, Materials and Methods, REVEA...","[, the reveal registry is a longitudinal regis..."


### 2.2.3 根据关键字列表合并sections

* 首先需要取section_names的每一个句子的小写
* 获取到section_names的中和keyword tabel相匹配的每一个section

In [96]:
train_df['sections'][0]

["a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively . the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% . anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight . snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states there are also some reports regarding school feeding programs in developing countries . in vietnam 

In [97]:
train_df['abstract_text'][0]

['<S> background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . </S>',
 '<S> the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . </S>',
 '<S> however , there were no significant changes among boys or total population . </S>',
 '<S> 

In [98]:
import re
from tqdm import tqdm
def keyword_matching_and_re_section(dataset):
    for idx, sample in tqdm(dataset.iterrows()):
        # reset for each sample
        sections = [
            [], # part_1
            [], # part_2
            [], # part_3
            [], # part_4
        ]
        section_names = [[] for _ in range(4)]
        
        for sec_name, sec in zip(sample['section_names'], sample['sections']):
            for id, values in enumerate(segmentation_keyword_table.values()):
                if any(keyword.lower() in sec_name.lower() for keyword in values):
                    sections[id].append(sec)
                    
        for id, value in enumerate(sections):
            sections[id] = " ".join(sections[id])

        dataset['sections'].loc[idx] = sections

In [99]:
for dataset in [train_df, val_df, test_df]:
    keyword_matching_and_re_section(dataset)

117112it [02:55, 668.83it/s]
6633it [00:01, 3651.25it/s]
6658it [00:01, 3663.66it/s]


* 如果处理完后的sections仍然不包含完整的四段式结构, 就删除那一行

In [100]:
train_df = train_df[train_df['sections'].apply(lambda x: all(s != '' for s in x))]
val_df = val_df[val_df['sections'].apply(lambda x: all(s != '' for s in x))]
test_df = test_df[test_df['sections'].apply(lambda x: all(s != '' for s in x))]

In [115]:
train_df.iloc[51235]['sections']

["aptamers are short single - stranded oligomers made up of dna , rna , or peptides that are capable of binding a target ligand ( proteins , small molecules , or even living cells ) with high affinity . they are also known as artificial antibodies because in addition to binding with high affinity , they also bind with high specificity . aptamers have several advantages over antibodies , including ease and low cost of production which does not involve animals . aptamers are less immunogenic than antibodies and are already being used as therapeutic agents in humans . nucleic acid aptamers , unlike antibodies , can be selected for and used under nonphysiological conditions , such as high - salt conditions and varying ph . also , nucleic acid aptamers are able to undergo specific conformational changes that antibodies can not . for example , nucleic acid aptamer binding can be  turned off  by the addition of the complementary strand . additionally , nucleic acid aptamers can undergo a conf

* 处理之后分别剩下了51552/2860/2984个样本

* 暂时进行一次保存

In [121]:
train_df.to_json(f'../datasets/pubmed-dataset-processed/train.json', orient='records', lines=True) # orient
val_df.to_json(f'../datasets/pubmed-dataset-processed/val.json', orient='records', lines=True)
test_df.to_json(f'../datasets/pubmed-dataset-processed/test.json', orient='records', lines=True)

In [129]:
index = 457

In [130]:
train_df.iloc[index]['section_names']

['Introduction',
 'Materials and methods',
 'Participants',
 'Missing data',
 'Statistical analyses',
 'Results',
 'Internal consistency of the MASI',
 'Confirmatory factor analysis of the MASI',
 'Item reduction procedure',
 'Rasch analysis of the MASI-R',
 'Discussion',
 'Conclusion',
 'Supplementary material']

In [131]:
train_df.iloc[index]['sections']

['it is well known that job stress is responsible for poor work performance , high absenteeism , less work productivity , and several diseases.15 recently , results from thirteen independent cohort studies in europe indicated that job strain is responsible for coronary heart disease , as are lifestyle and orthodox risk factors.6 the population s attributable risk for job strain was 3.4% , which is lower than that for smoking habits ( 36% ) , abdominal obesity ( 20% ) , and physical inactivity ( 12% ) . however , the interheart study7 showed that work stress doubled the risk of coronary heart disease . in finland , the increase in job strain was associated with an increase in the risk of requiring a disability pension . moreover , as regarding men , a positive association between cardiovascular diseases and an increased risk of disability pension was found.8 mental health caused disabling conditions in 9% of the population in the uk , and estimations for the 20012002 prevalence of self 

In [132]:
train_df.iloc[index]['abstract_text']

['<S> introduction and objectivesa multidimensional self - report questionnaire to evaluate job - related stress factors is presented . the questionnaire , </S>',
 '<S> called maugeri stress index  reduced form ( masi - r ) , aims to assess the impact of job strain on a team or on a single worker by considering four domains : wellness , resilience , perception of social support , and reactions to stressful situations.material and methodsthe reliability of a first longer version ( 47 items ) of the questionnaire was evaluated by an internal consistency analysis and a confirmatory factor analysis . </S>',
 '<S> an item reduction procedure was implemented to obtain a short form of the instrument , and the psychometric properties of the resulting instrument were evaluated using the rasch measurement model.resultsa total of 14 items from the initial pool were deleted because they were not productive for measurement . </S>',
 '<S> the analysis of internal consistency led to the exclusion of 

### 2.2.4 对abstract进行分割
* 特别注意：abstract有一部分文本需要再进行分词操作！！！！！！！

* 加载数据集

In [None]:
train_df = pd.read_json('../datasets/pubmed-dataset-processed/train.json', lines=True)
val_df = pd.read_json('../datasets/pubmed-dataset-processed/val.json', lines=True)
test_df = pd.read_json('../datasets/pubmed-dataset-processed/test.json', lines=True)

1. 摘要部分由多个句子组成，每个句子开头有S，结束时有/S，需要删除，因为bart会自动添加启示和终止符.
2. 原数据集的句子切分方式不明，所以需要重新划分，先将所有句子合并

In [134]:
test_sample = train_df.iloc[0]['abstract_text']
test_sample

['<S> background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . </S>',
 '<S> the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . </S>',
 '<S> however , there were no significant changes among boys or total population . </S>',
 '<S> 

In [135]:
def remove_tag(dataset):
    for idx, sample in dataset.iterrows():
        # reset for each sample
        abstract_text = " ".join([text.replace('<S>', '').replace('</S>', '').strip() for text in sample['abstract_text']])
        dataset['abstract_text'].loc[idx] = abstract_text

In [136]:
for dataset in [train_df, val_df, test_df]:
    remove_tag(dataset)

In [144]:
train_df.iloc[457]['abstract_text']

'introduction and objectivesa multidimensional self - report questionnaire to evaluate job - related stress factors is presented . the questionnaire , called maugeri stress index  reduced form ( masi - r ) , aims to assess the impact of job strain on a team or on a single worker by considering four domains : wellness , resilience , perception of social support , and reactions to stressful situations.material and methodsthe reliability of a first longer version ( 47 items ) of the questionnaire was evaluated by an internal consistency analysis and a confirmatory factor analysis . an item reduction procedure was implemented to obtain a short form of the instrument , and the psychometric properties of the resulting instrument were evaluated using the rasch measurement model.resultsa total of 14 items from the initial pool were deleted because they were not productive for measurement . the analysis of internal consistency led to the exclusion of eight items , while the analysis performed u

3. 将所有句子以句号为分隔符进行切分
---
* 注意：这个分割是只要遇见"."这个符号就分割，意味着37.8这样的数学标记也会被分割成37.和8
  * 因为`reported.conclusionintravenous`这样的情况存在，没办法直接用nltk, 先用.进行正则表达式的分割，再用nltk.word_tokenize分割(因为`wordninja`会删除标点符号)
  * 使用`wordninja`这个包，会对粘连在一起的词进行切分

In [145]:
import nltk
import re
import wordninja
from tqdm import tqdm
nltk.download("punkt", quiet=True)

def sentences_split(dataset):
    for idx, sample in tqdm(dataset.iterrows()):
        # reset for each sample
        # 以.作为分割符，但是保留.
        abstract_text = re.split(r'(?<=\.)', sample['abstract_text'])
        # 移除列表中的空字符串
        abstract_text = [text.strip() for text in abstract_text if text]
        
        sample = []
        for sent in abstract_text:
            words = nltk.word_tokenize(sent)
            
            # 对每个单词应用wordninja，保留原始的标点符号
            split_words = [wordninja.split(word) if word.isalpha() else [word] for word in words]
            
            # 将嵌套的列表展平
            flat_split_words = [item for sublist in split_words for item in sublist]

            # 检查是否为标点符号，如果是就不用空格做连接
            sent = '' # .join(flat_split_words)
            for i, word in enumerate(flat_split_words):
                if i == 0 or i == len(flat_split_words) - 1 or re.match(r'^\W+$', word):
                    sent += word
                else:
                    sent += f' {word}'
            sample.append(sent)
            
        dataset.at[idx, 'abstract_text'] = sample

ModuleNotFoundError: No module named 'wordninja'

In [283]:
for dataset in [train_df_copy_copy, val_df_copy_copy, test_df_copy_copy]:
    sentences_split(dataset)

51489it [06:45, 127.13it/s]
2858it [00:22, 128.62it/s]
2982it [00:22, 130.51it/s]


In [284]:
test_df_copy_copy.iloc[0]['abstract_text']

["research on the implications of anxiety in parkinson 's disease( pd) has been neglected despite its prevalence in nearly 50% of patients and its negative impact on quality of life.",
 'previous reports have noted that neuro psychiatric symptoms impair cognitive performance in pd patients; however, to date, no study has directly compared pd patients with and without anxiety to examine the impact of anxiety on cognitive impairments in pd.',
 'this study compared cognitive performance across 50 pd participants with and without anxiety( 17 pda+; 33 pda), who underwent neurological and neuro psychological assessment.',
 'group performance was compared across the following cognitive domains: simple attention/ vi suo motor processing speed, executive function( e.',
 'g.',
 ', set- shifting), working memory, language, and memory/ new verbal learning.',
 'results showed that pda+ performed significantly worse on the digit span forward and backward test and part b of the trail making task( t m

4. 对每一个句子进行关键字表匹配，如果开头出现的单词在匹配表里，就将整段划分到对于的part

In [285]:
train_df_split =  train_df_copy_copy.copy()
val_df_split = val_df_copy_copy.copy()
test_df_split = test_df_copy_copy.copy()

In [286]:
train_df_split.to_json(f'../datasets/pubmed-dataset-processed-abstract//train.json', orient='records', lines=True) # orient
val_df_split.to_json(f'../datasets/pubmed-dataset-processed-abstract//val.json', orient='records', lines=True)
test_df_split.to_json(f'../datasets/pubmed-dataset-processed-abstract//test.json', orient='records', lines=True)

In [478]:
train_df = pd.read_json('../datasets/pubmed-dataset-processed-abstract/train.json', lines=True)
val_df = pd.read_json('../datasets/pubmed-dataset-processed-abstract/val.json', lines=True)
test_df = pd.read_json('../datasets/pubmed-dataset-processed-abstract/test.json', lines=True)

In [479]:

segmentation_keyword_table = {
    # Introduction and Literature
    'part_1': ['introduction', 'case', 'objectives', 'purposes', 
               'objective', 'purpose', 'background', 'literature',
               'aim', 'aims'],
    
    # Methods
    'part_2': ['material and methods',
               'materials and methods', 'methods', 'techniques', 'methodology',
               'materials', 'research design', 'study design'],
    
    # Results
    'part_3': ['result', 'results', 'experiments', 'observations'],
    
    # Discussion and Conlusion
    'part_4': ['discussion', 'limitation', 'conclusions', 
               'conclusion', 'concluding', 'comment', 'comments', 
               'summary', 'concluding remarks'],
    
}

In [480]:
import re
from tqdm import tqdm

def keyword_matching_and_re_abstract(dataset):
    for idx, sample in tqdm(dataset.iterrows()):
        # Reset for each sample
        abstract_parts = [[] for _ in range(4)]
        current_part = 0  # Initialize to part_1

        for abs in sample['abstract_text']:
            # Splitting the abstract_text into 4 parts
            if abs != "":
                match = re.search(r'\b(\S+)\b', abs)
                if match:
                    first_word = match.group(1).lower()
                else:
                    # 处理没有匹配到单词的情况
                    print(f'No word matched in {abs}')
                    first_word = ""
            else:
                # 处理空字符串的情况
                print('Empty abstract_text')
                first_word = ""

            # Check if the current sentence contains the keyword for the next part
            for id, values in enumerate(segmentation_keyword_table.values()):
                if any(keyword.lower() in first_word for keyword in values):
                    # Move to the next part
                    current_part = id

            # Append the current sentence to the corresponding part
            abstract_parts[current_part].append(abs)

        # Joining the parts
        for id, value in enumerate(abstract_parts):
            abstract_parts[id] = " ".join(abstract_parts[id])

        dataset['abstract_text'].loc[idx] = abstract_parts
        print(f'{dataset["abstract_text"].iloc[idx]=}')


In [None]:
for dataset in [train_df, val_df, test_df]:
    keyword_matching_and_re_abstract(dataset)

* 删除不符合4段式结构的sample

In [None]:
for dataset in [train_df, val_df, test_df]:
    dataset = dataset[dataset['abstract_text'].apply(lambda x: all(s != '' for s in x))]

In [492]:
test_df

Unnamed: 0,article_id,article_text,abstract_text,section_names,sections
1,PMC4086000,[ohss is a serious complication of ovulation i...,[objective: to evaluate the efficacy and safet...,"[Introduction, Materials and Methods, Results,...",[ohss is a serious complication of ovulation i...
2,PMC4414990,[type 1 diabetes ( t1d ) results from the dest...,[objective( s): pen to x if yl line is an immu...,"[Introduction, Materials and Methods, Drug and...",[type 1 diabetes ( t1d ) results from the dest...
3,PMC5094872,[determinar a presena de anticorpos ige especf...,[abstract objective: to determine the presence...,"[Objetivo:, Mtodos:, Resultados:, Concluses:, ...",[\n staphylococcus aureus is a gram - positive...
4,PMC4262794,[development of human societies and industrial...,[background and objective: anxiety and depress...,"[INTRODUCTION , METHODS , ETHICAL CONSIDERATIO...",[development of human societies and industrial...
6,PMC4841868,[medical tourism is illustrated as occurrence ...,"[background: role of information source, perce...","[Introduction, Literature Review, Internationa...",[medical tourism is illustrated as occurrence ...
...,...,...,...,...,...
2970,PMC5029114,[the incidence of animal bites in the world is...,"[background: despite the progress made, animal...","[INTRODUCTION, MATERIALS AND METHODS, RESULTS,...",[the incidence of animal bites in the world is...
2975,PMC3427834,[minimally invasive surgery was first describe...,[purpose this study aimed to comparatively eva...,"[INTRODUCTION, MATERIALS AND METHODS, RESULTS,...",[minimally invasive surgery was first describe...
2976,PMC3155855,[ensuring the highest quality of health care f...,[background: we recently indicated that patien...,"[Introduction, Methods, Quality of care, Predi...",[ensuring the highest quality of health care f...
2977,PMC4325121,"[if left untreated , ureteropelvic junction ob...",[purpose to evaluate changes in differential r...,"[INTRODUCTION, MATERIALS AND METHODS, RESULTS,...","[if left untreated , ureteropelvic junction ob..."


* 处理后分别剩下 24793/ 1398/ 1429条样本

In [502]:
train_df.iloc[5567]['abstract_text']

['objectives: iron and multivitamin drops are being frequently prescribed in children less than 2 years of age. due to their low ph levels, these drops may lead to the softening of enamel and accelerate the destructive process. the aim of the present study was to investigate the enamel micro hardness of primary teeth after exposing them to iron and multivitamin drops.',
 'materials and methods: forty healthy anterior teeth were randomly divided into four groups of 10 samples each. samples were exposed to two iron drops of khar azmi( iran) and iron or m( uk) and two multivitamin drops of shah daro u( iran) and euro vit( germany) for 5 min. the surface micro hardness was measured before and after exposure and data processing was done using statistical paired t- test and analysis of variance( a nova) test. the surface structure of the teeth was examined by scanning electron microscope( sem).',
 'results: in all groups, micro hardness was decreased, but it was not significant in euro vit m

# 3. Save Processed Dataset

In [503]:
train_df.to_json(f'../datasets/pubmed-dataset-processed-final/train.json', orient='records', lines=True)
val_df.to_json(f'../datasets/pubmed-dataset-processed-final/val.json', orient='records', lines=True)
test_df.to_json(f'../datasets/pubmed-dataset-processed-final/test.json', orient='records', lines=True)

In [15]:
import pandas as pd
processed_train_df = pd.read_json('../datasets/pubmed-dataset-processed-final/train.json', lines=True)
processed_val_df = pd.read_json('../datasets/pubmed-dataset-processed-final/val.json', lines=True)
processed_test_df = pd.read_json('../datasets/pubmed-dataset-processed-final/test.json', lines=True)

In [22]:
processed_train_df.iloc[5567]['article_id']

'PMC4697239'

In [16]:
processed_train_df.iloc[5567]['sections']

['a large variety of products could be suggested for the treatment of iron deficiency in children and since the absorption of fe ( ii ) is more effective than that of other oral iron products , its consumption is more common . it is usually prescribed to prevent iron deficiency in children from 6 months to 2 years of age . its consumption may cause black discoloration on primary tooth , and many parents think that a kind of decay has been formed after giving iron drops to their children ; this may be the reason why they limit the consumption of this essential element by their children . this dental discoloration is one of the reasons parents and children refer to dental offices . demineralization of enamel has a clinical importance and many studies investigated the differences between the susceptibility of the enamel in primary and permanent teeth to erosion . in some studies [ amaechi et al . and hunter et al . ] , more susceptibility to enamel erosion was observed in primary teeth ( 

In [17]:
processed_train_df.iloc[5667]['abstract_text']

['aim. this paper presents a simple, versatile in vitro methodology that enables indirect quant if i cation of shrinkage and expansion stresses under clinically relevant conditions without the need for a dedicated instrument.',
 'methods. for shrinkage effects, resulting cusp deformation of aluminum blocks with mod type cavity, filled with novel filling compositions and commercial cements, has been measured using a bench- top micrometer and a linear variable differential transformer( lv dt, a displacement transducer) based instrument.',
 'results. the results demonstrated the validity of the proposed simple methodology. the technique was successfully used in longer- term measurements of shrinkage and expansion stress for several dental compositions.',
 'conclusions. in contrast to in situ techniques where a measuring instrument is dedicated to the sample and its data collection, the proposed simple methodology allows for transfer of the samples to the environment of choice for storage 

In [18]:
train_df = pd.read_json('../datasets/pubmed-dataset-copy/train.json', lines=True)

In [38]:
# 选择'PMC4697239'这篇文章
sample = train_df[train_df['article_id'] == 'PMC4697239']
sample['abstract_text'].to_list()

[['<S> objectives : iron and multivitamin drops are being frequently prescribed in children less than 2 years of age . due to their low ph levels , these drops may lead to the softening of enamel and accelerate the destructive process . the aim of </S>',
  '<S> the present study was to investigate the enamel microhardness of primary teeth after exposing them to iron and multivitamin drops.materials and methods : forty healthy anterior teeth were randomly divided into four groups of 10 samples each . </S>',
  '<S> samples were exposed to two iron drops of kharazmi ( iran ) and ironorm ( uk ) and two multivitamin drops of shahdarou ( iran ) and eurovit ( germany ) for 5 min . </S>',
  '<S> the surface microhardness was measured before and after exposure and data processing was done using statistical paired t - test and analysis of variance ( anova ) test . </S>',
  '<S> the surface structure of the teeth was examined by scanning electron microscope ( sem).results : in all groups , microh

# 4.Dataset Statistic

In [None]:
import nltk
from tqdm import tqdm
nltk.download("punkt", quiet=True)

* Avg token size of Sections

In [35]:
# calculate the avg token size of dataset['sections']:
def avg_token_size(dataset):
    token_size = 0
    for idx, sample in tqdm(dataset.iterrows()):
        for sec in sample['sections']:
            token_size += len(nltk.word_tokenize(sec))
    return token_size / len(dataset)

In [37]:
for dataset in [processed_train_df, processed_val_df, processed_test_df]:
    print(avg_token_size(dataset))

0it [00:00, ?it/s]

24793it [04:40, 88.27it/s] 


2739.152543056508


1398it [00:15, 88.62it/s]


2752.1630901287554


1429it [00:16, 88.98it/s]

2732.187543736879





| avg_token_size | sections |
| --- | --- |
| train | 2740 |
| eval | 2752 |
| test | 2732 |

* Avg token size of Abstract

In [39]:
# calculate the avg token size of dataset['abstract_text']:
def avg_token_size_abstract(dataset):
    token_size = 0
    for idx, sample in tqdm(dataset.iterrows()):
        for abs in sample['abstract_text']:
            token_size += len(nltk.word_tokenize(abs))
    return token_size / len(dataset)

In [40]:
for dataset in [processed_train_df, processed_val_df, processed_test_df]:
    print(avg_token_size_abstract(dataset))

0it [00:00, ?it/s]

24793it [00:43, 571.31it/s]


298.68567740894605


1398it [00:02, 563.99it/s]


300.18741058655223


1429it [00:02, 557.62it/s]

302.94891532540237





| avg_token_size | abstract_text |
| --- | --- |
| train | 299 |
| eval | 300 |
| test | 303 |