# Preprocessing for PubMed Dataset

## 1. Load Dataset

In [3]:
# load by pandas
import pandas as pd
train_df = pd.read_json('../datasets/pubmed-dataset-copy/train.json', lines=True)
val_df = pd.read_json('../datasets/pubmed-dataset-copy/val.json', lines=True)
test_df = pd.read_json('../datasets/pubmed-dataset-copy/test.json', lines=True)

* **Pubmed statistic:**
  | Split | samples | 
  | --- | --- |
  | Train | 119924 | 
  | Val | 6633 |
  | Test | 6658 |

* **a sample of vanilla dataset**
  * labels seems to used for classification task, the labels of vanilla datasets are all None, i didn't do any processing.
  
  | article_id | article_text | abstract_text | labels | section_names | sections |
  | --- | --- | --- | --- | --- | --- |
  | str | List[str] | List[str] | None | List[str] | List[List[str]] |
  | 'PMC3872579' | split by sentence<br>(似乎按照句子进行分割了, 如果要使用需要进行join) | ['<BOS> background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . </EOS>', <br> '<BOS> the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . </EOS>', <br> '<BOS> however , there were no significant changes among boys or total population . </EOS>', <br> '<BOS> the mean of all anthropometric indices changed significantly after intervention both among girls and boys as well as in total population . </EOS>', <br> "<BOS> the pre- and post - test education assessment in both groups showed that the student 's average knowledge score has been significantly increased from 12.5  3.2 to 16.8  4.3 ( p < 0.0001).conclusion : this study demonstrates the potential success and scalability of school feeding programs in iran . </EOS>", <br> '<BOS> community nutrition intervention based on the advocacy process model is effective on reducing the prevalence of underweight specifically among female school aged children . </EOS>'] <br> 分割方式不明, 需要进行join并重新分割 | None | ['INTRODUCTION', <br>'MATERIALS AND METHODS', <br>'Participants'Instruments', <br>'Procedure', <br>'First step', <br>'Second step', <br>'Third step', <br>'Forth step', <br>'Interventions', <br>'Fifth step (assessment)', <br>'Data analysis', <br>'RESULTS', <br>'DISCUSSION', <br>'CONCLUSION'] <br><br>请注意METHODS包含了从Participants'Instruments到Data analysis的部分 | [<br>section[seq1, seq2, ...], <br>section[seq1, seq2, ...], <br>...] |
  | reformat之后的格式 | --- | --- | --- | --- | --- |
  | str | List[str] | list[str] | None | List[str] | List[str]|

* **Final sample statistic after preprocessing**
  | Split | samples | 
  | --- | --- |
  | Train | 24843 | 
  | Val | 1399 |
  | Test | 1431 |


## 2. Dataset Preprocessing

### 2.1 Keyword Table
The summary(abstract) of pubmed dataset have not been split into multiple pieces, but luckily, it has section strutures so we could mannually process it by keyword matching.
> reference: [GovReport](https://arxiv.org/pdf/2104.02112v2.pdf)
> see page 18

In [4]:
segmentation_keyword_table = {
    # Introduction and Literature
    'part_1': ['introduction', 'case', 'objectives', 'purposes', 
               'objective', 'purpose', 'background', 'literature',
               'aim', 'aims'],
    
    # Methods
    'part_2': ['material and methods',
               'materials and methods', 'methods', 'techniques', 'methodology',
               'materials', 'research design', 'study design'],
    
    # Results
    'part_3': ['result', 'results', 'experiments', 'observations'],
    
    # Discussion and Conlusion
    'part_4': ['discussion', 'limitation', 'conclusions', 
               'conclusion', 'concluding', 'comment', 'comments', 
               'summary', 'concluding remarks'],
}

## 2.2 Sections

### 2.2.1 Concat sentences into one setence <br>(每一行是一个完整的sections, 由多个section组成, 其中每个section有多个句子, 需要进行合并)

In [5]:
train_df

Unnamed: 0,article_id,article_text,abstract_text,labels,section_names,sections
0,PMC3872579,[a recent systematic analysis showed that in 2...,[<S> background : the present study was carrie...,,"[INTRODUCTION, MATERIALS AND METHODS, Particip...",[[a recent systematic analysis showed that in ...
1,PMC3770628,[it occurs in more than 50% of patients and ma...,[<S> backgroundanemia in patients with cancer ...,,"[Introduction, Patients and methods, Study des...",[[it occurs in more than 50% of patients and m...
2,PMC5330001,"[tardive dystonia ( td ) , a rarer side effect...",[<S> tardive dystonia ( td ) is a serious side...,,"[INTRODUCTION, CASE REPORT, DISCUSSION, Declar...","[[tardive dystonia ( td ) , a rarer side effec..."
3,PMC4386667,"[lepidoptera include agricultural pests that ,...",[<S> many lepidopteran insects are agricultura...,,"[1. Introduction, 2. Insect Immunity, 3. Signa...",[[lepidoptera include agricultural pests that ...
4,PMC4307954,[syncope is caused by transient diffuse cerebr...,[<S> we present an unusual case of recurrent c...,,"[Introduction, Case report, Discussion, Confli...",[[syncope is caused by transient diffuse cereb...
...,...,...,...,...,...,...
119919,PMC3502213,[eukaryotic cells depend on vesicle - mediated...,[<S> long - distance trafficking of membranous...,,"[Introduction, Motor-Dependent Transport of Ra...",[[eukaryotic cells depend on vesicle - mediate...
119920,PMC3198562,[as regards the selection criteria of the post...,[<S> aims and objectives : to study the stress...,,"[INTRODUCTION, MATERIALS AND METHODS, Modeling...",[[fiber post systems are routinely used in res...
119921,PMC4436536,[in most of the peer review publications in th...,[<S> abstractbackgroundthe objective of this s...,,"[Introduction, Methods, Results, Discussion, L...",[[in most of the peer review publications in t...
119922,PMC4251613,[the reveal registry is a longitudinal registr...,[<S> background : patients with pulmonary arte...,,"[TRIAL REGISTRY:, Materials and Methods, REVEA...","[[], [the reveal registry is a longitudinal re..."


In [6]:
train_df['sections'][0]

[['a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries .',
  'in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively .',
  'the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% .',
  'anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight .',
  'snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states',
  'there are also some reports regarding school feeding programs in developi

In [7]:
def sec_join(df):
    # 每一行
    sample_list = []
    for sample in df['sections']:
        # 每一个sample的每一个section
        sections = []
        for section in sample: 
            temp = " ".join(section)
            sections.append(temp)
        
        sample_list.append(sections)
    
    return sample_list

In [8]:
for dataset in [train_df, val_df, test_df]:
    dataset['sections'] = sec_join(dataset)

In [9]:
train_df.iloc[0]['sections']

["a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively . the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% . anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight . snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states there are also some reports regarding school feeding programs in developing countries . in vietnam 

### 2.2.2 删除空行（包括None）

* 删除labels列, 原数据集里就是全空值

In [10]:
# 首先需要删除labels列, 原数据集里面的labels列是空的
for dataset in [train_df, val_df, test_df]:
    dataset.drop(columns=['labels'], inplace=True)

* 再删除有空值的行,确保没有问题

In [11]:
# 删除有空值的行
for dataset in [train_df, val_df, test_df]:
    dataset.dropna(inplace=True)

* 过滤掉文章本身为空的行

In [12]:
train_df = train_df[train_df['article_text'].apply(lambda x: x != [""])]
val_df = val_df[val_df['article_text'].apply(lambda x: x != [""])]
test_df = test_df[test_df['article_text'].apply(lambda x: x != [""])]

In [13]:
# 似乎只有train数据集里有空的article_text
train_df

Unnamed: 0,article_id,article_text,abstract_text,section_names,sections
0,PMC3872579,[a recent systematic analysis showed that in 2...,[<S> background : the present study was carrie...,"[INTRODUCTION, MATERIALS AND METHODS, Particip...",[a recent systematic analysis showed that in 2...
1,PMC3770628,[it occurs in more than 50% of patients and ma...,[<S> backgroundanemia in patients with cancer ...,"[Introduction, Patients and methods, Study des...",[it occurs in more than 50% of patients and ma...
2,PMC5330001,"[tardive dystonia ( td ) , a rarer side effect...",[<S> tardive dystonia ( td ) is a serious side...,"[INTRODUCTION, CASE REPORT, DISCUSSION, Declar...","[tardive dystonia ( td ) , a rarer side effect..."
3,PMC4386667,"[lepidoptera include agricultural pests that ,...",[<S> many lepidopteran insects are agricultura...,"[1. Introduction, 2. Insect Immunity, 3. Signa...","[lepidoptera include agricultural pests that ,..."
4,PMC4307954,[syncope is caused by transient diffuse cerebr...,[<S> we present an unusual case of recurrent c...,"[Introduction, Case report, Discussion, Confli...",[syncope is caused by transient diffuse cerebr...
...,...,...,...,...,...
119919,PMC3502213,[eukaryotic cells depend on vesicle - mediated...,[<S> long - distance trafficking of membranous...,"[Introduction, Motor-Dependent Transport of Ra...",[eukaryotic cells depend on vesicle - mediated...
119920,PMC3198562,[as regards the selection criteria of the post...,[<S> aims and objectives : to study the stress...,"[INTRODUCTION, MATERIALS AND METHODS, Modeling...",[fiber post systems are routinely used in rest...
119921,PMC4436536,[in most of the peer review publications in th...,[<S> abstractbackgroundthe objective of this s...,"[Introduction, Methods, Results, Discussion, L...",[in most of the peer review publications in th...
119922,PMC4251613,[the reveal registry is a longitudinal registr...,[<S> background : patients with pulmonary arte...,"[TRIAL REGISTRY:, Materials and Methods, REVEA...","[, the reveal registry is a longitudinal regis..."


### 2.2.3 根据关键字列表合并sections

* 首先需要取section_names的每一个句子的小写
* 获取到section_names的中和keyword tabel相匹配的每一个section

In [14]:
train_df['sections'][0]

["a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively . the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% . anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight . snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states there are also some reports regarding school feeding programs in developing countries . in vietnam 

In [15]:
train_df['abstract_text'][0]

['<S> background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . </S>',
 '<S> the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . </S>',
 '<S> however , there were no significant changes among boys or total population . </S>',
 '<S> 

In [16]:
import re
from tqdm import tqdm
def keyword_matching_and_re_section(dataset):
    for idx, sample in tqdm(dataset.iterrows()):
        # reset for each sample
        sections = [
            [], # part_1
            [], # part_2
            [], # part_3
            [], # part_4
        ]
        section_names = [[] for _ in range(4)]
        
        for sec_name, sec in zip(sample['section_names'], sample['sections']):
            for id, values in enumerate(segmentation_keyword_table.values()):
                if any(keyword.lower() in sec_name.lower() for keyword in values):
                    sections[id].append(sec)
                    
        for id, value in enumerate(sections):
            sections[id] = " ".join(sections[id])

        dataset['sections'].loc[idx] = sections

In [17]:
for dataset in [train_df, val_df, test_df]:
    keyword_matching_and_re_section(dataset)

0it [00:00, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['sections'].loc[idx] = sections
117112it [02:54, 672.59it/s]
6633it [00:01, 3584.99it/s]
6658it [00:01, 3573.40it/s]


* 如果处理完后的sections仍然不包含完整的四段式结构, 就删除那一行

In [18]:
train_df = train_df[train_df['sections'].apply(lambda x: all(s != '' for s in x))]
val_df = val_df[val_df['sections'].apply(lambda x: all(s != '' for s in x))]
test_df = test_df[test_df['sections'].apply(lambda x: all(s != '' for s in x))]

In [19]:
train_df.iloc[51235]['sections']

["aptamers are short single - stranded oligomers made up of dna , rna , or peptides that are capable of binding a target ligand ( proteins , small molecules , or even living cells ) with high affinity . they are also known as artificial antibodies because in addition to binding with high affinity , they also bind with high specificity . aptamers have several advantages over antibodies , including ease and low cost of production which does not involve animals . aptamers are less immunogenic than antibodies and are already being used as therapeutic agents in humans . nucleic acid aptamers , unlike antibodies , can be selected for and used under nonphysiological conditions , such as high - salt conditions and varying ph . also , nucleic acid aptamers are able to undergo specific conformational changes that antibodies can not . for example , nucleic acid aptamer binding can be  turned off  by the addition of the complementary strand . additionally , nucleic acid aptamers can undergo a conf

* 处理之后分别剩下了51552/2860/2984个样本

* 暂时进行一次保存

In [23]:
train_df.to_json(f'../datasets/pubmed-dataset-processed/train.json', orient='records', lines=True) # orient
val_df.to_json(f'../datasets/pubmed-dataset-processed/val.json', orient='records', lines=True)
test_df.to_json(f'../datasets/pubmed-dataset-processed/test.json', orient='records', lines=True)

In [24]:
index = 457

In [25]:
train_df.iloc[index]['section_names']

['Introduction',
 'Materials and methods',
 'Participants',
 'Missing data',
 'Statistical analyses',
 'Results',
 'Internal consistency of the MASI',
 'Confirmatory factor analysis of the MASI',
 'Item reduction procedure',
 'Rasch analysis of the MASI-R',
 'Discussion',
 'Conclusion',
 'Supplementary material']

In [26]:
train_df.iloc[index]['sections']

['it is well known that job stress is responsible for poor work performance , high absenteeism , less work productivity , and several diseases.15 recently , results from thirteen independent cohort studies in europe indicated that job strain is responsible for coronary heart disease , as are lifestyle and orthodox risk factors.6 the population s attributable risk for job strain was 3.4% , which is lower than that for smoking habits ( 36% ) , abdominal obesity ( 20% ) , and physical inactivity ( 12% ) . however , the interheart study7 showed that work stress doubled the risk of coronary heart disease . in finland , the increase in job strain was associated with an increase in the risk of requiring a disability pension . moreover , as regarding men , a positive association between cardiovascular diseases and an increased risk of disability pension was found.8 mental health caused disabling conditions in 9% of the population in the uk , and estimations for the 20012002 prevalence of self 

In [27]:
train_df.iloc[index]['abstract_text']

['<S> introduction and objectivesa multidimensional self - report questionnaire to evaluate job - related stress factors is presented . the questionnaire , </S>',
 '<S> called maugeri stress index  reduced form ( masi - r ) , aims to assess the impact of job strain on a team or on a single worker by considering four domains : wellness , resilience , perception of social support , and reactions to stressful situations.material and methodsthe reliability of a first longer version ( 47 items ) of the questionnaire was evaluated by an internal consistency analysis and a confirmatory factor analysis . </S>',
 '<S> an item reduction procedure was implemented to obtain a short form of the instrument , and the psychometric properties of the resulting instrument were evaluated using the rasch measurement model.resultsa total of 14 items from the initial pool were deleted because they were not productive for measurement . </S>',
 '<S> the analysis of internal consistency led to the exclusion of 

### 2.2.4 对abstract进行分割
* 特别注意：abstract有一部分文本需要再进行分词操作！！！！！！！

* 加载数据集

In [9]:
train_df = pd.read_json('../datasets/pubmed-dataset-processed/train.json', lines=True)
val_df = pd.read_json('../datasets/pubmed-dataset-processed/val.json', lines=True)
test_df = pd.read_json('../datasets/pubmed-dataset-processed/test.json', lines=True)

1. 摘要部分由多个句子组成，每个句子开头有S，结束时有/S，需要删除，因为bart会自动添加启示和终止符.
2. 原数据集的句子切分方式不明，所以需要重新划分，先将所有句子合并

In [10]:
test_sample = train_df.iloc[0]['abstract_text']
test_sample

['<S> background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . </S>',
 '<S> the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . </S>',
 '<S> however , there were no significant changes among boys or total population . </S>',
 '<S> 

In [11]:
def remove_tag(dataset):
    for idx, sample in dataset.iterrows():
        # reset for each sample
        abstract_text = " ".join([text.replace('<S>', '').replace('</S>', '').strip() for text in sample['abstract_text']])
        dataset['abstract_text'].loc[idx] = abstract_text

In [12]:
for dataset in [train_df, val_df, test_df]:
    remove_tag(dataset)

In [13]:
train_df.iloc[457]['abstract_text']

'introduction and objectivesa multidimensional self - report questionnaire to evaluate job - related stress factors is presented . the questionnaire , called maugeri stress index  reduced form ( masi - r ) , aims to assess the impact of job strain on a team or on a single worker by considering four domains : wellness , resilience , perception of social support , and reactions to stressful situations.material and methodsthe reliability of a first longer version ( 47 items ) of the questionnaire was evaluated by an internal consistency analysis and a confirmatory factor analysis . an item reduction procedure was implemented to obtain a short form of the instrument , and the psychometric properties of the resulting instrument were evaluated using the rasch measurement model.resultsa total of 14 items from the initial pool were deleted because they were not productive for measurement . the analysis of internal consistency led to the exclusion of eight items , while the analysis performed u

3. 将所有句子以句号为分隔符进行切分
---
* 注意：这个分割是只要遇见"."这个符号就分割，意味着37.8这样的数学标记也会被分割成37.和8
  * 因为`reported.conclusionintravenous`这样的情况存在，没办法直接用nltk, 先用.进行正则表达式的分割，再用nltk.word_tokenize分割(因为`wordninja`会删除标点符号)
  * 使用`wordninja`这个包，会对粘连在一起的词进行切分

In [14]:
import nltk
import re
import wordninja
from tqdm import tqdm
nltk.download("punkt", quiet=True)

def sentences_split(dataset):
    for idx, sample in tqdm(dataset.iterrows()):
        # reset for each sample
        # 以.作为分割符，但是保留.
        abstract_text = re.split(r'(?<=\.)', sample['abstract_text'])
        # 移除列表中的空字符串
        abstract_text = [text.strip() for text in abstract_text if text]
        
        sample = []
        for sent in abstract_text:
            words = nltk.word_tokenize(sent)
            
            # 对每个单词应用wordninja，保留原始的标点符号
            split_words = [wordninja.split(word) if word.isalpha() else [word] for word in words]
            
            # 将嵌套的列表展平
            flat_split_words = [item for sublist in split_words for item in sublist]

            # 检查是否为标点符号，如果是就不用空格做连接
            sent = '' # .join(flat_split_words)
            for i, word in enumerate(flat_split_words):
                if i == 0 or i == len(flat_split_words) - 1 or re.match(r'^\W+$', word):
                    sent += word
                else:
                    sent += f' {word}'
            sample.append(sent)
            
        dataset.at[idx, 'abstract_text'] = sample

In [15]:
for dataset in [train_df, val_df, test_df]:
    sentences_split(dataset)

51552it [08:53, 96.69it/s] 
2860it [00:29, 96.01it/s] 
2984it [00:30, 96.32it/s] 


In [16]:
test_df.iloc[0]['abstract_text']

["research on the implications of anxiety in parkinson 's disease( pd) has been neglected despite its prevalence in nearly 50% of patients and its negative impact on quality of life.",
 'previous reports have noted that neuro psychiatric symptoms impair cognitive performance in pd patients; however, to date, no study has directly compared pd patients with and without anxiety to examine the impact of anxiety on cognitive impairments in pd.',
 'this study compared cognitive performance across 50 pd participants with and without anxiety( 17 pda+; 33 pda), who underwent neurological and neuro psychological assessment.',
 'group performance was compared across the following cognitive domains: simple attention/ vi suo motor processing speed, executive function( e.',
 'g.',
 ', set- shifting), working memory, language, and memory/ new verbal learning.',
 'results showed that pda+ performed significantly worse on the digit span forward and backward test and part b of the trail making task( t m

4. 对每一个句子进行关键字表匹配，如果开头出现的单词在匹配表里，就将整段划分到对于的part

In [17]:
train_df.to_json(f'../datasets/pubmed-dataset-processed-abstract/train.json', orient='records', lines=True) # orient
val_df.to_json(f'../datasets/pubmed-dataset-processed-abstract/val.json', orient='records', lines=True)
test_df.to_json(f'../datasets/pubmed-dataset-processed-abstract/test.json', orient='records', lines=True)

In [18]:
import pandas as pd
train_df = pd.read_json('../datasets/pubmed-dataset-processed-abstract/train.json', lines=True)
val_df = pd.read_json('../datasets/pubmed-dataset-processed-abstract/val.json', lines=True)
test_df = pd.read_json('../datasets/pubmed-dataset-processed-abstract/test.json', lines=True)

In [19]:
train_df.iloc[0]['abstract_text']

['background: the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school- aged children in shiraz, iran.',
 'materials and methods: this case- control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls( 7- 13 years old) based on advocacy approach in shiraz, iran.',
 'the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention.',
 'for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post- intervention were statistically compared.',
 'results: the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls( p= 0.',
 '02).',
 'however, there were no significant changes among boys or total population.',
 'the mean of all an thro po me

In [20]:
segmentation_keyword_table = {
    # Introduction and Literature
    'part_1': ['introduction', 'case', 'objectives', 'purposes', 
               'objective', 'purpose', 'background', 'literature',
               'aim', 'aims'],
    
    # Methods
    'part_2': ['material and methods',
               'materials and methods', 'methods', 'techniques', 'methodology',
               'materials', 'research design', 'study design'],
    
    # Results
    'part_3': ['result', 'results', 'experiments', 'observations'],
    
    # Discussion and Conlusion
    'part_4': ['discussion', 'limitation', 'conclusions', 
               'conclusion', 'concluding', 'comment', 'comments', 
               'summary', 'concluding remarks'],
}

In [22]:
import re
from tqdm import tqdm

def keyword_matching_and_re_abstract(dataset):
    for idx, sample in tqdm(dataset.iterrows()):
        # Reset for each sample
        abstract_parts = [[] for _ in range(4)]
        current_part = 0  # Initialize to part_1

        for abs in sample['abstract_text']:
            # Splitting the abstract_text into 4 parts
            if abs != "":
                match = re.search(r'\b(\S+)\b', abs)
                if match:
                    first_word = match.group(1).lower()
                else:
                    # 处理没有匹配到单词的情况
                    # print(f'No word matched in {abs}')
                    first_word = ""
            else:
                # 处理空字符串的情况
                # print('Empty abstract_text')
                first_word = ""

            # Check if the current sentence contains the keyword for the next part
            for id, values in enumerate(segmentation_keyword_table.values()):
                if any(keyword.lower() in first_word for keyword in values):
                    # Move to the next part
                    current_part = id

            # Append the current sentence to the corresponding part
            abstract_parts[current_part].append(abs)

        # Joining the parts
        for id, value in enumerate(abstract_parts):
            abstract_parts[id] = " ".join(abstract_parts[id])

        dataset['abstract_text'].loc[idx] = abstract_parts
        # print(f'{dataset["abstract_text"].iloc[idx]=}')

In [23]:
for dataset in [train_df, val_df, test_df]:
    keyword_matching_and_re_abstract(dataset)

51552it [00:29, 1724.63it/s]
2860it [00:00, 2908.47it/s]
2984it [00:01, 2912.18it/s]


* 删除不符合4段式结构的sample

In [28]:
train_df = train_df[train_df['abstract_text'].apply(lambda x: all(s != '' for s in x))]
val_df = val_df[val_df['abstract_text'].apply(lambda x: all(s != '' for s in x))]
test_df = test_df[test_df['abstract_text'].apply(lambda x: all(s != '' for s in x))]

* 处理后分别剩下 24843/ 1399/ 1431条样本

In [32]:
train_df.iloc[5567]['abstract_text']

['introduction tumors lack normal drainage of secreted fluids and consequently build up tumor interstitial fluid( tif). unlike other bodily fluids, tif likely contains a high proportion of tumor- specific proteins with potential as biomarkers.',
 'methods here, we evaluated a novel technique using a unique ultra filtration catheter for in situ collection of tif and used it to generate the first catalog of tif proteins from a head and neck s quam o us cell carcinoma( hn s cc). to maximize pro teo mic coverage, tif was immuno depleted for high abundance proteins and digested with try ps in, and peptides were fraction a ted in three dimensions prior to mass spectrometry.',
 'results we identified 525 proteins with high confidence. the hn s cc tif pro teo me was distinct compared to pro teo mes of other bodily fluids. it contained a relatively high proportion of proteins annotated by gene ontology as extracellular compared to other secreted fluid and cellular pro teo mes, indicating minima

# 3. Save Processed Dataset

In [34]:
train_df.to_json(f'../datasets/pubmed-dataset-processed-final/train.json', orient='records', lines=True)
val_df.to_json(f'../datasets/pubmed-dataset-processed-final/val.json', orient='records', lines=True)
test_df.to_json(f'../datasets/pubmed-dataset-processed-final/test.json', orient='records', lines=True)

In [35]:
import pandas as pd
processed_train_df = pd.read_json('../datasets/pubmed-dataset-processed-final/train.json', lines=True)
processed_val_df = pd.read_json('../datasets/pubmed-dataset-processed-final/val.json', lines=True)
processed_test_df = pd.read_json('../datasets/pubmed-dataset-processed-final/test.json', lines=True)

In [36]:
processed_train_df.iloc[5567]['article_id']

'PMC2937136'

In [37]:
processed_train_df.iloc[5567]['sections']

['squamous cell carcinoma ( scc ) is a common type of cancer originating in epithelial cells . survival rates of sccs of the upper respiratory tract 5  years post - diagnosis are about 50% . however , early detection and treatment results in substantially better prognosis . currently , the most accurate diagnostic test for all sccs is a histological examination of a tissue biopsy . a less invasive test that still maintains sensitivity and selectivity , from a readily available bodily fluid like plasma or saliva , would be greatly beneficial . one potential test would be based on specific differential protein expression , or protein biomarkers of scc . a promising source of possible biomarkers of scc is tumor interstitial fluid ( tif ) . the production of tif results from buildup of fluid derived from cell secretions due to a lack of a fully formed vascular and lymphatic system within the tumor . because this fluid is directly associated with the tumor , it may be a rich source of tumor

In [38]:
processed_train_df.iloc[5667]['abstract_text']

['aims: to investigate the effect of the storage period on the accuracy of recently developed el as tom eric materials.',
 'methods: simultaneous impressions of a steel die were taken using a poly ether( i: imp re gum soft heavy and light body, 3 m esp e) and vinyl poly silo x a ne( p: perfect im blue velvet and flex i- velvet, j. morita). the trays were loaded with the heavy- bodied impression materials while the light- bodied impression materials were simultaneously spread on the steel die. the impressions were poured after 2 hours, 24 hours, and 7 days. impressions were stored at approximately 55% relative humidity and room temperature. ten replicas were produced for each experimental condition( n=60). accuracy of the stone dies was assessed with a depth- measuring microscope. the difference in height between the surface of the stone die and a standard metallic ring was recorded in micrometers at four demarcated points, by two independent examiners. dx at a were submitted to two- wa

In [18]:
train_df = pd.read_json('../datasets/pubmed-dataset-copy/train.json', lines=True)

In [38]:
# 选择'PMC4697239'这篇文章
sample = train_df[train_df['article_id'] == 'PMC4697239']
sample['abstract_text'].to_list()

[['<S> objectives : iron and multivitamin drops are being frequently prescribed in children less than 2 years of age . due to their low ph levels , these drops may lead to the softening of enamel and accelerate the destructive process . the aim of </S>',
  '<S> the present study was to investigate the enamel microhardness of primary teeth after exposing them to iron and multivitamin drops.materials and methods : forty healthy anterior teeth were randomly divided into four groups of 10 samples each . </S>',
  '<S> samples were exposed to two iron drops of kharazmi ( iran ) and ironorm ( uk ) and two multivitamin drops of shahdarou ( iran ) and eurovit ( germany ) for 5 min . </S>',
  '<S> the surface microhardness was measured before and after exposure and data processing was done using statistical paired t - test and analysis of variance ( anova ) test . </S>',
  '<S> the surface structure of the teeth was examined by scanning electron microscope ( sem).results : in all groups , microh

# 4.Dataset Statistic

In [39]:
import nltk
from tqdm import tqdm
nltk.download("punkt", quiet=True)

True

* Avg token size of Sections

In [40]:
# calculate the avg token size of dataset['sections']:
def avg_token_size(dataset):
    token_size = 0
    for idx, sample in tqdm(dataset.iterrows()):
        for sec in sample['sections']:
            token_size += len(nltk.word_tokenize(sec))
    return token_size / len(dataset)

In [41]:
for dataset in [processed_train_df, processed_val_df, processed_test_df]:
    print(avg_token_size(dataset))

24843it [09:25, 43.89it/s]


2736.260194018436


1399it [00:33, 42.28it/s]


2751.263759828449


1431it [00:33, 43.26it/s]

2729.0041928721175





| avg_token_size | sections |
| --- | --- |
| train | 2740 |
| eval | 2752 |
| test | 2732 |

* Avg token size of Abstract

In [39]:
# calculate the avg token size of dataset['abstract_text']:
def avg_token_size_abstract(dataset):
    token_size = 0
    for idx, sample in tqdm(dataset.iterrows()):
        for abs in sample['abstract_text']:
            token_size += len(nltk.word_tokenize(abs))
    return token_size / len(dataset)

In [40]:
for dataset in [processed_train_df, processed_val_df, processed_test_df]:
    print(avg_token_size_abstract(dataset))

0it [00:00, ?it/s]

24793it [00:43, 571.31it/s]


298.68567740894605


1398it [00:02, 563.99it/s]


300.18741058655223


1429it [00:02, 557.62it/s]

302.94891532540237





| avg_token_size | abstract_text |
| --- | --- |
| train | 299 |
| eval | 300 |
| test | 303 |