# Troubleshooting the single token warning 

The dygiepp modeling code is not designed to handle sentences with single tokens, and will warn the user upon encountering a sentence with a single token. At first this only happened on a few documents, but now I am noticing that this is happening on almost every single document, with the warning: 

```
UserWarning: Document PMID31030265_abstract_noPunct has a sentence with a single token or no tokens. This may break the modeling code.
```

In this notebook I'm going to explore what these sentences are, because they should be rare, and are not.

In [13]:
import jsonlines
from collections import defaultdict
import random

In [5]:
data_path = '../data/first_manuscript_data/dygiepp/prepped_data/noPunct_dygiepp_formatted_data_None.jsonl'

In [6]:
results = []
with jsonlines.open(data_path) as reader:
    for obj in reader:
        results.append(obj)

In [15]:
docs_with_zerolen_sents = defaultdict(list) # Key is doc_key, value is the indices of the sentences with 0 or 1 tokens 
for doc in results:
    for i, sentence in enumerate(doc["sentences"]):
        if len(sentence) == 0 or len(sentence) == 1:
            docs_with_zerolen_sents[doc["doc_key"]].append(sentence)

In [16]:
len(docs_with_zerolen_sents)

404

There are 404 (lol) docs with sentences that have one or zero tokens. Let's look at examples of these:

In [17]:
five_random_docs = random.sample(docs_with_zerolen_sents.items(), 5)

In [24]:
for doc_key, sentences in five_random_docs:
    print(doc_key, sentences)

PMID25603890_abstract_noPunct [['MATERIALMETHODS'], ['RESULTS'], ['CONCLUSION']]
PMID24898702_abstract_noPunct [['RESULTS']]
PMID29491632_abstract_noPunct [['OBJECTIVE'], ['CONCLUSION'], ['SUMMARY']]
PMID32034777_abstract_noPunct [['CONCLUSION']]
PMID26667233_abstract_noPunct [['CONCLUSIONS']]


Ok so that's hopeful, they're probably all like that, let's sample a few more to see:

In [21]:
twenty_random_docs = random.sample(docs_with_zerolen_sents.items(), 20)

In [23]:
for doc_key, sentences in twenty_random_docs:
    print(doc_key, sentences)

PMID31906847_abstract_noPunct [['RESULTS'], ['CONCLUSIONS']]
PMID12223724_abstract_noPunct [['cv']]
PMID23705659_abstract_noPunct [['CONCLUSIONS']]
PMID23528052_abstract_noPunct [['·'], ['·']]
PMID32164536_abstract_noPunct [['CONCLUSION']]
PMID21439092_abstract_noPunct [['RESULTS']]
PMID31557246_abstract_noPunct [['PURPOSE'], ['METHODS']]
PMID24240305_abstract_noPunct [['Dwarf5']]
PMID25903559_abstract_noPunct [['CONCLUSION']]
PMID18047439_abstract_noPunct [['DESIGN'], ['RESULTS'], ['CONCLUSIONS']]
PMID24947472_abstract_noPunct [['bud'], ['RESULTS']]
PMID29619087_abstract_noPunct [['CONCLUSIONS']]
PMID31747891_abstract_noPunct [['RESULTS'], ['CONCLUSIONS']]
PMID21073469_abstract_noPunct [['•']]
PMID23343093_abstract_noPunct [['RESULTS']]
PMID29304736_abstract_noPunct [['RESULTS']]
PMID25435617_abstract_noPunct [['RESULTS']]
PMID29801464_abstract_noPunct [['RESULTS']]
PMID20704658_abstract_noPunct [['•']]
PMID32034777_abstract_noPunct [['CONCLUSION']]


Some of these are weird, let's look at those whole abstracts:

#### PMID12223724_abstract_noPunct

The effect of isoprenoid growth regulators on avocado Persea americana Mill. cv Hass fruit growth and mesocarp 3hydroxy3methylglutaryl coenzyme A reductase HMGR activity was investigated during the course of fruit ontogeny. Both normal and smallfruit phenotypes were used to probe the interaction between the end products of isoprenoid biosynthesis and the activity of HMGR in the metabolic control of avocado fruit growth. Kinetic analysis of the changes in both cell number and size revealed that growth was limited by cell number in phenotypically small fruit. In small fruit a 70 reduction in microsomal HMGR activity was associated with an increased mesocarp abscisic acid ABA concentration. Application of mevastatin a competitive inhibitor of HMGR reduced the growth of normal fruit and increased mesocarp ABA concentration. These effects were reversed by cotreatment of fruit with mevalonic acid lactone isopentenyladenine or N2chloro4pyridylNphenylurea but were not significantly affected by either gibberellic acid or stigmasterol. However stigmasterol appeared to partially restore fruit growth when coinjected with mevastatin in either phase II or III of fruit growth. In vivo application of ABA reduced fruit growth and mesocarp HMGR activity and accelerated fruit abscission effects that were reversed by cotreatment with isopentenyladenine. Together these observations indicate that ABA accumulation downregulates mesocarp HMGR activity and fruit growth and that in situ cytokinin biosynthesis modulates these effects during phase I of fruit ontogeny whereas both cytokinins and sterols seem to perform this function during the later phases.


From looking in the literature it's clear that the variety name for the avocado in question is *Persea americana* Mill. cv. “Hass”, here the second period is missing so it's unclear to me why the cv was tokenized as a single sentence here, but generally this is a clear source of error.

#### PMID24240305_abstract_noPunct

Highresolution growth measurements were conducted using a linear variable displacement transformer in conjunction with a temperatureprogrammed meristemcooling collar. Chilling and rewarming profiles were determined for a range of Gramineae in the presence and absence of varying concentrations of gibberellic acid GA3. In wheat Triticum aestivum L. seedlings the growthconstraining temperature Pe was progressively lowered by increasing GA3 concentration with a difference of4.8°C between controls and material treated with 104 M GA3. Dwarf5 maize Zea mays L. seedlings had a higher Pe than tall segregates and the difference was markedly reduced by exposure to a saturating concentration of GA3. A similar effect was observed with Tanginbozu dwarf rice Oryza sativa L.. The growth ratetemperature responses of Rht3 gibberellininsensitive dwarf wheat seedlings were unaffected by GA3 and the Pe values for these segregates were around 5° C higher than for normals. Slender s1 barley Hordeum vulgare L. genotypes had Pe values of7° C compared with 4° C for wildtype material and did not show positive hysteresis for growth rate during the rewarming phase. These studies indicate that GA3 modifies the thermal sensitivity of meristem function in Gramineae in a manner which enhances lowtemperature growth.


Also unclear to me why this one was tokenized as its own sentence.

#### PMID20704658_abstract_noPunct

• Two cDNAs encoding allene oxide cyclases PpAOC1 PpAOC2 key enzymes in the formation of jasmonic acid JA and its precursor 9S13S12oxophytodienoic acid cisOPDA were isolated from the moss Physcomitrella patens. • Recombinant PpAOC1 and PpAOC2 show substrate specificity against the allene oxide derived from 13hydroperoxy linolenic acid 13HPOTE PpAOC2 also shows substrate specificity against the allene oxide derived from 12hydroperoxy arachidonic acid 12HPETE. • In protonema and gametophores the occurrence of cisOPDA but neither JA nor the isoleucine conjugate of JA nor that of cisOPDA was detected. • Targeted knockout mutants for PpAOC1 and for PpAOC2 were generated while double mutants could not be obtained. The ΔPpAOC1 and ΔPpAOC2 mutants showed reduced fertility aberrant sporophyte morphology and interrupted sporogenesis.


Bullet points *facepalm*

## Conclusion

It's clear that in most cases, the single token sentences could be removed in a second pre-processing step after the dygiepp formatting script is run on them without consequence, as most of them are section headers from within structured abstracts. There are occasional mis-tokenizations that would get caught up there, so if the modeling code works normally on all the other sentences in the same doc I would almost be inclined to leave them, and I believe that's the case based on the wording in the docs.