<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 1.0 Explore the Data

In this notebook, you'll explore the datasets and annotations for the projects in this section of the course.  The purpose of taking a close look at the data is to provide a clear picture of the inputs for the models, as well as provide insight into how you might structure your own datasets for future projects.

**[1.1 Corpus Annotated Data](#1.1-Corpus-Annotated-Data)<br>**
**[1.2 Text Classification Dataset](#1.2-Text-Classification-Dataset)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[1.2.1 Exercise: Explore the Test Set](#1.2.1-Exercise:-Explore-the-Test-Set)<br>
**[1.3 NER Dataset](#1.3-NER-Dataset)<br>**

# 1.1 Corpus Annotated Data

The [NCBI-disease corpus](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/) is a set of 793 PubMed abstracts, annotated by 14 annotators. The annotations take the form of HTML-style tags inserted into the abstract text using the clearly defined rules.  The annotations identify named diseases, and can be used to fine-tune a language model to identify disease mentions in future abstracts, *whether those diseases were part of the original training set or not*.  

Here's an example of what an annotated abstract from the corpus looks like:

```html
10021369	Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour</category> suppressor .	The <category="Modifier">adenomatous polyposis coli ( APC ) tumour</category>-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In <category="Modifier">colon carcinoma</category> cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - <category="Modifier">colon carcinoma</category> cells . Human APC2 maps to chromosome 19p13 . 3 . APC and APC2 may therefore have comparable functions in development and <category="SpecificDisease">cancer</category> .
```

In this example, we see the following tags within the abstract:
```html
<category="Modifier">adenomatous polyposis coli tumour</category>
<category="Modifier">adenomatous polyposis coli ( APC ) tumour</category>
<category="Modifier">colon carcinoma</category>
<category="Modifier">colon carcinoma</category>
<category="SpecificDisease">cancer</category>
```
For our purposes, we will consider any identified category (such as "Modifier", "Specific Disease", and a few others) to generally be a "disease".  If you want to see more examples, you can explore the text of the corpus using the file browser to the left, or open files directly: 

* [data/NCBI_corpus/NCBI_corpus_training.txt](data/NCBI_corpus/NCBI_corpus_training.txt)
* [data/NCBI_corpus/NCBI_corpus_testing.txt](data/NCBI_corpus/NCBI_corpus_testing.txt)
* [data/NCBI_corpus/NCBI_corpus_development.txt](data/NCBI_corpus/NCBI_corpus_development.txt)

We have two datasets derived from this corpus:  a text classification dataset and a named entity recognition (NER) dataset.  The text classification dataset labels the abstracts among three broad disease groupings.  We'll use this simple split to demonstrate the NLP text classification task.   The NER dataset labels individual words as diseases.  This dataset will be used for the NLP NER task.  

# 1.2 Text Classification Dataset

The text classification task seeks to categorize text according to its content.  Examples of applications for text classification include sentiment analysis (two classes) and topic labeling (multiple classes).  To understand what kind of dataset we need, we first need to decide what question we want to ask.

### Sentiment Analysis
For example, if we are analyzing the reviews from movies, our question might be:<br>
**Given a movie-review sentence, is the sentiment positive or negative?**<br>
In such an analysis, we need to look at sentences, and we only have two classes: "positive" and "negative".  Each sentence in the training set must be labeled as one or the other. Sentiment analysis is widely used by businesses to identify customer sentiment toward products, brands, or services in online conversations and feedback.

### Multi-Class Analysis
For our project, we'll ask a different question:<br>
**Given a medical disease abstract, is the abstract about cancer, a neurological disorder, or something else?**<br>
For our use case, we are looking at entire abstracts, not just sentences, and we have identified three classes: "cancer", "neurological", and "other".  As a naive approach for the purposes of this lab, the abstracts are labeled based on the diseases identified that fall into these three categories.  The data is stored in `.tsv` format.  This is similar to the common `.csv` comma-delimited format, but uses tabs to delimit columns instead.  Execute the following cell to see a list of `.tsv` files for the 3-class datasets for text classification.

In [1]:
TC_DATA_DIR = '/dli/task/data/NCBI_tc-3/'
!ls -lh $TC_DATA_DIR

total 1.4M
-rw-r--r-- 1 702112 10513 133K Jul 21  2020 dev.tsv
-rw-r--r-- 1 702112 10513  15K Jul 21  2020 test.tsv
-rw-r--r-- 1 702112 10513 1.2M Jul 21  2020 train.tsv


In JupyterLab, you can explore the files and data using the file explorer at the left.  For the notebooks, we'll use [_pandas_](https://pandas.pydata.org/docs/user_guide/index.html) to import and and view the data, which will be a useful way to import the data for the models.  

We can import the data into a _pandas_ DataFrame object using the `pd.read_csv()` function, specifying the tab as a delimiter.  The `.head()` function displays the top 5 rows of data.  Each row includes a raw lowercase abstract and a label.  The labels for the three categories of "cancer", "neurological", and "other" are the values 0, 1, and 2 respectively.

In [2]:
import pandas as pd
pd.options.display.max_colwidth = -1

In [3]:
train_df = pd.read_csv(TC_DATA_DIR + 'train.tsv', sep='\t')
train_df.head()

Unnamed: 0,sentence,label
0,"Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor . The adenomatous polyposis coli ( APC ) tumour-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In colon carcinoma cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - colon carcinoma cells . Human APC2 maps to chromosome 19p13 . 3 . APC and APC2 may therefore have comparable functions in development and cancer .",0
1,"A common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer . The frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5 leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2 mutation so far reported . Although this mutation was initially detected in four of 33 colorectal cancer families analysed from eastern England , more extensive analysis has reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast , the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal families from Newfoundland . To investigate the origin of this mutation in colorectal cancer families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) , haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the English and US families there was little evidence for a recent common origin of the MSH2 splice site mutation in most families . In contrast , a common haplotype was identified at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families . These findings suggested a founder effect within Newfoundland similar to that reported by others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined , the penetrances at age 60 years for all cancers and for colorectal cancer were 0 . 86 and 0 . 57 , respectively . The risk of colorectal cancer was significantly higher ( p < 0 . 01 ) in males than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) . For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks have implications for screening programmes and for attempts to identify colorectal cancer susceptibility modifiers .",0
2,"Age of onset in Huntington disease : sex specific influence of apolipoprotein E genotype and normal CAG repeat length . Age of onset ( AO ) of Huntington disease ( HD) is known to be correlated with the length of an expanded CAG repeat in the HD gene . Apolipoprotein E ( APOE ) genotype , in turn , is known to influence AO in Alzheimer disease , rendering the APOE gene a likely candidate to affect AO in other neurological diseases too . We therefore determined APOE genotype and normal CAG repeat length in the HD gene for 138 HD patients who were previously analysed with respect to CAG repeat length . Genotyping for APOE was performed blind to clinical information . In addition to highlighting the effect of the normal repeat length upon AO in maternally inherited HD and in male patients , we show that the APOE epsilon2epsilon3 genotype is associated with significantly earlier AO in males than in females . Such a sex difference in AO was not apparent for any of the other APOE genotypes . Our findings suggest that subtle differences in the course of the neurodegeneration in HD may allow interacting genes to exert gender specific effects upon AO.",1
3,"Familial deficiency of the seventh component of complement associated with recurrent bacteremic infections due to Neisseria . The serum of a 29-year old woman with a recent episode of disseminated gonococcal infection and a history of meningococcal meningitis and arthritis as a child was found to lack serum hemolytic complement activity . The seventh component of complement ( C7 ) was not detected by functional or immunochemical assays , whereas other components were normal by hemolytic and immunochemical assessment . Her fresh serum lacked complement-mediated bactericidal activity against Neisseria gonorrhoeae , but the addition of fresh normal serum or purified C7 restored bactericidal activity as well as hemolytic activity . The absence of functional C7 activity could not be accounted for on the basis of an inhibitor . Opsonization and generation of chemotactic activity functioned normally . Complete absence of C7 was also found in one sibling who had the clinical syndrome of meningococcal meningitis and arthritis as a child and in this siblings clinically well eight-year-old son . HLA histocompatibility typing of the family members did not demonstrate evidence for genetic linkage of C7 deficiency with the major histocompatibility loci . This report represents the first cases of C7 deficiency associated with infectious complications and suggests that bactericidal activity may be important in host defense against bacteremic neisseria infections .",2
4,"Increased incidence of cancer in patients with cartilage-hair hypoplasia . OBJECTIVE Previous reports have suggested an increased risk of cancer among patients with cartilage-hair hypoplasia (CHH) . This study was carried out to further evaluate this risk among patients with CHH and their first-degree relatives . STUDY DESIGN One hundred twenty-two patients with CHH were identified through 2 countrywide epidemiologic surveys in 1974 and in 1986 . Their parents and nonaffected siblings were identified through the Population Register Center . This cohort underwent follow-up for cancer incidence through the Finnish Cancer Registry to the end of 1995 . RESULTS A statistically significant excess risk of cancer was seen among the patients with CHH ( standardized incidence ratio 6 . 9 , 95 % confidence interval 2 . 3 to 16 ) , which was mainly attributable to non-Hodgkins lymphoma ( standardized incidence ratio 90 , 95 % confidence interval 18 to 264 ) . In addition , a significant excess risk of basal cell carcinoma was seen ( standardized incidence ratio 35 , 95 % confidence interval 7 . 2 to 102 ) . The cancer incidence among the siblings or the parents did not differ from the average cancer incidence in the Finnish population . CONCLUSIONS This study confirms an increased risk of cancer, especially non-Hodgkins lymphoma , probably attributable to defective immunity , among patients with CHH .",0


## 1.2.1 Exercise: Explore the Test Set
Try the same thing for the test set in the next cell.  

In [4]:
# Change the FIXME lines to view the test set.
test_df =  pd.read_csv(TC_DATA_DIR + 'test.tsv', sep='\t')
test_df.head() #FIXME

Unnamed: 0,sentence
0,"Clustering of missense mutations in the ataxia-telangiectasia gene in a sporadic T-cell leukaemia. Ataxia-telangiectasia ( A-T ) is a recessive multi-system disorder caused by mutations in the ATM gene at 11q22-q23 ( ref . 3 ) . The risk of cancer , especially lymphoid neoplasias , is substantially elevated in A-T patients and has long been associated with chromosomal instability . By analysing tumour DNA from patients with sporadic T-cell prolymphocytic leukaemia ( T-PLL ) , a rare clonal malignancy with similarities to a mature T-cell leukaemia seen in A-T , we demonstrate a high frequency of ATM mutations in T-PLL . In marked contrast to the ATM mutation pattern in A-T , the most frequent nucleotide changes in this leukaemia were missense mutations . These clustered in the region corresponding to the kinase domain , which is highly conserved in ATM-related proteins in mouse , yeast and Drosophila . The resulting amino-acid substitutions are predicted to interfere with ATP binding or substrate recognition . Two of seventeen mutated T-PLL samples had a previously reported A-T allele . In contrast , no mutations were detected in the p53 gene , suggesting that this tumour suppressor is not frequently altered in this leukaemia . Occasional missense mutations in ATM were also found in tumour DNA from patients with B-cell non-Hodgkins lymphomas ( B-NHL ) and a B-NHL cell line . The evidence of a significant proportion of loss-of-function mutations and a complete absence of the normal copy of ATM in the majority of mutated tumours establishes somatic inactivation of this gene in the pathogenesis of sporadic T-PLL and suggests that ATM acts as a tumour suppressor . As constitutional DNA was not available , a putative hereditary predisposition to T-PLL will require further investigation . ."
1,"Myotonic dystrophy protein kinase is involved in the modulation of the Ca2+ homeostasis in skeletal muscle cells. Myotonic dystrophy ( DM ) , the most prevalent muscular disorder in adults , is caused by ( CTG ) n-repeat expansion in a gene encoding a protein kinase ( DM protein kinase ; DMPK ) and involves changes in cytoarchitecture and ion homeostasis . To obtain clues to the normal biological role of DMPK in cellular ion homeostasis , we have compared the resting [ Ca2 + ] i , the amplitude and shape of depolarization-induced Ca2 + transients , and the content of ATP-driven ion pumps in cultured skeletal muscle cells of wild-type and DMPK [ - / - ] knockout mice . In vitro-differentiated DMPK [ - / - ] myotubes exhibit a higher resting [ Ca2 + ] i than do wild-type myotubes because of an altered open probability of voltage-dependent l-type Ca2 + and Na + channels . The mutant myotubes exhibit smaller and slower Ca2 + responses upon triggering by acetylcholine or high external K + . In addition , we observed that these Ca2 + transients partially result from an influx of extracellular Ca2 + through the l-type Ca2 + channel . Neither the content nor the activity of Na + / K + ATPase and sarcoplasmic reticulum Ca2 + -ATPase are affected by DMPK absence . In conclusion , our data suggest that DMPK is involved in modulating the initial events of excitation-contraction coupling in skeletal muscle . ."
2,"Constitutional RB1-gene mutations in patients with isolated unilateral retinoblastoma. In most patients with isolated unilateral retinoblastoma , tumor development is initiated by somatic inactivation of both alleles of the RB1 gene . However , some of these patients can transmit retinoblastoma predisposition to their offspring . To determine the frequency and nature of constitutional RB1-gene mutations in patients with isolated unilateral retinoblastoma , we analyzed DNA from peripheral blood and from tumor tissue . The analysis of tumors from 54 ( 71 % ) of 76 informative patients showed loss of constitutional heterozygosity ( LOH ) at intragenic loci . Three of 13 uninformative patients had constitutional deletions . For 39 randomly selected tumors , SSCP , hetero-duplex analysis , sequencing , and Southern blot analysis were used to identify mutations . Mutations were detected in 21 ( 91 % ) of 23 tumors with LOH . In 6 ( 38 % ) of 16 tumors without LOH , one mutation was detected , and in 9 ( 56 % ) of the tumors without LOH , both mutations were found . Thus , a total of 45 mutations were identified in tumors of 36 patients . Thirty-nine of the mutations-including 34 small mutations , 2 large structural alterations , and hypermethylation in 3 tumors-were not detected in the corresponding peripheral blood DNA . In 6 ( 17 % ) of the 36 patients , a mutation was detected in constitutional DNA , and 1 of these mutations is known to be associated with reduced expressivity . The presence of a constitutional mutation was not associated with an early age at treatment . In 1 patient , somatic mosaicism was demonstrated by molecular analysis of DNA and RNA from peripheral blood . In 2 patients without a detectable mutation in peripheral blood , mosaicism was suggested because 1 of the patients showed multifocal tumors and the other later developed bilateral retinoblastoma . In conclusion , our results emphasize that the manifestation and transmissibility of retinoblastoma depend on the nature of the first mutation , its time in development , and the number and types of cells that are affected . ."
3,"Hereditary deficiency of the fifth component of complement in man. I. Clinical, immunochemical, and family studies. The first recognized human kindred with hereditary deficiency of the fifth component of complement ( C5 ) is described . The proband , a 20-year-old black female with systemic lupus erythematosus since age 11 , lacked serum hemolytic complement activity , even during remission . C5 was undetectable in her serum by both immunodiffusion and hemolytic assays . Other complement components were normal during remission of lupus , but C1 , C4 , C2 , and C3 levels fell during exacerbations . A younger half-sister , who had no underlying disease , was also found to lack immunochemically detectable C5 . By hemolytic assay , she exhibited 1-2 % of the normal serum C5 level and normal concentrations of other complement components . C5 levels of other family members were either normal or approximately half-normal , consistent with autosomal codominant inheritance of the gene determining C5 deficiency . Normal hemolytic titers were restored to both homozygous C5-deficient ( C5D ) sera by addition of highly purified human C5 . In specific C5 titrations , however , it was noted that when limited amounts of C5 were assayed in the presence of low dilutions of either C5D serum , curving rather than linear dose-response plots were consistently obtained , suggesting some inhibitory effect . Further studies suggested that low dilutions of C5D serum contain a factor ( or factors ) interfering at some step in the hemolytic assay of C5 , rather than a true C5 inhibitor or inactivator . Of clinical interest are ( a ) the documentation of membranous glomerulonephritis , vasculitis , and arthritis in an individual lacking C5 ( and its biologic functions ) , and ( b ) a remarkable propensity to bacterial infections in the proband , even during periods of low-dose or alternate-day corticosteroid therapy . Other observations indicate that the C5D state is compatible with normal coagulation function and the capacity to mount a neutrophilic leukocytosis during pyogenic infection . ."
4,"Susceptibility to ankylosing spondylitis in twins: the role of genes, HLA, and the environment. OBJECTIVE To determine the relative effects of genetic and environmental factors in susceptibility to ankylosing spondylitis ( AS ) . METHODS Twins with AS were identified from the Royal National Hospital for Rheumatic Diseases database . Clinical and radiographic examinations were performed to establish diagnoses , and disease severity was assessed using a combination of validated scoring systems . HLA typing for HLA-B27 , HLA-B60 , and HLA-DR1 was performed by polymerase chain reaction with sequence-specific primers , and zygosity was assessed using microsatellite markers . Genetic and environmental variance components were assessed with the program Mx , using data from this and previous studies of twins with AS . RESULTS Six of 8 monozygotic ( MZ ) twin pairs were disease concordant , compared with 4 of 15 B27-positive dizygotic ( DZ ) twin pairs ( 27 % ) and 4 of 32 DZ twin pairs overall ( 12 . 5 % ) . Nonsignificant increases in similarity with regard to age at disease onset and all of the disease severity scores assessed were noted in disease-concordant MZ twins compared with concordant DZ twins . HLA-B27 and B60 were associated with the disease in probands , and the rate of disease concordance was significantly increased among DZ twin pairs in which the co-twin was positive for both B27 and DR1 . Additive genetic effects were estimated to contribute 97 % of the population variance . CONCLUSION Susceptibility to AS is largely genetically determined , and the environmental trigger for the disease is probably ubiquitous . HLA-B27 accounts for a minority of the overall genetic susceptibility to AS ."


You should see different abstracts and no labels at all.  The test samples will be used in our final inference test, and are therefore "unknown" to us beforehand.  We'll need to add placeholder values, which will be ignored, in a label column.

# 1.3 NER Dataset

For the NER task, we'll ask a new question:<br>
**Given sentences from medical abstracts, what diseases are mentioned?**<br>
In this case, our data input is sentences from the abstracts, and our labels are the precise locations of the named disease entities.  Take a look at the information provided for the dataset.

In [None]:
NER_DATA_DIR = '/dli/task/data/NCBI_ner-3/'
!ls -lh $NER_DATA_DIR

The NER task requires two files: the text sentences, and the labels.  Run the next two cells to see a sample of the two files.

In [None]:
!head $NER_DATA_DIR/text_train.txt

In [None]:
!head $NER_DATA_DIR/labels_train.txt

### IOB Tagging
We can see that the abstract has been broken into sentences.  Each sentence is then further parsed into words with labels that correspond to the original HTML-style tags in the corpus. 

The sentences and labels in the NER dataset map to each other with _inside, outside, beginning (IOB)_ tagging. Anything separated by white space is a word, including punctuation.  For the first sentence we have the following mapping:

```text
Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .
O              O  O    O O O         O  O   B           I         I    I      O          O  
```

Recall the original corpus tags:
```html
Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour</category> suppressor .
```
The beginning word of the tagged text, "adenomatous", is now IOB-tagged with a <span style="font-family:verdana;font-size:110%;">B</span> (beginning) tag, the other parts of the disease, "polyposis coli tumour" tagged with <span style="font-family:verdana;font-size:110%;">I</span> (inside) tags, and everything else tagged as <span style="font-family:verdana;font-size:110%;">O</span> (outside).

<h2 style="color:green;">Congratulations!</h2>

You've explored the datasets for both the text classification and NER tasks and learned:
* Text classification training data has labels mapping categories to text content
* NER training data maps words to tags, such as I, O, B (inside, outside, beginning) to identify entities

Next, we'll take a brief look at some of the NVIDIA NeMo toolkit features and how to use NeMo to set up and run our NLP tasks.<br>

Move on to [2.0 Getting Started with the NeMo Toolkit](020_ExploreNeMo.ipynb).


<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>