In [None]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Lesson 0 &ndash; Introduction

In this part, we will explore how to use machine learning skills to analyze text information.

## Learning Objectives
1. Preparing text data
    - reading a text file
    - retrieving abstracts from PubMed using PMID information
    - building a dataframe
2. Text data wrangling & processing
    - tokenization & lemmitization
    - stemming
    - stopword removal
3. Exploratory analysis of text data
    - word frequencies by uni-, bi-, and trigrams
    - frequency distribution
4. Parts of Speech Tagging
    - understanding language syntax and structure
    - shallow parsing or chunking
    - constituency parsing
    - dependency parsing
    - named entity recognition
5. Feature representation
    - bag-of-words 
    - TF-IDF
6. Predictive analysis of text data
    - logistic regression
    - naive Bayes

## Description of Dataset
The data is a publicly available, annotated dataset that was previously curated manually by systematic reviewers.  
*Sources: Cohen AM et al. Reducing Workload in Systematic Review Preparation Using Automated Citation Classification.* JAMIA *2006;13(2):206-219. Downloaded from https://dmice.ohsu.edu/cohenaa/systematic-drug-class-review-data.html on March 26, 2020*

Systematic review decisions for abstracts and articles are included for these fifteen drug review topics:  
- ACE inhibitors (ACEIs)  
- Attention-deficit hyperactivity disorder (ADHD)  
- Antihistamines
- Atypical antipsychotics  
- Beta blockers (BBs)
- Calcium challen blockers (CCBs)  
- Estrogens  
- NSAIDS  
- Opioids  
- Oral hypoglycemics (OHGs)  
- Proton pump inhibitors (PPIs)
- Skeletal muscle relaxants  
- Statins  
- Triptans  
- Urinary incontinence  

The data file has five columns: topics, EndNote ID, PubMed ID (PMID), abstract triage status, and article Triage Status.  
Due to the computational capacity of a Python library module we will later use, I cut the data down to hold only 10,000 artles, to include the first nine topics (ACEIs to Opioids) only.

The original file is in tab-separated value (.tsv) format. We can convert it into txt file using Excel.

The converted txt file looks like this. (I converted the tsv file to txt file for you.)
  
  
![txtfile.png](attachment:txtfile.png)

### Warning!!!
<font color=red>Your work will not be saved in Jupyter Notebook. You should download your work after working on it. <font>

In [None]:
import os

path = os.getcwd()
os.chdir(path)

# Lesson 1 &ndash; Data Preprocessing

We are interested in extracting topics, PMID, and exclusion/inclusion decision, which correspond to the first, third, and fourth column of the txt file.  
  
Take a close look at each line. Each line begins with topics which are strings, followed by a tab, then EndNote IDs which are integers. Again a tab followed by PMIDs, which are integers, and then come abstract triage decisions and full-article triage decisions which are represented by single upper-case characters, separated by a tab. We can express each line in a *pythonic way* as below:  
  
`"topics" + "\t" + "EndNote ID" + "\t" + "PMID" + "\t" + "abstract triage decision" + "\t" + "full-article triage decision"`  
  
First, read the file line by line  
Each line in the file has five elements deliminated by tabs.

We will extract topics (1st element), PMIDs (3rd element), and abstract triage decisions (4th element).  
  
We will make a dictionary 'data', whose keys are topics and whose values are (PMID, triage decision) tuples, to construct the information we have retrieved.

In [None]:
file = open('epc-ir_clean_10k.txt')

data = {}

for line in file:
    line = line.strip()
    line_elements = line.split('\t') # split a line into five elements.
    topic = line_elements[0] # 'Topic' is the first element in each line.
    pmid = line_elements[2] # 'PMID' is the third element in each line.
    decision = line_elements[3] # 'abstract triage decision' is the fourth element in each line
    values = pmid,decision # values are the tuples whose elements are pmid and decision.
    data.setdefault(topic,{}).setdefault(pmid,decision) # see below

print(data)

file.close()

`setdefault()` function returns value if the key is in the dictionary; if not, inserts key with the value of default and returns default.  
In our code `data.setdefault(pmid,[]).append((topic,decision))`, `setdefault()` function  keeps appending values if their keys are already there in the 'data' dictionary. When faced with a new key, the function creates a new empty dictionary that is ready to receive a new list which is comprised of tuples ('topic','decision')  
  
As you can see, the article '10024335' was used for two reviews, "ACEInbihitors" and "Statins." `setdefault()` is such a powerful tool to retrieve duplicated record that it does not allow to lose information used in different domains.  
  
Now we have a data dictionary whose keys are PMIDs and values are tuples of topics and trage decisions.    
With this dictionary, we can easily navigate topics with PMIDs at hand.  

Next, we want to build a Python DataFrame that will have columns of topic, PMID, triage decisions, and abstract text. 

In order to use Python's DataFrame, we first have to import `pandas` library.

In [None]:
import pandas as pd
from pandas import DataFrame

The index of the DataFrame should be PMIDs. First, we will create an empty DataFrame with three columns--topic, PMID, and triage decision.  

In [None]:
col_names = ['topic','pmid','decision']
df = pd.DataFrame(columns = col_names)
df.head() # df is an empty DataFrame with three columns

To add values to the empty 'df,' use `pandas.DataFrame.append()`. Construct the Pandas Series object using a dictionary that maps a column to a value and the name of the row to add.

In [None]:
# iterate over every topic, i.e., 'ACEInhitors',...,'Opioids'
for topic in data: 
    
    # extract the dictionary contained in each topic, i.e., {'10024335': 'E', '10027665': 'E',...}
    dict_by_topic = data[topic] 
    
    #  iterate over PMIDs (keys) in the sub-dictionary, i.e.,., '10024335'...
    for pmid in dict_by_topic: 
        
        # make a new dictionary with respect to the PMID. 
        # This data will become rows that will be inserted into the empty df.
        value = {'topic':topic, 'pmid':pmid, 'decision':dict_by_topic[pmid]} 
        
        # transform the 'value' dictionary to Pandas series.
        row = pd.Series(value)
        
        # append the series to the empty df.
        df = df.append(row, ignore_index=True)

In [None]:
df.head(10)

Now, We will add a new column, 'abstract,' to contain the texts of abstracts.    
To to this, we have to retrieve PubMed abstracts using the BioPython library.

------
## Getting abstracts from a list of PMID

credit: https://stackoverflow.com/questions/47559098/is-there-any-way-to-get-abstracts-for-a-given-list-of-pubmed-ids

#### BioPython library
Using BioPython library, you can extract the abstracts of given PMID. You can give the joined list of Pubmed IDs to Entrez.efetch that will perform a single URL lookup, creating the **dictionaries** whose keys are PMID and values are abstract strings. 

If this is the first time you use BioPython library, you have to install the module with the following code. You do not have to install BioPython if you had already installed it before.

````python
pip install biopython
````

Then import Entrez from the Bio library.

In [None]:
from Bio import Entrez

To make use of Entrez, we first have to have a list of PMIDs. 

In [None]:
# make an empty list that will hold PMIDs
pmids_list = []

# iterate over the key of the 'data' dictionary (i.e., topics) we built before.
for topic in data:
    
    # iterate over the PMIDs that are contained in the 'topics' sub-dictionaries.
    for pmid in data[topic]:
        
        # The PMIDs in the 'data' dictionaries are strings. 
        # To get PMIDs read by the Entrez module, we have to transform PMIDs into integers.
        pmid = int(pmid)
        
        # append any new PMIDs to 'pmids_list'
        if pmid not in pmids_list:
            pmids_list.append(pmid)

Check the number of PMIDs.

In [None]:
len(pmids_list)

There are 8914 unique PMIDs in the data.  
  
Now, we have a list of PMIDs, 'pmids_list.' We will pass the 'pmids_list' to Entrez module.  
(The following code will require a couple of minutes to complete running)

In [None]:
abstract_dict = {}
without_abstract = []

handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids_list)),
                       rettype="xml", retmode="text")
records = Entrez.read(handle)

for pubmed_article in records['PubmedArticle']:
    pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
    article = pubmed_article['MedlineCitation']['Article']
    if 'Abstract' in article:
        abstract = article['Abstract']['AbstractText'][0]
        abstract_dict[pmid] = abstract
    else:
       without_abstract.append(pmid)

The keys of the 'abstract_dict' dictionary are PMIDs, and the values are the text of abstracts corresponding to each PMID. Take a look at the abstract text of the PMID 8041685.

In [None]:
abstract_dict[8041685]

Now, we will attach the abstract texts to a new column of our DataFrame 'df.'  

In [None]:
# make an empty column that will contain the texts of abstracts. 
# The columns is a Pandas series.
abstract_col = pd.Series([])

# In the df data frame we made before, we will iterate over PMIDs in the df one by one.
for i in range(len(df)):
    
    # We will extract PMIDs which are located in the second column of df.
    pmid = df.iloc[i,1]
    
    # Currently the PMIDs are in string form. Transform them into integers.
    pmid = int(pmid)
    
    # There are some articles whose abstracts are not provided in PubMed. 
    # In that case, we will fill in 'NaN' in the abstract text column.
    if pmid in abstract_dict:
        abstract_col[i]=abstract_dict[pmid]
    else:
        abstract_col[i]='NaN'
        

Make sure all 10,000 articles has their abstract articles.

In [None]:
len(abstract_col)

Attach 'abstract_col,' a pandas Series, to DataFrame 'df.'

In [None]:
df.insert(3, "abstract", abstract_col)

Now, we have a complete dataset that contain the information about topics, PMIDs, abstracts, and triage decisions.

In [None]:
df.head(20)

As you may notice, the PMID 10069777 article does not have its abstract text. To certify this, visit pubmed.gov and search the article using the query '10069777[pmid]'  
  
  
![pubmed_pmid.png](attachment:pubmed_pmid.png)

## Navigating DataFrame

You can locate rows and columns with specific values using a `loc` command.  
Let's find a row corresponding to the PMID 8041685

In [None]:
df.loc[df['pmid']=='8041685']

We can find rows whose topic is 'BetaBlockers' and triage decision is 'include' but their abstract information is not available.  

In [None]:
df.loc[(df['topic']=='BetaBlockers') & (df['decision']=='I') & (df['abstract']=='NaN')]


When you take a close look at the data, you will notice that there are some articles whose 'decision' codes are integers

In [None]:
df.loc[(df['topic']=='BetaBlockers') & (df['decision']=='5')].head()

According to the description of dataset (https://dmice.ohsu.edu/cohenaa/systematic-drug-class-review-data.html), the integer codes indicate reasons for exclusion. Therefore, we can categorize those integer codes to 'E' in the 'decision' column.  
  
`replace(self,to_replace)` function replaces `self` values with `to_replace` values in a DataFrame.

In [None]:
df['decision'] = df['decision'].replace(['1','2','3','4','5','6','7','8','9'],'E')

df.head()

We, now, have a neatly formatted dataset of articles, and you can quickly check the total number of  articles with the following code.

In [None]:
df.topic.value_counts()

# Lesson 2 &ndash; Text Data Preprocessing

Credit: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

### To play the lecture video, run the cell below.
<font color=blue>We will use this video for submitting your answers to in-class quiz and exercises.<font>

In [None]:
from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://edpuzzle.com/embed/media/5ec0addd917ba83efc3920b4" frameborder="0" allowfullscreen></iframe>')

Now that we have created data sets that have texts and the topics, we need to preprocess our text data before we convert it to something useful (i.e. numbers) for the machine learning model.  

**Note**: You don't need to learn about all the details about the codes written here. You can just use the codes in the this Lesson for future references

The raw texts in the 'abstract' column need to be cleaned to represent each word in the text correctly. 

To to this, first download and install Natural Language Toolkit (NLTK).  
NLTK is a commonly used tool in Python to conduct the text analysis. It is an open source library in Python, and has support for most NLP tasks. It also provides access to numerous text corpora.  
  
To install nltk, a `pip install nltk` or a `conda install nltk` should suffice.

````python
pip install nltk
````

And then import the 'nltk' library.

In [None]:
import nltk

--------------
## Removing Special Characters

Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.  
  
We need to import 're' module in order to use regexes.

In [None]:
import re

In [None]:
def remove_special_characters(text, remove_digits=False):
    '''
    A caret located in a bracket means ‘not.’ 
    If remove_digits parameter is True, "^a-zA-Z0-9\s" matches any characters other than 
    alphabets ([a-zA-Z]) or digits ([0-9]), followed by a white space ([\s]).
    If 'remove_digits' parameter is False, the the function will remove numbers as well. 
    '''
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [None]:
remove_special_characters("Well this was fun! What do you think? 123#@", 
                          remove_digits=True)

As you can see here, the 'remove_special_characters' function removed the exclamation mark, question marks. Then why did it remove 123 here? It is because we turned on the 'remove_digits' parameter. The definition of the 'remove_special_characters' says that if the 'remove_digits' parameter is turned on, it matches non-alphabetical characters (i.e., numbers) and removes them.  
  
So if we turn off the "remove_digits" parameter, which is default, we can see digits are survived.

In [None]:
remove_special_characters("Well this was fun! What do you think? 123#@", 
                          remove_digits=False)

-------------
## Removing Stop Words

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stop words are ***a, an, the, and*** the like.  
  
There is no universal stop word list, but we will use a standard English language stopwords list from nltk.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer

Let's look at the list of stop words from nltk.

In [None]:
nltk.download('stopwords')
print(stopwords.words('english'))

Note that NLTK's stopword list contains contractions like "you'll," as well as negative expressions like "not" and "hasn't."  
  
In the current analysis, we will not remove negative expressions like 'no' and 'not' from original texts because they provide us important meaning of the text.  
When the 'is_lower_case' parameter is **True**, the 'remove_stopwords' function does not remove stopwords if any of their characters are in uppercase.

In [None]:
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no') # we will not remove 'no' from texts
stopword_list.remove('not') # we will not reomve 'not' from texts

In [None]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [None]:
remove_stopwords("The, and, if are StopWords, computer is not")

In [None]:
remove_stopwords("The, and, if are StopWords, computer is not", is_lower_case=True)

# Exercise

Consider a few product reviews that are already annotated:  
  
1. The product is really very good. – POSITIVE  
2. The product seems to be good. – POSITIVE  
3. Good product. I really liked it. – POSITIVE  
4. The product is not good. – NEGATIVE  
5. I didn’t like the product. – NEGATIVE  
  
Remove stop words in each review using the codes provided above and look what happens to the review comments. Critically argue whether stop words removal improves model performance in any context. Submit your codes to the text box embeded in the lecture video.

--------------
## Stemming

### To play the lecture video, run the cell below.
<font color=blue>We will use this video for submitting your answers to in-class quiz and exercises.<font>

In [None]:
from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://edpuzzle.com/embed/media/5ec0aeb22f4e7c3f03a29d60" frameborder="0" allowfullscreen></iframe>')

Word stems are also known as the base form of a word, and we can create new words by attaching affixes to them in a process known as inflection. The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming process maps all inflectional forms of a word to the same root form. Consider the word ***compute***. You can add affixes to it and form new words like ***computes, computed***, and ***computing***. In this case, the base word ***comput*** is the word stem.  
  
  
The reverse process of obtaining the base form of a word from its inflected form is known as **stemming**. Stemming helps us in standardizing words to their base or root stem, irrespective of their inflections, which helps many applications like classifying or clustering text, and even in information retrieval.  
  
Let’s see the popular Porter stemmer in action now.  
Porter stemmer, or Porter's algorithm, is a rule-based suffix stripping algorithm. For example, the word ***duplicatable*** is stemmed by the following steps:
1. ***duplicatable*** to ***duplicat*** by a rule from step 4.
2. ***duplicat*** to ***duplicate*** by a rule from step 1b1.
3. ***duplicate*** to ***duplic*** by a rule from step 3.
4. [Stop]  
  
Porter stemmer is known for its simplicity and speed.  

In [None]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    # split the text into individual word and return a list of words
    # the 'ps' function stems each word, and .join() function joins the stemmed words with whitespace.
    text = ' '.join([ps.stem(word) for word in text.split()]) 
    return text

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
lemmatize_text("My system keeps crashing! his was crashed yesterday, ours crashes daily")

---------------
## Lemmatization

Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in lemmatization is known as the root word, but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.  
For example, “am, are, is” will be lemmatized into “be”; “car, car’s, cars’, cars” into “car”. After the process of lemmatization, the importance of some specific words will be enhanced.  

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    s = " " # create an empty string that later will contain lemmatized words,
    t_l = [] # create an empty list
    t_w = nltk.word_tokenize(text) # tokenize the text
    # assign the list of tokenized words into t_w.
    for w in t_w:
        # “pos” is a part of speech parameter and “v” means verbs. 
        # We will lemmatize verbs only. 
        l_w = wordnet_lemmatizer.lemmatize(w, pos="v")
        # append l_w into the list t_l
        t_l.append(l_w)
    # joint the tokens to make a complete sentence
    text = s.join(t_l)
    return text

In [None]:
lemmatize_text("My system keeps crashing! his was crashed yesterday, ours crashes daily")

---------------
## Building a Text Normalizer

Let’s now bring everything we learnt together and chain these operations to build a text normalizer to pre-process text data.

In [None]:
def normalize_corpus(corpus, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

Not let's put this function in action!. We will create a new column 'clean_abs' to hold pre-processed text of abstract.

In [None]:
df['clean_abs'] = normalize_corpus(df['abstract'])
norm_corpus = list(df['clean_abs'])

In [None]:
df.head()

In the 'clean_abs' column, all the text is in lowercase and there are also no punctuation marks, no stopwords, and no contractions. Our text cleaning has worked like a charm.  

Let's take a look at the first row of 'abstract' and 'clean_abs' columns and transform them in a dictionary form.

In [None]:
# to_dict() function transforms row dataframe into a dictionary
df.iloc[0][['abstract','clean_abs']].to_dict()

In [None]:
df.iloc[5679][['abstract','clean_abs']].to_dict()

Save the DataFrame  we built as a csv format for future uses

````python
df.to_csv('text_mining-df.csv')
````

# Exercise  
  
Using the code provided in Jupyter Notebook, 1) remove special characters (including numbers) and stopwords, and 2) lemmatize the following paragraph. Submit your codes to the text box embeded in the lecture video above.  
  
“We measured the serum lipid profile, together with plasma fibrinogen and serum lipoprotein(a) (Lp[a]), glucose, bilirubin, and albumin levels in 491 patients (310 men) who were referred for the management of primary dyslipidemia. All these variables have been shown to predict vascular events. The patients were not taking lipid-lowering drugs; hypertension was present in 156 (31.7%) of them. Of the hypertensive patients, 52 (33%) were not receiving any treatment to control their blood pressure. Lipid-hostile antihypertensive drugs were associated with a significantly higher fibrinogen concentration when compared with untreated hypertensives or those taking lipid-neutral/lipid-friendly drugs (median values: 383, 353, and 336 mg/dL, respectively; P < .01). Lipid-neutral/lipid-friendly antihypertensive drugs were associated with lower Lp(a) levels when compared with untreated hypertensives (median values: 22 and 45 mg/dL, respectively; P < .05). The serum bilirubin level was significantly lower in the untreated hypertensives when compared with normotensives or the treated hypertensives. There were no significant differences in lipids, glucose, or albumin among the groups of hypertensives or normotensives. The influence of antihypertensive drugs on additional cardiovascular risk factors should be considered when selecting medication to reduce blood pressure.”