# <center>HW 4: Text preprocessing</center>

In [1]:
import nltk, re, json, string
from sklearn.preprocessing import normalize
from sklearn.metrics import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np  
import pandas as pd
from nltk.corpus import stopwords
from spacy.tokenizer import Tokenizer
import spacy

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Q1: Regular Expression (2 points)

Suppose you have scraped the text shown below from an online source. You'd like to extract data using regular expression.

Define a **extract** function which:
- Takes a piece of text (in the format of shown below) as an input
- Extracts data into a list of tuples using regular expression, e.g.  `[('BTC-USD','56,212.15','-58.16','-0.10%'), ('ETH-USD',  ...), ...]`
- Returns the list of tuples

In [2]:
text='''Symbol   Last Price  Change   % Change   Note
                  BTC-USD  56,212.15   -58.16   -0.10%   Bitcoin
                  ETH-USD  1,787.79    -53.63   -2.91%   Ether
                  BNB-USD  1,101,290.51      +51.81    +2.04%   Binance
                  USDT-USD 1.0003      -0.0004  -0.04%   Tether
                  ADA-USD  1.1187      -0.0528  -4.51%   Cardano
      '''
text

'Symbol   Last Price  Change   % Change   Note\n                  BTC-USD  56,212.15   -58.16   -0.10%   Bitcoin\n                  ETH-USD  1,787.79    -53.63   -2.91%   Ether\n                  BNB-USD  1,101,290.51      +51.81    +2.04%   Binance\n                  USDT-USD 1.0003      -0.0004  -0.04%   Tether\n                  ADA-USD  1.1187      -0.0528  -4.51%   Cardano\n      '

In [3]:
# Define the function

def extract(text):
    text = re.sub(" +"," ", text)
    regexPattern = re.compile(r"([A-Z]{2,4}[-][A-Z]{3}) (\b[\d,]+\.?\d*\b) ([\+\-]\b[\d,]+\.?\d*\b) ([\+\-]\b[\d,]+\.?\d*\b\%)")
    return regexPattern.findall(text)

In [4]:
# Test the function

extract(text)

[('BTC-USD', '56,212.15', '-58.16', '-0.10%'),
 ('ETH-USD', '1,787.79', '-53.63', '-2.91%'),
 ('BNB-USD', '1,101,290.51', '+51.81', '+2.04%'),
 ('USDT-USD', '1.0003', '-0.0004', '-0.04%'),
 ('ADA-USD', '1.1187', '-0.0528', '-4.51%')]

## Q2: Collocation (3 points)

Define a function `top_collocation(doc, K)` to find top-K collocations in specific patterns in a document as follows:
  - Takes a document (i.e. `doc`) and `K` as inputs
  - Find collocations as follows:
    - Tokenize the document and find POS tag of each token (hint: you can use NLTK word tokenizer or Spacy tokenizer).
    - Create bigrams from the tokens with POS tags.

    - Keep only bigrams matching the following patterns:
       - `Adj + Noun`: e.g. linear function
       - `Noun + Noun`: e.g. regression coefficient
    - Get frequency of each bigram (hint: you can use nltk.FreqDist)
    - Returns top K collocations by frequency

In [5]:
# Define the function


def top_collocation(doc, K):

    doc = re.sub("\n+", " ", doc)
    tokenizer = spacy.load('en_core_web_sm')
    tokens = tokenizer(doc)
    tokens = [i for i in tokens if i.is_stop == False]
    # print(dir(tokens[0]))
    collocationDict = {}
    for i in range(1, len(tokens)):
        pairItemsText = (tokens[i-1].text, tokens[i].text)
        pairItemsTag = (tokens[i-1].pos_, tokens[i].pos_)
        if pairItemsText in collocationDict:
            collocationDict[pairItemsText] += 1
        elif pairItemsTag[0] in ['ADJ', 'NOUN','PROPN'] and pairItemsTag[1] in ['NOUN', 'PROPN']:
            collocationDict[pairItemsText] = 1
    
    return sorted(collocationDict.items(), key=lambda x: x[1], reverse=True)[:K]
        
    
    # return result

In [6]:
data = json.load(open("qa.json","r"))
article = data["context"]

top_collocation(article, 10)

[(('public', 'health'), 14),
 (('community', 'spread'), 9),
 (('United', 'States'), 8),
 (('risk', 'exposure'), 5),
 (('spread', 'COVID-19'), 4),
 (('higher', 'risk'), 4),
 (('COVID-19', 'illness'), 4),
 (('spread', 'virus'), 4),
 (('elevated', 'risk'), 4),
 (('health', 'threat'), 3)]

## Q3: Question and Answering (QA) System (5 points)

Develop a QA system which allow you to search for answers in an article. For example, the file `qa.json` contains a research article. This article can answer a number of questions about COVID-19. You will design a solution to automatically search answers to these questions in this article.

`qa.json` is taken from https://github.com/deepset-ai/COVID-QA. This file contains a few questions, and answers to these questions have been located in the article. Let's define a QA system and check if your system can locate the right answers.

The following script helps you understand `qa.json`:

In [7]:
# Retrieve the article

data = json.load(open("qa.json","r"))
article = data["context"]

# A long article. Just print the first 200 characters
print(article[0:200])

CDC Summary 21 MAR 2020,
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/summary.html

This is a rapidly evolving situation and CDC will provide updated information and guidance as it becomes 


In [8]:
# Retrieve all the questions and answers
qas = data["qas"]

# show the first question-answer pair. Note the answer starts at the 6117th character
print(qas[0])

# get all questions
qs = [item["question"] for item in qas]
qs

{'question': 'What age group has the highest rate of severe outcomes?', 'id': 236, 'answers': [{'text': 'people 85 years and older', 'answer_start': 6117}], 'is_impossible': False}


['What age group has the highest rate of severe outcomes?',
 'How is COVID-19 spread?',
 'How many states in the U.S. have reported cases of COVID-19?',
 'When did the White House launch the "15 Days to Slow the Spread" program?',
 'What should mildly-ill patients do?',
 'What type of virus is SARS-CoV-2?',
 'What viruses are similar to the COVID-19 coronavirus?',
 'What are the phases of a pandemic?',
 'At which phase does the peak of the pandemic occur?',
 'People with which medical conditions have a higher rate of severe illness?',
 'What kind of test can diagnose COVID-19?',
 'In what species did the COVID-19 virus likely originate?',
 'What risk factors should be considered in addition to clinical symptoms?']

Next, following the instructions below step by step to develop the QA system

### Q3.1. Tokenizer

Define a function `tokenize(doc)`  as follows:
   - Take a piece of text (i.e. variable `doc`) as an input
   - Split the input text into unigrams
   - Clean up tokens as follows:
       - Lemmatize all unigrams
       - Remove all stop words
       - Remove all punctuations
       - Convert all unigrams to the lower case 
       - remove empty unigrams
   - Return the list of unigrams after all the processing. (Hint: you can use spacy package for this task. To test if a token is stop word or punctuation, check https://spacy.io/api/token#attributes)

In [9]:
# Define the function


def tokenize(doc):    
    # add your code

    tokenizer = spacy.load('en_core_web_sm')
    tokens = tokenizer(doc)

    tokens = [str.lower(token.lemma_) for token in tokens if not (token.is_stop or token.is_punct or token.text == '')]                  
    return tokens

In [10]:
doc = 'Older people and people of all ages with severe chronic medical conditions — \
like heart disease, lung disease and diabetes, \
for example — seem to be at higher risk of developing serious COVID-19 illness.'

print(tokenize(doc))

['old', 'people', 'people', 'age', 'severe', 'chronic', 'medical', 'condition', 'like', 'heart', 'disease', 'lung', 'disease', 'diabetes', 'example', 'high', 'risk', 'develop', 'covid-19', 'illness']


### Q3.2. Compute TF-IDF Matrix

Define a function `compute_tfidf(docs)` as follows: 

- Take `docs`, a list of documents (e.g. a list of questions) as an input
- Tokenize each document in `docs` using the `tokenize` function defined in Q3.1. 
- Calculate tf_idf weights as shown in lecture notes (Hint: you can reuse the last code segment in NLP Lecture Notes (II))
- Return a smoothed normalized `tf_idf` array. (The result may differ a little bit depending on the tokenize function and packages you use.)

In [11]:
# Define the function
def get_doc_tokens(doc):
    tokens = tokenize(doc)
    tokensCount = {token: tokens.count(token) for token in set(tokens)}
    return tokensCount

def get_doc_term_matrix(docs):
    docs_token = {idx: get_doc_tokens(doc) for idx, doc in enumerate(docs)}
    dtm = pd.DataFrame.from_dict(docs_token, orient="index")
    dtm = dtm.fillna(0)
    dtm = dtm.sort_index(axis=0)
    return dtm

def get_tf(docs):
    tf = get_doc_term_matrix(docs)
    doc_len = tf.sum(axis=1)
    tf = np.divide(tf.T, doc_len).T
    return tf
    
def get_idf(docs):
    df = get_tf(docs)
    df = np.where(df>0, 1, 0)
    idf = np.log(np.divide(len(docs), np.sum(df, axis=0)))+1
    idf = np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1
    return idf

def compute_tfidf(docs):
    tf = get_tf(docs)
    idf = get_idf(docs)
    tf_idf = normalize(tf*idf)
    return tf_idf

In [12]:
# Test the function using three questions

np.set_printoptions(precision=2)
compute_tfidf(qs[0:3])

array([[0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.61, 0.8 , 0.  , 0.  , 0.  ,
        0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.36, 0.  , 0.47, 0.47, 0.47,
        0.47]])

### Q3.2. Put Everything Together

Define a function `find_solutions(qs, article)` as follows: 

- Take two inputs:
    - `qs`: a list of questions (i.e. strings)
    - `article`: a document which may contain answers to the questions
- Segment the article into sentences (i.e. `sents`). You will locate the sentence which can answer a question.
- Concatenate the questions (`qs`) and sentences (`sents`) into a single list (i.e. `qs + sents`)
- Call the function `compute_tfidf` defined in Q3.2 with `qs + sents` to get a `TF-IDF` matrix. (Note, now `qs` and `sents` are converted to TF-IDF vectors in the same dimension. As a result, you can measure their similarities.) 
- Split the `TF-IDF` matrix into two sub matrices, one corresponding to `qs` and the other for `sents`. 
- Next, calculate the pairwise cosine similarity between the `qs` and `sents`. With $m$ questions and $n$ sentences, you should get a $m \times n$ matrix. (hint: you can `sklearn.metrics.pairwise_distances` to calculate pairwise distances between two matrices)
- Finally, the answer to each question is the sentence which has the `maximum similarity` to it. 
- Print out each question and its matched answer. Check if your QA system is able to find the right answer.(Depending on the packages you use, your answer might be a bit different from mine.)

In [19]:
def find_solutions(qs, article):
    tokenizer = spacy.load('en_core_web_sm')
    tokens = tokenizer(article)
    sents = [i.text.strip() for i in tokens.sents]
    
    tfIdf = compute_tfidf(qs+sents)

    qsVector = tfIdf[:len(qs)]
    sentsVector = tfIdf[len(qs):]
#     print(tfIdf.shape)
    # print(qsVector.shape)
    # print(sentsVector.shape)
    dist = 1-pairwise_distances(qsVector, sentsVector, metric='cosine')
    solutions = []

    for i in range(len(qs)):
        sentInd = np.argmax(dist[i])
        solutions.append(["Question: %s"%qs[i], "Answer: %s"%sents[sentInd]])
    
    return solutions

In [20]:
# Test the system
np.set_printoptions(precision=10)
find_solutions(qs, article)

[['Question: What age group has the highest rate of severe outcomes?',
  'Answer: Reported illnesses have ranged from very mild (including some with no reported symptoms) to severe, including illness resulting in death.'],
 ['Question: How is COVID-19 spread?',
  'Answer: If you have been in China or another affected area or have been exposed to someone sick with COVID-19 in the last 14 days, you will face some limitations on your movement and activity.'],
 ['Question: How many states in the U.S. have reported cases of COVID-19?',
  'Answer: All 50 states have reported cases of COVID-19 to CDC.'],
 ['Question: When did the White House launch the "15 Days to Slow the Spread" program?',
  'Answer: CDC Recommends\nEveryone can do their part to help us respond to this emerging public health threat:\nOn March 16, the White House announced a program called “15 Days to Slow the Spread,”pdf iconexternal icon which is a nationwide effort to slow the spread of COVID-19 through the implementation

In [26]:
print(qs[1])
token = tokenize(article)
en = spacy.load('en_core_web_sm')

print([s for s in en(article).sents][len(qs)])

How is COVID-19 spread?

U.S. COVID-19 cases include:
Imported cases in travelers
Cases among close contacts of a known case
Community-acquired cases where the source of the infection is unknown.
