# HW3: Natural Language Processing

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models (although it is not encouraged), make sure to adapt the generated code to meet all requirements and it is executable.
    - Anti-Plagiarism software will be used to check similarities between all submissions.
    - Check Syllabus for more details.

## Q1: Extract data using regular expression (2 points)
Suppose you have scraped the text shown below from an online source (https://finance.yahoo.com/). Write `a single regular expression` to covert the text into a list of tuples `(Symbol, Last Price, Change, % Change)` as shown below.


In [7]:
import pandas as pd
import nltk
from sklearn.metrics import pairwise_distances
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import normalize
import re
import json
import pprint as pp
import spacy
from collections import defaultdict

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [8]:
text='''BTC-USD

Bitcoin USD
	63,473.52	+1,498.26	+2.42%
ETH-USD

Ethereum USD
	3,471.85	+39.03	+1.14%
USDT-USD

Tether USDt USD
	1.00	-0.00	-0.01%
BNB-USD

BNB USD
	414.75	+4.12	+1.00%
SOL-USD

Solana USD
	128.86	-1.31	-1.01%'''



In [9]:
pattern = r'([A-Z]+-USD)\n\n[^\n]*\n\t([0-9,]+\.[0-9]+)\t([\+-][0-9,]+\.[0-9]+)\t([\+-][0-9]+\.[0-9]+%)'
matches = re.findall(pattern, text)
matches

[('BTC-USD', '63,473.52', '+1,498.26', '+2.42%'),
 ('ETH-USD', '3,471.85', '+39.03', '+1.14%'),
 ('USDT-USD', '1.00', '-0.00', '-0.01%'),
 ('BNB-USD', '414.75', '+4.12', '+1.00%'),
 ('SOL-USD', '128.86', '-1.31', '-1.01%')]

In [63]:
result = re.findall(pattern, text)

pp.pprint(matches)

[('BTC-USD', '63,473.52', '+1,498.26', '+2.42%'),
 ('ETH-USD', '3,471.85', '+39.03', '+1.14%'),
 ('USDT-USD', '1.00', '-0.00', '-0.01%'),
 ('BNB-USD', '414.75', '+4.12', '+1.00%'),
 ('SOL-USD', '128.86', '-1.31', '-1.01%')]


## Q2: Develop a QA system (8 points)


Objective: Find a sentence in an article that can best answer a question. A dataset has been provided. Please follow the instruction below carefully to develop this system.

In [10]:
data = json.load(open("qa.json", "r"))
data[5]

{'text': 'In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise". It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks. It was the #1 single of 1995 for all genres, and was a global hit, as it reached #1 in the United States, United Kingdom, Ireland, France, Germany, Italy, Sweden, Austria, Netherlands, Norway, Switzerland, Australia, and New Zealand. The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta\'s Paradise", titled "Amish Paradise". At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.  Originally "Gangsta\'s Paradise" was not meant to be included on one of Coolio\'s studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track. The title track sampled the chorus and music

In [11]:
# randomly select one article to test your code

idx = 5

text = data[idx]["text"]
text

qs = [item["question"] for item in  data[idx]['qa']]
qs

ans =[item["answer"] for item in  data[idx]['qa']]
ans


'In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise". It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks. It was the #1 single of 1995 for all genres, and was a global hit, as it reached #1 in the United States, United Kingdom, Ireland, France, Germany, Italy, Sweden, Austria, Netherlands, Norway, Switzerland, Australia, and New Zealand. The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta\'s Paradise", titled "Amish Paradise". At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.  Originally "Gangsta\'s Paradise" was not meant to be included on one of Coolio\'s studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track. The title track sampled the chorus and music of the s

["What was the relationship between Coolio and Gangsta's parapdise?",
 'WHen was the song released?',
 'Which record label release the song?',
 'Did the song have a high sales?',
 'Did he wind any award?',
 'Which other names were mention n the song?',
 'What was their contribution to the song?',
 'Which other song did he make?']

['Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise',
 'In 1995,',
 'RIAA.',
 'It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks.',
 'At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.',
 'Too Hot" with J.T. Taylor of Kool & the Gang doing the chorus.',
 'J.T. Taylor of Kool & the Gang doing the chorus.',
 "Sumpin' New"]

### **Q2.1.** Tokenize function (3 points)

Define a function `tokenize(doc, lemmatized = True, remove_stopword = True)`  as follows: 

   - Take three parameters: 
       - `doc`: an input string (e.g. a question)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized). 
       - `remove_stopword`: an optional bookean parameter to remove stop words. The default value is True (i.e. remove stop words). 
   - First split the text into sentences.
   - Split each sentence into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams.
       - if `remove_stopword` is tuned on, remove all stop words.
   - Convert all unigrams to the lower case and remove punctuations and empty tokens
   - Count the frequency of each word in each sentence and save the result into a dictionary (see sample output)
   - Return the resulting **sentences** and **dictionary** after all the processing. 
   
   
(Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

In [12]:
#第一步: split the text into sentences
print(text + '\n' +' '.join(qs))
sentences = re.split(r'(?<=[.!?])\s+', text + '\n' +' '.join(qs))

In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta's Paradise". It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks. It was the #1 single of 1995 for all genres, and was a global hit, as it reached #1 in the United States, United Kingdom, Ireland, France, Germany, Italy, Sweden, Austria, Netherlands, Norway, Switzerland, Australia, and New Zealand. The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta's Paradise", titled "Amish Paradise". At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.  Originally "Gangsta's Paradise" was not meant to be included on one of Coolio's studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track. The title track sampled the chorus and music of the song "

In [13]:
nlp = spacy.load('en_core_web_sm')
#stop_words = set(stopwords.words('english'))
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def tokenize(doc, lemmatized, remove_stopword):
    # Split the document into sentences
    sentences = [sent.text for sent in nlp(doc).sents]

    # Initialize vocabulary dictionary
    vocab = defaultdict(lambda: defaultdict(int))

    # Process each sentence
    for i, sentence in enumerate(sentences):
        # Tokenize the sentence into unigrams
        tokens = []
        for token in nlp(sentence):

            # Stop words removal
            if remove_stopword and token.text.lower() in stop_words:
                continue

            # Lemmatization
            if lemmatized:
                token_text = token.lemma_.lower()
            else:
                token_text = token.text.lower()

            # Lowercasing and removal of punctuations and empty tokens
            if not token.is_punct and token_text.strip():
                tokens.append(token_text)

        # Count word frequency in the current sentence
        word_frequency = defaultdict(int)
        for token in tokens:
            word_frequency[token] += 1

        # Save word frequency to vocabulary dictionary
        vocab[i] = word_frequency

    return dict(vocab), sentences

In [14]:
import pprint as pp

print("1.lemmatized=True, remove_stopword=True\n"), 

# concatenate questions to the text and tokenize together
vocab, sents = tokenize(text + '\n' +' '.join(qs), lemmatized=True, remove_stopword=True)
pp.pprint(vocab)
pp.pprint(sents)

1.lemmatized=True, remove_stopword=True



(None,)

{0: defaultdict(<class 'int'>,
                {'1995': 1,
                 'coolio': 1,
                 'dangerous': 1,
                 'feature': 1,
                 'gangsta': 1,
                 'lv': 1,
                 'minds': 1,
                 'movie': 1,
                 'paradise': 1,
                 'r&b': 1,
                 'singer': 1,
                 'song': 1,
                 'title': 1}),
 1: defaultdict(<class 'int'>,
                {'1': 1,
                 '100': 1,
                 '3': 1,
                 'billboard': 1,
                 'hot': 1,
                 'rap': 1,
                 'reach': 1,
                 'song': 1,
                 'successful': 1,
                 'time': 1,
                 'week': 1}),
 2: defaultdict(<class 'int'>,
                {'1': 2,
                 '1995': 1,
                 'australia': 1,
                 'austria': 1,
                 'france': 1,
                 'genre': 1,
                 'germany': 1,
  

#### Lemmatized=True, remove_stopword=True

In [44]:
import pprint as pp

print("1.lemmatized=True, remove_stopword=True\n"), 

# concatenate questions to the text and tokenize together
vocab, sents = tokenize(text + '\n' +' '.join(qs), lemmatized=True, remove_stopword=True)
pp.pprint(vocab)
pp.pprint(sents)

1.lemmatized=True, remove_stopword=True



(None,)

{0: defaultdict(<class 'int'>,
                {'1995': 1,
                 'coolio': 1,
                 'dangerous': 1,
                 'feature': 1,
                 'gangsta': 1,
                 'lv': 1,
                 'minds': 1,
                 'movie': 1,
                 'paradise': 1,
                 'r&b': 1,
                 'singer': 1,
                 'song': 1,
                 'title': 1}),
 1: defaultdict(<class 'int'>,
                {'1': 1,
                 '100': 1,
                 '3': 1,
                 'billboard': 1,
                 'hot': 1,
                 'rap': 1,
                 'reach': 1,
                 'song': 1,
                 'successful': 1,
                 'time': 1,
                 'week': 1}),
 2: defaultdict(<class 'int'>,
                {'1': 2,
                 '1995': 1,
                 'australia': 1,
                 'austria': 1,
                 'france': 1,
                 'genre': 1,
                 'germany': 1,
  

#### Lemmatized=True, remove_stopword=False

In [9]:
# Test another configuration

print("2.lemmatized=True, remove_stopword=False\n"), 
vocab, sents = tokenize(text + '\n' +' '.join(qs), lemmatized=True, remove_stopword=False)
pp.pprint(vocab)


2.lemmatized=True, remove_stopword=False



(None,)

{0: defaultdict(<class 'int'>,
                {"'s": 1,
                 '1995': 1,
                 'a': 1,
                 'coolio': 1,
                 'dangerous': 1,
                 'feature': 1,
                 'for': 1,
                 'gangsta': 1,
                 'in': 1,
                 'lv': 1,
                 'make': 1,
                 'minds': 1,
                 'movie': 1,
                 'paradise': 1,
                 'r&b': 1,
                 'singer': 1,
                 'song': 1,
                 'the': 1,
                 'title': 1}),
 1: defaultdict(<class 'int'>,
                {'1': 1,
                 '100': 1,
                 '3': 1,
                 'all': 1,
                 'become': 1,
                 'billboard': 1,
                 'for': 1,
                 'hot': 1,
                 'it': 1,
                 'most': 1,
                 'of': 2,
                 'on': 1,
                 'one': 1,
                 'rap': 1,
             

### **Q2.2.** Compute TF-IDF (1 point)

Define a function `compute_tf_idf(vocab)` as follows: 

- Take the dictionary returned in Q2.1 as an input.
- Calculate tf_idf weights as shown in lecture notes (Hint: feel free to reuse the code segment in NLP Lecture Notes (II))
- Return the smoothed normalized `tf_idf` array and the words corresponding to the columns of the tfidf array.
 

In [15]:
def compute_tfidf(vocab):
 
    dtm=pd.DataFrame.from_dict(vocab, orient="index" )
    dtm=dtm.fillna(0)
    dtm = dtm.sort_index(axis = 0)
      
    tf=dtm.values
    doc_len=tf.sum(axis=1, keepdims=True)
    tf=np.divide(tf, doc_len)
    
    df=np.where(tf>0,1,0)
    #idf=np.log(np.divide(len(docs), \
    #    np.sum(df, axis=0)))+1

    smoothed_idf=np.log(np.divide(len(vocab)+1, np.sum(df, axis=0)+1))+1    
    smoothed_tf_idf=tf*smoothed_idf
    
    words = list(dtm)
    
    return smoothed_tf_idf, words

In [54]:
tfidf, words = compute_tfidf(vocab)

# show shape of tfidf matrix
tfidf.shape

(21, 150)

### **Q2.3.** Put everything together to match questions and answers. (4 points)


Define a function `Match(text, questions, lemmatized = True, remove_stopword = True, top-K = 3)`  as follows: 
- Take four inputs:
    - `text`: a paragraph 
   - `questions`: is a list of questions
   - `lemmatized, remove_stopword`:  similar to those defined in Q2.1
   - `top-K`: the top-K answer to each question
- Tokenize the concatenated text and questions using the `tokenize` function as defined in Q2.1.
- Calculate the smoothed normalized tf_idf matrix for the concatenated text
- Split the tf_idf matrix into sub-matrices for the text and questions respectively
- For each question, find the top-K sentences that may answer it based on the TF-IDF similarities between the question and the sentences
- Return the matched top-K sentences of each question.


**Analysis (1 point)**


You may find TFIDF similarity may not be able to find correct answers to some questions. Based on your analysis, answer the following questions:
- What kind of questions cannot be correctly found by this method?
- What could be the possible solution to fix these issues? Discuss your idea. You don't have to implement it.

In [52]:
from sklearn.metrics.pairwise import cosine_similarity
def Match(text, questions,lemmatized = True, remove_stopword = True, topK=3):
    vocab, sents = tokenize(text + '\n' +' '.join(questions), lemmatized=True, remove_stopword=True)
    answers = []
    tfidf, words = compute_tfidf(vocab)
    
    sentences = [sent.text for sent in nlp(text).sents]
    tfidf_text = tfidf[0:len(sentences)]
    tfidf_question = tfidf[len(sentences):]
    
    similarities = 1 - pairwise_distances(tfidf_question, tfidf_text, metric = 'cosine')
    
    for i, question in enumerate(tfidf_question):
        topk_index = similarities.argsort()[i][::-1][0:topK]
        answers.append([sentences[index] for index in topk_index])

    return answers

In [53]:
answers = Match(text, qs, lemmatized = True, remove_stopword = True)
for q, a1, a2 in zip(qs, answers, ans):
    print(f'Q:\t{q}\n\nA:\t{a1}\n\nCorrect:\t{a2}\n')

Q:	What was the relationship between Coolio and Gangsta's parapdise?

A:	['Originally "Gangsta\'s Paradise" was not meant to be included on one of Coolio\'s studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track.', 'In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise".', 'The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta\'s Paradise", titled "Amish Paradise".']

Correct:	Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta's Paradise

Q:	WHen was the song released?

A:	["The album Gangsta's Paradise was released in 1995 and was certified 2X Platinum by the RIAA.", 'In 1996, Coolio had another top 40 hit with the song "It\'s All the Way Live (Now)" from the soundtrack to the movie Eddie.', 'It would become one of the most succe

In [17]:
answers = Match(text, qs, lemmatized = True, remove_stopword = True)


for q, a1, a2 in zip(qs, answers, ans):
    print(f'Q:\t{q}\n\nA:\t{a1}\n\nCorrect:\t{a2}\n')


Q:	What was the relationship between Coolio and Gangsta's parapdise?

A:	['Originally "Gangsta\'s Paradise" was not meant to be included on one of Coolio\'s studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track.'
 'In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise".'
 'The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta\'s Paradise", titled "Amish Paradise".']

Correct:	Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta's Paradise

Q:	WHen was the song released?

A:	["The album Gangsta's Paradise was released in 1995 and was certified 2X Platinum by the RIAA."
 'In 1996, Coolio had another top 40 hit with the song "It\'s All the Way Live (Now)" from the soundtrack to the movie Eddie.'
 'It would become one of the most succe

If the questions have complicated context, TF-IDF similarity might not perform well. TF-IDF treats each term in isolation and does not consider the context in which terms appear within a document. Furthermore, TF-IDF does not capture the semantic meaning of words. It treats each term independently and assigns weights based on their frequency and uniqueness in the document corpus. Therefore, it may not effectively capture the relationships between words or phrases within the document.

The possible solution could be to combine TF-IDF with other similarity measures or machine learning techniques to leverage the strengths of different methods and improve overall performance, or use word embeddings like Word2Vec or GloVe which can capture semantic similarities between words and phrases, allowing for more accurate similarity calculations.

## **Q3.** (Bonus 2 points)


Implement a function `match_by_wv(text, questions, lemmatized = True, remove_stopword = True, topK = 3)` to find topK answers to a question by the similarity of word vectors.
- For each key word in the question, find the best matched word in a candidate answer by the cosine similarity between the word vectors
- Calculate the match score of the answer as the mean of the cosine similarities of the best match words
- Return the answers with the topK largest match score.


hint: feel free to use pretrained word vectors


In [None]:
def match_by_wv(text, questions, 
                lemmatized = True, 
                remove_stopword = True, 
                topK = 3):
    
    answers = None
    
    # add your code
    
    return answers