IE 7374 Assignment4

Weihua Pan

In [96]:
import numpy as np
import pandas as pd

# For web scraping
import requests
from bs4 import BeautifulSoup

# extract text from pdf
import fitz

from pprint import pprint

# data preprocessing
import spacy
nlp = spacy.load('en_core_web_md')

# for calculate cos similarity
from scipy import spatial

# openai api 
import os
from openai import OpenAI
MODEL = "gpt-3.5-turbo"
client = OpenAI()

# for counting token
import tiktoken

# dataframe process pandas
import pandas as pd

# Part 1: Data Collection and Preparation

## Task 1.1: 
Collect relevant data about COVID-19 from Wikipedia and select PDF files containing scientific articles, guidelines, or reports about COVID-19.

In [97]:
def scrap_wiki(url,remove_sup=True):
    """given wiki url, scrap main content and title

    Args:
        url (string): url for wiki
        remove_sup (bool, optional): remove superscript reference like [123] in the page . Defaults to True.

    Returns:
        String: text output 
    """
    
    # get URL
    page = requests.get(url)
    
    # scrape webpage
    soup = BeautifulSoup(page.content, 'html.parser')

    # store wikipedia article
    text = "Article Title: "

    # Get the title of the article
    title = soup.find('h1', id='firstHeading').text
    text += title + "\n\n"

    # Get Wikipedia's main content
    main_content = soup.find('div', id="mw-content-text").find('div', class_="mw-parser-output")

    # Remove all <sup> elements to exclude references or superscript texts
    if remove_sup:
        for sup in main_content.find_all('sup'):
            sup.decompose()
        
    # Include headings and paragraphs
    for element in main_content.descendants:
        if element.name in ['h2', 'h3', 'h4', 'h5', 'h6']:
            heading = element.text.strip()
            text += '\n' + heading + '\n\n'
        elif element.name == 'p':
            paragraph = element.text.strip()
            text += paragraph + '\n'
            
    return text

In [98]:
def extract_text_pdf(path):
    """using PyMuPDF to extract text from pdf

    Args:
        path (string): file path for the pdf
    """     
    # Open the PDF file
    pdf_document = fitz.open(path)

    # Initialize a variable to hold the extracted text
    pdf_text = ''

    # Iterate over each page in the PDF
    for page_num in range(len(pdf_document)):
        # Get the page
        page = pdf_document.load_page(page_num)
        
        # Extract text from the page and add it to the pdf_text variable
        pdf_text += page.get_text() + '\n'

    # Close the document
    pdf_document.close()

    # pdf_text now contains all the text extracted from the PDF
    return pdf_text

In [99]:
page = requests.get("https://en.wikipedia.org/wiki/COVID-19")

# scrape webpage
soup = BeautifulSoup(page.content, 'html.parser')

# store wikipedia article
text = "Article Title: "

# Get the title of the article
title = soup.find('h1', id='firstHeading').text
text += title + "\n\n"

# get Wikipedia's main content 
main_content = soup.find('div', id="mw-content-text").find('div',class_="mw-parser-output")

# Remove all <sup> elements to exclude references or superscript texts
for sup in main_content.find_all('sup'):
    sup.decompose()

## Task 1.2: 
Extract text from the collected sources. For Wikipedia, use web scraping techniques. For PDF files, use PDF parsing libraries like PyMuPDF or PyPDF2 in Python.

In [100]:
wiki_url = ["https://en.wikipedia.org/wiki/COVID-19",
            "https://en.wikipedia.org/wiki/COVID-19_pandemic",]


pdf_path = ["pdf/COVID-19.pdf"]

In [101]:
# store the scrape text from wiki to .txt files
for i,url in enumerate(wiki_url):
    with open(f"wiki{i}.txt",mode='w') as text_file:
        text_file.write(scrap_wiki(url))

In [102]:
# extract pdf txt and store in local txt file
pdf_txt = extract_text_pdf('pdf/COVID-19.pdf')
with open(f'pdf1.txt','w') as file:
    file.write(pdf_txt)

In [103]:
def read_txt(file_path):
    ''' read a txt file by given file path
    '''
    file = open(file_path, "r")
    content = file.read()
    file.close()
    return content


## Task 1.3: 
Pre-process the extracted text to clean and prepare it for embedding. This includes tasks like removing special characters, stopwords, and stemming or lemmatization.

In [104]:
def preprocess_data(text):
    """
    Preprocess the given text using spaCy: tokenize into sentences,
    remove stopwords, and apply lemmatization.
    
    Parameters:
    - text (str): The input text to preprocess.
    
    Returns:
    - List of lists: A list of sentences, where each sentence is a list of lemmatized tokens.
    """
    # Process the text with the loaded model
    doc = nlp(text.lower().replace('\n',''))
    
    # Tokenize the text into sentences and remove stopwords,
    # then lemmatize the tokens
    sentences = []
    for sent in doc.sents:
        # Filter out stopwords and punctuation, then lemmatize the rest
        tokens = [token.lemma_ for token in sent if not token.is_stop and not token.is_punct]
        sentences.append(' '.join(tokens))
    
    return sentences


In [105]:
def num_tokens(text: str, model: str = MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

In [106]:
# read all txt file
wiki0 = read_txt('wiki0.txt')
wiki1 = read_txt('wiki1.txt')
pdf1 = read_txt('pdf1.txt')

In [107]:
# concatenate all document into one list and preprocess them
docs = [wiki0,wiki1,pdf1]
sentences = [preprocess_data(doc) for doc in docs]

# Part 2: Understanding GPT Embeddings

## Task 2.1:
 Study and summarize how GPT and similar transformer models generate embeddings, focusing on representing words and sentences in a high-dimensional space.

1. Word2Vec: Use neural network to find the embedding vector. Usually can be trained by 2 approach CBOW, skip-gram. 
* Advantage: Dense embedding vector compare with TF-IDF embedding. Fast and easy to train.
* Drawback: Static embedding (same word have one embedding vector), not positional encoding, contextless embedding. Therefore this embedding is not suited for transformer model

2. Contextual embedding:
    1. initialize token embedding: GPT use sub-word tokenizer, and use BPE(Byte pair encoding)
    2. use positional encoding
* Advantage: improve Contextual Understanding. able to handle Polysemy. And can do transfer learning (fine-tuning)
* Disadvantage: require more computational resource. High complexity, train with more resource and time. Overfiting in small dataset.

## Task 2.2: 
Use a pre-trained GPT model (like GPT-2 or GPT-3) from libraries such as Hugging Face’s Transformers to generate embeddings for the pre-processed text.

In [108]:
def get_embedding(text, model="text-embedding-3-small"):
   '''Get embedding by GPT model
   '''
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding


In [109]:
df = pd.DataFrame(columns=['Data source','text','embedding'])
i = 0
for doc_index,doc in enumerate(sentences):
    for sent in doc:
        df.loc[i] = doc_index , sent, get_embedding(sent)
        i+=1

In [110]:
df

Unnamed: 0,Data source,text,embedding
0,0,article title covid-19coronavirus,"[0.0013947351835668087, -0.05097736418247223, ..."
1,0,disease 2019 covid-19 contagious disease cause...,"[0.001791783724911511, -0.04623255878686905, -..."
2,0,know case identify wuhan china december 2019,"[-0.05436304956674576, -0.04535301774740219, 0..."
3,0,disease quickly spread worldwide result covid-19,"[-0.005425873212516308, -0.03965173289179802, ..."
4,0,pandemic.the symptom covid‑19 variable include...,"[-0.020816093310713768, 0.0005623253528028727,..."
...,...,...,...
844,2,learn,"[-0.007470429874956608, -0.03658827394247055, ..."
845,2, ask health care provider,"[0.006137406919151545, -0.0304762814193964, 0...."
846,2, local state health department,"[-0.04278329387307167, -0.013517333194613457, ..."
847,2, visit website food drug administration fda c...,"[0.004667202942073345, -0.04801260679960251, -..."


# Part 3: Building the Question-Answering System|

## Task 3.1: 
Implement a search mechanism to find the most relevant text passages to a given query. This can involve calculating similarity scores between the query and document embeddings.

In [111]:
def search(query : str,
           df : pd.DataFrame,
           top_n : int,
           MODEL : str = "text-embedding-3-small",
           relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y)):
    """search top n similar sentences with the query

    Args:
        query (str): Question
        df (pd.DataFrame): dataframe of sentences and its embedding vector
        top_n (int): number of sentences you want to get
        MODEL (str): name of GPT model
        relatedness_fn (_type_, optional): _description_. Defaults to lambdax.
    """
    
    # get query embedding
    query_embedding_response = client.embeddings.create(
        model=MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    
    # calculate cos similarity between query and sentences
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    
    # sort by similarity score
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [112]:
search("How many people died in US?",
       df,
       10)

(('official death count typically include people die test positive',
  'datum estimate true number death covid-19 worldwide include range 18.2 33.5 \xa0 million ≈27.4 \xa0 million 18 november 2023 economist 18.5 \xa0 million 1 april 2023 institute health metric evaluation ≈18.2 \xa0 million early death 1 january 2020 31 december 2021 comprehensive international study',
  'estimate reporting 22.62 total report covid-19 mortality 2020',
  '30 october worldwide daily death toll 424 low 385 death report 12 march 2020',
  '19 june 2020 country report millionth case nearly 49,000 report death',
  '6 march report total worldwide death count surpass 6 \xa0 million people',
  '10 july new york city population 8.4 \xa0 million 23,377 individual 18,758 confirm 4,619 probable die covid‑19 0.3 population',
  'base johns hopkins university statistic global death case ratio 1.02 6,881,955/676,609,955 10 march 2023',
  'case report north american country saint kitts nevis confirm case 25 march north a

The sentences with high cos similarity are considered as relevant in embedding space.

## Task 3.2:
 Design a method to extract or generate answers from the selected passages. This could involve fine-tuning a pre-trained model on a question-answering dataset or using heuristic methods to select portions of text as answers.

In [113]:
def construct_query(query:str,
                    df: pd.DataFrame,
                    max_token:int,
                    MODEL= "text-embedding-3-small",):
    """ Search relevant content about the question, create query with content for GPT
    """
    text, score = search(query,df,max_token-150,MODEL)
    
    introduction = 'Use the below information on the COVID-19 to answer the subsequent question. If the answer cannot be found in the information, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    
    # iterate top_n sentences given by search(), adding sentences to the query if not exceeding the max_token
    for sent in text:
        next_sent = f"\n\nSentences on Wikipedia:\n{sent}\n"
        if num_tokens(message + next_sent + question,model=MODEL) > max_token:
            break
        else:
            message += next_sent
    
    return message + question

# Part 4: Evaluation and Testing

## Task 4.1:
Create a test set of questions about COVID-19 that can be answered using the collected data.

In [114]:
questions = ["How many people died in US cause by COVID-19?",
            "How many people died in the world cause by COVID-19?",
            "When did the first case of COVID-19 found?",
            "Where did the first case of COVID-19 found?",
            "How long does it take to recover from COVID 19?",
            "What did the first vaccine for COVID-19 come out?",
            "Which company made the first vaccine?",
            "Which company made the first vaccine for COVID19?",
            "How many people around the world are infected with COVID 19",
            "Where do COVID-19 come from?",
            "What is the fatality rate of COVID 19",
            "When is the first person died in US cause by COVID-19?",
            ]

In [115]:
def ask(query,
        df = df,
        MODEL= "gpt-3.5-turbo",
        token_budget = 2000,
        print_message=False):
        ''' construct query by search relevant content and pass to GPT model.
        return GPT's response
        '''
        # get the query with content
        query_with_content = construct_query(query,df,2000)
        if print_message:
                print(query_with_content)
        
        # construct messages for GPT model and receive response.
        messages = [
                {'role':'system','content':'You answer questions about the COVID-19.'},
                {"role": "user", "content": query_with_content},
        ]
        response = client.chat.completions.create(
                model=MODEL,
                messages=messages,
                temperature=0
        )
        response_message = response.choices[0].message.content
        return query_with_content, response_message
        

In [116]:
query_content = []
GPT_answer = []

# put the query_content and GPT's response to the list for troubleshooting
for question in questions:
    query, answer = ask(question)
    query_content.append(query)
    GPT_answer.append(answer)

In [117]:
# manually collecting answer
True_answer =["Per Our World in Data, 103,436,829 confirmed cases have been reported in the United States with 1,177,223 deaths, the most of any country,",
             "estimated 18.2m - 33.5m",
             "December 2019",
             "Wuhan, China",
             "1 week",
             "Pfizer tested on NOV,2020 and get approved by England on DEC,2020. Base on the event of timeline. This can be inferred.",
             "Pfizer. Can be inferred.",
             "Pfizer. Can be inferred.",
             "As of 14 April 2022, over 500 million cases were confirmed globally.",
             "Some evidences shows it come from bats",
             "1.02%",
             "February 6, 2020."]

## Task 4.2: 
Evaluate the system’s performance by measuring the accuracy of its answers against a set of reference answers. Consider metrics like F1 score, precision, and recall.

In [118]:
# put everything in one dataframe
QandA = pd.DataFrame(data= {"question": questions,
                            "query_with_content": query_content,
                            "Actual_answer": True_answer,
                            "GPT_answer": GPT_answer
                            })

In [119]:
QandA

Unnamed: 0,question,query_with_content,Actual_answer,GPT_answer
0,How many people died in US cause by COVID-19?,Use the below information on the COVID-19 to a...,"Per Our World in Data, 103,436,829 confirmed c...",I could not find an answer.
1,How many people died in the world cause by COV...,Use the below information on the COVID-19 to a...,estimated 18.2m - 33.5m,The estimated number of deaths worldwide direc...
2,When did the first case of COVID-19 found?,Use the below information on the COVID-19 to a...,December 2019,The first case of COVID-19 was found in Decemb...
3,Where did the first case of COVID-19 found?,Use the below information on the COVID-19 to a...,"Wuhan, China","The first case of COVID-19 was found in Wuhan,..."
4,How long does it take to recover from COVID 19?,Use the below information on the COVID-19 to a...,1 week,"mild cases typically recover within a week, wh..."
5,What did the first vaccine for COVID-19 come out?,Use the below information on the COVID-19 to a...,"Pfizer tested on NOV,2020 and get approved by ...",The first COVID-19 vaccine came out in Decembe...
6,Which company made the first vaccine?,Use the below information on the COVID-19 to a...,Pfizer. Can be inferred.,I could not find an answer.
7,Which company made the first vaccine for COVID19?,Use the below information on the COVID-19 to a...,Pfizer. Can be inferred.,I could not find an answer.
8,How many people around the world are infected ...,Use the below information on the COVID-19 to a...,"As of 14 April 2022, over 500 million cases we...",I could not find an answer.
9,Where do COVID-19 come from?,Use the below information on the COVID-19 to a...,Some evidences shows it come from bats,COVID-19 is believed to have originated from a...


In [120]:
for q, a1, a2 in zip(questions,GPT_answer, True_answer):
    print(f"Question: {q}")
    print(f"GPT_answer: {a1}")
    print(f"True_answer: {a2}\n")

Question: How many people died in US cause by COVID-19?
GPT_answer: I could not find an answer.
True_answer: Per Our World in Data, 103,436,829 confirmed cases have been reported in the United States with 1,177,223 deaths, the most of any country,

Question: How many people died in the world cause by COVID-19?
GPT_answer: The estimated number of deaths worldwide directly caused by COVID-19 ranges from 18.2 to 33.5 million, with an approximate figure of 27.4 million as of November 18, 2023.
True_answer: estimated 18.2m - 33.5m

Question: When did the first case of COVID-19 found?
GPT_answer: The first case of COVID-19 was found in December 2019.
True_answer: December 2019

Question: Where did the first case of COVID-19 found?
GPT_answer: The first case of COVID-19 was found in Wuhan, China in December 2019.
True_answer: Wuhan, China

Question: How long does it take to recover from COVID 19?
GPT_answer: mild cases typically recover within a week, while severe or critical cases may take s

1. GPT cannot find the answer, a sentence mentions 1,177,223 deaths in the United America
2. GPT retrieve the info successfully
3. Correct
4. Correct
5. Correct
6. Correct
7. False
8. False
9. False
10. Correct
11. False, GPT summaries how to calculate the fatality rate but doesn't provide the exact number.
12. Correct 

In [121]:
# use boolean to substitute the answer to calculate the accuracy
True_answer_encode = [True] * len(questions)
GPT_answer_encode = [False,True,True,True,True,True,False,False,False,True,False,True]
QandA['True_answer_encode'] = True_answer_encode
QandA['GPT_answer_encode'] = GPT_answer_encode

In [122]:
from sklearn.metrics import classification_report

print(classification_report(QandA['True_answer_encode'],QandA['GPT_answer_encode']))

              precision    recall  f1-score   support

       False       0.00      0.00      0.00         0
        True       1.00      0.58      0.74        12

    accuracy                           0.58        12
   macro avg       0.50      0.29      0.37        12
weighted avg       1.00      0.58      0.74        12



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The accuracy is 58%. F1 score, precision, and recall is not making sense here since we don't false answer in actual.

# Part 5: Reflection and Improvement

## Task 5.1: 
Analyze the system's strengths and weaknesses, identifying areas where it performs well and struggles.

<h4>Advantage: </h4>

* You can ask questions that happened after the model is trained.
* The search function can find all the relevant passage from embedding vector space. Therefore, we don't need to provide all the information that introduce noise and reduce the number of tokens.


<h4>Disadvantage:</h4>

* lack of ability to infer the answer. For example, the model is able to know when the first vaccine comes out, but it cannot provide the company that made the vaccine.
* It is hard to find out whether GPT use its own knowledge base.

## Task 5.2: 
Propose improvements or alternative approaches to enhance the system's accuracy or efficiency.

* This can be improved by introduce knowledge graphs to incorporate reasoning.
* I should also provide the content source, so I can quickly find where GPT found the information.