<a href="https://colab.research.google.com/github/zackives/upenn-cis5450-hw/blob/main/4_Module_1_Part_2_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing in the Transformers and LLMs Era

As recently as 5 years ago, machine learning techniques for natural language and the Web were extremely brittle.  They still are not perfect, but they are often "good enough" to do real work -- thanks to large language models (LLMs) and transformers.  In this notebook we'll try some tools.

In [None]:
#TODO: use the Azure OpenAI key from Ed Discussion (not the OpenAI one!)
%set_env AZURE_OPENAI_API_KEY=%TODO

In [None]:
!pip install nltk

In [None]:
!pip install langchain langchain-core langchain-community langchain-openai
!pip install chromadb transformers

## Documents as Vectors

Let's parse a paragraph and create a very simple document vector.  We'll use a parser from a package called `nltk`.

In [None]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:

paragraph = '''A large language model (LLM) is a language model characterized by
               its large size. Its size is enabled by AI accelerators, which are
               able to process vast amounts of text data, mostly scraped from the Internet.'''

sentences = nltk.sent_tokenize(paragraph)

from nltk.tokenize import word_tokenize

# Accumulate all words, all sentences
all_words = []
for sent in sentences:
  words = word_tokenize(sent)
  all_words.extend([word.lower() for word in words if word.isalpha()])


# Reorder the words in lexicographical order
all_words.sort()
print (all_words)

In [None]:
# Simple function to create a dictionary of word / count
def create_word_count_dict(sorted_list_of_words):
  word_count_dict = {}
  current_word = None
  current_count = 0
  for word in sorted_list_of_words:
    if word != current_word:
      if current_word is not None:
        word_count_dict[current_word] = current_count
      current_word = word
      current_count = 1
    else:
      current_count += 1
  if current_word is not None:
    word_count_dict[current_word] = current_count
  return word_count_dict

print (create_word_count_dict(all_words))

NLTK allows us to do a lot more, especially based on linguistic cues.  However, let's now switch to some tools that use embeddings and transformers to do our tasks.

## Sentiment Analysis from a Model on HuggingFace

To do sentiment analysis, we'll use a transformer model called *distilbert*. Distilbert, "fine-tuned" on a sentiment analysis task, does a fairly good job of capturing sentiment of words and sentences. Note we will be loading the model onto our Colab machine from a model hosting site called HuggingFace.

In [None]:
import os

import pandas as pd

In [None]:
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")

### Beware Biases on Words from Training on Text

Beware that seemingly neutral statements may end up showing sentiment, because the terms themselves were used in positive or negative comments.  It's now known that, e.g., young people view iPhones in a much more favorable light than Android phones. Perhaps that's why we see this?

In [None]:
sentiment_pipeline('They bought an Android phone')

In [None]:
sentiment_pipeline('They bought an iPhone')

Nonetheless, for the most part transformer-based sentiment analysis works quite well.  Let's see it over product reviews.  Note this is quite expensive computationally!

### Sentiment for a DB of Product Reviews

In [None]:
reviews_df = pd.read_csv('https://storage.googleapis.com/penn-cis5450/GrammarandProductReviews.csv')

In [None]:
snacks_df = reviews_df[reviews_df['categories'].apply(lambda x: 'Snacks,' in x)]

snacks_df

In [None]:
reviews_text_df = snacks_df[['manufacturer','manufacturerNumber','name','reviews.text']].copy()

reviews_text_df

In [None]:
reviews_text_df.dtypes

In [None]:
reviews_text_df['sentiment'] = reviews_text_df['reviews.text'].apply(sentiment_pipeline)

reviews_text_df

In [None]:
reviews_text_df['label'] = reviews_text_df['sentiment'].apply(lambda x:x[0]['label'])
reviews_text_df['score'] = reviews_text_df.apply(lambda x:x['sentiment'][0]['score'] if x['label'] == 'POSITIVE' else -x['sentiment'][0]['score'], axis=1)


In [None]:
reviews_text_df

In [None]:
reviews_text_df[['manufacturer','manufacturerNumber','name','score']].groupby(
    by=['manufacturer','name','manufacturerNumber']).mean().sort_values(by='score')

In [None]:
reviews_text_df.describe()

## Named Entity Recognition from a Model on HuggingFace

What is a sentence or paragraph talking about?  Knowing the nouns may allow us to understand what's going on, or learn about entitities.

For this task, a popular model is called *spaCy*. Again, we can install it on our host machine. It will probably require you to restart your kernel. You can execute from this cell onwards.

In [None]:
!pip install spacy[transformers]
!pip install -U spacy-experimental
!pip install -U spacy-transformers

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
import spacy
from spacy import displacy

In [None]:
nlp = spacy.load('en_core_web_lg')

In [None]:
text = '''
After standing down from a first attempt Thursday night, SpaceX teams at Cape
Canaveral Space Force Station are now on track to launch a Falcon 9 rocket
carrying 22 Starlink internet satellites at 11:38 p.m. EDT from Launch Complex 40.

An additional launch opportunity for the Starlink 6-16 mission is set for 12:07
a.m. EDT. Saturday. Otherwise, two backup opportunities are available Saturday night,
at 11:13 p.m. and 11:38 p.m. EDT.'''

displacy.render(nlp(text), style='ent', jupyter=True)

In [None]:
displacy.render(nlp(text), style='dep', jupyter=True, options={'compact': True, 'space': 70})

Here are the different types of words in SpaCy (from https://towardsdatascience.com/explorations-in-named-entity-recognition-and-was-eleanor-roosevelt-right-671271117218):

```
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
```

In [None]:
import pandas as pd

words = []
for word in nlp(text).ents:
  words.append({'word': word.text, 'type': word.label_})

pd.DataFrame(words)

### Named Entity Recognition

Let's see how we do, focusing only on "people, places, and things"...

In [None]:
for ent in nlp(text).ents:
  if ent.label_ in ['ORG', 'PERSON', 'PRODUCT', 'NORP', 'FAC', 'GPE']:
    print(ent.text, ent.label_)


... Actually it's not *that* great when you look at the labels.  "Cape Canaveral Space Force Station" should be a FAC, SpaceX should be an ORG, Falcon should be a PRODUCT, etc.

## Zero-Shot Learning

Here we'll use a package called `langchain` to send a question to the GPT Large Language Model.  "Zero shot learning" simply asks the LLM a question based on what it knows, without giving it any examples of what you expect.

In [None]:
from langchain_openai import AzureChatOpenAI
from langchain import PromptTemplate
import os

In [None]:
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [None]:
endpoint = "https://zives-cis5450-openai.openai.azure.com/"
model_name = "gpt-4.1-mini"
deployment = "gpt-4.1-mini"

subscription_key = str(os.getenv('AZURE_OPENAI_API_KEY'))
api_version = "2024-12-01-preview"

llm = AzureChatOpenAI(
    api_version=api_version,
    deployment_name=deployment,
    azure_endpoint=endpoint,
    api_key=subscription_key
)
llm_chain = prompt | llm

In [None]:
question = "What are the main topics of a big data course?"

response = llm_chain.invoke({"question": question})

for sentence in response.content.split('\n'):
  print (sentence)

## Relation Extraction via Azure OpenAI

Relation extraction involves taking text and trying to populate a schema.  Sometimes one must do this via "few-shot" learning (provide a few examples) but for simpler cases zero-shot learning (with the schema) may be adequate.

Here's an example from the text copied from an Internet Movie Database poll.


In [None]:
# Input from IMDB poll on best movie characters, https://www.imdb.com/poll/gBcmBMHGh4k/results?ref_=po_sr

In [None]:
!wget https://storage.googleapis.com/penn-cis5450/imdb-poll.html

We're going to slightly simplify the document, so it costs less to have GPT process it. This step isn't strictly necessary if you have infinite money.

In [None]:
from bs4 import BeautifulSoup

def remove_javascript_from_html(html_content):
    """
    Parses an HTML document and removes all <script> tags and their content.

    Args:
        html_content (str): The HTML document as a string.

    Returns:
        str: The HTML document with all JavaScript removed.
    """
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all <script> tags and remove them
    for script_tag in soup.find_all('script'):
        script_tag.decompose()

    return str(soup)

try:
    with open("imdb-poll.html", "r") as f_in:
        html_from_file = f_in.read()

    html_without_js_from_file = remove_javascript_from_html(html_from_file)

    with open("poll.html", "w") as f_out:
        f_out.write(html_without_js_from_file)

except FileNotFoundError:
    print("Error: imdb-poll.html not found.")

In [None]:
from typing import List, Optional
from pydantic import Field, BaseModel

class Movie(BaseModel):
    ranked: int = Field(description="The rank of the character")
    actor: str = Field(description="The actor in the movie")
    character: str = Field(description="The character in the movie")
    votes: int = Field(description="The number of votes")
    movie: str = Field(description="The name of the movie")


class Document(BaseModel):
    actors: List[Movie] = Field(..., description="List of movie actors and characters")

# Input from IMDB poll on best movie characters, https://www.imdb.com/poll/gBcmBMHGh4k/results?ref_=po_sr
with open('poll.html','rt') as inp:
  input_data = inp.read()

structured_llm = llm.with_structured_output(Document)
results = structured_llm.invoke("You are an extraction algorithm. Please extract every possible instance of quotation information.\n\n" + input_data)

print(results)

In [None]:
results_df = pd.DataFrame([character.dict() for character in results.actors])
results_df

## Exercise

Take the list of Penn CIS courses and extract the information into a DataFrame!

In [None]:
!wget https://storage.googleapis.com/penn-cis5450/cis-catalog.html

Define a class specifying the schema to extract. It should include the fields `course`, `name`, `prerequisites`, `units`, `description`, and `frequency`.

In [None]:
# TODO: use "structured output" to map the text
# to a series of nested objects (by defining classes with properties).
# These are:
#  A Document has a list of Courses
#  A Course has a course of type CourseNumber, as well as the fields above.
#    Prerequisites should be a list of CourseNumber as well.
#  A CourseNumber has a degree program and a number.
from typing import List, Optional
from pydantic import Field, BaseModel

class CourseNumber(BaseModel):
  # TODO

class Course(BaseModel):
  course: CourseNumber = Field(description="The course number")
  # TODO

class Document(BaseModel):
    courses: # TODO

with open('cis-catalog.html','rt') as inp:
  input_data = inp.read()

structured_llm = llm.with_structured_output(Document)
results = structured_llm.invoke("You are an extraction algorithm. Please extract every possible instance of course information.\n\n" + input_data)

print(results)

In [None]:
results_df = pd.DataFrame([course.model_dump() for course in results.courses])
results_df


In [None]:
# This is just to catch simple mistakes

if 'name' not in results_df.columns or 'units' not in results_df.columns:
  print('Please revise your schema according to the spec')

In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

In [None]:
!pip3 install penngrader-client

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [None]:
%set_env HW_ID=cis5450_25f_HW9

In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

In [None]:
grader.grade('extracted_courses', results_df)