<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/Module_1_Part_2_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing in the Transformers and LLMs Era

As recently as 5 years ago, machine learning techniques for natural language and the Web were extremely brittle.  They still are not perfect, but they are often "good enough" to do real work -- thanks to large language models (LLMs) and transformers.  In this notebook we'll try some tools.

In [None]:
%set_env OPENAI_API_KEY=#TODO - add from Ed Discussion

In [2]:
!pip install llama-index
!pip install llama-index-llms-langchain
!pip install llama-index-llms-openai
!pip install langchain
!pip install langchain-community
!pip install langchain-openai
!pip install openai



In [3]:
!pip install nltk
!pip install langchain
!pip install llamaindex
!pip install chromadb

[31mERROR: Could not find a version that satisfies the requirement llamaindex (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for llamaindex[0m[31m


## Documents as Vectors

Let's parse a paragraph and create a very simple document vector.  We'll use a parser from a package called `nltk`.

In [4]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:

paragraph = '''A large language model (LLM) is a language model characterized by
               its large size. Its size is enabled by AI accelerators, which are
               able to process vast amounts of text data, mostly scraped from the Internet.'''

sentences = nltk.sent_tokenize(paragraph)

from nltk.tokenize import word_tokenize

# Accumulate all words, all sentences
all_words = []
for sent in sentences:
  words = word_tokenize(sent)
  all_words.extend([word.lower() for word in words if word.isalpha()])


# Reorder the words in lexicographical order
all_words.sort()
print (all_words)

['a', 'a', 'able', 'accelerators', 'ai', 'amounts', 'are', 'by', 'by', 'characterized', 'data', 'enabled', 'from', 'internet', 'is', 'is', 'its', 'its', 'language', 'language', 'large', 'large', 'llm', 'model', 'model', 'mostly', 'of', 'process', 'scraped', 'size', 'size', 'text', 'the', 'to', 'vast', 'which']


In [6]:
# Simple function to create a dictionary of word / count
def create_word_count_dict(sorted_list_of_words):
  word_count_dict = {}
  current_word = None
  current_count = 0
  for word in sorted_list_of_words:
    if word != current_word:
      if current_word is not None:
        word_count_dict[current_word] = current_count
      current_word = word
      current_count = 1
    else:
      current_count += 1
  if current_word is not None:
    word_count_dict[current_word] = current_count
  return word_count_dict

print (create_word_count_dict(all_words))

{'a': 2, 'able': 1, 'accelerators': 1, 'ai': 1, 'amounts': 1, 'are': 1, 'by': 2, 'characterized': 1, 'data': 1, 'enabled': 1, 'from': 1, 'internet': 1, 'is': 2, 'its': 2, 'language': 2, 'large': 2, 'llm': 1, 'model': 2, 'mostly': 1, 'of': 1, 'process': 1, 'scraped': 1, 'size': 2, 'text': 1, 'the': 1, 'to': 1, 'vast': 1, 'which': 1}


NLTK allows us to do a lot more, especially based on linguistic cues.  However, let's now switch to some tools that use embeddings and transformers to do our tasks.

## Sentiment Analysis from a Model on HuggingFace

To do sentiment analysis, we'll use a transformer model called *distilbert*. Distilbert, "fine-tuned" on a sentiment analysis task, does a fairly good job of capturing sentiment of words and sentences. Note we will be loading the model onto our Colab machine from a model hosting site called HuggingFace.

In [7]:
import os

import pandas as pd

In [8]:
!pip install -q transformers
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

### Beware Biases on Words from Training on Text

Beware that seemingly neutral statements may end up showing sentiment, because the terms themselves were used in positive or negative comments.  It's now known that, e.g., young people view iPhones in a much more favorable light than Android phones. Perhaps that's why we see this?

In [9]:
sentiment_pipeline('They bought an Android phone')

[{'label': 'NEGATIVE', 'score': 0.566922128200531}]

In [10]:
sentiment_pipeline('They bought an iPhone')

[{'label': 'POSITIVE', 'score': 0.9648463726043701}]

Nonetheless, for the most part transformer-based sentiment analysis works quite well.  Let's see it over product reviews.  Note this is quite expensive computationally!

### Sentiment for a DB of Product Reviews

In [11]:
reviews_df = pd.read_csv('https://storage.googleapis.com/penn-cis5450/GrammarandProductReviews.csv')

In [12]:
snacks_df = reviews_df[reviews_df['categories'].apply(lambda x: 'Snacks,' in x)]

snacks_df

Unnamed: 0,id,brand,categories,dateAdded,dateUpdated,ean,keys,manufacturer,manufacturerNumber,name,...,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,upc
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",2017-07-25T05:16:03Z,2018-02-05T11:27:45Z,73416000391,lundbergorganiccinnamontoastricecakes/b000fvzw...,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,...,100209113.0,,5,https://www.walmart.com/reviews/product/29775278,Good flavor. This review was collected as part...,Good,,,Dorothy W,73416000391
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",2017-07-25T05:16:03Z,2018-02-05T11:27:45Z,73416000391,lundbergorganiccinnamontoastricecakes/b000fvzw...,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,...,100209113.0,,5,https://www.walmart.com/reviews/product/29775278,Good flavor.,Good,,,Dorothy W,73416000391
1056,AV1YlENIglJLPUi8IHsX,KIND,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-07-19T02:01:37Z,2018-02-05T11:26:49Z,6.02652E+11,"602652184024,kind/15027059,darkchocolatechunkg...",Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,...,104821113.0,0.0,1,https://www.walmart.com/reviews/product/34202687,"Buyer beware, these taste like 55, nothing eve...",definetaly not a granola bar,,,walmartian,6.02652E+11
1057,AV1YlENIglJLPUi8IHsX,KIND,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-07-19T02:01:37Z,2018-02-05T11:26:49Z,6.02652E+11,"602652184024,kind/15027059,darkchocolatechunkg...",Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,...,33383690.0,0.0,2,https://www.walmart.com/reviews/product/34202687,"Not being a Kind Bar aficionado, I didn't know...",They were okay,,,LaurieB4041,6.02652E+11
1058,AV1YlENIglJLPUi8IHsX,KIND,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-07-19T02:01:37Z,2018-02-05T11:26:49Z,6.02652E+11,"602652184024,kind/15027059,darkchocolatechunkg...",Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,...,109642927.0,0.0,3,https://www.walmart.com/reviews/product/34202687,They're so hard and dry. They fall into a thou...,They're just okay,,,,6.02652E+11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68470,AVq5UzjHU2_QcyX9O584,Nutri-Grain,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-03-10T17:45:23Z,2018-02-05T11:28:54Z,38000355004,"nutrigraincerealbarsmixedberry8ct/b000aydhaq,0...",Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",...,73846963.0,0.0,5,https://www.walmart.com/reviews/product/108186...,Love the taste and ease for quick breakfast or...,Great taste!,,,AlwysLooknf4deals,38000355004490700000000
68471,AVq5UzjHU2_QcyX9O584,Nutri-Grain,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-03-10T17:45:23Z,2018-02-05T11:28:54Z,38000355004,"nutrigraincerealbarsmixedberry8ct/b000aydhaq,0...",Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",...,71615117.0,0.0,5,https://www.walmart.com/reviews/product/108186...,"Love, love them...eat them everyday!!",Love this...addicted to them!!,,,LauraB,38000355004490700000000
68472,AVq5UzjHU2_QcyX9O584,Nutri-Grain,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-03-10T17:45:23Z,2018-02-05T11:28:54Z,38000355004,"nutrigraincerealbarsmixedberry8ct/b000aydhaq,0...",Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",...,63440859.0,0.0,5,https://www.walmart.com/reviews/product/108186...,Pretty good and not too dry like some bars.,Breakfast bar,,,MissT,38000355004490700000000
68473,AVq5UzjHU2_QcyX9O584,Nutri-Grain,"Food,Packaged Foods,Snacks,Cereal Bars and Gra...",2017-03-10T17:45:23Z,2018-02-05T11:28:54Z,38000355004,"nutrigraincerealbarsmixedberry8ct/b000aydhaq,0...",Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",...,72022622.0,0.0,5,https://www.walmart.com/reviews/product/108186...,My favorite,My go to snack,,,Nicole,38000355004490700000000


In [13]:
reviews_text_df = snacks_df[['manufacturer','manufacturerNumber','name','reviews.text']].copy()

reviews_text_df

Unnamed: 0,manufacturer,manufacturerNumber,name,reviews.text
1,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,Good flavor. This review was collected as part...
2,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,Good flavor.
1056,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,"Buyer beware, these taste like 55, nothing eve..."
1057,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,"Not being a Kind Bar aficionado, I didn't know..."
1058,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,They're so hard and dry. They fall into a thou...
...,...,...,...,...
68470,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",Love the taste and ease for quick breakfast or...
68471,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.","Love, love them...eat them everyday!!"
68472,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",Pretty good and not too dry like some bars.
68473,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",My favorite


In [14]:
reviews_text_df.dtypes

Unnamed: 0,0
manufacturer,object
manufacturerNumber,object
name,object
reviews.text,object


In [15]:
reviews_text_df['sentiment'] = reviews_text_df['reviews.text'].apply(sentiment_pipeline)

reviews_text_df

Unnamed: 0,manufacturer,manufacturerNumber,name,reviews.text,sentiment
1,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,Good flavor. This review was collected as part...,"[{'label': 'POSITIVE', 'score': 0.999740898609..."
2,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,Good flavor.,"[{'label': 'POSITIVE', 'score': 0.999867796897..."
1056,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,"Buyer beware, these taste like 55, nothing eve...","[{'label': 'NEGATIVE', 'score': 0.999345004558..."
1057,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,"Not being a Kind Bar aficionado, I didn't know...","[{'label': 'NEGATIVE', 'score': 0.999023318290..."
1058,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,They're so hard and dry. They fall into a thou...,"[{'label': 'NEGATIVE', 'score': 0.998832166194..."
...,...,...,...,...,...
68470,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",Love the taste and ease for quick breakfast or...,"[{'label': 'POSITIVE', 'score': 0.999820411205..."
68471,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.","Love, love them...eat them everyday!!","[{'label': 'POSITIVE', 'score': 0.999883174896..."
68472,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",Pretty good and not too dry like some bars.,"[{'label': 'POSITIVE', 'score': 0.999789297580..."
68473,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",My favorite,"[{'label': 'POSITIVE', 'score': 0.999523520469..."


In [16]:
reviews_text_df['label'] = reviews_text_df['sentiment'].apply(lambda x:x[0]['label'])
reviews_text_df['score'] = reviews_text_df.apply(lambda x:x['sentiment'][0]['score'] if x['label'] == 'POSITIVE' else -x['sentiment'][0]['score'], axis=1)


In [17]:
reviews_text_df

Unnamed: 0,manufacturer,manufacturerNumber,name,reviews.text,sentiment,label,score
1,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,Good flavor. This review was collected as part...,"[{'label': 'POSITIVE', 'score': 0.999740898609...",POSITIVE,0.999741
2,Lundberg,574764,Lundberg Organic Cinnamon Toast Rice Cakes,Good flavor.,"[{'label': 'POSITIVE', 'score': 0.999867796897...",POSITIVE,0.999868
1056,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,"Buyer beware, these taste like 55, nothing eve...","[{'label': 'NEGATIVE', 'score': 0.999345004558...",NEGATIVE,-0.999345
1057,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,"Not being a Kind Bar aficionado, I didn't know...","[{'label': 'NEGATIVE', 'score': 0.999023318290...",NEGATIVE,-0.999023
1058,Kind Fruit & Nut Bars,15027059,Kind Dark Chocolate Chunk Gluten Free Granola ...,They're so hard and dry. They fall into a thou...,"[{'label': 'NEGATIVE', 'score': 0.998832166194...",NEGATIVE,-0.998832
...,...,...,...,...,...,...,...
68470,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",Love the taste and ease for quick breakfast or...,"[{'label': 'POSITIVE', 'score': 0.999820411205...",POSITIVE,0.999820
68471,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.","Love, love them...eat them everyday!!","[{'label': 'POSITIVE', 'score': 0.999883174896...",POSITIVE,0.999883
68472,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",Pretty good and not too dry like some bars.,"[{'label': 'POSITIVE', 'score': 0.999789297580...",POSITIVE,0.999789
68473,Nutri-Grain,12992472,"Nutrigrain Cereal Bars, Mixed Berry, 8 Ct.",My favorite,"[{'label': 'POSITIVE', 'score': 0.999523520469...",POSITIVE,0.999524


In [18]:
reviews_text_df[['manufacturer','manufacturerNumber','name','score']].groupby(
    by=['manufacturer','name','manufacturerNumber']).mean().sort_values(by='score')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,score
manufacturer,name,manufacturerNumber,Unnamed: 3_level_1
Ortega,Ortega Thick & Chunky Mild Salsa,00G6ICL6V5KH315,-0.981686
"Frito-Lay, Inc.",Simply Ruffles Sea Salted Reduced Fat* Potato Chips - 8oz,13327531,-0.893294
Lay's,Lay's Salt & Vinegar Flavored Potato Chips,2840003425,-0.332673
Snyder's of Hanover,Snyder's Of Hanover Chocolate Covered Pretzels Dark Chocolate Mini Dips,14991105,0.001811
VOORTMAN COOKIES LIMITED,Voortman Sugar Free Fudge Chocolate Chip Cookies,47079669,0.036092
Maple Grove Farms,"Maple Grove Farms Of Vermont Fat Free Dressing, Cranberry Balsamic",57201452,0.037448
Kellogg Sales Co,Chips Deluxe Soft 'n Chewy Cookies,44086,0.197914
Kellogg Sales Co.,Keebler Soft Batch Chocolate Chip Cookies,54086,0.230834
Knouse Foods Inc,"Musselman Apple Sauce, Cinnamon, 48oz",FCASC6000MUS45,0.254215
Stacy's,Stacy's Simply Naked Bagel Chips,14931211,0.330322


In [19]:
reviews_text_df.describe()

Unnamed: 0,score
count,1064.0
mean,0.632364
std,0.753635
min,-0.999803
25%,0.984014
50%,0.999603
75%,0.999851
max,0.999892


## Named Entity Recognition from a Model on HuggingFace

What is a sentence or paragraph talking about?  Knowing the nouns may allow us to understand what's going on, or learn about entitities.

For this task, a popular model is called *spaCy*. Again, we can install it on our host machine.

In [20]:
!pip install spacy[transformers]
!pip install -U spacy-experimental
!pip install -U spacy-transformers



In [21]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [22]:
import spacy
from spacy import displacy

In [23]:
nlp = spacy.load('en_core_web_lg')

In [24]:
text = '''
After standing down from a first attempt Thursday night, SpaceX teams at Cape
Canaveral Space Force Station are now on track to launch a Falcon 9 rocket
carrying 22 Starlink internet satellites at 11:38 p.m. EDT from Launch Complex 40.

An additional launch opportunity for the Starlink 6-16 mission is set for 12:07
a.m. EDT. Saturday. Otherwise, two backup opportunities are available Saturday night,
at 11:13 p.m. and 11:38 p.m. EDT.'''

displacy.render(nlp(text), style='ent', jupyter=True)

In [25]:
displacy.render(nlp(text), style='dep', jupyter=True, options={'compact': True, 'space': 70})

Here are the different types of words in SpaCy (from https://towardsdatascience.com/explorations-in-named-entity-recognition-and-was-eleanor-roosevelt-right-671271117218):

```
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
```

In [26]:
words = []
for word in nlp(text).ents:
  words.append({'word': word.text, 'type': word.label_})

pd.DataFrame(words)

Unnamed: 0,word,type
0,first,ORDINAL
1,Thursday,DATE
2,night,TIME
3,SpaceX,PRODUCT
4,Cape,GPE
5,Canaveral Space Force Station,FAC
6,Falcon,ORG
7,9,CARDINAL
8,22,CARDINAL
9,Starlink,ORG


### Named Entity Recognition

Let's see how we do, focusing only on "people, places, and things"...

In [32]:
for ent in nlp(text).ents:
  if ent.label_ in ['ORG', 'PERSON', 'PRODUCT', 'NORP', 'FAC', 'GPE']:
    print(ent.text, ent.label_)


SpaceX PRODUCT
Cape GPE
Canaveral Space Force Station FAC
Falcon ORG
Starlink ORG
Launch Complex 40 PRODUCT
Starlink ORG
6-16 PRODUCT


... Actually it's not *that* great when you look at the labels.  "Cape Canaveral Space Force Station" should be a FAC, SpaceX should be an ORG, Falcon should be a PRODUCT, etc.

## Zero-Shot Learning

Here we'll use a package called `langchain` to send a question to the English Core Web Large model.  "Zero shot learning" simply asks the LLM a question based on what it knows, without giving it any examples of what you expect.

In [51]:
from langchain_openai import ChatOpenAI
from langchain import PromptTemplate, LLMChain

In [52]:
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [53]:
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [54]:
question = "What are the main topics of a big data course?"

answer = llm_chain.run(question)

for sentence in answer.split('\n'):
  print (sentence)

1. Introduction to Big Data: This topic covers the basics of big data, including what it is, why it is important, and how it is different from traditional data analysis.

2. Data Collection and Storage: This topic focuses on the various methods of collecting and storing large amounts of data, including data warehouses, data lakes, and cloud storage solutions.

3. Data Processing and Analysis: This topic covers the tools and techniques used to process and analyze big data, including data mining, machine learning, and data visualization.

4. Data Management and Governance: This topic explores the challenges of managing and governing big data, including data quality, data security, and compliance with regulations such as GDPR.

5. Real-world Applications of Big Data: This topic looks at how big data is being used in various industries, such as healthcare, finance, and marketing, to drive business insights and decision-making.

6. Ethical and Social Implications of Big Data: This topic exa

## Relation Extraction via OpenAI

Relation extraction involves taking text and trying to populate a schema.  Sometimes one must do this via "few-shot" learning (provide a few examples) but for simpler cases zero-shot learning (with the schema) may be adequate.

Here's an example from the text copied from an Internet Movie Database poll.


In [78]:
from langchain.chains import create_extraction_chain
from langchain_core.pydantic_v1 import BaseModel, Field


# Schema
schema = {
    "properties": {
        "name": {"type": "string"},
        "ranked": {"type": "integer"},
        "votes": {"type": "integer"},
        "movie": {"type": "string"}
    },
    "required": ["name", "ranked", "votes", "movie"],
}

# Input from IMDB poll on best movie characters, https://www.imdb.com/poll/gBcmBMHGh4k/results?ref_=po_sr
inp = """
Results of 10,205 votes:
1.
Heath Ledger and Martin Ballantyne in The Dark Knight (2008)
2,988
The Joker #1 on Who Is The Nastiest Villain
2.
Harrison Ford in Indiana Jones and the Temple of Doom (1984)
743
Indiana Jones #1 on Steven Spielberg Leads
3.
James Earl Jones and David Prowse in Star Wars: Episode V - The Empire Strikes Back (1980)
565
Darth Vader #1 on Movie wo/man in a mask
4.
"The Godfather" Marlon Brando 1971 Paramount
560
Vito Corleone #1 on Movie Character Wisdom
5.
Clint Eastwood in The Good, the Bad and the Ugly (1966)
465
The Man With No Name #1 on Favourite Nameless Character
6.
Jodie Foster and Anthony Hopkins in The Silence of the Lambs (1991)
441
Dr. Hannibal Lecter #1 on Movie Villains Played by Brits and the Irish
7.
Javier Bardem in No Country for Old Men (2007)
341
Anton Chigurh #1 on The most likely villain to win at a staring contest ...
8.
Christian Bale in Batman Begins (2005)
320
Batman #1 on Classic Clothing - Movie Heroes!
9.
Malcolm McDowell in A Clockwork Orange (1971)
261
Alex #1 on Most Charming Sci-Fi Anti-Heroes
10.
Hugh Jackman is Logan/Wolverine
245
Wolverine #1 on Favourite Character Made Of Flesh & Metal
11.
Frank Oz and Yoda in Star Wars: Episode V - The Empire Strikes Back (1980)
243
Yoda #1 on Greatest Mentor
12.
Russell Crowe in Gladiator (2000)
234
Maximus #1 on The most inspiring Hero from a 'Best Picture' is ...
13.
Michael J. Fox in Back to the Future Part II (1989)
227
Marty McFly #1 on I think I just saw... me!
14.
Uma Thurman in Kill Bill: Vol. 1 (2003)
215
The Bride #1 on Weapon Wielding Women ( Nothing as boring as a Handgun)
15.
Sigourney Weaver in Alien (1979)
 212
Ellen Ripley #1 on The best female kick-ass characters of Fantasy/Sci-Fi
16.
Iron Man (2008)
 184
Iron Man #1 on Glow-in-the-Dark Characters
17.
Leonardo DiCaprio and Danièle Watts in Django Unchained (2012)
 177
Calvin Candie #1 on The Hero plays the Villain
18.
Keanu Reeves in The Matrix Reloaded (2003)
 176
Neo #1 on Chosen ones (Part 1)
19.
Arnold Schwarzenegger in The Terminator (1984)
 176
The Terminator #1 on Which one of these sunglasses-sporting movie characters looks the coolest ?
20.
Jennifer Lawrence in The Hunger Games: Catching Fire (2013)
 157
Katniss Everdeen #1 on You're in a Hunger Games/Battle Royale, who is your partner ?
21.
Chloë Grace Moretz in Kick-Ass 2 (2013)
 155
Hit-Girl #1 on Action Acrobatic Film Femmes
22.
Orlando Bloom in The Lord of the Rings: The Two Towers (2002)
 150
Legolas #1 on Bow and arrow characters
23.
Scarlett Johansson in Avengers: Age of Ultron (2015)
 115
Black Widow #1 on Colourful Characters
24.
Ben Burtt in WALL·E (2008)
 111
WALL. E #1 on Favourite Pixar's Lead Character
25.
Ming-Na Wen and Soon-Tek Oh in Mulan (1998)
 109
Mulan #1 on Most Inspirational Disney Princess
26.
Leslie Nielsen in The Naked Gun: From the Files of Police Squad! (1988)
 106
Lt. Frank Drebin #1 on To be Frank, my favorite cinematic character is ...
27.
Jodie Foster in The Silence of the Lambs (1991)
 101
Clarice Sterling #1 on Most Memorable Character to win Oscar Best Actress in past 24 Years
28.
Ben Affleck in Gone Girl (2014)
 82
The Villain in Gone Girl #1 on The Best Movie Villains Of 2014
29.
Kate Beckinsale in Underworld: Awakening (2012)
 79
Selene #1 on Ladies in Leather
30.
Milla Jovovich, Ian Holm, and Charlie Creed-Miles in The Fifth Element (1997)
 75
Leeloo #1 on Sexiest Movie Alien
31.
Mélanie Laurent in Inglourious Basterds (2009)
 66
Shosanna #1 on Who is the most attractive Tarantino lady?
32.
Vin Diesel in Fast Five (2011)
 61
Dominic Toretto #1 on Best Car Driver Of All Time
33.
George Clooney in Ocean's Thirteen (2007)
 34
Danny Ocean #1 on You're Planning a Heist
34.
Samuel L. Jackson in Captain America: The Winter Soldier (2014)
 31
Nick Fury #1 on What's with the patch ?"""

# Run chain
llm = ChatOpenAI(temperature=0, model="gpt-4o")
chain = create_extraction_chain(schema, llm)
chain.run(inp)

[{'name': 'Heath Ledger and Martin Ballantyne',
  'ranked': 1,
  'votes': 2988,
  'movie': 'The Dark Knight'},
 {'name': 'Harrison Ford',
  'ranked': 2,
  'votes': 743,
  'movie': 'Indiana Jones and the Temple of Doom'},
 {'name': 'James Earl Jones and David Prowse',
  'ranked': 3,
  'votes': 565,
  'movie': 'Star Wars: Episode V - The Empire Strikes Back'},
 {'name': 'Marlon Brando',
  'ranked': 4,
  'votes': 560,
  'movie': 'The Godfather'},
 {'name': 'Clint Eastwood',
  'ranked': 5,
  'votes': 465,
  'movie': 'The Good, the Bad and the Ugly'},
 {'name': 'Jodie Foster and Anthony Hopkins',
  'ranked': 6,
  'votes': 441,
  'movie': 'The Silence of the Lambs'},
 {'name': 'Javier Bardem',
  'ranked': 7,
  'votes': 341,
  'movie': 'No Country for Old Men'},
 {'name': 'Christian Bale',
  'ranked': 8,
  'votes': 320,
  'movie': 'Batman Begins'},
 {'name': 'Malcolm McDowell',
  'ranked': 9,
  'votes': 261,
  'movie': 'A Clockwork Orange'},
 {'name': 'Hugh Jackman',
  'ranked': 10,
  'votes'

In [83]:
from typing import List, Optional

class Movie(BaseModel):
    ranked: int = Field(description="The rank of the character")
    character: str = Field(description="The character in the movie")
    votes: int = Field(description="The number of votes")
    movie: str = Field(description="The name of the movie")


class Document(BaseModel):
    characters: List[Movie] = Field(..., description="List of movie characters")

structured_llm = llm.with_structured_output(Document)
results = structured_llm.invoke("You are an extraction algorithm. Please extract every possible instance of quotation information.\n\n" + inp)

In [84]:
results_df = pd.DataFrame([character.dict() for character in results.characters])
results_df

Unnamed: 0,ranked,character,votes,movie
0,1,The Joker,2988,The Dark Knight
1,2,Indiana Jones,743,Indiana Jones and the Temple of Doom
2,3,Darth Vader,565,Star Wars: Episode V - The Empire Strikes Back
3,4,Vito Corleone,560,The Godfather
4,5,The Man With No Name,465,"The Good, the Bad and the Ugly"
5,6,Dr. Hannibal Lecter,441,The Silence of the Lambs
6,7,Anton Chigurh,341,No Country for Old Men
7,8,Batman,320,Batman Begins
8,9,Alex,261,A Clockwork Orange
9,10,Wolverine,245,Logan/Wolverine


## Exercise

Take the list of CIS 19xx courses, inserted below, and extract the information into a DataFrame!

In [85]:
text = '''
CIS 1901 C++ Programming

This course will provide an introduction to programming in C++ and is intended for students who are already experienced with programming in C and in object-oriented languages such as Java. C++ provides programmers with a greater level of control over machine resources and is commonly used in situations where low level access or performance are important. This course will cover the features and abstractions that C++ provides to write code that is both safe and performant. This course recommends students to have completed CIS 1200 and CIS 2400.

Not Offered Every Year

0-0.5 Course Units

CIS 1902 Python Programming

Python is an elegant, concise, and powerful language that is useful for tasks large and small. Python has quickly become a popular language for getting things done efficiently in many in all domains: scripting, systems programming, research tools, and web development. This course will provide an introduction to this modern high-level language using hands-on experience through programming assignments and a collaborative final application development project.

Not Offered Every Year

Prerequisite: CIS 1200

0-0.5 Course Units

CIS 1903 Go Programming

Go is an open source programming language created by Google designed for speed, efficiency and infrastructure. While Go is particularly proficient at concurrent systems programming, it has a variety of uses and has been gaining popularity in a variety of fields, including graphics, mobile applications and machine learning. Go is simple, fast and is continuing to rapidly grow in industry. In this course, we will cover what makes Go so unique and apply it to practical, real world situations. Topics covered will include concurrency and parallelism, goroutines and channels, web scraping, and other popular industry Go applications.

Not Offered Every Year

Prerequisite: CIS 1100

0-0.5 Course Units

CIS 1904 Introduction to Haskell Programming

Haskell is a high-level, purely functional programming language with a strong static type system and elegant mathematical underpinnings. It is being increasingly used in industry by organizations such as Facebook, AT&T, and NASA, along with several financial firms. We will explore the joys of functional programming, using Haskell as a vehicle. The aim of the course will be to allow you to use Haskell to easily and conveniently write practical programs. Evaluation will be based on regular homework assignments and class participation.

Not Offered Every Year

Prerequisite: CIS 1200

0-0.5 Course Units

CIS 1905 Rust Programming

Rust is a new, practical, community-developed systems programming language that "runs blazingly fast, prevents almost all crashes, and eliminates data ra (rust-lang.org). Rust derives from a rich history of languages to create a multi-paradigm (imperative/functional), low-level language that focuses on high-performance, zero-cost safety guarantee in concurrent programs. It has begun to gain traction in industry, showing a recognized need for a new low-level systems language. In this course, we will cover what makes Rust so unique and apply it to practical systems programming problems. Topics covered will include traits and generics; memory safety (move semantics, borrowing, and lifetimes); Rust's rich macro system; closures; and concurrency. Evaluation is based on regular homework assignments as well as a final project and class participation. Prerequisite: CIS 1200 Recommended additional prerequisite: CIS 2400 or exposure to C or C++

Not Offered Every Year

Prerequisite: CIS 1200

0-0.5 Course Units

CIS 1911 Using and Understanding Unix and Linux

Unix, in its many forms, runs much of the world's computer infrastructure, from cable modems and cell phones to the giant clusters that power Google and Amazon. This half-credit course provides a thorough introduction to Unix and Linux. Topics will range from critical basic skills such as examining and editing files, compiling programs and writing shell scripts, to higher level topics such as the architecture of Unix and its programming model. The material learned is applicable to many classes, including CIS 2400, CIS 3310, CIS 3410, CIS 3710, and CIS 3800.

Not Offered Every Year

Prerequisite: CIS 1100

0-0.5 Course Units

CIS 1912 DevOps

DevOps is the breaking down of the wall between Developers and Operations to allow more frequent and reliable feature deployments. Through a variety of automation-focused techniques, DevOps has the power to radically improve and streamline processes that in the past were manual and susceptible to human error. In this course we will take a practical, hands-on look at DevOps and dive into some of the main tools of DevOps: automated testing, containerization, reproducibility, continuous integration, and continuous deployment. Throughout the semester we build toward an end-to-end pipeline that takes a webserver, packages it, and then deploys it to the cloud in a reliable and quickly-reproducible manner utilizing industry-leading technologies like Kubernetes and Docker. Evaluation is based on homework assignments and a final group project.

Not Offered Every Year

0-0.5 Course Units

CIS 1921 Solving Hard Problems in Practice

What does Sudoku have in common with debugging, scheduling exams, and routing shipments? All of these problems are provably hard -- no one has a fast algorithm to solve them. But in reality, people are quickly solving these problems on a huge scale with clever systems and heuristics! In this course, we'll explore how researchers and organizations like Microsoft, Google, and NASA are solving these hard problems, and we'll get to use some of the tools they've built!

Not Offered Every Year

Prerequisite: CIS 1210

0-0.5 Course Units

CIS 1951 iOS Programming

This project-oriented course is centered around application development on current iOS mobile platforms. The first half of the course will involve fundamentals of mobile app development, where students learn about mobile app lifecycles, event-based programming, efficient resource management, and how to interact with the range of sensors available on modern mobile devices. In the second half of the course, students work in teams to conceptualize and develop a significant mobile application. Creativity and originality are highly encouraged! Prerequisite: CIS 1200 or previous programming experience.

Not Offered Every Year

Prerequisite: CIS 1200

0-0.5 Course Units

CIS 1952 Android Programming

This project-oriented course is centered around application development on current Android mobile platforms. The first half of the course will involve fundamentals of mobile app development, where students learn about mobile app lifecycles, event-based programming, efficient resource management, and how to interact with the range of sensors available on modern mobile devices. In the second half of the course, students work in teams to conceptualize and develop a significant mobile application. Creativity and originality are highly encouraged! Prerequisite: CIS 1200 or previous programming experience.

Not Offered Every Year

0-0.5 Course Units

CIS 1961 Ruby on Rails Web Development

This course will teach the fundamentals of developing web applications using Ruby on Rails, a rapid-development web framework developed by Basecamp, and adopted by companies like Airbnb, GitHub, Bloomberg, CrunchBase, and Shopify. The first part of the course will focus on Ruby, the language that powers Rails. Along the way, students will also pick up essential skills such as git, bash, HTML and CSS. The second part will focus on Rails, the web framework and will include all topics required to develop and deploy production-ready modern web applications with Rails. Throughout the course, students will be working on a web application project of their own choosing. Upon completion of the course, this application will be deployed and made accessible to the public.

Not Offered Every Year

Prerequisite: CIS 1200

0-0.5 Course Units

CIS 1962 JavaScript Programming

This course provides an introduction to modern web development frameworks, techniques, and practices used to deliver robust client side applications on the web. The emphasis will be on developing JavaScript programs that run in the browser. Topics covered include the JavaScript language, web browser internals, the Document Object Model (DOM), HTML5, client-side app architecture and compile-to-JS languages like (Coffeescript, TypeScript, etc.). This course is most useful for students who have some programming and web development experience and want to develop moderate JavaScript skills to be able to build complex, interactive applications in the browser.

Not Offered Every Year

0-0.5 Course Units

CIS 1990 Special Topics

This course will be used for 'pilot versions' of new CIS courses of this type that the department is planning to offer. A given course will be offered as CIS 1990 at most twice; after this, it will be assigned a permanent course number.

0-0.5 Course Units

'''

Define a class specifying the schema to extract. It should include the fields `name`, `prerequisites`, `units`, `description`, and `frequency`.

In [None]:
# TODO

In [None]:
results_df = pd.DataFrame([course.dict() for course in result.courses])
results_df


In [111]:
# This is just to catch simple mistakes

if 'name' not in results_df.columns or 'units' not in results_df.columns:
  print('Please revise your schema according to the spec')

In [89]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Writing notebook-config.yaml


In [90]:
!pip3 install penngrader-client

Collecting penngrader-client
  Downloading penngrader_client-0.5.2-py3-none-any.whl.metadata (15 kB)
Collecting dill (from penngrader-client)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading penngrader_client-0.5.2-py3-none-any.whl (10 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill, penngrader-client
Successfully installed dill-0.3.8 penngrader-client-0.5.2


In [91]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [92]:
%set_env HW_ID=cis2450_fall24_HW9

env: HW_ID=cis2450_fall24_HW9


In [93]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

PennGrader initialized with Student ID: 99999999

Make sure this correct or we will not be able to store your grade


In [110]:
grader.grade('extracted_courses', results_df)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.
