(**Click the icon below to open this notebook in Colab**)

[![Open InColab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/machine-learning-for-actuarial-science/blob/main/2025-spring/week15/notebook/demo.ipynb)

# Introduction to NLP

## Preprocessing

In [1]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/xiangshiyin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/xiangshiyin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/xiangshiyin/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
%%sh

ls -l /Users/xiangshiyin/nltk_data

total 0
drwxr-xr-x  6 xiangshiyin  staff  192 Apr 13 23:47 [34mcorpora[m[m
drwxr-xr-x  6 xiangshiyin  staff  192 Apr 12 17:38 [34mtokenizers[m[m


In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords


data = "This is a simple example to demonstrate removing stopwords using NLTK."
stopWords = set(stopwords.words('english'))

In [7]:
len(stopWords)

198

In [8]:
stopwords.words('english')[:20]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been']

In [9]:
tokenized_data = word_tokenize(data)

In [10]:
print(f"Original text: {data}")
print(f"Tokenized text: {"|".join(tokenized_data)}")

Original text: This is a simple example to demonstrate removing stopwords using NLTK.
Tokenized text: This|is|a|simple|example|to|demonstrate|removing|stopwords|using|NLTK|.


In [11]:
data2 = "This is a simple example to demonstrate removing stopwords using NLTK. I'll be using Python to do the data processing. Please check the output."
tokenized_data2 = word_tokenize(data2)
print(f"Original text: {data2}")
print(f"Tokenized text: {"|".join(tokenized_data2)}")

Original text: This is a simple example to demonstrate removing stopwords using NLTK. I'll be using Python to do the data processing. Please check the output.
Tokenized text: This|is|a|simple|example|to|demonstrate|removing|stopwords|using|NLTK|.|I|'ll|be|using|Python|to|do|the|data|processing|.|Please|check|the|output|.


In [12]:
filtered_tokenized_data = [
    w
    for w in tokenized_data
    if w not in stopWords
]
print(f"After removing stopwords: {filtered_tokenized_data}")

After removing stopwords: ['This', 'simple', 'example', 'demonstrate', 'removing', 'stopwords', 'using', 'NLTK', '.']


In [13]:
print(f"Original text: {data}")
print(f"Tokenized text: {"|".join(tokenized_data)}")
print(f"After removing stopwords: {"|".join(filtered_tokenized_data)}")

Original text: This is a simple example to demonstrate removing stopwords using NLTK.
Tokenized text: This|is|a|simple|example|to|demonstrate|removing|stopwords|using|NLTK|.
After removing stopwords: This|simple|example|demonstrate|removing|stopwords|using|NLTK|.


## Feature Extraction

### Bag of Words

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Sample dataset
texts = [
    "I love this product product product",         # positive
    "This is amazing",             # positive
    "Very happy with the result",  # positive
    "I hate this",                 # negative
    "Worst experience ever",       # negative
    "Not satisfied at all"         # negative
]

labels = [1, 1, 1, 0, 0, 0]  # 1 = positive, 0 = negative

# 2. Convert text to bag-of-words vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# 3. Show feature names
print("Feature Names (Vocabulary):")
print(vectorizer.get_feature_names_out())


Feature Names (Vocabulary):
['all' 'amazing' 'at' 'ever' 'experience' 'happy' 'hate' 'is' 'love' 'not'
 'product' 'result' 'satisfied' 'the' 'this' 'very' 'with' 'worst']


In [21]:
X.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]])

In [22]:
X.shape

(6, 18)

### TF-IDF

In [23]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import nltk

# Download NLTK movie_reviews data
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Prepare dataset
docs = []
labels = []

for fileid in movie_reviews.fileids():
    docs.append(movie_reviews.raw(fileid))
    labels.append(movie_reviews.categories(fileid)[0])  # 'pos' or 'neg'

# Convert labels to binary format
y = [1 if label == 'pos' else 0 for label in labels]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(docs, y, test_size=0.2, random_state=42)

# Vectorize using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/xiangshiyin/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


Accuracy: 0.8275
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.81      0.82       199
           1       0.82      0.84      0.83       201

    accuracy                           0.83       400
   macro avg       0.83      0.83      0.83       400
weighted avg       0.83      0.83      0.83       400



In [25]:
feature_names = vectorizer.get_feature_names_out()
print(feature_names)
# feature_names[:20]

['00' '000' '0009f' ... 'zwigoff' 'zycie' 'zzzzzzz']


In [26]:
feature_names[:20]

array(['00', '000', '0009f', '007', '00s', '03', '04', '05', '05425',
       '10', '100', '1000', '10000', '100m', '101', '102', '103', '104',
       '105', '106'], dtype=object)

In [27]:
len(feature_names)

35940

In [29]:
docs[:1]

['plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience memb

In [None]:
import pandas as pd

# Choose a sample document from the test set
sample_idx = 0
sample_vector = X_test_tfidf[sample_idx]

# Convert sparse vector to dense and create DataFrame
df_features = pd.DataFrame(
    data=sample_vector.toarray()[0],
    index=feature_names,
    columns=["tfidf"]
)

# Filter non-zero features and sort
df_nonzero = df_features[df_features.tfidf > 0].sort_values(by="tfidf", ascending=False)

# Show top 15 features by TF-IDF weight
print("\nTop TF-IDF features in sample test document:")
print(df_nonzero.head(15))

### Word2Vec

#### Hand-craft implementation

In [30]:
import numpy as np
import re
import random

# Sample corpus
corpus = "The quick brown fox jumps over the lazy dog"

# Preprocessing: Tokenization and vocabulary building
tokens = re.findall(r'\b\w+\b', corpus.lower())
vocab = set(tokens)
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
vocab_size = len(vocab)

In [31]:
vocab_size

8

In [32]:
word_to_idx

{'lazy': 0,
 'jumps': 1,
 'the': 2,
 'over': 3,
 'dog': 4,
 'fox': 5,
 'brown': 6,
 'quick': 7}

In [33]:
idx_to_word

{0: 'lazy',
 1: 'jumps',
 2: 'the',
 3: 'over',
 4: 'dog',
 5: 'fox',
 6: 'brown',
 7: 'quick'}

In [34]:
tokens

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [35]:
# Generate training data
def generate_training_data(tokens, window_size):
    training_data = []
    for idx, target_word in enumerate(tokens):
        target_idx = word_to_idx[target_word]
        context_range = list(range(max(0, idx - window_size), idx)) + \
                        list(range(idx + 1, min(len(tokens), idx + window_size + 1)))
        for context_idx in context_range:
            context_word = tokens[context_idx]
            context_word_idx = word_to_idx[context_word]
            training_data.append((target_idx, context_word_idx))
    return training_data

window_size = 2
training_data = generate_training_data(tokens, window_size)


In [36]:
# Inspect the training data
print(f"Corpus: {corpus}")
print([
    (idx_to_word[t[0]], idx_to_word[t[1]])
    for t in training_data
])

Corpus: The quick brown fox jumps over the lazy dog
[('the', 'quick'), ('the', 'brown'), ('quick', 'the'), ('quick', 'brown'), ('quick', 'fox'), ('brown', 'the'), ('brown', 'quick'), ('brown', 'fox'), ('brown', 'jumps'), ('fox', 'quick'), ('fox', 'brown'), ('fox', 'jumps'), ('fox', 'over'), ('jumps', 'brown'), ('jumps', 'fox'), ('jumps', 'over'), ('jumps', 'the'), ('over', 'fox'), ('over', 'jumps'), ('over', 'the'), ('over', 'lazy'), ('the', 'jumps'), ('the', 'over'), ('the', 'lazy'), ('the', 'dog'), ('lazy', 'over'), ('lazy', 'the'), ('lazy', 'dog'), ('dog', 'the'), ('dog', 'lazy')]


In [37]:
# Initialize parameters
embedding_dim = 10
W1 = np.random.randn(vocab_size, embedding_dim)
W2 = np.random.randn(embedding_dim, vocab_size)

# Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Training parameters
epochs = 1000
learning_rate = 0.01
num_negative_samples = 2

# Training loop
for epoch in range(epochs):
    loss = 0
    for target_idx, context_idx in training_data:
        # Positive sample
        h = W1[target_idx]
        u = np.dot(h, W2[:, context_idx])
        pred = sigmoid(u)
        error = pred - 1
        loss += -np.log(pred + 1e-7)
        # Gradients
        grad_W2 = error * h
        grad_W1 = error * W2[:, context_idx]
        # Update weights
        W2[:, context_idx] -= learning_rate * grad_W2
        W1[target_idx] -= learning_rate * grad_W1

        # Negative sampling
        negative_samples = random.sample([i for i in range(vocab_size) if i != context_idx], num_negative_samples)
        for neg_idx in negative_samples:
            u_neg = np.dot(h, W2[:, neg_idx])
            pred_neg = sigmoid(u_neg)
            error_neg = pred_neg
            loss += -np.log(1 - pred_neg + 1e-7)
            # Gradients
            grad_W2_neg = error_neg * h
            grad_W1_neg = error_neg * W2[:, neg_idx]
            # Update weights
            W2[:, neg_idx] -= learning_rate * grad_W2_neg
            W1[target_idx] -= learning_rate * grad_W1_neg
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch + 1}, Loss: {loss:.4f}")

Epoch 100, Loss: 45.3056
Epoch 200, Loss: 43.4334
Epoch 300, Loss: 44.7213
Epoch 400, Loss: 43.2369
Epoch 500, Loss: 37.0080
Epoch 600, Loss: 45.4319
Epoch 700, Loss: 39.8265
Epoch 800, Loss: 40.0149
Epoch 900, Loss: 39.4416
Epoch 1000, Loss: 37.7593


In [38]:
# Retrieve word embeddings
word_embeddings = W1

# Example: Find similar words
def find_similar(word, top_n=3):
    if word not in word_to_idx:
        print(f"'{word}' not in vocabulary.")
        return
    idx = word_to_idx[word]
    vec = word_embeddings[idx]
    similarities = []
    for i in range(vocab_size):
        if i == idx:
            continue
        sim = np.dot(vec, word_embeddings[i]) / (np.linalg.norm(vec) * np.linalg.norm(word_embeddings[i]))
        similarities.append((idx_to_word[i], sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    for word, sim in similarities[:top_n]:
        print(f"{word}: {sim:.4f}")

# Test the model
print("\nWords similar to 'fox':")
find_similar('fox')


Words similar to 'fox':
brown: 0.3096
the: 0.2008
jumps: 0.1507


#### With `Gensim`

In [39]:
import gensim
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
    ["i", "love", "natural", "language", "processing"],
    ["word2vec", "is", "a", "technique", "for", "natural", "language", "processing"],
    ["the", "dog", "is", "lazy", "but", "the", "brown", "fox", "is", "quick"]
]


In [40]:
# Initialize and train the model
model = Word2Vec(
    sentences,
    vector_size=100,  # Dimensionality of the word vectors
    window=5,         # Maximum distance between the current and predicted word
    min_count=1,      # Ignores all words with total frequency lower than this
    workers=4,        # Use these many worker threads to train the model
    sg=1              # 1 for Skip-gram; 0 for CBOW
)

In [41]:
# Find most similar words
similar_words = model.wv.most_similar("fox", topn=3)
print(similar_words)

[('over', 0.16699621081352234), ('brown', 0.1388736069202423), ('quick', 0.13150502741336823)]


In [42]:
# Compute similarity between two words
similarity = model.wv.similarity("dog", "fox")
print(f"Similarity between 'dog' and 'fox': {similarity:.4f}")

Similarity between 'dog' and 'fox': -0.1052


In [43]:
sample = """
Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the
neighbors. The Dursleys had a small son called Dudley and in their
opinion there was no finer boy anywhere.


The Dursleys had everything they wanted, but they also had a secret, and
their greatest fear was that somebody would discover it. They didn't
think they could bear it if anyone found out about the Potters. Mrs.
Potter was Mrs. Dursley's sister, but they hadn't met for several years;
in fact, Mrs. Dursley pretended she didn't have a sister, because her
sister and her good-for-nothing husband were as unDursleyish as it was
possible to be. The Dursleys shuddered to think what the neighbors would
say if the Potters arrived in the street. The Dursleys knew that the
Potters had a small son, too, but they had never even seen him. This boy
was another good reason for keeping the Potters away; they didn't want
Dudley mixing with a child like that.


When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story
starts, there was nothing about the cloudy sky outside to suggest that
strange and mysterious things would soon be happening all over the
country. Mr. Dursley hummed as he picked out his most boring tie for
work, and Mrs. Dursley gossiped away happily as she wrestled a screaming
Dudley into his high chair.
"""

sentences = [
    gensim.utils.simple_preprocess(sentence)
    for sentence in sample.split("\n\n")
]

In [44]:
model = Word2Vec(
    sentences,
    vector_size=100,  # Dimensionality of the word vectors
    window=5,         # Maximum distance between the current and predicted word
    min_count=1,      # Ignores all words with total frequency lower than this
    workers=4,        # Use these many worker threads to train the model
    sg=1              # 1 for Skip-gram; 0 for CBOW
)

In [45]:
# Find most similar words
similar_words = model.wv.most_similar("potter", topn=3)
print(similar_words)

[('dursley', 0.2640940845012665), ('on', 0.2523151636123657), ('starts', 0.24110931158065796)]


In [46]:
similar_words = model.wv.most_similar("somebody", topn=3)
print(similar_words)

[('want', 0.2976841926574707), ('suggest', 0.2530568242073059), ('she', 0.23463937640190125)]


In [47]:
similar_words = model.wv.most_similar("sister", topn=3)
print(similar_words)

[('which', 0.21421189606189728), ('large', 0.19235055148601532), ('arrived', 0.19228415191173553)]


### `GLoVE`

- https://nlp.stanford.edu/projects/glove/
- https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models

In [48]:
import gensim.downloader

In [49]:
# All available models in gensim-data
for model in gensim.downloader.info()['models'].keys():
    print(model)

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis


In [50]:
glove_vectors = gensim.downloader.load('glove-twitter-25')

In [51]:
glove_vectors.most_similar('twitter', topn=20)

[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863),
 ('bio', 0.8740679621696472),
 ('skype', 0.8711126446723938),
 ('youtube', 0.8707534074783325),
 ('spam', 0.8684024214744568),
 ('tumblr', 0.8668119311332703),
 ('ex', 0.8645952939987183),
 ('ask', 0.8644779920578003),
 ('dm', 0.8439710736274719),
 ('insta', 0.8426101207733154),
 ('post', 0.8411487340927124)]

In [52]:
glove_vectors.most_similar('president', topn=20)

[('barack', 0.9471943378448486),
 ('obama', 0.9400959014892578),
 ('clinton', 0.9378828406333923),
 ('former', 0.9294927716255188),
 ('minister', 0.9137527346611023),
 ('romney', 0.9051568508148193),
 ('pope', 0.9035295248031616),
 ('senator', 0.8977937698364258),
 ('kerry', 0.8958768844604492),
 ('hillary', 0.893998920917511),
 ('potus', 0.8929537534713745),
 ('bill', 0.8894913196563721),
 ('says', 0.8860845565795898),
 ('candidate', 0.8827615976333618),
 ('justice', 0.882584273815155),
 ('gov', 0.8798654675483704),
 ('leader', 0.8764994144439697),
 ('labour', 0.8748924732208252),
 ('claims', 0.8715708255767822),
 ('reagan', 0.8713157176971436)]

In [53]:
glove_vectors.most_similar('usa', topn=20)

[('china', 0.8639861941337585),
 ('local', 0.8453366756439209),
 ('capital', 0.8419684767723083),
 ('base', 0.8405494689941406),
 ('a', 0.8330598473548889),
 ('fox', 0.8293148279190063),
 ('sub', 0.8233862519264221),
 ('america', 0.8221979737281799),
 ('union', 0.8131723403930664),
 ('media', 0.8116157650947571),
 ('club', 0.8025693297386169),
 ('pro', 0.8019703030586243),
 ('central', 0.8011414408683777),
 ('uk', 0.7979711294174194),
 ('red', 0.794592559337616),
 ('dc', 0.7931137681007385),
 ('top', 0.791972279548645),
 ('rock', 0.7895259261131287),
 ('no', 0.786898672580719),
 ('york', 0.7861546874046326)]

In [54]:
glove_vectors.get_vector('king')

array([-0.74501 , -0.11992 ,  0.37329 ,  0.36847 , -0.4472  , -0.2288  ,
        0.70118 ,  0.82872 ,  0.39486 , -0.58347 ,  0.41488 ,  0.37074 ,
       -3.6906  , -0.20101 ,  0.11472 , -0.34661 ,  0.36208 ,  0.095679,
       -0.01765 ,  0.68498 , -0.049013,  0.54049 , -0.21005 , -0.65397 ,
        0.64556 ], dtype=float32)

In [55]:
king = glove_vectors.get_vector('king')
queen = glove_vectors.get_vector('queen')
man = glove_vectors.get_vector('man')
woman = glove_vectors.get_vector('woman')

res = king - man + woman

In [56]:
res

array([-2.04041   , -0.06222999,  0.07362202,  1.1453301 ,  0.3944    ,
       -0.72558004,  1.8081    ,  0.11692998,  0.79493   , -0.66673994,
        0.99141   ,  0.54456997, -3.1602    , -0.00692701, -0.68719   ,
       -0.71597195, -0.13448006, -0.077546  ,  1.42316   ,  1.05583   ,
        0.720557  , -0.25537002, -0.4989    , -2.1607199 , -0.56942   ],
      dtype=float32)

In [57]:
res.shape

(25,)

In [60]:
# calculate the cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(res.reshape(1, -1), queen.reshape(1, -1))
print(f"Similarity between queen and res: {similarity[0][0]}")

Similarity between queen and res: 0.7530912756919861


In [63]:
# calculate the cosine similarity of two vectors following the linear algebra formula
import numpy as np

def cosine_similarity2(v1, v2):
    return np.dot(v1, v2.T) / (np.linalg.norm(v1) * np.linalg.norm(v2))


In [64]:
similarity2 = cosine_similarity2(res.reshape(1, -1), queen.reshape(1, -1))
print(f"Similarity between queen and res: {similarity2[0][0]}")

Similarity between queen and res: 0.7530912756919861


# Prompt Engineering

## Quick Example

In [65]:
import openai
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

python-dotenv could not parse statement starting at line 1


In [66]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message["content"]

In [67]:
prompt = "what is the capital of France?"
response = get_completion(prompt)
print(response)

The capital of France is Paris.


In [68]:
prompt = "If there are 3 apples and you take away 2, how many in total?"
response = get_completion(prompt)
print(response)

There would be 1 apple remaining.


In [69]:
text = f"""
Cooking ma po tofu is easy. First, you need to buy some tofu. Then you need to heat some oil in a pan.
After that, you need to add the tofu to the pan. Then you need to cook the tofu. After that, you need 
to add some seasoning to the tofu. Some people might first cook some ground beef and then add the tofu.
And that's it! You have cooked some delicious tofu. Enjoy!
"""

prompt = f"""
You will be provided with text delimited by triple quotes. If the content contains a sequence of instructions,
re-write those instructions in the following format:

Step 1 - ...
Step 2 - ...
...
Step N - ...
If the content does not contain a sequence of instructions, then simply write \"No steps provided.\"
\"\"\"{text}\"\"\"
"""

response = get_completion(prompt)
print("Completion for Text-to-Step transformation:")
print(response)


Completion for Text-to-Step transformation:

Step 1 - Buy some tofu.
Step 2 - Heat some oil in a pan.
Step 3 - Add the tofu to the pan.
Step 4 - Cook the tofu.
Step 5 - Add seasoning to the tofu.
Step 6 - Optional: Cook some ground beef and then add it to the tofu.
Step 7 - Enjoy your delicious tofu.


## Tokens

In [70]:
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [71]:
tokenizer.encode('tiktoken is great!')

[83, 1609, 5963, 374, 2294, 0]

In [72]:
def num_tokens_from_string(string: str, model_name: str = "gpt-3.5-turbo") -> int:
    """Returns the number of tokens in a text string."""
    tokenizer = tiktoken.encoding_for_model(model_name)
    num_tokens = len(tokenizer.encode(string))
    return num_tokens

In [73]:
num_tokens_from_string(prompt)

157

In [74]:
num_tokens_from_string(response)

76

In [75]:
# turn tokens into text
tokenizer.decode([83, 1609, 5963, 374, 2294, 0])

'tiktoken is great!'

## More Examples

### Avoid prompt injection

### Format

In [76]:
text = f"""
Cooking ma po tofu is easy. First, you need to buy some tofu. Then you need to heat some oil in a pan.
After that, you need to add the tofu to the pan. Then you need to cook the tofu. After that, you need 
to add some seasoning to the tofu. Some people might first cook some ground beef and then add the tofu.
And that's it! You have cooked some delicious tofu. Enjoy!
"""

prompt = f"""
You will be provided with text delimited by triple quotes. If the content contains a sequence of instructions,
re-write those instructions in the following format:

Step 1 - ...
Step 2 - ...
...
Step N - ...
If the content does not contain a sequence of instructions, then simply write \"No steps provided.\"
Please provide the response in JSON format with the following keys:
step_numbers, steps
\"\"\"{text}\"\"\"
"""

response = get_completion(prompt)
print("Completion for Text-to-Step transformation:")
print(response)

Completion for Text-to-Step transformation:
{
    "step_numbers": ["Step 1", "Step 2", "Step 3", "Step 4", "Step 5", "Step 6"],
    "steps": [
        "Buy some tofu.",
        "Heat some oil in a pan.",
        "Add the tofu to the pan.",
        "Cook the tofu.",
        "Add some seasoning to the tofu.",
        "Cook ground beef and then add the tofu."
    ]
}


### Check if condition is met

### Control the length

In [77]:
text = f"""
Cooking ma po tofu is easy. First, you need to buy some tofu. Then you need to heat some oil in a pan.
After that, you need to add the tofu to the pan. Then you need to cook the tofu. After that, you need 
to add some seasoning to the tofu. Some people might first cook some ground beef and then add the tofu.
And that's it! You have cooked some delicious tofu. Enjoy!
"""

prompt = f"""
You will be provided with text delimited by triple quotes. If the content contains a sequence of instructions,
Summarize the information about cooking ma po tofu, make sure the summarization is less than 20 tokens.

If the content does not contain a sequence of instructions, then simply write \"No steps provided.\"
Please provide the response in JSON format with the following keys:
step_numbers, steps
\"\"\"{text}\"\"\"
"""

response = get_completion(prompt)
print("Completion for Text-to-Step transformation:")
print(response)

Completion for Text-to-Step transformation:
{
    "step_numbers": "6",
    "steps": "Buy tofu, heat oil, cook tofu, add seasoning, cook ground beef (optional), enjoy"
}


### Few-shot prompting

In [78]:
prompt = """
Please answer questions in a consistent style.

Q: How can I become a kungfu master?
A: Empty your mind, be formless. Shapeless, like water. If you put water into a cup, it becomes the cup. You put water into a bottle and it becomes the bottle. You put it in a teapot, it becomes the teapot. Now, water can flow or it can crash. Be water, my friend.
Q: How can I become a good leader?
"""

response = get_completion(prompt)
print(response)


A: To become a good leader, you must lead by example, inspire others, communicate effectively, make decisions with confidence, and always strive to improve yourself and those around you. It's important to listen to feedback, be open-minded, and show empathy towards others. Remember, a good leader is someone who empowers and motivates their team to achieve success.


### A math problem (coursera example)

In [79]:
prompt = """
Determine if the student's solution is correct or not.

Question:
I'm building a solar power installation and I need
 help working out the financials. 
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost 
me a flat $100k per year, and an additional $10 / square
foot
What is the total cost for the first year of operations 
as a function of the number of square feet.

Student's Solution:
Let x be the size of the installation in square feet.
Costs:
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000
"""

response = get_completion(prompt)
print(response)

The student's solution is incorrect. The total cost for the first year of operations should be calculated as follows:

Total cost = Land cost + Solar panel cost + Maintenance cost
Total cost = 100x + 250x + (100,000 + 10x)
Total cost = 350x + 100,000

Therefore, the correct total cost for the first year of operations as a function of the number of square feet is 350x + 100,000.


In [None]:
len(response)

### Hallucinations

In [80]:
prompt = "Tell me about the architecture Xiangshi Yin"

response = get_completion(prompt)
print(response)

Xiangshi Yin is a traditional Chinese architectural style that originated in the Song Dynasty (960-1279) and reached its peak during the Ming (1368-1644) and Qing (1644-1912) dynasties. It is characterized by its intricate wooden structures, curved eaves, and elaborate decorations.

One of the key features of Xiangshi Yin architecture is the use of dougong, a unique structural element that consists of interlocking wooden brackets that support the roof. Dougong allows for large, open interior spaces without the need for columns or other supports, giving Xiangshi Yin buildings a sense of lightness and elegance.

Xiangshi Yin buildings are typically constructed using traditional Chinese building techniques, such as mortise and tenon joints, and feature intricate carvings and paintings on the walls, doors, and ceilings. The roofs are often adorned with colorful glazed tiles and ornate dragon motifs, symbolizing power and protection.

Xiangshi Yin architecture is often found in temples, pal

## Interative Solution

In [None]:
prompt = "Tell me about the self-attention mechanism in transformers."

## Summarize

In [None]:
review = """
First year changing from Milorganite.....Have already spread over 3 acres once, waiting on my next shipment. This stuff spreads super easy and I am glad they increased the size of bags. We only had one issue and it is when my daughter had her basketball team over, it is good for PGF and bad.....maybe. I decided to use the spreading of fertilizer as conditioning drill, which seemed like a win/win. All the girls did well until it got to KOBI. Kobi started sprinting with the spreader and I was attempting to get her to slow down. By the time I got to her to explain why I needed an even spread on my beautiful lawn, she slipped in a giant St. Bernard turd. Unfortunately she was wearing her basketball shoes for some reason so they got ruined....but that is not the worst part. When she fell she went forward and landed face first in the spreader with all the fertilizer. Which would not have been a problem but Kobi is the most out of shape person on the team and when she began sprinting with the spreader she immediately broke out in a sweat. The PGF fertilizer stuck to her face and when she looked up it looked like a young version of the bearded lady from the carnival. After I quit laughing, I attempted to help her wash off her face. Well actually I asked her to pick up all the fertilizer she spilled first because this stuff is not cheap. Then we washed off her face but she screamed and screamed. I thought it was because of her prepubescent acne face. However, I now believe it was because PGF started to work instantly, with the moisture on her face. That was late February and her mother just called me in April and stated her daughter is growing a full beard and mustache. It was unfortunate because we had a basketball tournament and they would not let her play because they did not believe she was a girl. The hair on her face is thick and rich, which makes me think this is a great product. But you might want to keep it away from the Kobi's of the world.
"""


## Inferring
- Sentiment (positive/negative)
- Identify types of emotions
- Identify the subject of the text
- Identify the entities (product and company)
- Multiple tasks at once

In [None]:
prompt = f"""
Help me identify the entities and relations present in the following product review:
Review:
```
{review}
```
"""

response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
Help me identify the entities (product, company, etc.) and relations present in the following product review:
Review:
```
{review}
```
"""

response = get_completion(prompt)
print(response)

## Transform

### Translation

In [None]:
prompt = """
Translate the following English text to Chinese:

Hi, I would like to order a ma po tofu.
"""


### Tone transformation

### Spellchecks (example from coursera) 

In [None]:
text = [ 
  "The girl with the black and white puppies have a ball.",  # The girl has a ball.
  "Yolanda has her notebook.", # ok
  "Its going to be a long day. Does the car need it’s oil changed?",  # Homonyms
  "Their goes my freedom. There going to bring they’re suitcases.",  # Homonyms
  "Your going to need you’re notebook.",  # Homonyms
  "That medicine effects my ability to sleep. Have you heard of the butterfly affect?", # Homonyms
  "This phrase is to cherck chatGPT for speling abilitty"  # spelling
]

### Reply to Customer Emails

In [None]:
prompt = f"""
You are a customer service AI assistant.
Your task is to send an email reply to a valued customer.
Given the customer email delimited by ```, \
Generate a reply to thank the customer for their review.
If the sentiment is positive or neutral, thank them for \
their review.
If the sentiment is negative, apologize and suggest that \
they can reach out to customer service. 
Make sure to use specific details from the review.
Write in a concise and professional tone.
Sign the email as `AI customer agent`.
Customer review: ```{review}```
Review sentiment: {sentiment}
"""

## Chatbot

In [None]:
messages =  [  
{'role':'system', 'content':'You are an assistant that speaks like Shakespeare.'},    
{'role':'user', 'content':'tell me a joke'},   
{'role':'assistant', 'content':'Why did the chicken cross the road'},   
{'role':'user', 'content':'I don\'t know'}  ]

In [None]:
x = input("Tell me a joke: ")

In [None]:
print(x)