<a href="https://colab.research.google.com/github/shiraeisenberg/genomics_and_play_notebooks/blob/main/contrastive_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Playing with contrastive search

In [1]:
!pip install torch
!pip install "transformers==4.24.0"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.24.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 45.9 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 59.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.2 transformers-4.24.0


In [4]:
#greedy search - deterministically selects text continuation with highest likelihood measured by the LM
from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
input_ids = tokenizer('HuggingFace company is', return_tensors='pt').input_ids
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

output = model.generate(input_ids, max_length = 128)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print(""+ 100 * '-')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
HuggingFace company is a new company that is trying to make it easier for people to find and share their favorite photos.

The company is currently in beta testing and is looking for beta testers to help test the service.

The company is looking for people to help test the service and provide feedback.

The beta testing is currently open to the public and will run until the end of the month.

The beta testing is free and will run until the end of the month.

The company is looking for people to help test the service and provide feedback.

The beta testing is free and will run
----------------------------------------------------------------------------------------------------


In [8]:
# nucleus sampling -- stochastic method, selects the smallest possible sets of top V words such that sum of probability >= p
import torch
tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
input_ids = tokenizer('Deepmind company is', return_tensors='pt').input_ids
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=128, top_p=0.95, top_k=0)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Deepmind company is a fundamental invention.

... is "tree-savvy": Innovators, wherever they are, include creativity in their arsenal (and a lot of it).

... heists algorithmically and systematically.

... its efforts are overseen by a charismatic leader, who was thus the first AI research manager.

These are issues I'm very well familiar with, but as such (and because the site and related materials are free of charge and accessible all the time) I think others need to come to understand this. Bostrom on Turing Machines

Carlos Betancourt is a professor at
----------------------------------------------------------------------------------------------------


In [11]:
# contrastive search -- considers both the prob predicted by the LM to maintain semantic coherence and similarity with respect to prev. context to avoid model degeneration
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = 'gpt2-large'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
model.eval()

# prepare the prefix
prefix_text = r'DeepMind Company is'
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

# generate the result with contrastive search
output = model.generate(input_ids, penalty_alpha=0.6, top_k=4, max_length=512)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leader in artificial intelligence (AI). We have a long history of working with companies such as Google, Facebook, Amazon, and Microsoft to build products that improve people's lives, and today we are excited to announce that DeepMind's AlphaGo program has won the game of Go, becoming the first program to defeat a professional Go player.

The victory is a testament to the power of deep learning, and to the incredible work of our research team, which has been at the forefront of AI research for the past five years. AlphaGo is one of the most advanced Go programs ever created, and its performance is an important step towards the goal of human-level AI.

"This is the culmination of a decade of hard work," said Andy Ng, co-founder and CTO of DeepMind. "We are thrilled to have achieved this milestone and look forward to continuing to develop AI that can be used 

In [12]:
# contrastive search -- considers both the prob predicted by the LM to maintain semantic coherence and similarity with respect to prev. context to avoid model degeneration
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = 'gpt2-large'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
model.eval()

# prepare the prefix
prefix_text = r'Huggingface Company is'
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

# generate the result with contrastive search
output = model.generate(input_ids, penalty_alpha=0.6, top_k=4, max_length=512)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

Output:
----------------------------------------------------------------------------------------------------
Huggingface Company is a company that makes a lot of things.

I have no idea what they make, but it's pretty cool to see them in the game.

This is an image of the company's logo:

And here's a close up of the nameplate:

The company has a few products that are listed on their website, such as this product:

This looks like a lot of fun, right? Well, if you're a fan of Star Wars, you're in luck, because they're offering a limited edition set of Darth Vader figurines for $20. The set comes with a set of six figures, and is limited to 1,000 pieces.

If you want to check out more information about the company, you can head over to their website.

You can follow me on Twitter, add me on Google+, and pick up a copy of my sci-fi novel, The Last Exodus, and its sequel, The Exiled Earthborn. Both books are now in print and available in ebook and audiobook versions. I also have a Tumblr 