### NER: Named Enitty Recognition: Using token classification to classify entites from natural language

In [1]:
from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification, pipeline

In [2]:
# https://huggingface.co/savasy/bert-base-turkish-ner-cased

custom_module = 'savasy/bert-base-turkish-ner-cased'

turkish_ner_tokenizer = AutoTokenizer.from_pretrained(custom_module)
turkish_ner_model = AutoModelForTokenClassification.from_pretrained(custom_module)

In [3]:
sequence = "Merhaba! Benim adım Sinan. San Francisco'dan geliyorum" # Hi! I'm Sinan. I come from San Francisco"

ner=pipeline('ner', model=turkish_ner_model, tokenizer=turkish_ner_tokenizer)
ner(sequence)

[{'entity': 'B-PER',
  'score': 0.724247,
  'index': 5,
  'word': 'Sinan',
  'start': 20,
  'end': 25},
 {'entity': 'B-LOC',
  'score': 0.99879956,
  'index': 7,
  'word': 'San',
  'start': 27,
  'end': 30},
 {'entity': 'I-LOC',
  'score': 0.99770975,
  'index': 8,
  'word': 'Francisco',
  'start': 31,
  'end': 40}]

### Speech Transcription

In [4]:
import librosa    
import torch
from transformers import Wav2Vec2Processor, HubertForCTC
from datasets import load_dataset
import soundfile as sf

processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")

local_file, sampling_rate = librosa.load('../data/sample.wav', sr=16000) # Downsample to 16kHz as the model was trained in

input_values = processor(local_file, return_tensors="pt", sampling_rate=sampling_rate).input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])

transcription

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'WHAT A WONDERFUL CLASS'

## Image Captioning does not do THAT much better than ours :) 
https://huggingface.co/spaces/flax-community/image-captioning
<img title="a title" alt="Attention" src="../images/ViT.png">

## Text to Image. Truly Horrifying :)
https://huggingface.co/spaces/flax-community/dalle-mini
<img title="a title" alt="Attention" src="../images/texttoimage.png">

## Python code completion

[Huggingface repo here](https://huggingface.co/Sentdex/GPyT)

In [5]:
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Sentdex/GPyT")
model = AutoModelForCausalLM.from_pretrained("Sentdex/GPyT")

In [6]:
input_code = """import pandas as pd
import numpy as np

df = pd"""  # I'd expect a read_csv here

converted = input_code.replace("\n", "<N>")
tokenized = tokenizer.encode(converted, return_tensors='pt')
resp = model.generate(tokenized, max_length=tokenized.shape[1] + 10)  # 10 more tokens

decoded = tokenizer.decode(resp[0])
reformatted = decoded.replace("<N>","\n")

print(reformatted)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


import pandas as pd
import numpy as np

df = pd.read_csv('data/data/data


## Article Headline Generation

[Huggingface repo here](https://huggingface.co/Michau/t5-base-en-generate-headline)

In [7]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline")
tokenizer = T5Tokenizer.from_pretrained("Michau/t5-base-en-generate-headline")
model = model.to(device)

In [8]:
article = '''
Very early yesterday morning, the United States President Donald Trump reported he and his wife First Lady Melania Trump tested positive for COVID-19. Officials said the Trumps' 14-year-old son Barron tested negative as did First Family and Senior Advisors Jared Kushner and Ivanka Trump.
Trump took to social media, posting at 12:54 am local time (0454 UTC) on Twitter, "Tonight, [Melania] and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!" Yesterday afternoon Marine One landed on the White House's South Lawn flying Trump to Walter Reed National Military Medical Center (WRNMMC) in Bethesda, Maryland.
Reports said both were showing "mild symptoms". Senior administration officials were tested as people were informed of the positive test. Senior advisor Hope Hicks had tested positive on Thursday.
Presidential physician Sean Conley issued a statement saying Trump has been given zinc, vitamin D, Pepcid and a daily Aspirin. Conley also gave a single dose of the experimental polyclonal antibodies drug from Regeneron Pharmaceuticals.
According to official statements, Trump, now operating from the WRNMMC, is to continue performing his duties as president during a 14-day quarantine. In the event of Trump becoming incapacitated, Vice President Mike Pence could take over the duties of president via the 25th Amendment of the US Constitution. The Pence family all tested negative as of yesterday and there were no changes regarding Pence's campaign events.
'''

text =  "headline: " + article

encoding = tokenizer.encode_plus(text, return_tensors = "pt")
input_ids = encoding["input_ids"].to(device)
attention_masks = encoding["attention_mask"].to(device)

beam_outputs = model.generate(
    input_ids = input_ids,
    attention_mask = attention_masks,
    max_length = 64,
    num_beams = 3,
    early_stopping = True,
)

result = tokenizer.decode(beam_outputs[0])
print(result)

<pad> Trump and First Lady Melania Test Positive for COVID-19</s>


In [9]:
# load env variables from a .env file | pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()

True

In [10]:
import requests
from random import sample
import os

NYTIMES_KEY = os.environ.get('NYTIMES_KEY')

results =  requests.get(f'https://api.nytimes.com/svc/mostpopular/v2/viewed/1.json?api-key={NYTIMES_KEY}').json()['results']

for result in results[:5]:
    print(result['title'], result['url'])

article = results[0]

article['title'], article['url']

Noma, Rated the World’s Best Restaurant, Is Closing Its Doors https://www.nytimes.com/2023/01/09/dining/noma-closing-rene-redzepi.html
Biden Lawyers Found Classified Material at His Former Office https://www.nytimes.com/2023/01/09/us/politics/biden-classified-documents.html
A Lecturer Showed a Painting of the Prophet Muhammad. She Lost Her Job. https://www.nytimes.com/2023/01/08/us/hamline-university-islam-prophet-muhammad.html
House Narrowly Approves Rules Amid Concerns About McCarthy’s Concessions https://www.nytimes.com/2023/01/09/us/politics/house-rules-republicans-mccarthy.html
Has Prince Harry’s Confessional Tour Run Its Course? https://www.nytimes.com/2023/01/09/books/prince-harry-book-royal-family.html


('Noma, Rated the World’s Best Restaurant, Is Closing Its Doors',
 'https://www.nytimes.com/2023/01/09/dining/noma-closing-rene-redzepi.html')

In [11]:
article_body = """Erika López Prater, an adjunct professor at Hamline University, said she knew many Muslims have deeply held religious beliefs that prohibit depictions of the Prophet Muhammad. So last semester for a global art history class, she took many precautions before showing a 14th-century painting of Islam’s founder.

In the syllabus, she warned that images of holy figures, including the Prophet Muhammad and the Buddha, would be shown in the course. She asked students to contact her with any concerns, and she said no one did.

In class, she prepped students, telling them that in a few minutes, the painting would be displayed, in case anyone wanted to leave.

Then Dr. López Prater showed the image — and lost her teaching gig.

Officials at Hamline, a small, private university in St. Paul, Minn., with about 1,800 undergraduates, had tried to douse what they feared would become a runaway fire. Instead they ended up with what they had tried to avoid: a national controversy, which pitted advocates of academic liberty and free speech against Muslims who believe that showing the image of Prophet Muhammad is always sacrilegious.

After Dr. López Prater showed the image, a senior in the class complained to the administration. Other Muslim students, not in the course, supported the student, saying the class was an attack on their religion. They demanded that officials take action."""

In [12]:
text =  "headline: " + article_body

encoding = tokenizer.encode_plus(text, return_tensors = "pt")
input_ids = encoding["input_ids"]
attention_masks = encoding["attention_mask"]

beam_outputs = model.generate(
    input_ids = input_ids,
    attention_mask = attention_masks,
    max_length = input_ids.shape[1] + 15,
    num_beams = 5,
    early_stopping = True,
)

result = tokenizer.decode(beam_outputs[0], skip_special_tokens=True)

In [13]:
print("Original Title\n==============\n", article['title'], "\n\nGenerated Title\n==============\n", result)

Original Title
 Noma, Rated the World’s Best Restaurant, Is Closing Its Doors 

Generated Title
 Hamline University's Art History Professor Loses Teaching Job
