### NER: Named Enitty Recognition: Using token classification to classify entites from natural language

In [1]:
from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification, pipeline

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


In [14]:
# https://huggingface.co/savasy/bert-base-turkish-ner-cased

custom_module = 'savasy/bert-base-turkish-ner-cased'

turkish_ner_tokenizer = AutoTokenizer.from_pretrained(custom_module)
turkish_ner_model = AutoModelForTokenClassification.from_pretrained(custom_module)

In [3]:
sequence = "Merhaba! Benim adım Sinan. San Francisco'dan geliyorum" # Hi! I'm Sinan. I come from San Francisco"

ner=pipeline('ner', model=turkish_ner_model, tokenizer=turkish_ner_tokenizer)
ner(sequence)

[{'entity': 'B-PER',
  'score': 0.72424716,
  'index': 5,
  'word': 'Sinan',
  'start': 20,
  'end': 25},
 {'entity': 'B-LOC',
  'score': 0.99879956,
  'index': 7,
  'word': 'San',
  'start': 27,
  'end': 30},
 {'entity': 'I-LOC',
  'score': 0.9977098,
  'index': 8,
  'word': 'Francisco',
  'start': 31,
  'end': 40}]

### Speech Transcription

In [4]:
import librosa    
import torch
from transformers import Wav2Vec2Processor, HubertForCTC
from datasets import load_dataset
import soundfile as sf

processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")

local_file, sampling_rate = librosa.load('../data/sample.wav', sr=16000) # Downsample to 16kHz as the model was trained in

input_values = processor(local_file, return_tensors="pt", sampling_rate=sampling_rate).input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])

transcription

Downloading:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

'WHAT A WONDERFUL CLASS'

## Image Captioning does't do THAT much better than ours :) 
https://huggingface.co/spaces/flax-community/image-captioning
<img title="a title" alt="Attention" src="../images/ViT.png">

## Text to Image. Truly Horrifying :)
https://huggingface.co/spaces/flax-community/dalle-mini
<img title="a title" alt="Attention" src="../images/texttoimage.png">

## Python code completion

[Huggingface repo here](https://huggingface.co/Sentdex/GPyT)

In [12]:
from transformers import AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("Sentdex/GPyT")
model = AutoModelWithLMHead.from_pretrained("Sentdex/GPyT")

In [13]:
input_code = """import pandas as pd
import numpy as np

df = pd"""  # I'd expect a read_csv here

converted = input_code.replace("\n", "<N>")
tokenized = tokenizer.encode(converted, return_tensors='pt')
resp = model.generate(tokenized, beams=3, max_length=tokenized.shape[1] + 10)

decoded = tokenizer.decode(resp[0])
reformatted = decoded.replace("<N>","\n")

print(reformatted)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


import pandas as pd
import numpy as np

df = pd.read_csv('data/data/data


## Article Headline Generation

[Huggingface repo here](https://huggingface.co/Michau/t5-base-en-generate-headline)

In [16]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline")
tokenizer = T5Tokenizer.from_pretrained("Michau/t5-base-en-generate-headline")
model = model.to(device)

In [17]:
article = '''
Very early yesterday morning, the United States President Donald Trump reported he and his wife First Lady Melania Trump tested positive for COVID-19. Officials said the Trumps' 14-year-old son Barron tested negative as did First Family and Senior Advisors Jared Kushner and Ivanka Trump.
Trump took to social media, posting at 12:54 am local time (0454 UTC) on Twitter, "Tonight, [Melania] and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!" Yesterday afternoon Marine One landed on the White House's South Lawn flying Trump to Walter Reed National Military Medical Center (WRNMMC) in Bethesda, Maryland.
Reports said both were showing "mild symptoms". Senior administration officials were tested as people were informed of the positive test. Senior advisor Hope Hicks had tested positive on Thursday.
Presidential physician Sean Conley issued a statement saying Trump has been given zinc, vitamin D, Pepcid and a daily Aspirin. Conley also gave a single dose of the experimental polyclonal antibodies drug from Regeneron Pharmaceuticals.
According to official statements, Trump, now operating from the WRNMMC, is to continue performing his duties as president during a 14-day quarantine. In the event of Trump becoming incapacitated, Vice President Mike Pence could take over the duties of president via the 25th Amendment of the US Constitution. The Pence family all tested negative as of yesterday and there were no changes regarding Pence's campaign events.
'''

text =  "headline: " + article

encoding = tokenizer.encode_plus(text, return_tensors = "pt")
input_ids = encoding["input_ids"].to(device)
attention_masks = encoding["attention_mask"].to(device)

beam_outputs = model.generate(
    input_ids = input_ids,
    attention_mask = attention_masks,
    max_length = 64,
    num_beams = 3,
    early_stopping = True,
)

result = tokenizer.decode(beam_outputs[0])
print(result)

<pad> Trump and First Lady Melania Test Positive for COVID-19</s>


In [32]:
import requests
from random import sample

NYTIMES_KEY = 'uy1HSemASbwBaaSskpdSVckRgSLOVKE6'

results =  requests.get(f'https://api.nytimes.com/svc/mostpopular/v2/viewed/1.json?api-key={NYTIMES_KEY}').json()['results']

for result in results[:5]:
    print(result['title'], result['url'])

article = results[2]

article['title'], article['url']

2 Men Convicted of Killing Malcolm X Will Be Exonerated After Decades https://www.nytimes.com/2021/11/17/nyregion/malcolm-x-killing-exonerated.html
House, Mostly Along Party Lines, Censures Gosar for Violent Video https://www.nytimes.com/2021/11/17/us/politics/paul-gosar-video.html
Does It Matter if I Eat the Stickers on Fruits and Vegetables? https://www.nytimes.com/2021/11/16/well/eat/stickers-fruits-vegetables.html
Adele, Dressed for Power https://www.nytimes.com/2021/11/15/style/adele-oprah-white-pantsuit.html
Democrats Shouldn’t Panic. They Should Go Into Shock. https://www.nytimes.com/2021/11/17/opinion/democrats-midterms-biden.html


('Does It Matter if I Eat the Stickers on Fruits and Vegetables?',
 'https://www.nytimes.com/2021/11/16/well/eat/stickers-fruits-vegetables.html')

In [33]:
from bs4 import BeautifulSoup

article_body = BeautifulSoup(requests.get(article['url']).text).find('section', {'name': 'articleBody'}).text

In [34]:
text =  "headline: " + article_body

encoding = tokenizer.encode_plus(text, return_tensors = "pt")
input_ids = encoding["input_ids"]
attention_masks = encoding["attention_mask"]

beam_outputs = model.generate(
    input_ids = input_ids,
    attention_mask = attention_masks,
    max_length = 64,
    num_beams = 3,
    early_stopping = True,
)

result = tokenizer.decode(beam_outputs[0], skip_special_tokens=True)

In [35]:
print("Original Title\n==============\n", article['title'], "\n\nGenerated Title\n==============\n", result)

Original Title
 Does It Matter if I Eat the Stickers on Fruits and Vegetables? 

Generated Title
 Is There Any Health Hazard in Eating Produce Stickers?
