<a href="https://colab.research.google.com/github/richardOlson/nlp__tranformers/blob/main/Long_text_classify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
# doing some of the imports 
import tensorflow as tf
import numpy as np

In [27]:
! pip install transformers -q

In [28]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch
from typing import Union

In [29]:
# making the model
model = BertForSequenceClassification.from_pretrained("ProsusAI/finbert")
tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert")

In [30]:
text = """
I would like to get your all  thoughts on the bond yield increase this week.  I am not worried about the market downturn but the sudden increase in yields. On 2/16 the 10 year bonds yields increased by almost  9 percent and on 2/19 the yield increased by almost 5 percent.

Key Points from the CNBC Article:

* **The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program.**
* **Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic.**
* **However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation.**

The recent rise in bond yields and U.S. inflation expectations has some investors wary that a repeat of the 2013 “taper tantrum” could be on the horizon.

The benchmark U.S. 10-year Treasury note climbed above 1.3% for the first time since February 2020 earlier this week, while the 30-year bond also hit its highest level for a year. Yields move inversely to bond prices.

Yields tend to rise in lockstep with inflation expectations, which have reached their highest levels in a decade in the U.S., powered by increased prospects of a large fiscal stimulus package, progress on vaccine rollouts and pent-up consumer demand.

The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program.

Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic. The Fed and others have maintained supportive tones in recent policy meetings, vowing to keep financial conditions loose as the global economy looks to emerge from the Covid-19 pandemic.

However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation.

With central bank support removed, bonds usually fall in price which sends yields higher. This can also spill over into stock markets as higher interest rates means more debt servicing for firms, causing traders to reassess the investing environment.

“The supportive stance from policymakers will likely remain in place until the vaccines have paved a way to some return to normality,” said Shane Balkham, chief investment officer at Beaufort Investment, in a research note this week.

“However, there will be a risk of another ‘taper tantrum’ similar to the one we witnessed in 2013, and this is our main focus for 2021,” Balkham projected, should policymakers begin to unwind this stimulus.

Long-term bond yields in Japan and Europe followed U.S. Treasurys higher toward the end of the week as bondholders shifted their portfolios.

“The fear is that these assets are priced to perfection when the ECB and Fed might eventually taper,” said Sebastien Galy, senior macro strategist at Nordea Asset Management, in a research note entitled “Little taper tantrum.”

“The odds of tapering are helped in the United States by better retail sales after four months of disappointment and the expectation of large issuance from the $1.9 trillion fiscal package.”

Galy suggested the Fed would likely extend the duration on its asset purchases, moderating the upward momentum in inflation.

“Equity markets have reacted negatively to higher yield as it offers an alternative to the dividend yield and a higher discount to long-term cash flows, making them focus more on medium-term growth such as cyclicals” he said. Cyclicals are stocks whose performance tends to align with economic cycles.

Galy expects this process to be more marked in the second half of the year when economic growth picks up, increasing the potential for tapering.

## Tapering in the U.S., but not Europe

Allianz CEO Oliver Bäte told CNBC on Friday that there was a geographical divergence in how the German insurer is thinking about the prospect of interest rate hikes.

“One is Europe, where we continue to have financial repression, where the ECB continues to buy up to the max in order to minimize spreads between the north and the south — the strong balance sheets and the weak ones — and at some point somebody will have to pay the price for that, but in the short term I don’t see any spike in interest rates,” Bäte said, adding that the situation is different stateside.

“Because of the massive programs that have happened, the stimulus that is happening, the dollar being the world’s reserve currency, there is clearly a trend to stoke inflation and it is going to come. Again, I don’t know when and how, but the interest rates have been steepening and they should be steepening further.”

## Rising yields a ‘normal feature’

However, not all analysts are convinced that the rise in bond yields is material for markets. In a note Friday, Barclays Head of European Equity Strategy Emmanuel Cau suggested that rising bond yields were overdue, as they had been lagging the improving macroeconomic outlook for the second half of 2021, and said they were a “normal feature” of economic recovery.

“With the key drivers of inflation pointing up, the prospect of even more fiscal stimulus in the U.S. and pent up demand propelled by high excess savings, it seems right for bond yields to catch-up with other more advanced reflation trades,” Cau said, adding that central banks remain “firmly on hold” given the balance of risks.

He argued that the steepening yield curve is “typical at the early stages of the cycle,” and that so long as vaccine rollouts are successful, growth continues to tick upward and central banks remain cautious, reflationary moves across asset classes look “justified” and equities should be able to withstand higher rates.

“Of course, after the strong move of the last few weeks, equities could mark a pause as many sectors that have rallied with yields look overbought, like commodities and banks,” Cau said.

“But at this stage, we think rising yields are more a confirmation of the equity bull market than a threat, so dips should continue to be bought.”
"""

In [31]:
tokens = tokenizer.encode_plus(text=text, add_special_tokens=False)
type(tokens)

Token indices sequence length is longer than the specified maximum sequence length for this model (1345 > 512). Running this sequence through the model will result in indexing errors


transformers.tokenization_utils_base.BatchEncoding

In [16]:
len(tokens["input_ids"])

1345

In [23]:
# showing how we can take a slice of the 
# tokens and then we will send it into the model
# we will loop through the data and will slice of 
# 512 to get the logits from the model
# if the size is smaller than 512 then we will add padding
def makeSlices(x, add_special_tokens=True):
  """
  Function that is used to make the slices
  
  ARGS:

  x:  can be either a string of text or also be the tokens that has been returned from the tokenizer

  add_special_tokens:  Bool that when true will add the tokens to each of the slices.  When false the slices will 
    not include the begin "CLS" token and the end "SEP" tokens

  """
  
  tokens = None
  # check to see if it is a string or not
  if isinstance(x, str):
    
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert")
    tokens = tokenizer.encode_plus(text=x, add_special_tokens=False)
  else:
    tokens = x
  # making it so that has the correct sequence len with and without the special tokens  
  if add_special_tokens:
    seq_len_using = 510
  else:
    seq_len_using = 512

  seq_len = len(tokens["input_ids"])
  
  begin = 0
  end = begin + seq_len_using
  windows = []

  while True:
    input_id = tokens["input_ids"][begin:end]
    atten_mask = tokens["attention_mask"][begin:end]

    
    # doing the padding if it is needed
    if len(input_id) < seq_len_using:
      
      input_id = input_id + [tokenizer.pad_token_id] * (seq_len_using - len(input_id))
      atten_mask = atten_mask + [tokenizer.pad_token_id] * (seq_len_using - len(input_id))
      
    if add_special_tokens:
      # adding the sep and the cls token to the from and the back
      input_id = [tokenizer.cls_token_id] + input_id + [tokenizer.sep_token_id]
      atten_mask = [tokenizer.cls_token_id] + atten_mask + [tokenizer.sep_token_id]
    
    windows.append({"input_ids": input_id, "attention_mask": atten_mask})
    # moving the window
    
    if end >= seq_len:
      break
    begin = end
    end = end + seq_len_using if end + seq_len_using < len(tokens["input_ids"]) else len(tokens["input_ids"])

  return windows

  



In [24]:
new_text = """
  PHOENIX (AP) — An Arizona man who stormed the U.S. Capitol on Jan. 6 while wearing a costume of Captain Moroni from the Book of Mormon and narrating the melee in videos for his mother has been arrested, authorities said.

Nathan Wayne Entrekin, 48, told the FBI that former President Donald Trump inspired him to drive more than 2,000 miles to Washington for the rally on Jan. 6, according to court documents filed this week.

Entrekin documented his movements in and outside the building in cellphone videos in which he addressed his mother, who wasn't at the Capitol, authorities said.

Entrekin said in the videos that he was dressed in the Roman gladiator costume to portray Captain Moroni, a figure from the Book of Mormon who sought to defend his people from another group that wanted to overthrow democracy and install a king, court records say.

"I made it Mom. I made it to the top. Mom, look, I made it to the top, to the top here. Look at all the patriots here," he said in one video. "I'm here for Trump. Four more years, Donald Trump! Our rightful president!"

"I don't think you want to be here, Mom," Entrekin said in another video from inside the ransacked Senate Parliamentarian's office. "I mean you do want to be here, but in spirit."

While Entrekin claimed that he was herded into the building by the crowd, the FBI said security video shows Entrekin didn't appear to be pushed into the Capitol. And when he left the building and reentered it, investigators said Entrekin didn't appear to be pushed forced in against his will, according to court records.

Court records didn't list a lawyer for Entrekin, and he doesn't have a listed phone number.

Entrekin, who was arrested Thursday in Cottonwood on two misdemeanor charges, is among more than 500 people charged with federal crimes in the Jan. 6 attack.

At least eighteen people have pleaded guilty, including two members of the far-right Oath Keepers militia group who admitted to conspiring with other extremists to block the certification of President Joe Biden's victory.
"""

In [25]:
# running the two texts, one as a tokenized alread and the other as text
windows = makeSlices(tokens)
second_set_of_windows = makeSlices(new_text)

UnboundLocalError: ignored

In [None]:
windows = makeSlices(tokens)

In the padding section
The length of the input_id is  325
After padding the len of the input_id is now  510


In [None]:
len(windows)

3

In [None]:
type(windows[0]["input_ids"])

list

In [None]:
# showing the lenghts of the windows
counter = 1
print(f"Each of the windows is of type {type(windows[0])}")
for w in windows:
  
  theLen = len(w["input_ids"])
  print(f"The length of the window number {counter} values is {theLen}")
  counter += 1

Each of the windows is of type <class 'dict'>
The length of the window number 1 values is 512
The length of the window number 2 values is 512
The length of the window number 3 values is 512


In [None]:
type(tokens)

transformers.tokenization_utils_base.BatchEncoding

In [None]:
# making the function that will get the sentiment
def sentiment(x: Union[str, transformers.tokenization_utils_base.BatchEncoding ]):
  if !isinstance(x, transformers.tokenization_utils_base.BatchEncoding):
    # in here we will need to tokenize the text
    x = tokenizer.encode_plus(x, add_special_tokens=False)

  # getting the input_ids and the attenstion_mask
  attenstion_mask = x["attention_mask"]
  input_ids = x["input_ids"]

  # doing the prediction with the input_ids
  model.bert.pr