# **Install Transformers**

In [None]:
!pip install transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers[sentencepiece]
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m89.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# **output wrapper**

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
    <style>
      pre{
        white-space:pre-wrap;
      }
    </style>
  '''))
get_ipython().events.register('pre_run_cell',set_css)

# **Read input file**

In [None]:
file = open('/content/drive/MyDrive/my_work/sample.transcript.txt','r')
fileContent = file.read().strip(' ')
fileContent

"Yeah. Yeah, sure. It kinda does make sense, doesn't it, because when we get into the end of meeting we're kind of talking about action and design as opposed to background. Everything I have is kinda background. Mm-hmm. Uh that sounds. Sure. Okay. Sure. Yeah, cool. Why don't I get that? Hmm. Okay. Okay. Um alright so c is it function F_ eight? Hmm. Come on. I think it's working. Okay great s so let me just start this. Okay great. So um uh s move on. Uh-huh oh where'd it all go? It's not good. Okay lemme just see where I can find it. This looks more like it. I think I just opened up the template. Sorry about that. Okay alright so let's have a look here. Okay so this was the method that um I've taken. Uh basically what I wanna do here, before we get into it uh too far, is I want to show you all the background information I have that I think we need to acknowledge if we want this to be successful. And uh and then sorta g go through some of the way that I've dealt with that information, an

In [None]:
len(fileContent)

39719

# **Load the Model and Tokenizer**

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_NAME = "sshleifer/distilbart-cnn-12-6"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

# **Model statistics**

In [None]:
print(tokenizer.model_max_length)
print(tokenizer.max_len_single_sentence)
print(tokenizer.num_special_tokens_to_add())

1024
1022
2


# **Convert file content to sentences**

In [None]:
import nltk 
nltk.download('punkt')
sentences = nltk.tokenize.sent_tokenize(fileContent)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
len(sentences)

633

In [None]:
max([len(tokenizer.tokenize(sentence)) for sentence in sentences])

93

# **Create the chunks**
Converting to chunks so that it will not go outside the limit of maximum single sentence lenght of the used tokenizer

In [None]:
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count+=1
  combined_length = len(tokenizer.tokenize(sentence)) + length

  if combined_length <= tokenizer.max_len_single_sentence:
    chunk += sentence + " "
    length = combined_length

    if count == len(sentences):
      chunks.append(chunk.strip())
    
  else:
    chunks.append(chunk.strip())

    chunk = ""
    length = ""

    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))


chunks.append(chunk.strip())
len(chunks)

10

# **Some checks**

In [None]:
[len(tokenizer.tokenize(c)) for c in chunks]

[1014, 984, 960, 1003, 1020, 988, 996, 998, 1017, 576]

In [None]:
[len(tokenizer(c).input_ids) for c in chunks]

[1016, 986, 962, 1005, 1022, 990, 998, 1000, 1019, 578]

In [None]:
sum([len(tokenizer.tokenize(c)) for c in chunks] )

9556

In [None]:
len(tokenizer.tokenize(fileContent))

Token indices sequence length is longer than the specified maximum sequence length for this model (9559 > 1024). Running this sequence through the model will result in indexing errors


9559

In [None]:
print(len(chunks))

10


# **Get the inputs**

In [None]:
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# **Output**

In [None]:
for input in inputs:
  output = model.generate(input.input_ids, attention_mask=input.attention_mask, max_new_tokens = 150)
  show = tokenizer.decode(output[0], skip_special_token=True)
  print(show.replace("<s>", "").replace("</s>", ""))
  print()

 Market research shows that TV remote control has a fancy look and feel, not a functional look or or feel, the number one thing that was found was that television remote control was not functional. Number two was that it be innovative without a adding unnecessary functional bits to it, and third priority is that it has to be user friendly while still having technology.

 Style is number one thing in the in the market of who we're selling to. Innovative design technology's also a must in that it's seen it'd be seen to be uh cutting edge, but ease of use t has to be insured throughout. And then at the end there are vibrant natural colours.

 We need to have something that unifies a lot of the different concepts, and if we think that what we are w our number one marketing motive is the look and feel. We are leaning quite a bit to the side of being low-tech, rubber buttons plastic frame, it's almost like we're reproducing the remote control that's out there. We're gonna need to put in a re

**Although the model used in this code is abstractive model but it is giving the extractive results.
For actual abstractive result you can use Google's pegasus model. It is the state-of-the-art pre-trained model for abstractive text summarization.**