### Preprocessing
Before sending text to the API, perform preprocessing steps like removing unnecessary whitespace, correcting spelling errors, and shortening long sentences. This reduces the number of tokens consumed while retaining the core information.

### Text summarization
If your input text is lengthy, consider using a text summarization algorithm or service to condense the content before sending it to the API. This way, you can ensure the most crucial information is preserved while staying within the token limits.

In [34]:
import nltk
import re
from transformers import pipeline
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
import tensorflow as tf


AttributeError: module 'keras' has no attribute 'utils'

In [23]:
# Download the required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/williank/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/williank/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
# Text preprocessing function
def preprocess_text(text):
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Lowercase the text
    text = text.lower()

    # Tokenize sentences and words
    sentences = nltk.sent_tokenize(text)
    words = [nltk.word_tokenize(sentence) for sentence in sentences]

    # Remove stopwords
    stopwords = set(nltk.corpus.stopwords.words('english'))
    filtered_words = [[word for word in word_list if word not in stopwords] for word_list in words]

    # Reconstruct the sentences
    preprocessed_sentences = [' '.join(word_list) for word_list in filtered_words]
    preprocessed_text = ' '.join(preprocessed_sentences)

    return preprocessed_text

In [25]:
# Text summarization function
def summarize_text(text, model_name="sshleifer/distilbart-cnn-12-6", framework="tf"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

    summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework=framework)
    summary = summarizer(text, max_length=100, min_length=25, do_sample=False)
    return summary[0]['summary_text']

In [26]:
# Example text
text = """
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
"""

In [27]:
# Preprocess the text
preprocessed_text = preprocess_text(text)
print("Preprocessed text:")
print(preprocessed_text)

Preprocessed text:
machine learning application artificial intelligence ( ai ) provides systems ability automatically learn improve experience without explicitly programmed . machine learning focuses development computer programs access data use learn . process learning begins observations data , examples , direct experience , instruction , order look patterns data make better decisions future based examples provide . primary aim allow computers learn automatically without human intervention assistance adjust actions accordingly .


In [29]:
# Summarize the text
summary = summarize_text(preprocessed_text)
print("\nSummarized text:")
print(summary)

ImportError: 
TFAutoModelForSeq2SeqLM requires the TensorFlow library but it was not found in your environment. Checkout the instructions on the
installation page: https://www.tensorflow.org/install and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.
