<a href="https://colab.research.google.com/github/richlysakowski/natural-language-processing/blob/master/tutorials/04_GoogleNews_Cleaner_Splitter_Classification_Aggregator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Obsei Tutorial 04
## This example shows following Obsei workflow
 1. Observe: Search and fetch news article via Google News
 2. Cleaner: Clean article text proerply
 3. Analyze: Classify article text while splitting text in small chunks and later computing final inference using given formula

## Install Obsei from latest code, perform these steps -
- Select GPU RunType for faster computation
- Restart Runtime after installation

In [1]:
!pip install obsei[all]
!pip install trafilatura

Collecting obsei[all]
  Downloading obsei-0.0.15-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.5/44.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting dateparser>=1.2.0 (from obsei[all])
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Collecting mmh3>=4.0.1 (from obsei[all])
  Downloading mmh3-5.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting pydantic-settings>=2.1.0 (from obsei[all])
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings>=2.1.0->obsei[all])
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting presidio-analyzer>=2.2.351 (from obsei[all])
  Downloading presidio_analyzer-2.2.355-py3-none-any.whl.m

In [7]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## Configure Google News Observer

In [9]:
from obsei.source.google_news_source import GoogleNewsConfig, GoogleNewsSource

source_config = GoogleNewsConfig(
    query="bitcoin",
    max_results=10,
    fetch_article=True,
    lookup_period="1d",
)

source = GoogleNewsSource()

## Configure TextCleaner as Pre-Processor to clean review text
These cleaning function will run serially

In [10]:
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *

text_cleaner_config = TextCleanerConfig(
    cleaning_functions = [
        ToLowerCase(),
        RemoveWhiteSpaceAndEmptyToken(),
        RemovePunctuation(),
        RemoveSpecialChars(),
        DecodeUnicode(),
        RemoveStopWords(),
        RemoveWhiteSpaceAndEmptyToken(),
   ]
)

text_cleaner = TextCleaner()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Configure Classification Analyzer

- List of categories in `labels`
- `TextSplitterConfig` with proper `max_split_length` and `split_stride`
- `InferenceAggregatorConfig` with required `aggregate_function` currently two are supported (average and max frequent class)
- `ClassificationMaxCategories` need `score_threshold` which is used to determine what minimum probability needed to take a class into consideration

**Note**: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one

In [11]:
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer
from obsei.postprocessor.inference_aggregator import InferenceAggregatorConfig
from obsei.postprocessor.inference_aggregator_function import ClassificationMaxCategories
from obsei.preprocessor.text_splitter import TextSplitterConfig

analyzer_config=ClassificationAnalyzerConfig(
   labels=["buy", "sell", "going up", "going down"],
   use_splitter_and_aggregator=True,
   splitter_config=TextSplitterConfig(
       max_split_length=300,
       split_stride=3
   ),
   aggregator_config=InferenceAggregatorConfig(
       aggregate_function=ClassificationMaxCategories(
           score_threshold=0.3
       )
   )
)

text_analyzer = ZeroShotClassificationAnalyzer(
   model_name_or_path="typeform/mobilebert-uncased-mnli",
   device="auto"
)

## Search and fetch news article

In [12]:
source_response_list = source.lookup(source_config)



'<' not supported between instances of 'float' and 'datetime.datetime'


## PreProcess text to clean it

In [13]:
cleaner_response_list = text_cleaner.preprocess_input(
    input_list=source_response_list,
    config=text_cleaner_config
)

## Analyze article to perform classification
**Note**: This is compute heavy step

In [14]:
analyzer_response_list = text_analyzer.analyze_input(
    source_response_list=cleaner_response_list,
    analyzer_config=analyzer_config
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Print Result

In [15]:
for analyzer_response in analyzer_response_list:
  print(vars(analyzer_response))

{'segmented_data': {'aggregator_data': {'category_count': {'going up': 1, 'positive': 1, 'negative': 1}, 'max_scores': {'going up': 0.8165131211280823, 'positive': 0.6288111209869385, 'negative': 0.42871829867362976}, 'aggregator_name': 'ClassificationMaxCategories'}}, 'meta': {'title': 'Bitcoin nears $100,000 as investors bet on crypto-friendly Trump policies', 'desc': None, 'date': '3 hours ago', 'datetime': datetime.datetime(2024, 11, 22, 3, 10, 4, 175132), 'link': 'https://news.google.com/read/CBMitwFBVV95cUxOMFMwQjh2RDV5eUxTSVJKNFJxQ2Y3YVI0N05YNXA0VUF1X25wUXd2aWpLQUVzcGtsQXkyZlJQaWVnLVNnX0xUOGQyeE1uM3NXeFVSd1FHQk54TGJvQ1VTMDZkQ1FOY3hmZ0pTVDFGdXl0eGZHcWRBd2lwUTdfMURZNmZOdnRiY3VlcXQ3cmV2MU1idzUyNWxKbVhOMUdzY19vOUVrMFhCbDBHdHQ3R2J3Nk8wQktaajjSAbwBQVVfeXFMUEJhNURDUVhlcnRNazZOSk1nQ0JPRk5HOERCdnVmRjF3b1lwWTA0WXhwNkM1ZTl4SXdIYUVsTWtfeEdJb0RtSVpzS2VmaG14VG5kNzMzc2lxRG1GcjVsRVdXTFRQWTNKOWl5RFNWRVg1X0xQLXptWGhyNWpwbXdyaVEySHItSHZjbFpiNzJqSnJEN0JPR0hGd1NwY0VyZEIwVmNzeGh1QklTTFpzQVpyMmpYSjBNa

In [17]:
# prompt: Add Python code to retrieve the full text of the article above and put it in the cell below.

# Assuming the necessary libraries and configurations are already defined as in the provided code.
#  Specifically, source_response_list should contain the fetched articles.

for response in source_response_list:
    response.text


AttributeError: 'TextPayload' object has no attribute 'text'