# ChunkyBERT #
This notebook demonstrates the output of ChunkyBERT compared with KeyBert on a small number of long English Wikipedia documents. There are no qualitative benchmarks done yet and any contribution in this regard would be welcome. 

In [1]:
from typing import List, Callable

import datasets
from IPython.display import clear_output
from datasets import load_dataset
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
from sentence_transformers import SentenceTransformer

from chunkey_bert.model import ChunkeyBert

clear_output()

In [2]:
dataset: datasets.Dataset = load_dataset("wikipedia", "20220301.en", split="train").take(5)

Loading dataset shards:   0%|          | 0/41 [00:00<?, ?it/s]

In [3]:
sentence_model: SentenceTransformer = SentenceTransformer(model_name_or_path="all-MiniLM-L6-v2")
keybert: KeyBERT = KeyBERT(model=sentence_model)
keyphrase_vectorizer: KeyphraseCountVectorizer = KeyphraseCountVectorizer(spacy_pipeline="en_core_web_trf")

  _torch_pytree._register_pytree_node(


Define our ChunkeyBert instance by wrapping KeyBERT and create a crude chunker that splits on paragraphs and filters short paragraphs.

In [4]:
chunkey_bert: ChunkeyBert = ChunkeyBert(keybert=keybert)
chunker: Callable[[str], List[str]] = lambda text: [t for t in text.split("\n\n") if len(t) > 25]

In [5]:
text: str = dataset[0]["text"]
text

'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism.\n\nHumans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished 

Inspect the chunks.

In [6]:
chunker(text)

['Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism.',
 "Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourishe

In [7]:
keybert.extract_keywords(docs=text, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=10)

[('political anarchism', 0.7281),
 ('anarchism', 0.7173),
 ('anarchist movement', 0.7089),
 ('modern anarchism', 0.7053),
 ('anarchist philosophy', 0.7028),
 ('anarchistic ideas', 0.7021),
 ('libertarian anarchism', 0.7005),
 ('anarchist society', 0.6991),
 ('anarchist', 0.6954),
 ('anarchist tendencies', 0.6947)]

In [8]:
chunkey_bert.extract_keywords(
    docs=text, num_keywords=10, chunker=chunker, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=3
)

[[('anarchist', 0.82442033),
  ('anarchism', 0.787518),
  ('anarchists', 0.7699559),
  ('philosophical anarchism', 0.4317795),
  ('anarchist movement', 0.414533),
  ('anarchist theory', 0.35548595),
  ('contemporary anarchists', 0.28468305),
  ('many anarchists', 0.2819548),
  ('anarchist thought', 0.28096262),
  ('prominent anarchists', 0.28039744)]]

In [9]:
text = dataset[1]["text"]
text

'Autism is a neurodevelopmental disorder characterized by difficulties with social interaction and communication, and by restricted and repetitive behavior. Parents often notice signs during the first three years of their child\'s life. These signs often develop gradually, though some autistic children experience regression in their communication and social skills after reaching developmental milestones at a normal pace.\n\nAutism is associated with a combination of genetic and environmental factors. Risk factors during pregnancy include certain infections, such as rubella, toxins including valproic acid, alcohol, cocaine, pesticides, lead, and air pollution, fetal growth restriction, and autoimmune diseases. Controversies surround other proposed environmental causes; for example, the vaccine hypothesis, which has been disproven. Autism affects information processing in the brain and how nerve cells and their synapses connect and organize; how this occurs is not well understood. The Di

In [10]:
keybert.extract_keywords(docs=text, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=10)

[('autism risk', 0.6733),
 ('early infantile autism', 0.6638),
 ('autism spectrum disorders', 0.6598),
 ('infantile autism', 0.6586),
 ('autism spectrum', 0.6558),
 ('childhood autism', 0.6495),
 ('autism', 0.6477),
 ('autism spectrum disorder', 0.6177),
 ('autism diagnoses', 0.6168),
 ('autism research', 0.6158)]

In [11]:
chunkey_bert.extract_keywords(
    docs=text, num_keywords=10, chunker=chunker, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=3
)

[[('autism', 0.7128061),
  ('autistic children', 0.40841734),
  ('autistic people', 0.35323882),
  ('autism spectrum', 0.3338979),
  ('autistic individuals', 0.33198574),
  ('autism spectrum disorders', 0.3031995),
  ('autistic symptoms', 0.2833338),
  ('asperger syndrome', 0.24426702),
  ('intellectual disability', 0.23897676),
  ('autism research', 0.20369457)]]

In [12]:
text = dataset[2]["text"]
text

'Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, to 1, corresponding to a body that reflects all incident radiation.\n\nSurface albedo is defined as the ratio of radiosity Je to the irradiance Ee (flux per unit area) received by a surface. The proportion reflected is not only determined by properties of the surface itself, but also by the spectral and angular distribution of solar radiation reaching the Earth\'s surface. These factors vary with atmospheric composition, geographic location, and time (see position of the Sun). While bi-hemispherical reflectance is calculated for a single angle of incidence (i.e., for a given position of the Sun), albedo is the directional integration of reflectance over all solar angles in a given period. The temporal resolution may range from seconds (as obtained from flux measurements) to daily, mon

In [13]:
keybert.extract_keywords(docs=text, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=10)

[('spectral albedo', 0.7367),
 ('term albedo', 0.7314),
 ('surface albedo', 0.7278),
 ('optical albedos', 0.7081),
 ('astronomical albedo', 0.7043),
 ('albedo', 0.7015),
 ('geometric albedo', 0.6956),
 ('common optical albedos', 0.6917),
 ('actual albedo', 0.6906),
 ('overall albedo', 0.6898)]

In [14]:
chunkey_bert.extract_keywords(
    docs=text, num_keywords=10, chunker=chunker, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=3
)

[[('albedo', 0.72697675),
  ('albedos', 0.43500245),
  ('radar albedo', 0.38149863),
  ('low albedo', 0.30122527),
  ('reflectivity', 0.2842312),
  ('climate', 0.27145365),
  ('surface albedo', 0.24144487),
  ('high albedo', 0.24080944),
  ('albedo effect', 0.23963155),
  ('albedo effects', 0.23868895)]]

In [15]:
chunkey_bert.extract_keywords(
    docs=text, num_keywords=10, chunker=chunker, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=2
)

[[('albedo', 0.7542199),
  ('albedos', 0.39941892),
  ('radar albedo', 0.32681236),
  ('surface albedo', 0.26756504),
  ('albedo effect', 0.26608932),
  ('albedo effects', 0.2654092),
  ('low albedo', 0.2651981),
  ('term albedo', 0.26230577),
  ('oc radar albedo', 0.25035024),
  ('reflectivity', 0.24573831)]]

In [16]:
text = dataset[3]["text"]
text

'A, or a, is the first letter and the first vowel of the modern English alphabet and the ISO basic Latin alphabet. Its name in English is a (pronounced ), plural aes. It is similar in shape to the Ancient Greek letter alpha, from which it derives. The uppercase version consists of the two slanting sides of a triangle, crossed in the middle by a horizontal bar. The lowercase version can be written in two forms: the double-storey a and single-storey ɑ. The latter is commonly used in handwriting and fonts based on it, especially fonts intended to be read by children, and is also found in italic type.\n\nIn the English grammar, "a", and its variant "an", are indefinite articles.\n\nHistory\n\nThe earliest certain ancestor of "A" is aleph (also written \'aleph), the first letter of the Phoenician alphabet, which consisted entirely of consonants (for that reason, it is also called an abjad to distinguish it from a true alphabet). In turn, the ancestor of aleph may have been a pictogram of an

In [17]:
keybert.extract_keywords(docs=text, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=10)

[('greek alphabet', 0.5973),
 ('modern english alphabet', 0.5886),
 ('semitic letter aleph', 0.5676),
 ('ancient greek letter alpha', 0.5527),
 ('phoenician alphabet', 0.5472),
 ('vowel letter', 0.5432),
 ('greek letter alpha', 0.5422),
 ('phonetic alphabet symbols', 0.5385),
 ('basic latin alphabet', 0.5325),
 ('latin alphabet', 0.5302)]

In [18]:
chunkey_bert.extract_keywords(
    docs=text, num_keywords=10, chunker=chunker, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=3
)

[[('greek alphabet', 0.68532836),
  ('latin alphabet', 0.67784774),
  ('greek letter alpha', 0.6687489),
  ('open front unrounded vowel', 0.63306737),
  ('english grammar', 0.6221442),
  ('alphabet', 0.43506628),
  ('letters', 0.43405268),
  ('latin letter alpha', 0.42377108),
  ('letter', 0.4222855),
  ('latin alpha', 0.4220736)]]

In [19]:
text = dataset[4]["text"]
text

'Alabama () is a state in the Southeastern region of the United States, bordered by Tennessee to the north; Georgia to the east; Florida and the Gulf of Mexico to the south; and Mississippi to the west. Alabama is the 30th largest by area and the 24th-most populous of the U.S. states. With a total of  of inland waterways, Alabama has among the most of any state.\n\nAlabama is nicknamed the Yellowhammer State, after the state bird. Alabama is also known as the "Heart of Dixie" and the "Cotton State". The state tree is the longleaf pine, and the state flower is the camellia. Alabama\'s capital is Montgomery, and its largest city by population and area is Huntsville. Its oldest city is Mobile, founded by French colonists in 1702 as the capital of French Louisiana. Greater Birmingham is Alabama\'s largest metropolitan area and its economic center.\n\nOriginally home to many native tribes, present-day Alabama was a Spanish territory beginning in the sixteenth century until the French acquir

In [20]:
keybert.extract_keywords(docs=text, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=10)

[('alabama state capitol', 0.6394),
 ('alabama language', 0.6353),
 ('alabama territory', 0.6234),
 ('alabama cities', 0.6169),
 ('alabama', 0.6161),
 ('alabama english', 0.5877),
 ('alabama lineage', 0.5732),
 ('alabama state politics', 0.5717),
 ('alabama jurisdictions', 0.57),
 ('alabama state tax', 0.5646)]

In [21]:
chunkey_bert.extract_keywords(
    docs=text, num_keywords=10, chunker=chunker, vectorizer=keyphrase_vectorizer, nr_candidates=20, top_n=3
)

[[('alabama', 0.64232457),
  ('alabama legislature', 0.31014132),
  ('mississippi', 0.20785388),
  ('georgia', 0.20528163),
  ('confederate states', 0.20451856),
  ('huntsville', 0.20169012),
  ('birmingham', 0.20006083),
  ('redistricting', 0.19923021),
  ('supreme court', 0.1989378),
  ('south alabama', 0.16532314)]]