# Dataset: BBCSport

All rights, including copyright, in the content of the original articles are owned by the BBC.

Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005.

#### Class Labels: 5 (athletics, cricket, football, rugby, tennis)

In [1]:
import zipfile
import os

extract = '/content/drive/MyDrive/Personal Project/Natural Langauge Processing /bbcsport'
path = '/content/drive/MyDrive/Personal Project/Natural Langauge Processing /bbcsport-fulltext.zip'

with zipfile.ZipFile(path, 'r') as k:
  k.extractall(extract)

print('Unzipping Complete')

In [2]:
import os
from pathlib import Path

newpath = '/content/drive/MyDrive/Personal Project/Natural Langauge Processing /bbcsport/bbcsport'

data_dir = Path(newpath)

texts = []
labels = []

for label in os.listdir(data_dir):
  category_dir = data_dir / label
  if category_dir.is_dir():
    for file_path in category_dir.glob("*.txt"):
      with open(file_path, encoding='latin-1') as f:
        text = f.read().strip()
        texts.append(text)
        labels.append(label)

print(f"Loaded {len(texts)} documents.")
print("Sample label:", set(labels))

Loaded 737 documents.
Sample label: {'football', 'athletics', 'rugby', 'cricket', 'tennis'}


In [3]:
print(texts[1])

Reyes tricked into Real admission

Jose Antonio Reyes has added to speculation linking him with a move from Arsenal to Real Madrid after falling victim to a radio prank.

The Spaniard believed he was talking to Real Madrid sporting director Emilio Butragueno when he allegedly berated his team-mates as "bad people". "I wish I was playing for Real Madrid," the 21-year-old told Cadena Cope. "Hopefully it could happen. I love the way Madrid play. I'm not happy with the way things are." The striker joined the Gunners from Seville for Â£17m at the start of 2004, but it has frequently been reported that he is homesick. He began the season in superb form but has struggled to maintain his high standards as Arsenal have gradually lost the Premiership initiative to Manchester United and Chelsea. "If I'm not (playing for Real) I'm going to have to carry on playing with some bad people," he added.

"I'm sure there are none in the Real dressing room. "I'm happy Madrid is interested in me because it 

## Topic Modelling

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [5]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

tm = tfidf.fit_transform(texts)

In [6]:
tm

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 89578 stored elements and shape (737, 7585)>

#### Non-Negative Matrix Factorization (NMF)

In [7]:
from sklearn.decomposition import NMF

In [8]:
nmf = NMF(n_components=7, random_state=42)
nmf_topics = nmf.fit_transform(tm)

In [9]:
len(tfidf.get_feature_names_out())

7585

In [10]:
len(nmf.components_)

7

In [11]:
nmf.components_

array([[1.04851557e-04, 1.28832329e-02, 0.00000000e+00, ...,
        1.09784103e-04, 3.76198787e-02, 6.45793844e-04],
       [0.00000000e+00, 3.82897721e-02, 0.00000000e+00, ...,
        1.38506143e-02, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.72061157e-02, 0.00000000e+00, ...,
        5.67738901e-03, 0.00000000e+00, 1.88904445e-02],
       ...,
       [8.24373021e-04, 4.28158325e-03, 0.00000000e+00, ...,
        3.03517635e-05, 5.39266928e-05, 0.00000000e+00],
       [6.48773142e-03, 1.42186564e-02, 1.71793208e-01, ...,
        1.63484291e-03, 2.94534046e-05, 0.00000000e+00],
       [2.81282131e-05, 1.15939509e-03, 0.00000000e+00, ...,
        8.98361623e-05, 3.15015332e-03, 0.00000000e+00]])

In [12]:
for index,topic in enumerate(nmf.components_):
  print(f'The top 10 words for topic #{index}')
  print([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
  print('*\n')

The top 10 words for topic #0
['game', 'scotland', 'half', 'rugby', 'france', 'nations', 'robinson', 'ireland', 'wales', 'england']
*

The top 10 words for topic #1
['khan', 'zealand', 'tour', 'day', 'series', 'test', 'australia', 'cricket', 'india', 'pakistan']
*

The top 10 words for topic #2
['beat', 'agassi', 'final', 'hewitt', 'set', 'australian', 'federer', 'roddick', 'seed', 'open']
*

The top 10 words for topic #3
['champions', 'cup', 'said', 'mourinho', 'club', 'liverpool', 'united', 'league', 'arsenal', 'chelsea']
*

The top 10 words for topic #4
['tests', 'charges', 'olympics', 'doping', 'athens', 'drugs', 'iaaf', 'greek', 'thanou', 'kenteris']
*

The top 10 words for topic #5
['championships', '60m', 'record', 'champion', 'european', 'world', 'holmes', 'olympic', 'race', 'indoor']
*

The top 10 words for topic #6
['andrew', 'boje', 'flintoff', 'trescothick', 'jones', 'strauss', 'vaughan', 'africa', 'england', 'south']
*



In [13]:
from collections import defaultdict

category_docs = defaultdict(list)
for text, label in zip(texts, labels):
    category_docs[label].append(text)

In [14]:
def extract_topics(texts, n_topics=5, top_n_words=10):
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
    tfidf = vectorizer.fit_transform(texts)

    nmf = NMF(n_components=n_topics, random_state=42)
    nmf_topics = nmf.fit_transform(tfidf)

    feature_names = vectorizer.get_feature_names_out()
    topics = []
    for topic_idx, topic in enumerate(nmf.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-top_n_words - 1:-1]]
        topics.append(top_words)
    return topics

In [15]:
list(category_docs)

['football', 'cricket', 'athletics', 'rugby', 'tennis']

In [16]:
tennis_topics = extract_topics(category_docs["tennis"], n_topics=2, top_n_words=10)
for i, topic in enumerate(tennis_topics):
    print(f"Tennis Sub-category {i+1}: {', '.join(topic)}")

Tennis Sub-category 1: seed, set, federer, match, agassi, final, hewitt, second, win, williams
Tennis Sub-category 2: cup, davis, moya, year, said, open, roddick, tennis, murray, injury


# Named Entity Recongnition

In [17]:
from transformers import pipeline

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")

def extract_person_and_job(text):
    entities = ner(text)
    persons = [e for e in entities if e['entity_group'] == 'PER']
    # Simple pattern: Find "Person, Job" in text
    results = []
    for person in persons:
        start, end = person['start'], person['end']
        # Look for ", JOB" after the name, up to 40 chars ahead
        tail = text[end:end+40]
        import re
        match = re.search(r', ([\w\s]+)[\.,]', tail)
        job = match.group(1).strip() if match else None
        results.append({"name": person['word'], "job": job})
    return results

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [18]:
results = [extract_person_and_job(text) for text in category_docs["tennis"]]
print(results[:20])

[[{'name': 'Ed', 'job': None}, {'name': 'Agassi', 'job': None}, {'name': 'Dent Andre Agassi', 'job': None}, {'name': 'Taylor Dent', 'job': None}, {'name': 'Agassi', 'job': None}, {'name': 'Agassi', 'job': None}, {'name': 'Mario Ancic', 'job': None}, {'name': 'Ancic', 'job': None}, {'name': 'Safin', 'job': None}, {'name': 'Safin', 'job': None}, {'name': 'Safin', 'job': None}, {'name': 'Ancic', 'job': None}, {'name': 'Jarkko Nieminen', 'job': None}, {'name': 'Ni', 'job': None}, {'name': '##inen', 'job': None}, {'name': 'Federer', 'job': None}, {'name': 'Tommy Robredo', 'job': None}, {'name': 'Federer', 'job': None}], [{'name': 'Roche', 'job': None}, {'name': 'Federer', 'job': None}, {'name': 'Tony Roche', 'job': None}, {'name': 'Roger Federer', 'job': None}, {'name': 'Roche', 'job': 'troubled by a hip complaint'}, {'name': 'Roche', 'job': None}, {'name': 'Federer', 'job': None}, {'name': 'Peter Lundgren', 'job': None}, {'name': 'Roche', 'job': None}, {'name': 'John Newcombe', 'job': None

In [19]:
import spacy
nlp = spacy.load("en_core_web_sm")
def spacy_job_title_extract(text):
    doc = nlp(text)
    results = []
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            for token in ent.root.children:
                if token.dep_ == "appos":
                    results.append({"name": ent.text, "job": token.text})
    return results

In [20]:
results2 = [spacy_job_title_extract(text) for text in category_docs["tennis"]]
print(results2[:20])

[[{'name': 'Tommy Robredo', 'job': '7'}], [], [{'name': 'Paradorn Srichaphan', 'job': '6'}, {'name': 'Paradorn Srichaphan', 'job': '6'}], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]


In [21]:
results3 = [spacy_job_title_extract(text) for text in texts]
print(results3[:20])

[[], [], [], [], [{'name': 'Gordon Strachan', 'job': 'Burns'}], [], [], [], [], [], [], [], [], [], [], [{'name': 'Thomas', 'job': 'Kishishev'}, {'name': 'Luis Garcia', 'job': 'Potter'}, {'name': 'Riise', 'job': 'Warnock'}, {'name': 'Att', 'job': '27,102'}], [], [], [], [{'name': 'Mateja Kezman', 'job': 'scorer'}]]


In [22]:
# Text Summarisation Model

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
sum = summarizer(texts[1], max_length=130, min_length=30, do_sample=False)
print(sum)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'summary_text': 'Jose Antonio Reyes has added to speculation linking him with a move to Real Madrid. Spaniard believed he was talking to sporting director Emilio Butragueno when he allegedly berated his team-mates as "bad people" The 21-year-old joined Arsenal from Seville for Â£17m at the start of 2004.'}]


In [23]:
print(texts[1])
print('\n')
print(sum)

Reyes tricked into Real admission

Jose Antonio Reyes has added to speculation linking him with a move from Arsenal to Real Madrid after falling victim to a radio prank.

The Spaniard believed he was talking to Real Madrid sporting director Emilio Butragueno when he allegedly berated his team-mates as "bad people". "I wish I was playing for Real Madrid," the 21-year-old told Cadena Cope. "Hopefully it could happen. I love the way Madrid play. I'm not happy with the way things are." The striker joined the Gunners from Seville for Â£17m at the start of 2004, but it has frequently been reported that he is homesick. He began the season in superb form but has struggled to maintain his high standards as Arsenal have gradually lost the Premiership initiative to Manchester United and Chelsea. "If I'm not (playing for Real) I'm going to have to carry on playing with some bad people," he added.

"I'm sure there are none in the Real dressing room. "I'm happy Madrid is interested in me because it 