## BBC News Classification Using FastText
**Dataset:** https://www.kaggle.com/datasets/alfathterry/bbc-full-text-document-classification

**Objective:** Classify BBC news articles into one of the following categories:
1. Entertainment
2. Business
3. Sport
4. Politics
5. Tech

#### Import Dataset

In [None]:
import pandas as pd

df= pd.read_csv("bbc_data.csv", names=["text", "label"], header=None)
print(df.shape)
df.head(3)

(2226, 2)


Unnamed: 0,text,label
0,data,labels
1,Musicians to tackle US red tape Musicians gro...,entertainment
2,"U2s desire to be number one U2, who have won ...",entertainment


#### Data Preprocessing

In [None]:
df = df.drop(0)
df.head(3)

Unnamed: 0,text,label
1,Musicians to tackle US red tape Musicians gro...,entertainment
2,"U2s desire to be number one U2, who have won ...",entertainment
3,Rocker Doherty in on-stage fight Rock singer ...,entertainment


In [None]:
df.dropna(inplace=True)
df.shape

(2225, 2)

In [None]:
df.label.unique()

array(['entertainment', 'business', 'sport', 'politics', 'tech'],
      dtype=object)

In [None]:
df['label'] = '__label__' + df['label'].astype(str)
df.head(5)

Unnamed: 0,text,label
1,Musicians to tackle US red tape Musicians gro...,__label__entertainment
2,"U2s desire to be number one U2, who have won ...",__label__entertainment
3,Rocker Doherty in on-stage fight Rock singer ...,__label__entertainment
4,Snicket tops US box office chart The film ada...,__label__entertainment
5,"Oceans Twelve raids box office Oceans Twelve,...",__label__entertainment


In [None]:
df['label_text'] = df['label'] + ' ' + df['text']
df.head(3)

Unnamed: 0,text,label,label_text
1,Musicians to tackle US red tape Musicians gro...,__label__entertainment,__label__entertainment Musicians to tackle US ...
2,"U2s desire to be number one U2, who have won ...",__label__entertainment,__label__entertainment U2s desire to be number...
3,Rocker Doherty in on-stage fight Rock singer ...,__label__entertainment,__label__entertainment Rocker Doherty in on-st...


In [None]:
import re

def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower()

In [None]:
df['label_text'] = df['label_text'].map(preprocess)
df.head()

Unnamed: 0,text,label,label_text
1,Musicians to tackle US red tape Musicians gro...,__label__entertainment,__label__entertainment musicians to tackle us ...
2,"U2s desire to be number one U2, who have won ...",__label__entertainment,__label__entertainment u2s desire to be number...
3,Rocker Doherty in on-stage fight Rock singer ...,__label__entertainment,__label__entertainment rocker doherty in on st...
4,Snicket tops US box office chart The film ada...,__label__entertainment,__label__entertainment snicket tops us box off...
5,"Oceans Twelve raids box office Oceans Twelve,...",__label__entertainment,__label__entertainment oceans twelve raids box...


#### Train Test Splitting

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [None]:
train.shape, test.shape

((1780, 3), (445, 3))

In [None]:
train.to_csv("bbc.train", columns=["label_text"], index=False, header=False)
test.to_csv("bbc.test", columns=["label_text"], index=False, header=False)

#### Modelling

In [None]:
# !pip install fasttext
import fasttext

model = fasttext.train_supervised(input="bbc.train", epoch=20)
model.test("bbc.test")

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.5-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.5-py3-none-any.whl (240 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4246560 sha256=5b2e14a023f51b599caa75c7a996642d87aaf9cd9e2bbfa5f1c385ba50c6ca64
  Stored in directory: /root/.cache/pip/wheels/0d/a2/00/81db54d3e6a8199b829d58

(445, 0.8696629213483146, 0.8696629213483146)

After training the FastText model, we evaluate its performance using the test dataset. The evaluation results return three key metrics:

- **Number of Samples (445):** This represents the total number of samples in the test set.
- **Precision (0.8697):** Precision measures the proportion of correctly classified news articles out of all predicted articles, indicating the model's accuracy.
- **Recall (0.8697):** Recall shows the proportion of correctly classified articles out of the actual total articles in each category.
The precision and recall values being the same means the model predicts only the most likely category for each article, and performs well with approximately 87% accuracy.

In [None]:
model.predict("ukraine strikes turkmen gas deal ukraine has agreed to pay 30 more for natural gas supplied by turkmenistan the deal was sealed three days after turkmenistan cut off gas supplies in a price dispute that threatened the ukrainian economy supplies from turkmenistan account for 45 of all natural gas imported by ukraine which has large coal deposits but no gas fields turkmenistan is also trying to strike a similar deal with russia which is not so dependent on its gas turkmen president saparmurat niyazov who signed the contract said the turkmen side agreed to lower the price demanded by 2 per 1 000 cubic metres bringing it down to 58 but the new price is still 14 higher than the price fixed in the contract for 2004 the head of the ukrainian state owned naftohaz company yury boyko said he was fully happy with the deal on friday turkmenistan acted on a threat and shut off gas supplies to ukraine in attempt to bring the price dispute to a head mr niyazov said that his government would insist on the same price for supplies to russia analysts say thay may not happen as russia the worlds leading gas producer needs the cheap turkmen gas only to relieve is state owned gazprom from costly investment in the exploration of oil fields in siberia turkmenistan is the second largest gas producer in the world")

(('__label__business',), array([0.69115645]))