## Importing the libraries

In [25]:
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd

## Data Collection and Preprocessing
* Load Dataset: Import and load the dataset.
* Text Cleaning: Remove unnecessary punctuation, special characters, numbers, and stop words.
* Tokenization: Split articles into words/tokens for further processing.
* Lemmatization/Stemming: Convert words to their base or root form to improve consistency across the dataset.
* Data Splitting: Divide the dataset into training, validation, and test sets (e.g., 70% train, 15% validation, 15% test).

### Load Dataset

In [26]:
pd.set_option('display.max_colwidth',None)
df = pd.read_csv('bbc_text_cls.csv')

df.head(2)

Unnamed: 0,text,labels
0,"Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.\n\nTime Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n\nTimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.",business
1,"Dollar gains on Greenspan speech\n\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\n\nAnd Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.""\n\nWorries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the ""time is ripe"" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.",business


### Cleaning, Tokenizing, and Stemming the text

In [27]:
import re   # for cleaning
import nltk # for stop words and stemming
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer # for stemming

# Stop words = words not relevant
# Stemming = taking the root word of the word
# Stop words + Stemming = reduces the dimension of the sparse matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [28]:
def stop_words():
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')

  return all_stopwords

In [29]:
def clean_stem_text(text):
  # replace any non-alphabet characters by a space
  cleaned_text = re.sub('[^a-zA-Z]', ' ', text)

  # replace uppercase characters to lowercase characters
  cleaned_text = cleaned_text.lower()

  # split text into words
  tokens = cleaned_text.split()

  # stem each words of each article text
  ps = PorterStemmer()
  all_stopwords = stop_words()
  stemmed_text = [ps.stem(word) for word in tokens
                  if not word in set(all_stopwords)]
  # join the words together to become a single text separated by a space
  stemmed_text = ' '.join(stemmed_text)

  return stemmed_text

In [6]:
# list of cleaned article text
corpus = []
for i in range(len(df)):
  text = df['text'][i]
  text = clean_stem_text(text)

  # add the cleaned article text to the corpus
  corpus.append(text)

In [7]:
corpus[0]

'ad sale boost time warner profit quarterli profit us media giant timewarn jump bn three month decemb year earlier firm one biggest investor googl benefit sale high speed internet connect higher advert sale timewarn said fourth quarter sale rose bn bn profit buoy one gain offset profit dip warner bro less user aol time warner said friday own search engin googl internet busi aol mix fortun lost subscrib fourth quarter profit lower preced three quarter howev compani said aol underli profit except item rose back stronger internet advertis revenu hope increas subscrib offer onlin servic free timewarn internet custom tri sign aol exist custom high speed broadband timewarn also restat result follow probe us secur exchang commiss sec close conclud time warner fourth quarter profit slightli better analyst expect film divis saw profit slump help box offic flop alexand catwoman sharp contrast year earlier third final film lord ring trilog boost result full year timewarn post profit bn perform re

## Create the "Bag of Words" (BoW) model

In [8]:
max_features = 1500

### Traditional BoW model

In [9]:
# Sparse matrix = row represents an article text,
#                 column represents each word in the corpus,
#                 value is count of article word in the corpus

# Create the sparse matrix
from sklearn.feature_extraction.text import CountVectorizer
# Limit the number of words
cv = CountVectorizer(max_features = max_features)

# Create an array from all the words in our corpus
X_count = cv.fit_transform(corpus).toarray()
# Dependent variable = Label
y_count = df.iloc[:, -1].values

### Weighted (wBoW) Model
Term Frequency-Inverse Document Frequency (TF-IDF)

In [10]:
# Sparse matrix = row represents an article text,
#                 column represents each word in the corpus,
#                 value is count of article word in the corpus

# Create the sparse matrix
from sklearn.feature_extraction.text import TfidfVectorizer
# Limit the number of words
tfidf = TfidfVectorizer(max_features = max_features)

# Create an array from all the words in our corpus
X_tfidf = tfidf.fit_transform(corpus).toarray()
# Dependent variable = Label
y_tfidf = df.iloc[:, -1].values

## Create and Evaluate the (traditional) Machine Learning models

In [11]:
vectorizer = tfidf
X = X_tfidf
y = y_tfidf

In [12]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Split the dataset to train/val set and test set
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
# Initialize classifiers
models = {
    'Naive Bayes': MultinomialNB(),
    'SVM': SVC(kernel='linear', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

# Perform k-fold cross-validation and evaluate each model
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for model_name, model in models.items():
    print(f'\nModel: {model_name}')

    # Calculate accuracies of each model based on the train and validation sets
    cv_accuracies = cross_val_score(model, X_train_val, y_train_val, cv=kf, scoring='accuracy')
    for fold, accuracy in enumerate(cv_accuracies, 1):
        print(f'Fold {fold}: Accuracy = {accuracy:.4f}')
    print(f'Mean CV Accuracy: {cv_accuracies.mean():.4f}')

    # Train the model on the entire training-validation set
    model.fit(X_train_val, y_train_val)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model on the test set
    test_accuracy = accuracy_score(y_test, y_pred)
    print(f'Test Set Accuracy: {test_accuracy:.4f}')

    # Print classification report
    print('Classification Report:')
    print(classification_report(y_test, y_pred))


Model: Naive Bayes
Fold 1: Accuracy = 0.9747
Fold 2: Accuracy = 0.9663
Fold 3: Accuracy = 0.9635
Fold 4: Accuracy = 0.9775
Fold 5: Accuracy = 0.9803
Mean CV Accuracy: 0.9725
Test Set Accuracy: 0.9618
Classification Report:
               precision    recall  f1-score   support

     business       0.96      0.94      0.95       115
entertainment       0.97      0.96      0.97        72
     politics       0.94      0.95      0.94        76
        sport       1.00      0.99      1.00       102
         tech       0.94      0.97      0.96        80

     accuracy                           0.96       445
    macro avg       0.96      0.96      0.96       445
 weighted avg       0.96      0.96      0.96       445


Model: SVM
Fold 1: Accuracy = 0.9747
Fold 2: Accuracy = 0.9747
Fold 3: Accuracy = 0.9719
Fold 4: Accuracy = 0.9747
Fold 5: Accuracy = 0.9831
Mean CV Accuracy: 0.9758
Test Set Accuracy: 0.9730
Classification Report:
               precision    recall  f1-score   support

     b

## Finalize and Save the Model

Best Model: SVM
* Mean CV Accuracy: 0.9758
* Test Set Accuracy: 0.9730

In [14]:
vectorizer = tfidf
best_model = models['SVM']

In [15]:
# Retrain the model using the training set and validation set combined
# to leverage more labeled data for improved generalization.
# Do not include the test set in this retraining, as the test set is
#   meant to simulate new, unseen data for evaluation purposes.
best_model.fit(X_train_val, y_train_val)

In [16]:
import joblib  # For saving the model
joblib.dump(best_model, 'svm_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']

## Production simulation

In [17]:
article = '''
Spain fines budget airlines including Ryanair €179m

Spain has fined five budget airlines a total of €179m (£149m) for "abusive practices" including charging for hand luggage.

Ryanair has been given the largest fine of €108m (£90m), followed by EasyJet's penalty of €29m (£24m).

Vueling, Norwegian and Volotea were issued with sanctions by Spain's Consumer Rights Ministry on Friday.

The ministry said it plans to ban practices such as charging extra for carry-on hand luggage and reserving seats for children.

The fines are the biggest sanction issued by the ministry, and follow an investigation into the budget airline industry.

The ministry said it had upheld fines that were first announced in May after dismissing appeals lodged by the companies.

Vueling, the budget arm of British Airways owner IAG, has been fined €39m (£32m), while Norwegian Airlines and Volotea have been fined €1.6m (£1.3m) and €1.2m (£1m) respectively.

The fines were issued because the airlines were found to have provided misleading information and were not transparent with prices, "which hinders consumers' ability to compare offers" and make informed decisions, the ministry said.

Ryanair was accused of violating a range of consumer rights, including charging for larger carry-on luggage, seat selection, and asking for "a disproportionate amount" to print boarding passes at terminals.

Each fine was calculated based on the "illicit profit" obtained by each airline from these practices.

Ryanair boss Michael O'Leary said the fines were "illegal" and "baseless", adding that he will appeal the case and take it to the EU courts.

"Ryanair has for many years used bag fees and airport check-in fees to change passenger behaviour and we pass on these cost savings in the form of lower fares to consumers," he said.

Easyjet and Norwegian said they would also appeal the decision.

The Spanish airline industry watchdog, ALA, plans a further appeal and has called the ministry's decision "nonsense", arguing the fine infringes EU free market rules.

But Andrés Barragán, secretary general for consumer affairs and gambling at the ministry, defended the fines, saying the government's decision was based on Spanish and EU law.

"It is an abuse to charge €20 for just printing the boarding card in the airport, [it's] something no one wants," he told the BBC's World Business Report programme.

"This is a problem consumers are facing not only in Spain but in other EU countries."

Consumer rights association Facua, which has campaigned against the fees for six years, said the decision was "historic".
'''

In [18]:
# Load the model and vectorizer
prod_model = joblib.load('svm_model.pkl')
prod_vectorizer = joblib.load('tfidf_vectorizer.pkl')

In [19]:
normalized_article = clean_stem_text(article)

In [20]:
normalized_article

'spain fine budget airlin includ ryanair spain fine five budget airlin total abus practic includ charg hand luggag ryanair given largest fine follow easyjet penalti vuel norwegian volotea issu sanction spain consum right ministri friday ministri said plan ban practic charg extra carri hand luggag reserv seat children fine biggest sanction issu ministri follow investig budget airlin industri ministri said upheld fine first announc may dismiss appeal lodg compani vuel budget arm british airway owner iag fine norwegian airlin volotea fine respect fine issu airlin found provid mislead inform not transpar price hinder consum abil compar offer make inform decis ministri said ryanair accus violat rang consum right includ charg larger carri luggag seat select ask disproportion amount print board pass termin fine calcul base illicit profit obtain airlin practic ryanair boss michael leari said fine illeg baseless ad appeal case take eu court ryanair mani year use bag fee airport check fee chang 

In [21]:
vectorized_article = prod_vectorizer.transform([normalized_article]).toarray()

In [22]:
vectorized_article

array([[0.0454829, 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ]])

In [23]:
prediction = prod_model.predict(vectorized_article)[0]

In [24]:
prediction

'business'