# BBC text categorization

Tập dữ liệu này chứa 2225 bài báo trên BBC và gồm 2 trường. Trường thứ nhất là chủ đề của bài báo, trường thứ 2 là nội dung và tiêu đề của bài báo. Trong bài này m sẽ sử dụng machine learning cũng như convolutional neural network để phân loại bài báo vào 1 trong 5 chủ đề.

## Download data

Các bạn có thể tải sữ liệu tại đây [BBC articles fulltext and category](https://www.kaggle.com/yufengdev/bbc-fulltext-and-category)

In [2]:
import pandas as pd
raw_data = pd.read_csv('/content/drive/My Drive/Google colab data/bbc-text.csv')

## Understanding data

In [3]:
raw_data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [4]:
raw_data['text'][1] # category: business

'worldcom boss  left books alone  former worldcom boss bernie ebbers  who is accused of overseeing an $11bn (£5.8bn) fraud  never made accounting decisions  a witness has told jurors.  david myers made the comments under questioning by defence lawyers who have been arguing that mr ebbers was not responsible for worldcom s problems. the phone company collapsed in 2002 and prosecutors claim that losses were hidden to protect the firm s shares. mr myers has already pleaded guilty to fraud and is assisting prosecutors.  on monday  defence lawyer reid weingarten tried to distance his client from the allegations. during cross examination  he asked mr myers if he ever knew mr ebbers  make an accounting decision  .  not that i am aware of   mr myers replied.  did you ever know mr ebbers to make an accounting entry into worldcom books   mr weingarten pressed.  no   replied the witness. mr myers has admitted that he ordered false accounting entries at the request of former worldcom chief financi

In [5]:
raw_data.isnull().sum()

category    0
text        0
dtype: int64

In [6]:
raw_data.describe()

Unnamed: 0,category,text
count,2225,2225
unique,5,2126
top,sport,howl helps boost japan s cinemas japan s box o...
freq,511,2


# Preprocessing data

In [7]:
import nltk 
import re
from nltk.stem import PorterStemmer 
from gensim.parsing.preprocessing import remove_stopwords

def preprocessing_data(text):

  # Loại bỏ những kí tự đặc biệt
  REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
  BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

  text = REPLACE_BY_SPACE_RE.sub(' ', text)
  text = BAD_SYMBOLS_RE.sub('', text)

  # Stemming and lemantization
  stemmer = PorterStemmer() # SnowballStemmer hỗ trợ 15 ngôn ngữ ko phải tiếng anh nhưng ko có tiếng việt
  text = stemmer.stem(text)

  # Remove stop-words
  text = remove_stopwords(text)
  text = text.strip()

  return text

In [8]:
data = raw_data.copy()
data['text'].apply(lambda text: preprocessing_data(text))

0       tv future hands viewers home theatre systems p...
1       worldcom boss left books worldcom boss bernie ...
2       tigers wary farrell gamble leicester rushed ma...
3       yeading face newcastle fa cup premiership newc...
4       ocean s raids box office ocean s crime caper s...
                              ...                        
2220    cars pull retail figures retail sales fell 03 ...
2221    kilroy unveils immigration policy exchatshow h...
2222    rem announce new glasgow concert band rem anno...
2223    political squabbles snowball s commonplace arg...
2224    souness delight euro progress boss graeme soun...
Name: text, Length: 2225, dtype: object

## Feature extraction and training model

In [9]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train,y_test = train_test_split(data['text'], data['category'], train_size = 0.9 , random_state=42 )

### Using CountVectorizer với LogisticRegression

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
# Minh sẽ không sử dụng tree model ở đây vì không phù hợp với bài toán này

# Using CountVectorizer với LogisticRegression
pipeline = Pipeline ([('BOW',CountVectorizer()),('clf' , LogisticRegression() )])

params = {
    'BOW__ngram_range':[(1,2),(1,3)],
    'BOW__min_df': [0.01, 0.05, 0.1], # Loai bỏ những cặp từ ít xuất hiện (df: document Frequentcy)
    'clf__penalty':['l2','l1'],
    'clf__C':[0.001, 0.01, 0.1]
}

grid_logistic = GridSearchCV(pipeline, param_grid = params, n_jobs=-1, cv = 5)

grid_logistic.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('BOW',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prepr

In [11]:
print('Best score of logistic with BOW: {}'.format(grid_logistic.best_score_))
print('Best parameter of logistic with BOW: {}'.format(grid_logistic.best_params_))

best_logistic_estimator = grid_logistic.best_estimator_

Best score of logistic with BOW: 0.9685411471321697
Best parameter of logistic with BOW: {'BOW__min_df': 0.01, 'BOW__ngram_range': (1, 2), 'clf__C': 0.1, 'clf__penalty': 'l2'}


### Using TfidfVectorizer with svc

In [12]:
from sklearn.svm import LinearSVC
pipe = Pipeline([('Tfidf',TfidfVectorizer()), ('clf', SVC())])

params2 = {
    'Tfidf__ngram_range':[(1,2),(1,3)],
    'Tfidf__min_df':[0.01, 0.05, 0.1],
    'clf__C':[0.001, 0.01, 0.1],
    'clf__kernel':['rbf','linear']
}

grid_SVC = GridSearchCV(pipe, param_grid= params2, n_jobs= -1, cv=5)
grid_SVC.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('Tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        n

In [14]:
print('Best score of SVC with Tfidf: {}'.format(grid_SVC.best_score_))
print('Best parameter of SVC with Tfidf: {}'.format(grid_SVC.best_params_))

best_SVC_estimator = grid_SVC.best_estimator_

Best score of SVC with Tfidf: 0.9425561097256857
Best parameter of SVC with Tfidf: {'Tfidf__min_df': 0.01, 'Tfidf__ngram_range': (1, 2), 'clf__C': 0.1, 'clf__kernel': 'linear'}


### Using Convolution neural network 

In [19]:
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

# Đầu tiên ta biểu diễn mỗi từ bằng một số nguyên(index)
# Vì embedding layer sẽ chuyển mỗi số nguyên(từ) thành một dense vecto
tokenizer = Tokenizer(num_words = 40000)
tokenizer.fit_on_texts(X_train)
X_train_cnn = tokenizer.texts_to_sequences(X_train)
X_test_cnn = tokenizer.texts_to_sequences(X_test)

print(X_train_cnn[1])
print(X_train[1])

[1689, 1733, 806, 2, 1830, 555, 510, 1893, 1689, 1733, 1, 1003, 539, 2131, 17, 2012, 2, 16, 5, 806, 2, 1, 1830, 1689, 9, 9225, 1, 64, 2184, 3, 1, 1422, 81, 29, 211, 255, 6, 1, 63, 770, 6, 1, 6561, 1831, 15, 13, 26, 1975, 66, 16, 1832, 60, 658, 35, 38, 954, 22, 1, 63, 1014, 3, 1, 409, 2184, 4, 1, 332, 35, 2, 146, 75, 828, 67, 26, 20, 2, 2520, 71, 9, 5758, 4, 1, 1003, 539, 9, 5758, 21, 1, 914, 26, 206, 1, 63, 510, 2, 20, 149, 11, 1689, 386, 6562, 2012, 9226, 173, 611, 5423, 5142, 4, 5759, 1753, 18523, 24, 15, 9, 28, 18524, 4726, 26, 222, 66, 20, 2, 16, 9227, 52, 2012, 101, 30, 20, 634, 79, 3, 829, 4, 20, 149, 1, 174, 330, 536, 26, 20, 38, 2217, 8, 171, 79, 4, 26, 20, 1, 275, 968, 3, 1003, 539, 5424, 2, 124, 664, 2012, 325, 14196, 853, 26, 150, 10, 11, 9, 133, 2, 16, 5, 659, 332, 8, 49, 101, 555, 20, 5, 130, 233, 423, 3, 234, 776, 26, 20, 302, 555, 5, 353, 337, 12, 696, 9228, 30, 25, 100, 130, 4, 30, 25, 100, 380, 12, 1, 3413, 895, 30, 708, 100, 110, 4, 466, 4211, 3, 1, 760, 76, 11, 23, 1

In [50]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

max_len = X_train.apply(lambda x:len(x.split())).max()
print(max_len)

X_train_cnn = pad_sequences(X_train_cnn, padding='post', maxlen=max_len)
X_test_cnn = pad_sequences(X_test_cnn, padding='post', maxlen=max_len)

28407
4492


In [51]:
# Build model
from tensorflow.keras import layers
embedding_dim = 300
model = keras.Sequential()
model.add(layers.Embedding(input_dim= vocab_size,output_dim= embedding_dim, input_length=max_len ))
model.add(layers.Conv1D(128,5, activation = 'relu'))
model.add(layers.GlobalAveragePooling1D()) # tao 128 feature
model.add(layers.Dense(64, activation ='relu'))
model.add(layers.Dense(5))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4492, 300)         8522100   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 4488, 128)         192128    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 325       
Total params: 8,722,809
Trainable params: 8,722,809
Non-trainable params: 0
_________________________________________________________________


In [56]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

le = LabelEncoder()
y_train_cnn = le.fit_transform(y_train)
y_test_cnn = le.fit_transform(y_test)

model.compile(optimizer='adam',
              loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics = ['accuracy'])

history = model.fit(X_train_cnn,y_train_cnn,epochs=10, batch_size=32, validation_data=(X_test_cnn,y_test_cnn))
# Nên sử dụng thêm early stopping


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
loss, accuracy = model.evaluate(X_train_cnn, y_train_cnn)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test_cnn, y_test_cnn)
print("Testing Accuracy:  {:.4f}".format(accuracy))

## Look at Feature importance

Trong cac mô hình, có thể thấy logistic cho kết quả tốt nhất. Do đó ta sẽ chon mô hình best_logistic_estimator để tìm các feature tốt nhất. Vì các feature gần như có cùng scale nên ta sẽ sử dụng hệ số của mô hình logistic để đo lường mức độ quan trọng của feature. Hệ số càng lớn thì feature càng quan trọng

In [69]:
coef_matrix = best_logistic_estimator[1].coef_

for i in range(len(best_logistic_estimator[1].classes_)):
  important_feature = sorted(enumerate(list(coef_matrix[i,:])), key = lambda x: x[1] ,reverse= True)[:10]
  if_index = [i for i,v in important_feature]

  vocab = best_logistic_estimator[0].vocabulary_.items()
  important_feature1 = [i for i,v in vocab if v in if_index]
  category = best_logistic_estimator[1].classes_[i]
  print(category, ' : ', important_feature1 )


business  :  ['its', 'companies', 'bank', 'economy', 'economic', 'firm', 'business', 'company', 'market', 'shares']
entertainment  :  ['star', 'us', 'tv', 'film', 'films', 'album', 'band', 'music', 'show', 'singer']
politics  :  ['mps', 'mr', 'labour', 'government', 'party', 'secretary', 'uk', 'minister', 'election', 'blair']
sport  :  ['but', 'win', 'club', 'match', 'cup', 'we', 'season', 'champion', 'olympic', 'seed']
tech  :  ['game', 'games', 'computer', 'users', 'people', 'technology', 'software', 'digital', 'online', 'sony']
