Hi everyone!

In this notebook, I will demonstrate how to do automated text classification with PyCaret.
The dataset is collected from an Indonesian online newspaper, thus the text is in Indonesian.
I only use minimal pre-processing and feature extraction, so that the reader can easily understand what's going on here.

Feel free to connect with me on LinkedIn. [**LinkedIn post of this notebook**](https://www.linkedin.com/posts/yevonnael-andrew-3351b9a7_automated-text-classification-using-pycaret-activity-6827505978343325696-AsEf)

**What is PyCaret?**

PyCaret is an open-source, **low-code machine learning library** in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for **seasoned data scientists** who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for **citizen data scientists** and **those new to data science** with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment. https://pycaret.org/guide/

In [None]:
%%capture
!pip install pycaret

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
#import data and preprocess
import pandas as pd
import re
import string
from string import punctuation
from nltk.corpus import stopwords
stop_words = stopwords.words('indonesian') #stopwords for Indonesian

#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

We have three separate data sources, each source is its own category. After loading all three data, we will concatenate them.

In [None]:
bisnis_df = pd.read_csv('../input/iykra-odin/bisnis.csv', usecols=['content'])
bisnis_df['category'] = 'bisnis'
bisnis_df.head(2)

In [None]:
lifestyle_df = pd.read_csv('../input/iykra-odin/lifestyle.csv', usecols=['content'])
lifestyle_df['category'] = 'lifestyle'
lifestyle_df.head(2)

In [None]:
sport_df = pd.read_csv('../input/iykra-odin/sport.csv', usecols=['content'])
sport_df['category'] = 'sport'
sport_df.head(2)

In [None]:
df = pd.concat([bisnis_df, lifestyle_df, sport_df])
df.describe()

In [None]:
import warnings
warnings.filterwarnings('ignore')

## BoW

In [None]:
cv = CountVectorizer(lowercase = True, stop_words = stop_words, token_pattern="[A-Za-z]+")
BoW = cv.fit_transform(df['content'])
BoW_df = pd.DataFrame(BoW.toarray(), columns=cv.get_feature_names())
BoW_df['target_cat'] = df.reset_index().category.map({'bisnis':0, 'lifestyle':1, 'sport':2})
BoW_df

Before doing any machine learning task with PyCaret, we should set up the environment necessarily. Because we want to do a classification task, we need to set up it accordingly. Another argument we specify here is the target variable, train size, and the number of folds. 

In [None]:
from pycaret.classification import *
setup = setup(data=BoW_df, target='target_cat', session_id=123, train_size = 0.7, fold=10, silent=True)

After setting up the environment, we now compare the models.

In [None]:
models = compare_models()

### Confusion Matrix (BoW)

In [None]:
lr = create_model('lr')
plot_model(lr, "confusion_matrix")

In [None]:
svm = create_model('svm')
plot_model(svm, "confusion_matrix")

In [None]:
nb = create_model('nb')
plot_model(nb, "confusion_matrix")

### TF-IDF

In [None]:
tv = TfidfVectorizer(lowercase = True, stop_words = stop_words, token_pattern="[A-Za-z]+")
tf_idf = tv.fit_transform(df['content'])
tf_idf_df = pd.DataFrame(tf_idf.toarray(), columns=tv.get_feature_names())
tf_idf_df['target_cat'] = df.reset_index().category.map({'bisnis':0, 'lifestyle':1, 'sport':2})
tf_idf_df

In [None]:
from pycaret.classification import *
setup = setup(data=tf_idf_df, target='target_cat', session_id=123, train_size = 0.7, fold=10, silent=True)

In [None]:
models = compare_models()

### Confusion Matrix (TF-IDF)

In [None]:
lr = create_model('lr')
plot_model(lr, "confusion_matrix")

In [None]:
svm = create_model('svm')
plot_model(svm, "confusion_matrix")

In [None]:
nb = create_model('nb')
plot_model(nb, "confusion_matrix")

## [BONUS] Clustering by K-Means

In [None]:
from sklearn.cluster import KMeans

In [None]:
mod = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=-1, precompute_distances='deprecated',
       random_state=123, tol=0.0001, verbose=0)

In [None]:
res = mod.fit_transform(BoW_df.drop('target_cat', axis=1))

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

In [None]:
BoW_df['col'] = BoW_df['target_cat'].map({0:'green', 1:'red', 2:'blue'})

In [None]:
plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.set_zlim(0, 50)
ax.set_xlim(0, 50)
ax.set_ylim(65, 90)
ax.scatter3D(res[:,1], res[:,2], res[:,0], c=BoW_df['col'], cmap='Greens');
ax.view_init(10, 150) #rotate view
plt.draw()

## Prediction

In [None]:
#we choose naive bayes because it has the highest accuracy
nb = create_model('nb')
nb_final = finalize_model(nb)

Above we can see the metrics from training on 10 folds.

In [None]:
text_sport = ['TRIBUNNEWS.COM, JAKARTA – Pebulutangkis asal China Taipei, Lee Yang/Wang Chi-Lin akhirnya mampu mengalahkan Kevin Sanjaya Sukamuljo/Marcus Fernaldi Gideon pada laga pamungkas grup A Olimpiade Tokyo 2020. Pada laga tersebut, Lee/Wang menaklukkan Kevin/Marcus melalui drama rubber game 18-21, 21-15 dan 17-21. Hasil ini membuat Lee/Wang bangga lantaran tak pernah menang dari Kevin/Marcus dalam tiga pertemuan terakhir. Seusai laga, Wang Chi Lin mengatakan di laga tadi, dirinya bersama Lee Yang memang mengubah strategi yang awalnya bertahan menjadi menyerang. Bahkan, ia mengatakan penampilan ini dinilainya jadi penampilan terbaiknya." "Kami tidak memiliki tekanan dan jadi kami bermain lebih baik dari kemarin. Kami kalah di pertandingan pertama karena kami bermain terlalu bertahan,” kata Wang Chi Lin, Selasa (27/7/2021). “Jadi kami mencoba menikmati pertandingan hari ini dan menyerang. Sejauh ini, ini penampilan terbaik kami selama kompetisi,” sambungnya. Meski sukses mengalahkan The Minions, Lee/Wang yang sama-sama mengemas dua kali kemenangan harus puas lolos sebagai runner-up grup A. Sementara The Minions tetap keluar sebagai juara grup lantaran mengemas angka yang lebih baik dari Lee/Wang dan Shetty/Reddy. Kevin/Marcus meraih kemenangan dua game langsung saat menghadapi Shetty/Reddy dan Lane Vendy serta terakhir kalah 2-1 dari Lee/Wang.']

In [None]:
text_transformed = tv.transform(text_sport)
text_transformed_df = pd.DataFrame(text_transformed.toarray(), columns=tv.get_feature_names())
prediction = predict_model(nb_final, text_transformed_df)
prediction[['Label', 'Score']] # 'bisnis':0, 'lifestyle':1, 'sport':2

In [None]:
text_bisnis =['TRIBUNNEWS.COM, JAKARTA - Perusahaan layanan angkutan bus Damri melakukan penyesuaian jam operasional selama masa Pemberlakuan Pengetatan Kegiatan Masyarakat (PPKM) Level 4. Corporate Secretary Damri Sidik Pramono mengatakan, Damri melakukan penyesuaian jam operasional armada menuju Bandara mulai pukul 02.00 – 18.00 WIB, sedangkan dari dalam Bandara mulai pukul 07.00 – 21.00 WIB. "Selain itu, kami juga memperketat pembatasan jumlah penumpang dengan kapasitas angkut hanya 50 persen," ucap Sidik, Selasa (27/7/2021). Sidik juga mengungkapkan, penumpang Damri yang melakukan perjalanan di Pulau Jawa dan Bali wajib menunjukkan kartu vaksin dosis pertama dan surat keterangan hasil negatif tes RT-PCR yang diambil dalam kurun waktu maksimal 2x24 jam sebelum keberangkatan, atau hasil negatif rapid test antigen yang diambil dalam kurun waktu maksimal 1x24 jam sebelum keberangkatan. "Kemudian untuk yang bekerja di sektor formal diimbau untuk membawa Surat Tanda Registrasi Pekerja (STRP) atau Surat Tugas/Keperluan dari pimpinan Perusahaan," ucap Sidik. Kebijakan ini, lanjut Sidik, tentunya mengacu pada Surat Edaran Kementerian Perhubungan Nomor 54 Tahun 2021 tentang Perubahan Kedua Atas SE Menteri Perhubungan Nomor 42 Tahun 2021, serta Surat Edaran Satgas Covid-19 Nomor 15 Tahun 2021. "Kami mengimbau kepada masyarakat yang masih harus keluar rumah dan menggunakan transportasi publik agar tetap mematuhi protokol kesehatan dengan mengutamakan kesehatan dan keselamatan bersama," ujar Sidik.']

In [None]:
text_transformed = tv.transform(text_bisnis)
text_transformed_df = pd.DataFrame(text_transformed.toarray(), columns=tv.get_feature_names())
prediction = predict_model(nb_final, text_transformed_df)
prediction[['Label', 'Score']] # 'bisnis':0, 'lifestyle':1, 'sport':2

We can see that our model succesully predict the category of our unseen data with full confidence.

## Conclusion

In this notebook, we demonstrated how easy to create classification models with PyCaret. With only several lines of code, we can compare the numbers of models for text classification, with their metrics.

PyCaret philosophy is “low-code”, which means it goals to make machine learning more accessible to a wider audience. 

We encouraged readers to dive deeper into other parameters available on PyCaret. Please consult with official documentation of PyCaret.