# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [420]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [421]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [422]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [423]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [424]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [425]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [426]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [427]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [428]:
X_train.shape, y_train.shape

((417,), (417,))

In [429]:
X_test.shape, y_test.shape

((180,), (180,))

In [430]:
X_train.head(5)

347    In article < 93106.082502ACPS6992@RyeVm.Ryerso...
287    Article-I.D.: blue.7995 References: < C4zCII.F...
291    Article-I.D.: LMSC5.93096.46336.J056600 Sender...
498    In article < 1p7ciqINN3th@tamsun.tamu.edu> cov...
337    OK, I'll join in the fun and give my playoff p...
Name: TEXT, dtype: object

In [431]:
y_train[:5]

array([1, 1, 1, 2, 1])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [432]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [433]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!


X_test = tfidf_vect.transform(X_test)


In [434]:
X_train.shape, X_test.shape

((417, 9852), (180, 9852))

In [435]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9852 sparse matrix of type '<class 'numpy.float64'>'
	with 29722 stored elements in Compressed Sparse Row format>

In [436]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

# N-Component =500

In [437]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_5= svd.fit_transform(X_train)
X_test_5 = svd.transform(X_test)


In [438]:
X_train_5.shape, X_test_5.shape

((417, 417), (180, 417))

## Random Forest

In [439]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train_5, y_train)

### Evaluating Model Performance

In [440]:
from sklearn.metrics import accuracy_score

In [441]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train_5)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9856


In [442]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test_5)
acc = accuracy_score(y_test, y_pred_test)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

Test acc: 0.8278


In [443]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[41,  1, 16],
       [ 1, 60,  9],
       [ 3,  1, 48]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [444]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train_5, y_train)

### Evaluating Model Performance

In [445]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train_5)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9952


In [446]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test_5)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

Test acc: 0.9222


In [447]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[58,  0,  0],
       [ 3, 64,  3],
       [ 8,  0, 44]], dtype=int64)

# When n_components = 100

# Latent Semantic Analysis (Singular Value Decomposition)


In [448]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_1= svd.fit_transform(X_train)
X_test_1 = svd.transform(X_test)


In [449]:
X_train_1.shape, X_test_1.shape

((417, 100), (180, 100))

# Random Forest

In [450]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train_1, y_train)

# Evaluating Model Performance

In [451]:
from sklearn.metrics import accuracy_score

In [452]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train_1)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9784


In [453]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test_1)
acc = accuracy_score(y_test, y_pred_test)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

Test acc: 0.8333


In [454]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[49,  0,  9],
       [ 3, 54, 13],
       [ 5,  0, 47]], dtype=int64)

# Stochastic Gradient Descent Classifier

In [455]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train_1, y_train)

# Evaluating Model Performance

In [456]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train_1)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9928


In [457]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test_1)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

Test acc: 0.9333


In [458]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[57,  0,  1],
       [ 5, 64,  1],
       [ 5,  0, 47]], dtype=int64)

# FOR ITERATION=300

# Latent Semantic Analysis (Singular Value Decomposition)


In [459]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_3= svd.fit_transform(X_train)
X_test_3 = svd.transform(X_test)


In [460]:
X_train_3.shape, X_test_3.shape

((417, 300), (180, 300))

# Random Forest

In [461]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train_3, y_train)

# Evaluating Model Performance

In [462]:
from sklearn.metrics import accuracy_score

In [463]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train_3)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9880


In [464]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test_3)
acc = accuracy_score(y_test, y_pred_test)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

Test acc: 0.7778


In [465]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[41,  1, 16],
       [ 1, 53, 16],
       [ 6,  0, 46]], dtype=int64)

# Stochastic Gradient Descent Classifier

In [466]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train_3, y_train)

# Evaluating Model Performance

In [467]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train_3)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9952


In [468]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test_3)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

Test acc: 0.8444


In [469]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[48,  0, 10],
       [ 0, 53, 17],
       [ 1,  0, 51]], dtype=int64)

# Analysis

When n-component =500, the Training accuracy is 0.9904 and test accuracy is 0.8222 for Random Forest Model
When n-component =500, the Training accuracy is 1.0000 and test accuracy is 0.9278 for SGD classifier

When n-component =300, the Training accuracy is 0.9928 and test accuracy is 0.8611 for Random Forest Model
When n-component =300, the Training accuracy is 0.9976 and test accuracy is 0.9500 for SGD classifier

When n-component =100, the Training accuracy is 0.9808 and test accuracy is 0.8667 for Random Forest Model
When n-component =100, the Training accuracy is 0.9952 and test accuracy is 0.9389 for SGD classifier



For Random Forest by changing the n-component value accuracy for train and test is not significantly different. 
But for Stochastic Gradient Descent Classifier by increasing the number of components the accuracy gets increased and no overfitting issue exists in SGD.
But for Random Forest there is an issue with overfitting.

We can see that as we increase n_components from 100 to 300 to 500, the training accuracy of both Random Forest and SGD classifiers generally improves, indicating that the models are better able to fit the training data. However, the test accuracy does not always improve and even decreases for the Random Forest Model with 500 components. This suggests that increasing n_components may not always result in better generalization performance on unseen data.
SVD can be useful for reducing the dimensionality of high-dimensional datasets and improving the efficiency of our models