# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [1]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
X_train.shape, y_train.shape

((417,), (417,))

In [10]:
X_test.shape, y_test.shape

((180,), (180,))

In [11]:
X_train.head(5)

521    I am posting to this group in hopes of finding...
360    Smythe Division --------------- Vancouver vs. ...
585    In article < 1993Apr14.122647.16364@tms390.mic...
182    Hello, I purchased my new 486 with a NoName gr...
64     In article < 734553308snx@rjck.UUCP> rob@rjck....
Name: TEXT, dtype: object

In [12]:
y_train[:5]

array([2, 1, 2, 0, 0])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [13]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [14]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [15]:
X_train.shape, X_test.shape

((417, 10476), (180, 10476))

In [16]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10476 sparse matrix of type '<class 'numpy.float64'>'
	with 32061 stored elements in Compressed Sparse Row format>

In [17]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])


## Latent Semantic Analysis (Singular Value Decomposition)

In [18]:
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score



dec_100 = TruncatedSVD(n_components=100, n_iter=10)
X_train_dec_100 = dec_100.fit_transform(X_train)
X_test_dec_100 = dec_100.transform(X_test)

ls_model_100 = RandomForestClassifier()
ls_model_100.fit(X_train_dec_100, y_train)

ls_train_acc_100 = accuracy_score(y_train, ls_model_100.predict(X_train_dec_100))
ls_test_acc_100 = accuracy_score(y_test, ls_model_100.predict(X_test_dec_100))



In [19]:
# n_components = 300
dec_300 = TruncatedSVD(n_components=300, n_iter=10)
X_train_dec_300 = dec_300.fit_transform(X_train)
X_test_dec_300 = dec_300.transform(X_test)

ls_model_300 = RandomForestClassifier()
ls_model_300.fit(X_train_dec_300, y_train)

ls_train_acc_300 = accuracy_score(y_train, ls_model_300.predict(X_train_dec_300))
ls_test_acc_300 = accuracy_score(y_test, ls_model_300.predict(X_test_dec_300))



In [20]:
# n_components = 500
dec_500 = TruncatedSVD(n_components=500, n_iter=10)
X_train_dec_500 = dec_500.fit_transform(X_train)
X_test_dec_500 = dec_500.transform(X_test)

ls_model_500 = RandomForestClassifier()
ls_model_500.fit(X_train_dec_500, y_train)

ls_train_acc_500 = accuracy_score(y_train, ls_model_500.predict(X_train_dec_500))
ls_test_acc_500 = accuracy_score(y_test, ls_model_500.predict(X_test_dec_500))



In [21]:
from sklearn.linear_model import SGDClassifier


# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

rf_train_acc = accuracy_score(y_train, rf_model.predict(X_train))
rf_test_acc = accuracy_score(y_test, rf_model.predict(X_test))

# SGD
sgd_model = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_model.fit(X_train, y_train)

sgd_train_acc = accuracy_score(y_train, sgd_model.predict(X_train))
sgd_test_acc = accuracy_score(y_test, sgd_model.predict(X_test))

data = {'Model': ['LSA (n_components = 100)', 'LSA (n_components = 300)', 'LSA (n_components = 500)', 'Random Forest', 'SGD'], 
        'Train Accuracy': [ls_train_acc_100, ls_train_acc_300, ls_train_acc_500, rf_train_acc, sgd_train_acc], 
        'Test Accuracy': [ls_test_acc_100, ls_test_acc_300, ls_test_acc_500, rf_test_acc, sgd_test_acc]}

results_df = pd.DataFrame(data)
print(results_df)

                      Model  Train Accuracy  Test Accuracy
0  LSA (n_components = 100)        1.000000       0.861111
1  LSA (n_components = 300)        1.000000       0.794444
2  LSA (n_components = 500)        1.000000       0.761111
3             Random Forest        0.894484       0.777778
4                       SGD        1.000000       0.950000


#### The values of n_components in the TruncatedSVD had a significant impact on the performance of the LSA models. Increasing the number of components generally improved the performance on the training set but resulted in diminishing returns on the test set. This is likely due to overfitting as the models with a larger number of components may have been more complex and better able to fit to the training data, but worse at generalizing to new, unseen data.

#### The Random Forest model had lower training accuracy than the LSA models but performed similarly on the test set. This suggests that the Random Forest model was less complex and less prone to overfitting than the LSA models, which may have been too complex for the given data.

#### The SGD model had the highest test accuracy, but also had perfect training accuracy, suggesting that it may have overfit the data. However, because the test accuracy was also high, it's possible that the model was able to generalize well to new data.




#### When deciding whether or not to use SVD in our analysis, we should consider the trade-off between model complexity and performance. SVD can be useful for reducing the dimensionality of large, sparse datasets and improving the efficiency of subsequent modeling tasks. However, increasing the number of components can lead to overfitting and decreased generalization performance, so it's important to choose an appropriate number of components and to evaluate the model's performance on a held-out test set to ensure that it is not overfitting. In some cases, simpler models like the Random Forest may be more appropriate, especially if the dataset is not particularly large or sparse.