# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.



### Import common packages

In [108]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [109]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [110]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [111]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [112]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [113]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [114]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [115]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [116]:
X_train.shape, y_train.shape

((417,), (417,))

In [117]:
X_test.shape, y_test.shape

((180,), (180,))

In [118]:
X_train.head(5)

546    In article < 1qk92lINNl55@im4u.cs.utexas.edu> ...
356    Derian Hatcher's game-misconduct penalty was r...
248    What about his rectum? -- GO SKINS! ||" Now fo...
595    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
570    noring@netcom.com (Jon Noring) writes: Recentl...
Name: TEXT, dtype: object

In [119]:
y_train[:5]

array([2, 1, 1, 2, 2])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [120]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [121]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [122]:
X_train.shape, X_test.shape

((417, 10031), (180, 10031))

In [123]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10031 sparse matrix of type '<class 'numpy.float64'>'
	with 30716 stored elements in Compressed Sparse Row format>

In [124]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#### Testing the accuracies of the two models with three different n_components values using the for loop insted of executing the blocks seperately for three different n_component values.

In [127]:
# Defining the n_components to be tested in the form of list
n_components_values = [100, 300, 500]

for n_components in n_components_values:
    # Initialize SVD with the ncomponents value in the list
    svd = TruncatedSVD(n_components=n_components, n_iter=10)
    
    # Fitting the svd with the training and testing data
    X_train_1 = svd.fit_transform(X_train)
    X_test_1 = svd.transform(X_test)
    
    # Using the transformed data, train and test a random forest classifier.
    rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
    _ = rnd_clf.fit(X_train_1, y_train)
    
    # Print the train and test accuracies for the random forest classifier model
    y_pred_train = rnd_clf.predict(X_train_1)
    acc_train = accuracy_score(y_train, y_pred_train)
    print(f"Random Forest Classifier model")
    print(f"n_components = {n_components}")
    print(f"Train accuracy for Random forest classifier: {acc_train:.4f}")
    
    y_pred_test = rnd_clf.predict(X_test_1)
    acc_test = accuracy_score(y_test, y_pred_test)
    #print(f"Random Forest Classifier")
    #print(f"n_components = {n_components}")
    print(f"Test accuracy for Random forest classifier: {acc_test:.4f}")
    
    # printing the confusion matrix for the random forest classifier model
    print(f"Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred_test))
    
    # Using transformed data train and evaluate SGD classifier model
    sgd_clf = SGDClassifier(max_iter=100)
    _ = sgd_clf.fit(X_train_1, y_train)
    
    # Print the train and test accuracies for the SGD classifier model
    y_pred_train = sgd_clf.predict(X_train_1)
    acc_train = accuracy_score(y_train, y_pred_train)
    print(f"SGD Classifier model")
    print(f"n_components = {n_components}")
    print(f"Train accuracy for SGD classifier: {acc_train:.4f}")
    
    y_pred_test = sgd_clf.predict(X_test_1)
    acc_test = accuracy_score(y_test, y_pred_test)
    print(f"Test accuracy for SGD classifier: {acc_test:.4f}")
    
    # printing the confusion matrix for the SGD classifier
    print(f"Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred_test))
    

Random Forest Classifier model
n_components = 100
Train accuracy for Random forest classifier: 0.9664
Test accuracy for Random forest classifier: 0.8556
Confusion Matrix:
[[59  2  8]
 [ 3 47  8]
 [ 3  2 48]]
SGD Classifier model
n_components = 100
Train accuracy for SGD classifier: 0.9880
Test accuracy for SGD classifier: 0.9000
Confusion Matrix:
[[60  0  9]
 [ 2 50  6]
 [ 1  0 52]]
Random Forest Classifier model
n_components = 300
Train accuracy for Random forest classifier: 0.9880
Test accuracy for Random forest classifier: 0.8556
Confusion Matrix:
[[57  1 11]
 [ 3 50  5]
 [ 4  2 47]]
SGD Classifier model
n_components = 300
Train accuracy for SGD classifier: 0.9952
Test accuracy for SGD classifier: 0.8778
Confusion Matrix:
[[69  0  0]
 [ 9 49  0]
 [13  0 40]]
Random Forest Classifier model
n_components = 500
Train accuracy for Random forest classifier: 0.9832
Test accuracy for Random forest classifier: 0.7778
Confusion Matrix:
[[48  0 21]
 [ 1 42 15]
 [ 1  2 50]]
SGD Classifier model

## Analysis

From the above results we can see the performance of the random forest classifier and SGD model with 
different n components values that is 100,300,500

1.From the above results for the n components value of 100 the train accuracy of the SGD model is high 
which is 0.9880 when comaperd to that of the train accuracy of random forest model which is 0.9664.
SGD outperforms the random forest model in terms of test accuracy.
As a result, SGD performs well when n_components is set to 100.

2.From the above results for the n components value of 300 the train accuracy of the SGD model is high that
which is 0.9952 when comaperd to that of the train accuracy of random forest model which is 0.9880.
In terms of test accuracy, the SGD model outperforms the random forest model.
As a result, SGD performs well for a n_components value of 300.

3.From the above results for the n components value of 500 the train accuracy of the SGD model is high that
which is 0.9952 when comaperd to that of the train accuracy of random forest model which is 0.9832.
In terms of test accuracy, the SGD model outperforms the random forest model.
As a result, SGD performs well for a n_components value of 500.

And also,according to the above output, the train accuracy for the random forest model is high for the SVD with n_components value of 300, which is 0.9880, and the test accuracy is constant for both n_components values of 100 and 300, which is 0.8556.
Also, the train accuracy for the SGD classifier is consistent and high for the SVD with n_components values of 500 and 300, which is 0.9952, and the test accuracy is high for the n_components value of 500, which is 0.9222.

Hence we can summarize that when all three n_components values of 100,300, and 500 are considered, the SGD model outperforms the random forest model.
