# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [530]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [531]:
news = pd.read_csv(r'C:\DSP\WE06\news.csv')

news.shape


(597, 5)

In [532]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [533]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [534]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [535]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [536]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [537]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [538]:
X_train.shape, y_train.shape

((417,), (417,))

In [539]:
X_test.shape, y_test.shape

((180,), (180,))

In [540]:
X_train.head(5)

541    I hope this is the correct newsgroup for this....
56     In article < lsk1v9INN93c@caspian.usc.edu> zye...
131    Using the VMODE command, all you need to do is...
396    In article < saross01.734885336@starbase.spd.l...
16     Article-I.D.: DIALix.1praaa$pqv Organization: ...
Name: TEXT, dtype: object

In [541]:
y_train[:5]

array([2, 0, 0, 1, 0])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [542]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [543]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [544]:
X_train.shape, X_test.shape

((417, 9626), (180, 9626))

In [545]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9626 sparse matrix of type '<class 'numpy.float64'>'
	with 29374 stored elements in Compressed Sparse Row format>

In [546]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.15858167,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [547]:
from sklearn.decomposition import TruncatedSVD

for num in [100, 300, 417]:

    svd = TruncatedSVD(n_components=num, random_state=42) #n_components is the number of topics, which should be less than the number of features
    X_train_svd= svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)
    

# Train Random Forest Classifier
    from sklearn.ensemble import RandomForestClassifier 
    from sklearn.metrics import accuracy_score
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train_svd, y_train)
    
    #Evaluating Model Performance - Train accuracy
    y_pred_train = rf.predict(X_train_svd)
    score_train_rf = accuracy_score(y_train, y_pred_train)

    #Evaluating Model Performance - Test accuracy
    y_pred_test = rf.predict(X_test_svd)
    score_test_rf = accuracy_score(y_test, y_pred_test)
    
    # Confusion Matrix
    from sklearn.metrics import confusion_matrix
    c_matrix_rf = confusion_matrix(y_test, y_pred_test)



#Train Stochastic Gradient Descent Classifier
    from sklearn.linear_model import SGDClassifier
    
    sgd = SGDClassifier(random_state=42)
    sgd.fit(X_train_svd, y_train)
    
     #Evaluating Model Performance - Train accuracy
    y_pred_train = sgd.predict(X_train_svd)
    score_train_sgd = accuracy_score(y_train, y_pred_train)

    #Evaluating Model Performance - Test accuracy
    y_pred_test = sgd.predict(X_test_svd)
    score_test_sgd = accuracy_score(y_test, y_pred_test)  
    
    # Confusion Matrix
    from sklearn.metrics import confusion_matrix
    c_matrix_sgd = confusion_matrix(y_test, y_pred_test)

    # Print results
    print(f"for n_components={num}")
    print(f"RF train score={score_train_rf}, RF test score={score_test_rf}")
    print(f"SGD train score={score_train_sgd}, SGD test score={score_test_sgd}")
    print(f"confusion matrix for RF={c_matrix_rf}")
    print(f"confusion matrix for sgd={c_matrix_sgd}")
    print(f"   ")


for n_components=100
RF train score=0.9952038369304557, RF test score=0.8388888888888889
SGD train score=0.9952038369304557, SGD test score=0.9444444444444444
confusion matrix for RF=[[47  0  8]
 [ 3 51 12]
 [ 6  0 53]]
confusion matrix for sgd=[[55  0  0]
 [ 2 63  1]
 [ 7  0 52]]
   
for n_components=300
RF train score=0.9952038369304557, RF test score=0.8166666666666667
SGD train score=0.9952038369304557, SGD test score=0.9111111111111111
confusion matrix for RF=[[43  0 12]
 [ 2 50 14]
 [ 5  0 54]]
confusion matrix for sgd=[[51  4  0]
 [ 0 66  0]
 [ 5  7 47]]
   
for n_components=417
RF train score=0.9952038369304557, RF test score=0.8444444444444444
SGD train score=0.9952038369304557, SGD test score=0.9555555555555556
confusion matrix for RF=[[45  1  9]
 [ 3 53 10]
 [ 5  0 54]]
confusion matrix for sgd=[[54  1  0]
 [ 1 64  1]
 [ 5  0 54]]
   


As per the above results, we can observe that the test score for both Random classifier and Stochastic gradient descent decreased slightly when we increase n components from 100 to 300. and when we increased the n components to 417 which is the max feature, again the test score for both models increased slightly. This says that reducing the dimensionality of the data with SVD may not be very helpful for these types of models.The RF AND SGD test score is 0.8444 and 0.9611 respectively before applying SVD. The difference between before and after applying SVD is very less. So according to that results, we might not need SVD in our analysis. Although SVD may be extremely helpful for lowering the dimensionality of high-dimensional data and removing noise, but it can also result in the loss of crucial information.