# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.


### Import common packages

In [1]:
import pandas as pd
import numpy as np
np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('news.csv')
news.shape

(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']
X

0      I have a few reprints left of chapters from my...
1      gnuplot, etc. make it easy to plot real valued...
2      Article-I.D.: snoopy.1pqlhnINN8k1 References: ...
3      Hello, I am looking to add voice input capabil...
4      I recently got a file describing a library of ...
                             ...                        
592    carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...
593    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
594    Article-I.D.: kestrel.1993Apr16.172052.27843 R...
595    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
596    I have a 42 yr old male friend, misdiagnosed a...
Name: TEXT, Length: 597, dtype: object

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Like CountVectorizer, TfidfVectorizer will covert to lowercase, remove punctuation, and remove 
# stop words - to remove other things, such as numbers, use the token_pattern parameter
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X1 = vectorizer.fit_transform(X)

df = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,aa,aalborg,aamrl,aangeboden,aantal,aaplay,aarnet,ab,abad,abandon,...,zoo,zool,zorn,zt,zu,zubov,zupancic,zurich,zyeh,zzz
0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.09173,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.126841,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
592,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
593,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
594,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
595,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Lemmatization
We here have applied the lemmatization technique before we split our data in train,test.

In [9]:
import nltk
#nltk.download('averaged_perceptron_tagger') # you only need to run this once
#nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag, word_tokenize

In [10]:
corpus = X
corpus

0      I have a few reprints left of chapters from my...
1      gnuplot, etc. make it easy to plot real valued...
2      Article-I.D.: snoopy.1pqlhnINN8k1 References: ...
3      Hello, I am looking to add voice input capabil...
4      I recently got a file describing a library of ...
                             ...                        
592    carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...
593    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
594    Article-I.D.: kestrel.1993Apr16.172052.27843 R...
595    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
596    I have a 42 yr old male friend, misdiagnosed a...
Name: TEXT, Length: 597, dtype: object

In [11]:

transformed_corpus = []
wnl = WordNetLemmatizer()
for document in corpus:
    transformed_document = ""
    for word, tag in pos_tag(word_tokenize(document)):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        if not wntag:
            lemma = word
        else:
            lemma = wnl.lemmatize(word, wntag)
        transformed_document+= lemma + " "
    transformed_corpus += [transformed_document]

#transformed_corpus

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Like CountVectorizer, TfidfVectorizer will covert to lowercase, remove punctuation, and remove 
# stop words - to remove other things, such as numbers, use the token_pattern parameter
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X2 = vectorizer.fit_transform(transformed_corpus)

df = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,aa,aalborg,aamrl,aangeboden,aantal,aaplay,aarnet,ab,abad,abandon,...,zoo,zool,zorn,zt,zu,zubov,zupancic,zurich,zyeh,zzz
0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.09234,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.1337,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
592,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
593,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
594,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
595,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Analysis
We can observe that the number of columns have been reduced by atleast 2000. This is the advantage of lemmatization.
Lemmatization is a natural language processing technique used to reduce a word to its base or root form, which is known as the lemma. The goal of lemmatization is to transform all forms of a word (such as "walk," "walked," "walking") into its base form ("walk"), which can help to improve the accuracy of text analysis and reduce the complexity of a text.

## Split the data

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.3)

In [14]:
X_train.shape, y_train.shape

((417, 10925), (417,))

In [15]:
X_test.shape, y_test.shape

((180, 10925), (180,))

In [16]:
X_train

<417x10925 sparse matrix of type '<class 'numpy.float64'>'
	with 28954 stored elements in Compressed Sparse Row format>

In [17]:
X_test

<180x10925 sparse matrix of type '<class 'numpy.float64'>'
	with 12399 stored elements in Compressed Sparse Row format>

In [18]:
y_train[:5]

array([0, 0, 1, 2, 2])

**Copying the train & test data into different variables for further analysis using different components applied to SVD**

In [19]:
X_train1=X_train
X_test1=X_test
y_train1=y_train
y_test1=y_test

In [20]:
X_test1.shape

(180, 10925)

In [22]:
X_train2=X_train
X_test2=X_test
y_train2=y_train
y_test2=y_test

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [23]:
X_train.shape, X_test.shape

((417, 10925), (180, 10925))

In [24]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10925 sparse matrix of type '<class 'numpy.float64'>'
	with 28954 stored elements in Compressed Sparse Row format>

In [25]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [26]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [27]:
X_train.shape, X_test.shape

((417, 300), (180, 300))

## Random Forest

In [28]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [30]:
from sklearn.metrics import accuracy_score

In [31]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

Train acc: 0.9784
Test acc: 0.8722


array([[40,  0,  9],
       [ 3, 59,  5],
       [ 3,  3, 58]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [32]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [33]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"test acc: {accuracy_score(y_test, y_pred_test):.4f}")
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

Train acc: 0.9976
test acc: 0.9111


array([[48,  0,  1],
       [ 5, 62,  0],
       [ 7,  3, 54]], dtype=int64)

## Analysis
Initially, we had performed the lemmatization technique using the modules defined in python. We had sent the data that we want to process into a variable called corpus,The defined methods will produce an output variable transformed_corpus which will be containing the tokenized data to its root form. We then transform this data into a sparse matrix, the TfidfVectorizer will covert to lowercase, remove punctuation, and remove stop words - to remove other things, such as numbers, use the token_pattern parameter.
After we split our new data into test,train split; We then use SVD to transform our training data & test data which will reduce the dimentionality.

We have applied the random forest classifier & also the stochastic gradient classifier for the transformed data for evaluation the models.
We have used the accuracy as our main metric to evaluate the model performance. 
Comparing our results achieved with lemmatization with the previous notebook where we ommitted the lemmatization part, our models have improved results, the accuracy of both the *test & train* scores.
We can see that the changes in confusion matrix when compared to the previous one.
The accuracy with the SGD classifier gives us 0.99 with training data & 0.91 test data

# SVD with number of components set to 100, N=100
SVD (Singular Value Decomposition) is a popular matrix factorization technique used in data analysis, and machine learning. 

## Latent Semantic Analysis (Singular Value Decomposition)

In [34]:

svd = TruncatedSVD(n_components=100, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train1= svd.fit_transform(X_train1)
X_test1 = svd.transform(X_test1)


In [35]:
X_train1.shape, X_test1.shape

((417, 100), (180, 100))

## Random Forest

In [36]:

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train1, y_train1)

### Evaluating Model Performance

In [38]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train1)
acc = accuracy_score(y_train1, y_pred_train)
print(f"Train acc: {accuracy_score(y_train1, y_pred_train):.4f}")
#Test accuracy
y_pred_test = rnd_clf.predict(X_test1)
acc = accuracy_score(y_test1, y_pred_test)
print(f"Test acc: {accuracy_score(y_test1, y_pred_test):.4f}")

# Confusion Matrix

confusion_matrix(y_test1, y_pred_test)

Train acc: 0.9640
Test acc: 0.8389


array([[38,  0, 11],
       [ 3, 60,  4],
       [ 7,  4, 53]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [39]:

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train1, y_train1)

### Evaluating Model Performance

In [40]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train1)
print(f"Train acc: {accuracy_score(y_train1, y_pred_train):.4f}")
#Test accuracy
y_pred_test = sgd_clf.predict(X_test1)
print(f"test acc: {accuracy_score(y_test1, y_pred_test):.4f}")
# Confusion Matrix

confusion_matrix(y_test1, y_pred_test)

Train acc: 0.9856
test acc: 0.8611


array([[48,  0,  1],
       [ 9, 58,  0],
       [13,  2, 49]], dtype=int64)

## Analysis
We can observe that there is a drastic change in accuracy of test score when the number of components are decreased, the test  score is 0.86 which has siginificant change compared to our previous results.
However the train data accuracy is comparitively not that big change as it stands at 0.98

# SVD with number of components set to 500, N=500
For the final analysis of the effect of number of components when increase to 500

## Latent Semantic Analysis (Singular Value Decomposition)

In [41]:

svd = TruncatedSVD(n_components=500, n_iter=10) #n_components is the number of topics, which should be less than the number of features
X_train2= svd.fit_transform(X_train2)
X_test2 = svd.transform(X_test2)


In [42]:
X_train2.shape, X_test2.shape

((417, 417), (180, 417))

### Discovery
Here we can observe that the shape of the data is 417, which is the max number of features, since we had set our components to 500 which is greater than our features so it is automatically reduced to number of featuers.
We should note that the number of components must always be less than number of features in our dataset.

## Random Forest

In [43]:

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train2, y_train2)

### Evaluating Model Performance

In [45]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train2)
acc = accuracy_score(y_train2, y_pred_train)
accN5tr=print(f"Train acc: {accuracy_score(y_train2, y_pred_train):.4f}")
#Test accuracy
y_pred_test = rnd_clf.predict(X_test2)
accN5ts= accuracy_score(y_test2, y_pred_test)
print(f"Test acc: {accuracy_score(y_test2, y_pred_test):.4f}")

# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test2, y_pred_test)

Train acc: 0.9856
Test acc: 0.8556


array([[38,  1, 10],
       [ 2, 61,  4],
       [ 5,  4, 55]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [46]:

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train2, y_train2)

### Evaluating Model Performance

In [47]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train2)
print(f"Train acc: {accuracy_score(y_train2, y_pred_train):.4f}")
#Test accuracy
y_pred_test = sgd_clf.predict(X_test2)
print(f"test acc: {accuracy_score(y_test2, y_pred_test):.4f}")
# Confusion Matrix
confusion_matrix(y_test2, y_pred_test)

Train acc: 0.9976
test acc: 0.7944


array([[48,  0,  1],
       [ 3, 64,  0],
       [20, 13, 31]], dtype=int64)

# Results
From the above analysis, with different number of components applied with the SVD.
We can observe that the accuracy was the least when the Number of components was 100, with the accuracy of the SGD Classifier resulting in 0.96 of train & 0.86 with the test data.

Our initial exploration of data with components set to 300, gave us the better output, with the accuracy standing at 0.99 and 0.91 of test and train data respectively.
The final experiment was done with 500 components, which gave us the train accuracy of 0.99 however, we got the poor performance of 0.79 with the test data.

The appropriate number of components to use in SVD depends on the specific application and the desired level of accuracy. 

# Discussion
When we keep increasing the number of components in SVD, we are essentially increasing the level of detail in the representation of the original matrix. This means that the approximation of the original matrix becomes more accurate as more components are added. However, adding too many components can lead to overfitting, which means that the model may become too specialized to the specific data used to train it, and may not generalize well to new data.
In addition, adding more components may increase the computational complexity of the SVD, making it slower and more memory-intensive to compute. Therefore, it is important to balance the number of components used with the desired level of accuracy and the available computational resources.Furthermore, increasing the number of components may lead to an increase in the size of the transformed data, which can become a challenge in cases where storage or transmission resources are limited. Therefore, it is important to consider the practical implications of increasing the number of components, especially when dealing with large datasets.

We can observe the patterns in the results, having the maximum components have indeed resulted in overfitting, which resulted in poor test data accuracy, Applying SVD to data can have several impacts depending on the specific application, By representing the data in a lower-dimensional space, SVD can reveal underlying patterns or relationships that may not be immediately apparent in the original data.

SVD can be utilized to decrease the number of dimensions of the data by choosing the most relevant components. This can be beneficial for several purposes, including the visualization of high-dimensional data, minimizing storage and computational requirements, and enhancing the effectiveness of machine learning algorithms.

## Conclusion

The main benefits of using SVD is reducing the complexity & dimensionality of our data. However, whether or not to use SVD in a particular analysis depends on various factors, such as the nature of the data, the goals of the analysis, and the computational resources available. But there will be also downsides of using the SVD such as overfitting, as we can see our results when n=500, we observed overfitting and got poor results.
We should use optimal values to apply SVD to get good results.