### U95310908 Srikar Pusuluri

# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [1]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
X_train.shape, y_train.shape

((417,), (417,))

In [10]:
X_test.shape, y_test.shape

((180,), (180,))

In [11]:
X_train.head(5)

532    Probably within 50 years, a new type of eugeni...
367    In < tvartiai.734823058@vipunen.hut.fi> tvarti...
69     In article < 1993Apr11.132604.13400@ornl.gov> ...
596    I have a 42 yr old male friend, misdiagnosed a...
506    Article-I.D.: blue.8016 References: < 19214@pi...
Name: TEXT, dtype: object

In [12]:
y_train[:5]

array([2, 1, 0, 2, 2])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [13]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [14]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [15]:
X_train.shape, X_test.shape

((417, 10160), (180, 10160))

In [16]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10160 sparse matrix of type '<class 'numpy.float64'>'
	with 30599 stored elements in Compressed Sparse Row format>

In [17]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [18]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


#### n=500

In [19]:
X_train.shape, X_test.shape

((417, 417), (180, 417))

#### n=100
X_train.shape, X_test.shape

((417, 100), (180, 100))

#### n=300
X_train.shape, X_test.shape

((417, 300), (180, 300))

#### Here we can see that the X_train and X_test dataset's dimensions.

#### As we only have 417 observations for n components equal to 500, we are unable to have more than 417 components even if we specified n to be 500.

## Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [21]:
from sklearn.metrics import accuracy_score

#### n = 500

In [22]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9664


#### n = 100 
Train acc: 0.9760

#### n = 300 
Train acc: 0.9784

#### On Comparing the  train accuracy scores for all the n components of Random Forest Model. The Highest accuracy score is for Training score(n=300) with 0.9784

#### n=500

In [23]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.8556


#### n = 100 
Test acc: 0.8500

#### n = 300
Test acc: 0.8722

#### On Comparing the test accuracy scores for all the n components for Random Forest Model. The highest accuracy score in test dataset is for 300 n components with 0.8722

#### n=500

In [24]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[46,  2,  5],
       [ 2, 53,  0],
       [12,  5, 55]], dtype=int64)

#### n = 100
array([[44,  2,  9],
       [ 2, 57,  8],
       [ 5,  1, 52]], dtype=int64)

#### n = 300

array([[45,  0,  9],
       [ 1, 49,  6],
       [ 5,  2, 63]], dtype=int64)

#### The Above are the confusion matrix of all the three n components for Random Forest Model. We can see the clearly see the difference in the matrices.

## Stochastic Gradient Descent Classifier

In [25]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

#### n = 500

In [26]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


#### n = 100 
Train acc: 0.9760

#### n = 300 
Train acc: 0.9976

#### On Comparing the train accuracy scores for all the n components of Stochastic Gradient Descent Classifier, the scores for 300 and 500 components are same. i.e, 0.9976

In [27]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


#### n = 100 
Test acc: 0.9760

#### n = 300 
Test acc: 0.9976

#### On Comparing the test accuracy scores for all the n components of Stochastic Gradient Descent Classifier, the scores for 300 and 500 components are same. i.e, 0.9976

#### n = 500

In [28]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[45,  1,  7],
       [ 0, 54,  1],
       [ 2,  3, 67]], dtype=int64)

#### n = 100 

array([[53,  0,  3],
       [ 3, 58,  2],
       [ 1,  0, 60]], dtype=int64)

#### n = 300 

array([[54,  0,  0],
       [ 2, 54,  0],
       [ 6,  3, 61]], dtype=int64)

#### The Above are the confusion matrix of all the three n components for Stochastic Gradient Descent Classifier. We can see the clearly see the difference in the matrices.

## Analysis:

We have performed SVD with various n components i.e. 100,300 & 500 using random forest and Gradient Descent Models.

From all the above ,Increasing the number of components will lead to better performance of the model. If we consider Random forest model the scores increased as we increase the components. so
we can conclude that increasing the n components from 100 to 300 will lead to improved performance as the number of components increases.

Even though Gradient Descent's train and test accuracy scores, 0.9976 for the 300 and 500 components, are equal.The test and train accuracy scores are greater for 300 n components than for 500 n components when the same random forest models are compared.

The use of SVD may also lead to better performance with larger datasets.As the number of observations are 417 in our case SVD with 300 n components is the best model using the random forest and Gradient Descent. When working with a large dataset, SVD works best.