### Sai Srihitha Goverdhana    U58956033

# WE06 - Text Mining - Classification 

In the weekly exercise, you will explore the impacts of applying SVD to the data.  Analyze how changing the n_components parameter impacts the modeling results. Use the values 100, 300, and 500 and discuss how each of these values impacted the performance of the models. Discuss why we may or may not want to use SVD in our analysis.

### Import common packages

In [1]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('../data/news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
X_train.shape, y_train.shape

((417,), (417,))

In [10]:
X_test.shape, y_test.shape

((180,), (180,))

In [11]:
X_train.head(5)

435    There were a few people who responded to my re...
437    Article-I.D.: netcom.kaminskiC52n0s.2uA Refere...
337    OK, I'll join in the fun and give my playoff p...
0      I have a few reprints left of chapters from my...
354    In article 1@tnclus.tele.nokia.fi, hahietanen@...
Name: TEXT, dtype: object

In [12]:
y_train[:5]

array([2, 2, 1, 0, 1])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [13]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [14]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [15]:
X_train.shape, X_test.shape

((417, 10321), (180, 10321))

In [16]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10321 sparse matrix of type '<class 'numpy.float64'>'
	with 30904 stored elements in Compressed Sparse Row format>

In [17]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [18]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [19]:
X_train.shape, X_test.shape

((417, 417), (180, 417))

### n_components = 100

> At no of components is 100, shape of X_train is (417, 100) and shape of X_test is (180, 100)

### n_components = 300

> At no of components is 100, shape of X_train is (417, 300) and shape of X_test is (180, 300)

### n_components = 500

> At no of components is 100, shape of X_train is (417, 417) and shape of X_test is (180, 417)

> As there are only 417 observations , even if we try to add no of components as 500. It will reshape it to only (417,417) observations for training and test data

## Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [21]:
from sklearn.metrics import accuracy_score

In [22]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9856


### n_components = 100

> Training set accuracy for the dataset is 96.40%

### n_components = 300

> At no_components = 300, Training set accuracy for the dataset is 96.60%

### n_components = 500

> At no_components = 300, Training set accuracy for the dataset is 98.56%

In [23]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Test accuracy: {accuracy_score(y_test, y_pred_test):.4f}")

Test accuracy: 0.8889


### n_components = 100
> At no_components = 100, Test set accuracy for the dataset is 87.22%

### n_components = 300
>  At no_components = 300, Test set accuracy for the dataset is 91.11%

### n_components = 500
>  At no_components = 300, Test set accuracy for the dataset is 88.99%

In [24]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[50,  0,  7],
       [ 2, 54,  3],
       [ 6,  2, 56]], dtype=int64)

>

**We can Observe some changes in confusion matrix for diffrent values of no of components for Random Forest Model**
### n_components = 100
       [50,  0,  7],
       [ 3, 56,  5],
       [ 6,  2, 51]
       
### n_components = 300
       [55,  0,  4],
       [ 0, 56,  2],
       [ 9,  1, 53]]
       
### n_components = 500
       [50,  0,  7],
       [ 2, 54,  3],
       [ 6,  2, 56]

## Stochastic Gradient Descent Classifier

In [25]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [26]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 1.0000


### n_components = 100
> Train accuracy is 98.56%

### n_components = 300
> Train accuracy is 99.52%

### n_components = 500
> Train Acc is 100% ( As there are only 417 observations )

In [27]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Test accuracy: {accuracy_score(y_train, y_pred_train):.4f}")

Test accuracy: 1.0000


### n_components = 100
> Test accuracy is 98.56%

### n_components = 300
> Test accuracy is 99.52%

### n_components = 500
> test Acc is 100% ( As there are only 417 observations )

In [28]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[56,  1,  0],
       [ 3, 56,  0],
       [ 6,  2, 56]], dtype=int64)

**Changes in confusion matrix for diffrent values of no of components for Stochastic Gradient Descent Classifier**
### n_components = 100
       [51,  5,  1],
       [ 0, 62,  2],
       [ 3,  2, 54]
       
### n_components = 300
       [[50,  0,  9],
       [ 1, 57,  0],
       [ 0,  1, 62]]  
       
### n_components = 500
       [56,  1,  0],
       [ 3, 56,  0],
       [ 6,  2, 56]

### Analysis: 

> With 100 components, the SVD model might not capture all the underlying variation in the data, leading to less performance.

> With 300 components, the SVD model captures more of the underlying variation in the data, leading to better performance. In our model, I consider n_components at 300 will provide optimal performance.

> We achieced Test accuracy when n_components is 300 as 99.52% for the model Stochastic Gradient Descent Classifier and for randomforest model , accuracy obtained is 91.11%

> With 500 components, the SVD model captures even more of the underlying variation in the data, but at the cost of increased computational complexity and potential overfitting.

> In general, when we have a big dataset with plenty of features, we can apply SVD in our study. Considering that there are only 417 observations and fewer observation.