## Muralidhar Reddy 
### U64546777

# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [36]:
!pip install nltk



In [37]:
import warnings
warnings.simplefilter('ignore')


In [38]:

import nltk
nltk.download('punkt')

 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rmura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [39]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\rmura\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [40]:
#import required modules
# import data handling modules
import pandas as pd
import numpy as np
# import model handling modules
import nltk
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet
# feature convertion
from sklearn.feature_extraction.text import CountVectorizer


np.random_seed = 1

### Load data

In [81]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [82]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


# Code for lemmatisation

In [83]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


# call lemmatizer
lemmatizer = WordNetLemmatizer()

# converting the POS  to WordNet tags
def getPOSWordNet(tag):
    if tag.startswith('M'):
        return wordnet.ADJ
    elif tag.startswith('U'):
        return wordnet.VERB
    elif tag.startswith('R'):
        return wordnet.NOUN
    elif tag.startswith('A'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


news['TEXT'] = news['TEXT'].apply(
    lambda x: ' '.join(
        [
            lemmatizer.lemmatize(word,getPOSWordNet(tag))
                        for word, tag in nltk.pos_tag(nltk.word_tokenize(x))
        ]
    )
)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rmura\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\rmura\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [84]:
news["TEXT"]

0      I have a few reprint left of chapter from my b...
1      gnuplot , etc . make it easy to plot real valu...
2      Article-I.D . : snoopy.1pqlhnINN8k1 References...
3      Hello , I am looking to add voice input capabi...
4      I recently got a file describing a library of ...
                             ...                        
592    carl @ SOL1.GPS.CALTECH.EDU ( Carl J Lydick ) ...
593    In article < 1qmlgaINNjab @ hp-col.col.hp.com ...
594    Article-I.D . : kestrel.1993Apr16.172052.27843...
595    In article < 1qmlgaINNjab @ hp-col.col.hp.com ...
596    I have a 42 yr old male friend , misdiagnosed ...
Name: TEXT, Length: 597, dtype: object

### Check for missing values

In [85]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [86]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [87]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [88]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [89]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [90]:
X_train.shape, y_train.shape

((417,), (417,))

In [91]:
X_test.shape, y_test.shape

((180,), (180,))

In [92]:
X_train.head(5)

502    I attended high school in the San Jose , Calif...
594    Article-I.D . : kestrel.1993Apr16.172052.27843...
488    In article < 19621.3049.uupcb @ factory.com > ...
595    In article < 1qmlgaINNjab @ hp-col.col.hp.com ...
78     Robert J.C. Kyanko ( rob @ rjck.UUCP ) wrote :...
Name: TEXT, dtype: object

In [93]:
y_train[:5]

array([2, 2, 2, 2, 0])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [94]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [95]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [96]:
X_train.shape, X_test.shape

((417, 9473), (180, 9473))

In [97]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9473 sparse matrix of type '<class 'numpy.float64'>'
	with 30381 stored elements in Compressed Sparse Row format>

In [98]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# SVD n_components =100 
## Latent Semantic Analysis (Singular Value Decomposition)

In [24]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [25]:
X_train.shape, X_test.shape

((417, 100), (180, 100))

## Random Forest

In [26]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [27]:
from sklearn.metrics import accuracy_score

In [28]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9784


In [29]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.8889


In [30]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[58,  1,  5],
       [ 1, 48,  5],
       [ 8,  0, 54]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [31]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [32]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


In [33]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


In [34]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[61,  0,  3],
       [ 2, 51,  1],
       [ 8,  1, 53]], dtype=int64)

# SVD n_components =300 
## Latent Semantic Analysis (Singular Value Decomposition)

In [59]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [60]:
X_train.shape, X_test.shape

((417, 300), (180, 300))

## Random Forest

In [61]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [62]:
from sklearn.metrics import accuracy_score

In [63]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9808


In [64]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.8278


In [65]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[46,  1, 14],
       [ 0, 46, 13],
       [ 1,  2, 57]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [66]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [67]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


In [68]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


In [69]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[57,  2,  2],
       [ 1, 58,  0],
       [ 4,  1, 55]], dtype=int64)

In [70]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [71]:
X_train.shape, X_test.shape

((417, 100), (180, 100))

## Random Forest

In [72]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [73]:
from sklearn.metrics import accuracy_score

In [74]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9736


In [75]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.9000


In [76]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[55,  2,  4],
       [ 0, 56,  3],
       [ 7,  2, 51]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [77]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [78]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9880


In [79]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9880


In [80]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[48,  3, 10],
       [ 0, 58,  1],
       [ 1,  0, 59]], dtype=int64)

# SVD n_components =500 
## Latent Semantic Analysis (Singular Value Decomposition)

In [99]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [100]:
X_train.shape, X_test.shape

((417, 417), (180, 417))

## Random Forest

In [101]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [102]:
from sklearn.metrics import accuracy_score

In [103]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9904


In [104]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.9056


In [105]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[54,  0, 10],
       [ 0, 54,  3],
       [ 4,  0, 55]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [106]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [107]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


In [108]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9976


In [109]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[59,  3,  2],
       [ 0, 57,  0],
       [ 4,  1, 54]], dtype=int64)

# In response to the inquiry, I changed the code for lemmatisation of the text column in the news data set during the preprocessing step before splitting.
# After checking the model finds, they still same after lemmatistaion also.

# analysis
- Firstwe need to load the data from given data, we are doing text mining so selecteed data set contain content or statmnet or feedbacks or commnets
- After loading data we need to check all data properties if required we need to change that is suitable for model 
- one of the major important step for text mining or text realted data sets is lemmitaiation 
- Lemmatization is the process of grouping together different inflected forms of the same word. 
- Now need to do split the data into train and test in 70:30
- we need to chnage the lemmatized data into tf-idf Vectorizer 
- finally we change the data words into nuerica format based on the tf-idf "term frequency-inverse document frequency"
- Now data is situable for SVD Singular Value Decomposition with 3 different  n_compomnets 100,300,500 
- results from svd data is ready for model like Random Forest ,Stochastic Gradient Descent Classifier
- SVD n_components 100
    - RFC 
        - Train acc: 0.9784
        - Test acc: 0.8889
    - SGDC
        - Train acc: 0.9976
        - Test acc:0.997
- SVD n_components 300
    - RFC 
        - Train acc: 0.9736
        - Test acc: 0.9000
    - SGDC
        - Train acc: 0.9880
        - Test acc: 0.9880

- SVD n_components 500
    - RFC 
        - Train acc: 0.9904
        - Test acc: 0.9056
    - SGDC
        - Train acc: 0.9976
        - Test acc: 0.9976
- when SVD n_components incrreased the accuracy in both models in train and test is improved
- Even we gave 500 in n_components it take upto the max n_feature. feature we can check from shape X  its 417 .. so evn we gave 500 it take upto 417  

- We need SVD because its change tf-idf data into  Singular Value Decomposition SVD data then we need to apply the model. as we seen increase in n_compomnets the model performances also .. increased . so to get good perfomance from model in text mining we need to use SVD.