# Text Clasification (week 2)

This lab is prepared with the materials in the article "A Comprehensive Guide to Understand and Implement Text Classification in Python" https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

Load libraries for dataset preparation, feature engineering, model training 

In [1]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# install textblob if necessary: $>pip install textblob
import pandas, numpy, textblob, string  
import nltk

# load functions from textpreprocess.py
from textpreprocess import denoise_text, normalize, replace_contractions, remove_non_ascii, to_lowercase, remove_punctuation, replace_numbers, remove_stopwords

## 1. Dataset preparation
We are using the dataset of amazon reviews which can be downloaded at this link (https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235). The dataset consists of <b>10,000 text reviews</b> and their labels, To prepare the dataset, load the downloaded data into a pandas dataframe containing two columns – text and label.

In [2]:
# load the dataset
data = open('data/corpus', encoding="utf-8").read()
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
    line = replace_contractions(line) # Replace contractions in string of text
    content = nltk.word_tokenize(line)
    labels.append(content[0])
    words = content[1:]
    words = remove_non_ascii(words)
    #words = to_lowercase(words)
    words = remove_punctuation(words)
    #words = replace_numbers(words)
    #words = remove_stopwords(words)
    texts.append(words)

# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
texts1=[' '.join(line) for line in texts] # join words in each line with space character
trainDF['text'] = texts1
trainDF['label'] = labels

In [3]:
trainDF.head()

Unnamed: 0,text,label
0,Stuning even for the nongamer This sound track...,__label__2
1,The best soundtrack ever to anything I am read...,__label__2
2,Amazing This soundtrack is my favorite music o...,__label__2
3,Excellent Soundtrack I truly like this soundtr...,__label__2
4,Remember Pull Your Jaw Off The Floor After Hea...,__label__2


In [4]:
trainDF.tail()

Unnamed: 0,text,label
9995,A revelation of life in small town America in ...,__label__2
9996,Great biography of a very interesting journali...,__label__2
9997,Interesting Subject Poor Presentation You woul...,__label__1
9998,Do not buy The box looked used and it is obvio...,__label__1
9999,Beautiful Pen and Fast Delivery The pen was sh...,__label__2


In [5]:
trainDF.shape

(10000, 2)

In [6]:
trainDF['label'].value_counts()

label
__label__1    5097
__label__2    4903
Name: count, dtype: int64

Next, we will split the dataset into training and testing sets so that we can train and test classifier. Also, we will encode our target column so that it can be used in machine learning models.

In [7]:
# train_test_split(): https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# split the dataset into training and testing datasets: 75% for training, 25% for testing (default); e.g., test_size=0.30
# stratify: data is split in a stratified fashion
# random_state: the seed used by the random number generator
train_x, test_x, train_y, test_y = model_selection.train_test_split(trainDF['text'], trainDF['label'], 
                                                                      random_state=2, stratify=trainDF['label'])
print("Train_y: ")
print(train_y.value_counts(), '\n')
print("Test_y: ")
print(test_y.value_counts())
# label encode the target variable 
encoder = preprocessing.LabelEncoder()    # Encode target labels with value between 0 and n_classes-1.
train_y = encoder.fit_transform(train_y)  # Fit label encoder and return encoded labels.
test_y = encoder.transform(test_y)

Train_y: 
label
__label__1    3823
__label__2    3677
Name: count, dtype: int64 

Test_y: 
label
__label__1    1274
__label__2    1226
Name: count, dtype: int64


In [8]:
print(encoder.classes_)  # __label__1 becomes 0, __label__2 becomes 1

['__label__1' '__label__2']


In [9]:
train_y [0:20]  # 0 means negative review; 1 means positive review

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1])

In [10]:
train_y.shape

(7500,)

In [11]:
test_y [0:20]

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1])

In [12]:
test_y.shape

(2500,)

## 2. Feature Engineering
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement the following different ideas in order to obtain relevant features from our dataset.

### 2.1 Count Vectors as features
[Count Vector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
[计数向量] 是数据集的矩阵表示法，其中每一行代表语料库中的一个文档，每一列代表语料库中的一个术语，每个单元格代表特定文档中特定术语的频率计数。

In [20]:
# create a count vectorizer object: 
# analyzer: whether the feature should be made of word or character n-grams.
# token_pattern: regular expression denoting what constitutes a “token”; '\w{1,}' matches words whose lengths are 1 or more than 1.
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

# fit and transform the training and test data using count vectorizer object
xtrain_count =  count_vect.fit_transform(train_x)
xtest_count =  count_vect.transform(test_x)

In [21]:
print(count_vect.get_feature_names_out())

['0' '000' '001' ... 'zzzzzzzzzzzzz' 'zzzzzzzzzzzzzzzzzz'
 'zzzzzzzzzzzzzzzzzzzzz']


In [22]:
print(sorted([(v, k) for k, v in count_vect.vocabulary_.items()]))  # ordered (colume id, term) pairs



In [23]:
print(xtrain_count.shape) # or print(xtrain_count.toarray().shape)

(7500, 31226)


### 2.2 TF-IDF Vectors as features

[TF-IDF Vectors](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) can be generated at different levels of input tokens (words, n-grams, and characters)

a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents

b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus

In [25]:
# word level tf-idf
# max_features: if not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
xtrain_tfidf =  tfidf_vect.fit_transform(train_x)
xtest_tfidf =  tfidf_vect.transform(test_x)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1,2), max_features=5000)
xtrain_tfidf_ngram =  tfidf_vect_ngram.fit_transform(train_x)
xtest_tfidf_ngram =  tfidf_vect_ngram.transform(test_x)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', ngram_range=(2,3), max_features=5000)
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.fit_transform(train_x) 
xtest_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(test_x) 

In [26]:
print(tfidf_vect_ngram.get_feature_names_out()[0:1000])  # both unigram and bigram terms

['0' '1' '1 star' '10' '100' '11' '12' '13' '14' '15' '16' '17' '18'
 '1984' '1984 is' '1st' '2' '2 and' '2 stars' '20' '2000' '24' '25' '2nd'
 '3' '30' '30 minutes' '35' '3d' '3rd' '4' '4 stars' '40' '400' '45' '451'
 '5' '5 stars' '50' '6' '6 months' '60' '7' '70' '8' '80' '80 s' '9' '90'
 'a' 'a bad' 'a beautiful' 'a better' 'a big' 'a bit' 'a book' 'a bunch'
 'a cd' 'a chance' 'a character' 'a cheap' 'a child' 'a class' 'a classic'
 'a complete' 'a copy' 'a couple' 'a day' 'a decent' 'a different'
 'a disappointment' 'a dvd' 'a fan' 'a fantastic' 'a favor' 'a few'
 'a film' 'a friend' 'a full' 'a fun' 'a gift' 'a good' 'a great'
 'a group' 'a half' 'a hard' 'a high' 'a horror' 'a huge' 'a joke' 'a kid'
 'a large' 'a little' 'a long' 'a look' 'a lot' 'a major' 'a man'
 'a masterpiece' 'a month' 'a more' 'a movie' 'a much' 'a must' 'a new'
 'a nice' 'a novel' 'a number' 'a perfect' 'a person' 'a piece' 'a pretty'
 'a problem' 'a real' 'a really' 'a regular' 'a replacement' 'a review'

In [27]:
print(tfidf_vect_ngram_chars.get_feature_names_out()[0:1000]) # both bigram and trigram terms

[' 0' ' 0 ' ' 1' ' 1 ' ' 10' ' 11' ' 12' ' 13' ' 14' ' 15' ' 16' ' 17'
 ' 18' ' 19' ' 1s' ' 2' ' 2 ' ' 20' ' 21' ' 23' ' 24' ' 25' ' 26' ' 2n'
 ' 3' ' 3 ' ' 30' ' 32' ' 34' ' 35' ' 36' ' 3d' ' 3r' ' 4' ' 4 ' ' 40'
 ' 45' ' 4t' ' 5' ' 5 ' ' 50' ' 51' ' 6' ' 6 ' ' 60' ' 65' ' 7' ' 7 '
 ' 70' ' 75' ' 8' ' 8 ' ' 80' ' 9' ' 9 ' ' 90' ' 99' ' a' ' a ' ' ab'
 ' ac' ' ad' ' ae' ' af' ' ag' ' ah' ' ai' ' al' ' am' ' an' ' ap' ' ar'
 ' as' ' at' ' au' ' av' ' aw' ' ay' ' b' ' b ' ' ba' ' bd' ' be' ' bi'
 ' bl' ' bo' ' br' ' bu' ' by' ' c' ' c ' ' ca' ' cd' ' ce' ' cg' ' ch'
 ' ci' ' cl' ' co' ' cr' ' cu' ' cy' ' d' ' d ' ' da' ' de' ' dh' ' di'
 ' do' ' dr' ' du' ' dv' ' dy' ' e' ' e ' ' ea' ' eb' ' ec' ' ed' ' ef'
 ' eg' ' ei' ' el' ' em' ' en' ' ep' ' eq' ' er' ' es' ' et' ' eu' ' ev'
 ' ex' ' ey' ' f' ' fa' ' fe' ' fi' ' fl' ' fo' ' fr' ' fu' ' fw' ' g'
 ' g4' ' ga' ' gb' ' ge' ' gh' ' gi' ' gl' ' go' ' gr' ' gu' ' gy' ' h'
 ' ha' ' hd' ' he' ' hi' ' ho' ' hp' ' ht' ' hu' ' hy' ' i' ' i ' ' i

### 3. Model Building
The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement Naive Bayes Classifier for this purpose:



The following function is a utility function which can be used to train a model. It accepts the classifier, feature_vector of training data, labels of training data and feature vectors of test data as inputs. Using these inputs, the model is trained and accuracy score is computed.

In [30]:
def train_model(classifier, feature_vector_train, label, feature_vector_test):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on test dataset
    predictions = classifier.predict(feature_vector_test)
       
    return metrics.accuracy_score(predictions, test_y)

### 3.1 Implementing a naive bayes model using sklearn implementation with different features

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here .

In [32]:
# Naive Bayes: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xtest_count)
print ("NB, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xtest_tfidf)
print ("NB, WordLevel TF-IDF Vectors: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
print ("NB, N-Gram TF-IDF Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
print ("NB, CharLevel TF-IDF Vectors: ", accuracy)

NB, Count Vectors:  0.8432
NB, WordLevel TF-IDF Vectors:  0.852
NB, N-Gram TF-IDF Vectors:  0.8712
NB, CharLevel TF-IDF Vectors:  0.8216


### 3.2 Build Naive Bayes Model on Count Vectors without using the train_model() function

In [34]:
### Create and run Classifier

classifier = naive_bayes.MultinomialNB()

### Fitting requires training count vectors and labels
classifier.fit(xtrain_count, train_y)

### xtest_count is the transformed test count vectors
predictions = classifier.predict(xtest_count)

### test_x - 25% of the reviews in test set
for record, category in zip(test_x, predictions): 
    print('%r => %s' % (record, category))  # 0 means negative review; 1 means positive review

'Not what I expected Not what I expected Too much acting Disappointing to say the least I will not buy the like again Oh well' => 0
'halti I have a 100 lb lab australian shepherd mix When we are out walking he tends to forget that I exist whenever he sees a new dog and he is strong enough to pull me over I have tried several different approaches including a choke chain without a lot of successThe halti has worked better than anything else I have tried He has been able to remove it a couple of times but since it ended up around his neck he was not able to get away from me' => 0
'West Coast Avant Garde Listen to the sound bite before you buy I did not but I should have This is way to modern for me I like cool west coast jazz but this is sort of Ornette Coleman meets Karlheinz Stockhausen very modern in tone If that is your bag the album is for you otherwise pass this one by' => 1
'Good Concept VERY VERY Flimsy I got this for my 5 year old I loved the concept unfortunatley I never seen su

In [35]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd
cm = confusion_matrix(test_y, predictions)
cm_df = pd.DataFrame(cm.T, index=classifier.classes_, columns=classifier.classes_)
cm_df.index.name = 'Predicted'
cm_df.columns.name = 'True'
print(cm_df)

True          0    1
Predicted           
0          1130  248
1           144  978


In [36]:
print(classification_report(test_y,predictions))  

              precision    recall  f1-score   support

           0       0.82      0.89      0.85      1274
           1       0.87      0.80      0.83      1226

    accuracy                           0.84      2500
   macro avg       0.85      0.84      0.84      2500
weighted avg       0.85      0.84      0.84      2500



Weighting by class frequency may give you a better estimate of overall performance, since the class frequencies can be very different

In [38]:
(0.85*1274 + 0.83*1226) / (1274 + 1226) # weighted average f1-score

0.8401919999999998

In [39]:
print(accuracy_score(test_y, predictions))  

0.8432


In [40]:
### Cross Validation - will perform prediction and evaluation
## to obtain average scoring from partitioned training and testing datasets.

### using only the fitted training data
### cross_val_score: default scoring is accuracy
scores = model_selection.cross_val_score(classifier, xtrain_count, train_y, cv=10)
print(numpy.mean(scores), scores)

0.8388 [0.82       0.84133333 0.82933333 0.81866667 0.824      0.84
 0.85866667 0.84533333 0.84666667 0.864     ]


### 3.3 Grid Search - improve performance through grid search of parameters

In [42]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

pipeline = Pipeline([
    ('vect', TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}')),
    ('clf', naive_bayes.MultinomialNB())
])

# binary in TfidfVectorizer(): if True, all non-zero term counts are set to 1. 
# This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary.
parameters = {
    #'vect__max_df': (0.1, 0.25, 0.5, 0.75),
    'vect__stop_words': ('english', None),
    'vect__lowercase': (True, False),
    'vect__binary': (True, False),
    'vect__max_features': (5000, 10000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    #'vect__use_idf': (True, False),
    #'vect__norm': ('l1', 'l2')
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    grid_search.fit(train_x, train_y)
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    predictions = grid_search.predict(test_x)
    print('Accuracy:', accuracy_score(test_y, predictions))
    print('Precision:', precision_score(test_y, predictions))  # result of positve class with average='binary' (default); use average='weighted' for weighted average scores
                                                                # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
    print('Recall:', recall_score(test_y, predictions))        # result of positve class
    print('F1_score:', f1_score(test_y, predictions))          # result of positve class

Fitting 3 folds for each of 48 candidates, totalling 144 fits
Best score: 0.865
Best parameters set:
	vect__binary: True
	vect__lowercase: True
	vect__max_features: 10000
	vect__ngram_range: (1, 2)
	vect__stop_words: None
Accuracy: 0.8764
Precision: 0.8804979253112033
Recall: 0.865415986949429
F1_score: 0.8728918140682846


If you are running your module (the source file) as the main program, the interpreter will assign the hard-coded string "__main__" to the __name__ variable.