<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification
INSTRUCTIONS:
- Run the cells
- Observe and understand the results
- Answer the questions

## Import libraries

In [2]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [3]:
## Loading the data

trainDF = pd.read_fwf(
    filepath_or_buffer = '../../DATA/corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ], 
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
trainDF['label'] = trainDF['label'] - 1

## Inspect the data

In [4]:
print(trainDF.info())
print(trainDF.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   9999 non-null   int64 
 1   text    9999 non-null   object
dtypes: int64(1), object(1)
memory usage: 156.4+ KB
None
      label                                               text
6410      0  Doesn't fit!: The bolt holes don't line up. Th...
2225      0  the movie was better: Usually the book is alwa...
5970      0  Embarrassing but No Disrespect meant: I was en...
1856      0  Waste of time.: I bought this video while thin...
3422      0  I appreciate what he attempted, but...: I appr...
7062      1  One Of My Wife's Favorites: My wife really lov...
9021      0  Doesn't work! Cheaply made.: Bought this as on...
5334      0  Does not follow the book much. Had to insert a...
8358      1  a good memory: I got this card and I am comple...
9544      1  Works great with Windows 7: Just got this card...


In [5]:
trainDF.shape

(9999, 2)

## Split the data into train and test

In [6]:
## split the dataset
X_train, X_test, y_train, y_test = train_test_split(trainDF['text'], trainDF['label'], test_size = 0.2, random_state = 42)

## Feature Engineering

### Count Vectors as features

In [7]:
# create a count vectorizer object (bag of words)
count_vect = CountVectorizer(token_pattern = r'\w{1,}') # pattern to match all strings and digits (no punctuation)

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(trainDF['text'])

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

In [8]:
X_train_count.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [9]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(trainDF['text'])
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
Wall time: 1.16 s


In [10]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(trainDF['text'])
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
Wall time: 6.16 s


In [11]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         token_pattern = r'\w{1,}',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(trainDF['text'])
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3),
                token_pattern='\\w{1,}')




Wall time: 8.43 s


### Text / NLP based features

Create some other features.

Char_Count = Number of Characters in Text

Word Count = Number of Words in Text

Word Density = Average Number of Char in Words

Punctuation Count = Number of Punctuation in Text

Title Word Count = Number of Words in Title

Uppercase Word Count = Number of Upperwords in Text

In [12]:
%%time
trainDF['char_count'] = trainDF['text'].apply(len)
trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split()))
trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count'] + 1)
trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len(''.join(_ for _ in x if _ in string.punctuation))) 
trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([w for w in x.split() if w.istitle()]))
trainDF['uppercase_word_count'] = trainDF['text'].apply(lambda x: len([w for w in x.split() if w.isupper()]))

Wall time: 530 ms


In [13]:
trainDF.sample(5)

Unnamed: 0,label,text,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count
7572,0,Cute but painful: Ugh. I was so excited to get...,387,78,4.898734,10,10,3
4821,0,Ida B. Wells: Woman of Courage: I was very eag...,310,58,5.254237,11,11,3
7422,0,AWFUL!: This was the worst movie I have ever s...,603,115,5.198276,16,13,8
6936,1,Thought-provoking and at times chilling: Many ...,915,164,5.545455,26,11,0
8753,1,I want to be Shirley Manson!: This group has b...,279,55,4.982143,7,9,3


In [14]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter 

In [15]:
# Initialise some columns for feature counts
trainDF['adj_count'] = 0
trainDF['adv_count'] = 0
trainDF['noun_count'] = 0
trainDF['num_count'] = 0
trainDF['pron_count'] = 0
trainDF['propn_count'] = 0
trainDF['verb_count'] = 0

In [16]:
trainDF.head(3)

Unnamed: 0,label,text,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
0,1,The best soundtrack ever to anything.: I'm rea...,509,97,5.193878,14,7,3,0,0,0,0,0,0,0
1,1,Amazing!: This soundtrack is my favorite music...,760,129,5.846154,40,24,4,0,0,0,0,0,0,0
2,1,Excellent Soundtrack: I truly like this soundt...,743,118,6.243697,33,52,4,0,0,0,0,0,0,0


In [17]:
%%time
# for each text
for i in range(trainDF.shape[0]):
    # convert into a spaCy document
    doc = nlp(trainDF.iloc[i]['text'])
    # initialise feature counters
    c = Counter([token.pos_ for token in doc])

    trainDF.at[i, 'adj_count'] = c['ADJ']
    trainDF.at[i, 'adv_count'] = c['ADV']
    trainDF.at[i, 'noun_count'] = c['NOUN']
    trainDF.at[i, 'num_count'] = c['NUM']
    trainDF.at[i, 'pron_count'] = c['PRON']
    trainDF.at[i, 'propn_count'] = c['PROPN']
    trainDF.at[i, 'verb_count'] = c['VERB']

Wall time: 2min 44s


Pandas `at[]` is used to return data in a dataframe at the passed location. The passed location is in the format `[position, Column Name]`

In [18]:
c = Counter([token.pos_ for token in doc])
c

Counter({'PROPN': 13,
         'CCONJ': 5,
         'PUNCT': 12,
         'DET': 8,
         'NOUN': 15,
         'AUX': 9,
         'VERB': 10,
         'ADV': 4,
         'PRON': 17,
         'ADJ': 7,
         'SCONJ': 1,
         'ADP': 14,
         'PART': 2})

In [19]:
c['ADJ']

7

In [20]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

trainDF[cols].sample(5)

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
1577,326,59,5.433333,8,6,2,5,3,9,1,5,3,8
1653,147,27,5.25,7,12,2,4,0,3,0,3,4,1
2327,247,42,5.744186,4,7,1,0,3,7,0,5,6,5
3484,412,83,4.904762,17,10,4,6,2,19,2,6,2,6
1345,546,92,5.870968,25,5,1,16,12,18,0,10,1,8


### Topic Models as features

**Unsupervised approach**

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus.

In [21]:
# Instatiate LDA model
lda_model = LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20) # n_components = number of topics

In [22]:
X_train_count.shape

(7999, 31661)

In [23]:
%%time
# train LDA model on BOW
X_topics = lda_model.fit_transform(X_train_count)

Wall time: 1min 10s


In [24]:
print(X_topics.shape)
X_topics

(7999, 20)


array([[0.0002907 , 0.0002907 , 0.02936047, ..., 0.0002907 , 0.08624418,
        0.0002907 ],
       [0.00059524, 0.00059524, 0.00059524, ..., 0.00059524, 0.02444083,
        0.00059524],
       [0.00033557, 0.00033557, 0.00033557, ..., 0.00033557, 0.00033557,
        0.00033557],
       ...,
       [0.00031447, 0.00031447, 0.05691824, ..., 0.00031447, 0.00031447,
        0.00660377],
       [0.00119048, 0.00119048, 0.00119048, ..., 0.00119048, 0.00119048,
        0.00119048],
       [0.0004386 , 0.0004386 , 0.01073809, ..., 0.0004386 , 0.0004386 ,
        0.0004386 ]])

In [25]:
topic_word = lda_model.components_ 
print(topic_word.shape)
topic_word

(20, 31661)


array([[0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05000001, ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       ...,
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05000001, 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ]])

In [26]:
vocab = count_vect.get_feature_names()
vocab



['0',
 '00',
 '000',
 '001',
 '002',
 '00290',
 '007',
 '0070412901',
 '0072316373',
 '008',
 '00now',
 '00yeah',
 '01',
 '011',
 '02',
 '03',
 '04',
 '05',
 '052',
 '06',
 '06rate',
 '07',
 '088',
 '09',
 '0s9',
 '0sx',
 '1',
 '10',
 '100',
 '1000',
 '1000amp',
 '1000s',
 '1000uf',
 '1001',
 '100m',
 '100th',
 '100this',
 '101',
 '10162',
 '102',
 '1020',
 '1021',
 '103',
 '1030pm',
 '104',
 '1048259',
 '105',
 '1058',
 '1059',
 '107',
 '108',
 '1080p',
 '10bm',
 '10ft',
 '10gameplay',
 '10gb',
 '10i',
 '10mb',
 '10min',
 '10movie',
 '10overall',
 '10p',
 '10sound',
 '10th',
 '10the',
 '10this',
 '11',
 '110',
 '1100',
 '110e',
 '1115',
 '11223',
 '114622',
 '115',
 '116',
 '119',
 '11b',
 '11g',
 '11i',
 '11lbs',
 '11movie',
 '11th',
 '12',
 '120',
 '1200',
 '120100',
 '120lbs',
 '120s',
 '120v',
 '121',
 '1215',
 '123',
 '125',
 '128',
 '128mb',
 '129',
 '12th',
 '12x',
 '13',
 '130',
 '133x',
 '1340s',
 '135lbs',
 '1394',
 '13th',
 '13w',
 '13x',
 '14',
 '140',
 '140lbs',
 '144',
 

In [27]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 effects max special mp3 force office van finding catholic brando
    1 memory error zen diabetes sharp blocks attractive control violence diet
    2 manson rice haiku bugliosi cooker studies member wayne references charles
    3 scent exuviance varies completly duo qing gopichand bela llamas males
    4 economics japanese government scared theme publisher moment literature wanting higgins
    5 edition hollywood versions diane lane worthy tales gay drivel paperback
    6 richard emma stockings schure wagner questo prayer strauss è draft
    7 cave shame food winston bear ayla clan exam cooking alive
    8 boots these weight pair boot shoes anywhere source them smooth
    9 i the it to and a my for not this
   10 costs cash hip label et mill hop funk slip delicious
   11 the and a of i to this is it in
   12 thrash steer nuclear yea dede voivod gillain obtain funniest fifty
   13

**Dissect function**

In [28]:
for i, topic_dist in enumerate(topic_word): # iterate through 20 rows
    print(topic_dist)

[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05       0.05       0.05000001 ... 0.05       0.05       0.05      ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05       0.05       0.05       ... 0.05       0.05000001 0.05      ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05       0.05       0.05       ... 0.05       0.05000001 0.05      ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[22.30335756  0.05000001  0.05000003 ...  0.05        0.05
  0.05      ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[5.70724328e+01 3.81595565e+01 2.43159917e+01 ... 5.00000002e-02
 5.00000004e-02 5.00000006e-02]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05       0.05       0.05       ... 0.05       1.20707175 1.0485002 ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[0.05000001 0.05       0.05       ... 0.05       0.05       0.05      ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05]


In [29]:
# most important line
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] # [start:stop:step]

In [30]:
np.array(vocab)

array(['0', '00', '000', ..., 'éviter', 'última', 'única'], dtype='<U76')

In [31]:
np.array(vocab)[-1]

'única'

In [32]:
topic_word[0]

array([0.05, 0.05, 0.05, ..., 0.05, 0.05, 0.05])

In [33]:
topic_word[0]

array([0.05, 0.05, 0.05, ..., 0.05, 0.05, 0.05])

`np.argsort()`: Returns the indices that would sort an array

In [34]:
np.argsort(topic_word)

array([[13548, 20623, 16659, ..., 26301, 17532,  9391],
       [18250,  8527, 14694, ..., 31578,  9990, 17768],
       [26944, 29562,  7981, ..., 12927, 23796, 17282],
       ...,
       [ 2695, 31564, 10022, ..., 24651, 16516, 12496],
       [24503, 22420,  8024, ...,  1435, 18652,  4928],
       [24699,  6364, 19033, ..., 30378, 26655, 17073]], dtype=int64)

In [35]:
np.argsort(topic_word[0])

array([13548, 20623, 16659, ..., 26301, 17532,  9391], dtype=int64)

In [36]:
np.argsort(topic_word[0])[0]

13548

In [37]:
np.array(vocab)[np.argsort(topic_word[0])][:-11:-1] # step -1 reverses the order

array(['effects', 'max', 'special', 'mp3', 'force', 'office', 'van',
       'finding', 'catholic', 'brando'], dtype='<U76')

In [38]:
topic_summaries

['effects max special mp3 force office van finding catholic brando',
 'memory error zen diabetes sharp blocks attractive control violence diet',
 'manson rice haiku bugliosi cooker studies member wayne references charles',
 'scent exuviance varies completly duo qing gopichand bela llamas males',
 'economics japanese government scared theme publisher moment literature wanting higgins',
 'edition hollywood versions diane lane worthy tales gay drivel paperback',
 'richard emma stockings schure wagner questo prayer strauss è draft',
 'cave shame food winston bear ayla clan exam cooking alive',
 'boots these weight pair boot shoes anywhere source them smooth',
 'i the it to and a my for not this',
 'costs cash hip label et mill hop funk slip delicious',
 'the and a of i to this is it in',
 'thrash steer nuclear yea dede voivod gillain obtain funniest fifty',
 'lame cat descent suspense professional lord creepy henry tomcat structure',
 'christ minds jesus cop souls titan elvis sooo criticiz

## Modelling

**Supervised approach**

In [67]:
## helper function
def train_model(classifier, X_train, y_train, X_test):
    # fit the training dataset on the classifier
    classifier.fit(X_train, y_train)

    # predict the labels on validation dataset
    predictions = classifier.predict(X_test)

    return accuracy_score(predictions, y_test) 

Note! `y_test` was not specifically imported into the function. If the variable is defined in the global scope (i.e., outside of any function), then it can be accessed from within any function without needing to pass it in as an argument.

In [40]:
# compile results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [68]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8540

Wall time: 10.1 ms


In [42]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8600

Wall time: 8 ms


In [43]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8400

Wall time: 6 ms


In [44]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8180

Wall time: 31 ms


In [45]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [46]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8520

Wall time: 1.87 s


In [47]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8730

Wall time: 86 ms


In [48]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8360

Wall time: 58 ms


In [49]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8485

Wall time: 395 ms


In [50]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [51]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8345

Wall time: 720 ms


In [52]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8610

Wall time: 70 ms


In [53]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8210

Wall time: 49 ms


In [54]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8570

Wall time: 426 ms


In [55]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [56]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8375

Wall time: 40.7 s


In [57]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8315

Wall time: 12.6 s


In [58]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7810

Wall time: 12.3 s


In [59]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7885

Wall time: 27.2 s


In [60]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [61]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

Wall time: 50.2 s


In [62]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7955

Wall time: 13.2 s


In [63]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7370

Wall time: 8.55 s


In [64]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8020

Wall time: 1min 54s


In [65]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [66]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.854,0.86,0.84,0.818
Logistic Regression,0.852,0.873,0.836,0.8485
Support Vector Machine,0.8345,0.861,0.821,0.857
Random Forest,0.8375,0.8315,0.781,0.7885
Gradient Boosting,0.799,0.7955,0.737,0.802




---



---



> > > > > > > > > © 2023 Institute of Data


---



---



