<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [2]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

import warnings
warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [3]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = 'corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [4]:
# ANSWER
print(df_corpus.head())
print(df_corpus.shape)
print(df_corpus['label'].value_counts(normalize=True))

   label                                               text
0      1  The best soundtrack ever to anything.: I'm rea...
1      1  Amazing!: This soundtrack is my favorite music...
2      1  Excellent Soundtrack: I truly like this soundt...
3      1  Remember, Pull Your Jaw Off The Floor After He...
4      1  an absolute masterpiece: I am quite sure any o...
(9999, 2)
label
0    0.509751
1    0.490249
Name: proportion, dtype: float64


## Split the data into train and test

In [5]:
## ANSWER
## split the dataset
X_train, X_test, y_train, y_test = train_test_split(df_corpus['text'], 
                                                    df_corpus['label'], 
                                                    test_size=0.2, 
                                                    random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Training set size: 7999
Test set size: 2000


## Feature Engineering

### Count Vectors as features

In [None]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = ???

In [6]:
# Create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

print(f"Shape of train count vectors: {X_train_count.shape}")
print(f"Shape of test count vectors: {X_test_count.shape}")

Shape of train count vectors: (7999, 28212)
Shape of test count vectors: (2000, 28212)


### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [None]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = ???
X_test_tfidf  = ???

In [7]:
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

In [None]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(???)
X_train_tfidf_ngram = ???
X_test_tfidf_ngram  = ???

In [8]:
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram = tfidf_vect_ngram.transform(X_test)

In [None]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(???)
X_train_tfidf_ngram_chars = ???
X_test_tfidf_ngram_chars  = ???

In [9]:
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_test)

print(f"Shape of train word level TF-IDF: {X_train_tfidf.shape}")
print(f"Shape of train n-gram level TF-IDF: {X_train_tfidf_ngram.shape}")
print(f"Shape of train character level TF-IDF: {X_train_tfidf_ngram_chars.shape}")

Shape of train word level TF-IDF: (7999, 5000)
Shape of train n-gram level TF-IDF: (7999, 5000)
Shape of train character level TF-IDF: (7999, 5000)


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text


In [None]:
%%time
# ANSWER

In [10]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [11]:
def create_nlp_features(text):
    # Basic text features
    char_count = len(text)
    word_count = len(text.split())
    word_density = char_count / (word_count + 1)  # Adding 1 to avoid division by zero
    punctuation_count = sum(1 for char in text if char in string.punctuation)
    title_word_count = sum(1 for word in text.split() if word.istitle())
    uppercase_word_count = sum(1 for word in text.split() if word.isupper())
    
    # spaCy-based features
    doc = nlp(text)
    adj_count = sum(1 for token in doc if token.pos_ == 'ADJ')
    adv_count = sum(1 for token in doc if token.pos_ == 'ADV')
    noun_count = sum(1 for token in doc if token.pos_ == 'NOUN')
    num_count = sum(1 for token in doc if token.pos_ == 'NUM')
    pron_count = sum(1 for token in doc if token.pos_ == 'PRON')
    propn_count = sum(1 for token in doc if token.pos_ == 'PROPN')
    verb_count = sum(1 for token in doc if token.pos_ == 'VERB')
    
    return {
        'char_count': char_count,
        'word_count': word_count,
        'word_density': word_density,
        'punctuation_count': punctuation_count,
        'title_word_count': title_word_count,
        'uppercase_word_count': uppercase_word_count,
        'adj_count': adj_count,
        'adv_count': adv_count,
        'noun_count': noun_count,
        'num_count': num_count,
        'pron_count': pron_count,
        'propn_count': propn_count,
        'verb_count': verb_count
    }

In [None]:
# Initialise some columns for feature's counts
df_corpus['adj_count'] = 0
df_corpus['adv_count'] = 0
df_corpus['noun_count'] = 0
df_corpus['num_count'] = 0
df_corpus['pron_count'] = 0
df_corpus['propn_count'] = 0
df_corpus['verb_count'] = 0

In [12]:
# ANSWER
# Apply the function to create new features
df_corpus = df_corpus.join(df_corpus['text'].apply(create_nlp_features).apply(pd.Series))

# Display a sample of the new features
print(df_corpus[['char_count', 'word_count', 'word_density',
                 'punctuation_count', 'title_word_count',
                 'uppercase_word_count', 'adj_count',
                 'adv_count', 'noun_count', 'num_count',
                 'pron_count', 'propn_count', 'verb_count']].sample(5))

# Display summary statistics of the new features
print(df_corpus[['char_count', 'word_count', 'word_density',
                 'punctuation_count', 'title_word_count',
                 'uppercase_word_count', 'adj_count',
                 'adv_count', 'noun_count', 'num_count',
                 'pron_count', 'propn_count', 'verb_count']].describe())

      char_count  word_count  word_density  punctuation_count  \
6793       329.0        63.0      5.140625               10.0   
3215       643.0        99.0      6.430000               24.0   
6604       516.0       101.0      5.058824               19.0   
8516       363.0        71.0      5.041667               10.0   
8712       533.0        93.0      5.670213               49.0   

      title_word_count  uppercase_word_count  adj_count  adv_count  \
6793               4.0                   1.0        4.0        3.0   
3215              15.0                   6.0        9.0        6.0   
6604              18.0                   8.0        4.0       11.0   
8516              16.0                   2.0        4.0        4.0   
8712               5.0                   4.0        5.0        2.0   

      noun_count  num_count  pron_count  propn_count  verb_count  
6793        11.0        1.0        11.0          2.0         7.0  
3215        19.0        0.0         5.0         17.0  

In [None]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

df_corpus[cols].sample(5)

### Topic Models as features

In [13]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: total: 41 s
Wall time: 41.5 s


In [14]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 my product not dvd for on the they quality item
    1 miles theaters lackey rome bulbs products emma sarah logical housing
    2 ran san philadelphia incomprehensible francisco bargain remastered dude emphasis celiac
    3 cap os circuit fw germans scenario spy aluminum oriented poem
    4 richard anderson titan stockings kentucky steer anthology motorola ra lowest
    5 orwell fi sci winston cat titanic haiku yoga everyday textbook
    6 et est manon des cereal acid l steam korean insulting
    7 the i and a to it of this is in
    8 u card digital cards larger white henry simpletech fashion wayne
    9 with product my use to works player work software card
   10 unbelievable wanting higgins magazine stands gore overview damn cities turner
   11 la de hair camcorder lesson y lady en el spanish
   12 cave cute eye bear ayla clan 70 tree keel force
   13 paris captured achieve fa

In [15]:
# View the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(lda_model.components_):
    top_word_indices = topic_dist.argsort()[:-n_top_words - 1:-1]
    topic_words = [count_vect.get_feature_names_out()[i] for i in top_word_indices]
    topic_summary = ' '.join(topic_words)
    topic_summaries.append(topic_summary)
    print(f'  {i:3d} {topic_summary}')

# Add topic features to the training and test sets
X_train_topics = lda_model.transform(X_train_count)
X_test_topics = lda_model.transform(X_test_count)

print(f"Shape of train topic features: {X_train_topics.shape}")
print(f"Shape of test topic features: {X_test_topics.shape}")

Group Top Words
----- --------------------------------------------------------------------------------
    0 my product not dvd for on the they quality item
    1 miles theaters lackey rome bulbs products emma sarah logical housing
    2 ran san philadelphia incomprehensible francisco bargain remastered dude emphasis celiac
    3 cap os circuit fw germans scenario spy aluminum oriented poem
    4 richard anderson titan stockings kentucky steer anthology motorola ra lowest
    5 orwell fi sci winston cat titanic haiku yoga everyday textbook
    6 et est manon des cereal acid l steam korean insulting
    7 the i and a to it of this is in
    8 u card digital cards larger white henry simpletech fashion wayne
    9 with product my use to works player work software card
   10 unbelievable wanting higgins magazine stands gore overview damn cities turner
   11 la de hair camcorder lesson y lady en el spanish
   12 cave cute eye bear ayla clan 70 tree keel force
   13 paris captured achieve fa

## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [16]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [17]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [18]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8520

CPU times: total: 15.6 ms
Wall time: 8.93 ms


In [19]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8550

CPU times: total: 31.2 ms
Wall time: 23.8 ms


In [20]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8360

CPU times: total: 15.6 ms
Wall time: 29.4 ms


In [21]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8195

CPU times: total: 109 ms
Wall time: 184 ms


In [22]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [23]:
print(results)

             Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes          0.852             0.855           0.836   

             CharLevel Vectors  
Naïve Bayes             0.8195  


### Linear Classifier

In [24]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8525

CPU times: total: 19 s
Wall time: 2.34 s


In [25]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8715

CPU times: total: 62.5 ms
Wall time: 24.3 ms


In [26]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8295

CPU times: total: 15.6 ms
Wall time: 16.9 ms


In [27]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8490

CPU times: total: 141 ms
Wall time: 143 ms


In [28]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [29]:
print(results)

                     Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes                 0.8520            0.8550          0.8360   
Logistic Regression         0.8525            0.8715          0.8295   

                     CharLevel Vectors  
Naïve Bayes                     0.8195  
Logistic Regression             0.8490  


### Support Vector Machine

In [30]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8345

CPU times: total: 359 ms
Wall time: 356 ms


In [31]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8605

CPU times: total: 62.5 ms
Wall time: 68.4 ms


In [32]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8120

CPU times: total: 46.9 ms
Wall time: 37.2 ms


In [33]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8590

CPU times: total: 594 ms
Wall time: 588 ms


In [34]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [35]:
print(results)

                        Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes                    0.8520            0.8550          0.8360   
Logistic Regression            0.8525            0.8715          0.8295   
Support Vector Machine         0.8345            0.8605          0.8120   

                        CharLevel Vectors  
Naïve Bayes                        0.8195  
Logistic Regression                0.8490  
Support Vector Machine             0.8590  


### Bagging Models

In [36]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8230

CPU times: total: 10.6 s
Wall time: 10.6 s


In [37]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8265

CPU times: total: 6.27 s
Wall time: 6.3 s


In [38]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7860

CPU times: total: 6.22 s
Wall time: 6.24 s


In [39]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7735

CPU times: total: 20.3 s
Wall time: 20.4 s


In [40]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [41]:
print(results)

                        Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes                    0.8520            0.8550          0.8360   
Logistic Regression            0.8525            0.8715          0.8295   
Support Vector Machine         0.8345            0.8605          0.8120   
Random Forest                  0.8230            0.8265          0.7860   

                        CharLevel Vectors  
Naïve Bayes                        0.8195  
Logistic Regression                0.8490  
Support Vector Machine             0.8590  
Random Forest                      0.7735  


### Boosting Models

In [42]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

CPU times: total: 6.2 s
Wall time: 6.22 s


In [43]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7960

CPU times: total: 12.3 s
Wall time: 12.3 s


In [44]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7330

CPU times: total: 7.72 s
Wall time: 7.75 s


In [45]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8025

CPU times: total: 1min 48s
Wall time: 1min 49s


In [46]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [47]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.852,0.855,0.836,0.8195
Logistic Regression,0.8525,0.8715,0.8295,0.849
Support Vector Machine,0.8345,0.8605,0.812,0.859
Random Forest,0.823,0.8265,0.786,0.7735
Gradient Boosting,0.799,0.796,0.733,0.8025


Which combination of features and model performed the best?



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



