This notebook will be used to implement spacy2.x models for text classifcation on the US_consumer_finance_complaints dataset. The data can be downloaded from https://www.kaggle.com/cfpb/us-consumer-finance-complaints. Only the product and consumer_complaint_narrative features are used as categories and text respectively to train a model to classify text into one of the categories.

In [147]:
import pandas as pd
import numpy as np
import seaborn as sns
import random
import time
import re
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, classification_report

from nltk.corpus import stopwords
stop = stopwords.words('english')

import spacy
from spacy.util import minibatch, compounding

import torch

import warnings
warnings.filterwarnings(action="ignore")

In [148]:
# read the data
df = pd.read_csv('consumer_complaints.csv')

In [149]:
df.head()

Unnamed: 0,date_received,product,sub_product,issue,sub_issue,consumer_complaint_narrative,company_public_response,company,state,zipcode,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed?,complaint_id
0,08/30/2013,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,U.S. Bancorp,CA,95993,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511074
1,08/30/2013,Mortgage,Other mortgage,"Loan servicing, payments, escrow account",,,,Wells Fargo & Company,CA,91104,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511080
2,08/30/2013,Credit reporting,,Incorrect information on credit report,Account status,,,Wells Fargo & Company,NY,11764,,,Postal mail,09/18/2013,Closed with explanation,Yes,No,510473
3,08/30/2013,Student loan,Non-federal student loan,Repaying your loan,Repaying your loan,,,"Navient Solutions, Inc.",MD,21402,,,Email,08/30/2013,Closed with explanation,Yes,Yes,510326
4,08/30/2013,Debt collection,Credit card,False statements or representation,Attempted to collect wrong amount,,,Resurgent Capital Services L.P.,GA,30106,,,Web,08/30/2013,Closed with explanation,Yes,Yes,511067


In [150]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555957 entries, 0 to 555956
Data columns (total 18 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   date_received                 555957 non-null  object
 1   product                       555957 non-null  object
 2   sub_product                   397635 non-null  object
 3   issue                         555957 non-null  object
 4   sub_issue                     212622 non-null  object
 5   consumer_complaint_narrative  66806 non-null   object
 6   company_public_response       85124 non-null   object
 7   company                       555957 non-null  object
 8   state                         551070 non-null  object
 9   zipcode                       551452 non-null  object
 10  tags                          77959 non-null   object
 11  consumer_consent_provided     123458 non-null  object
 12  submitted_via                 555957 non-null  object
 13 

In [151]:
# find out the number of null valuesby column
df.isnull().sum().sort_values(ascending=False)

consumer_complaint_narrative    489151
tags                            477998
company_public_response         470833
consumer_consent_provided       432499
sub_issue                       343335
sub_product                     158322
state                             4887
zipcode                           4505
product                              0
issue                                0
complaint_id                         0
company                              0
consumer_disputed?                   0
submitted_via                        0
date_sent_to_company                 0
company_response_to_consumer         0
timely_response                      0
date_received                        0
dtype: int64

In [152]:
# drop all the rows with null values in the 'consumer_complaint_narrative' column as that is the column we will 
# be using for our analysis
df.dropna(subset=['consumer_complaint_narrative'], axis=0, inplace=True)

In [153]:
# check to make sure all the null values in the required column have been dropped
df.isnull().sum().sort_values(ascending=False)

tags                            55389
company_public_response         34030
sub_issue                       33874
sub_product                     20455
zipcode                           189
state                             186
company                             0
product                             0
issue                               0
consumer_complaint_narrative        0
complaint_id                        0
consumer_disputed?                  0
consumer_consent_provided           0
submitted_via                       0
date_sent_to_company                0
company_response_to_consumer        0
timely_response                     0
date_received                       0
dtype: int64

In [154]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66806 entries, 190126 to 553096
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   date_received                 66806 non-null  object
 1   product                       66806 non-null  object
 2   sub_product                   46351 non-null  object
 3   issue                         66806 non-null  object
 4   sub_issue                     32932 non-null  object
 5   consumer_complaint_narrative  66806 non-null  object
 6   company_public_response       32776 non-null  object
 7   company                       66806 non-null  object
 8   state                         66620 non-null  object
 9   zipcode                       66617 non-null  object
 10  tags                          11417 non-null  object
 11  consumer_consent_provided     66806 non-null  object
 12  submitted_via                 66806 non-null  object
 13  date_sent_

### Preprocess

In [155]:
# create a new dataframe with the only two columns required for our analysis
df = df[['product', 'consumer_complaint_narrative']]

In [158]:
# function to clean text
def clean_text_round1(text):
    text = text.lower()                                                 # lowercase text
    text = re.sub('\{.*?\}', '', text)                                  # remove text in curly brackets
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)     # remove punctuations
    text = re.sub('\w*\d\w*', '', text)                                 # remove numbers like dates
    text = re.sub('\n', '', text)                                       # remove new line characters
    return text

In [159]:
df['clean_text'] = df['consumer_complaint_narrative'].apply(clean_text_round1)

In [160]:
df.head()

Unnamed: 0,product,consumer_complaint_narrative,clean_text
190126,Debt collection,XXXX has claimed I owe them {$27.00} for XXXX ...,xxxx has claimed i owe them for xxxx years de...
190135,Consumer Loan,Due to inconsistencies in the amount owed that...,due to inconsistencies in the amount owed that...
190155,Mortgage,In XX/XX/XXXX my wages that I earned at my job...,in xxxxxxxx my wages that i earned at my job d...
190207,Mortgage,I have an open and current mortgage with Chase...,i have an open and current mortgage with chase...
190208,Mortgage,XXXX was submitted XX/XX/XXXX. At the time I s...,xxxx was submitted xxxxxxxx at the time i subm...


In [195]:
# function to remove xx's
def remove_xx(text):
    words = str(text).split()
    for word in words:
        if len(word) >= 2:
            if word[0] == 'x' and word[1] == 'x':
                words.remove(word)
            
    return ' '.join(words)

remove_xx('hello, world xxxxxxxxxx')

'hello, world'

In [162]:
df['clean_text'] = df['clean_text'].map(lambda x: remove_xx(x))

In [164]:
# remove stop words
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [165]:
# this is enough preprocessing for now
df.head()

Unnamed: 0,product,consumer_complaint_narrative,clean_text
190126,Debt collection,XXXX has claimed I owe them {$27.00} for XXXX ...,claimed owe years despite proof payment sent c...
190135,Consumer Loan,Due to inconsistencies in the amount owed that...,due inconsistencies amount owed told bank amou...
190155,Mortgage,In XX/XX/XXXX my wages that I earned at my job...,wages earned job decreased almost half knew tr...
190207,Mortgage,I have an open and current mortgage with Chase...,open current mortgage chase bank chase reporti...
190208,Mortgage,XXXX was submitted XX/XX/XXXX. At the time I s...,submitted time submitted complaint dealt rushm...


In [168]:
 # only the first 1000 rows to save time during training
df = df[['product', 'clean_text']][:1000]

In [169]:
df.head()

Unnamed: 0,product,clean_text
190126,Debt collection,claimed owe years despite proof payment sent c...
190135,Consumer Loan,due inconsistencies amount owed told bank amou...
190155,Mortgage,wages earned job decreased almost half knew tr...
190207,Mortgage,open current mortgage chase bank chase reporti...
190208,Mortgage,submitted time submitted complaint dealt rushm...


In [170]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 190126 to 203247
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   product     1000 non-null   object
 1   clean_text  1000 non-null   object
dtypes: object(2)
memory usage: 23.4+ KB


In [171]:
# reset index
df.reset_index()

Unnamed: 0,index,product,clean_text
0,190126,Debt collection,claimed owe years despite proof payment sent c...
1,190135,Consumer Loan,due inconsistencies amount owed told bank amou...
2,190155,Mortgage,wages earned job decreased almost half knew tr...
3,190207,Mortgage,open current mortgage chase bank chase reporti...
4,190208,Mortgage,submitted time submitted complaint dealt rushm...
...,...,...,...
995,203235,Consumer Loan,credit report reporting appears account online...
996,203241,Credit card,credit card accounts capital one earlier year ...
997,203244,Bank account or service,bank america bank took whole check account go ...
998,203246,Mortgage,received tax abatement escrow payment balance ...


In [172]:
# check the frequency of each category
100.0*df['product'].value_counts()/len(df)

Debt collection            28.3
Mortgage                   23.3
Credit reporting           14.3
Credit card                12.0
Bank account or service     7.8
Consumer Loan               5.7
Student loan                4.7
Money transfers             1.7
Payday loan                 1.3
Prepaid card                0.7
Other financial service     0.2
Name: product, dtype: float64

### Objective
The idea here is to use the text in the 'consumer_complaint_narrative' column to categorise it to the right category 

### Prepare train/test/valid dataset

In [173]:
label_values = list(df['product'].unique())
label_values

['Debt collection',
 'Consumer Loan',
 'Mortgage',
 'Credit card',
 'Credit reporting',
 'Student loan',
 'Bank account or service',
 'Payday loan',
 'Money transfers',
 'Other financial service',
 'Prepaid card']

In [174]:
train_X, test_X, train_y, test_y = train_test_split(df['clean_text'],
                                                   df['product'],
                                                   test_size=0.2,
                                                   stratify=df['product']
                                                   )

In [175]:
print('Shape of train_X:', train_X.shape)
print('Shape of train_y:', train_y.shape)
print('Shape of test_X:', test_X.shape)
print('Shape of test_y:', test_y.shape)

Shape of train_X: (800,)
Shape of train_y: (800,)
Shape of test_X: (200,)
Shape of test_y: (200,)


### Convert dataset to spacy compatible format

In [176]:
# one hot encode all the labels
train_y_df = pd.get_dummies(train_y)
test_y_df = pd.get_dummies(test_y)

In [177]:
train_y_df.head()

Unnamed: 0,Bank account or service,Consumer Loan,Credit card,Credit reporting,Debt collection,Money transfers,Mortgage,Other financial service,Payday loan,Prepaid card,Student loan
199871,0,0,0,0,1,0,0,0,0,0,0
200030,1,0,0,0,0,0,0,0,0,0,0
199616,0,0,0,0,0,0,1,0,0,0,0
200500,0,0,0,1,0,0,0,0,0,0,0
200691,0,0,0,0,0,0,1,0,0,0,0


In [178]:
# convert data to text list and label dictionaries
train_texts = train_X.tolist()
train_cats = train_y_df.to_dict(orient='records')
test_texts = test_X.tolist()
test_cats = test_y_df.to_dict(orient='records')

In [179]:
# combine the text and labels to create data in spacy format
train_data = list(zip(train_texts, [{'cats': cats} for cats in train_cats]))
test_data = list(zip(test_texts, [{'cats': cats} for cats in test_cats]))

In [180]:
# check
train_data[:2]

[('ended service receiving final statement notice additional charges needed corrected contacted informed errors ticket created amend bill new updated bill sent business days bill never received called regards bill customer service observed bill still waiting supervisor approval corrections another ticket created marked urgent corrections new bill received credit alert stating account turned third party collections agency spoke representative rom southwest credit stated attempting collect past due bill explain charges incorrect bill reviewed corrections requested transferred billing time stated third party collections hired contacted spoke reviewed bill supervisor corrected also sent request remove account third party collections contacted check status account still collections customer service representative stated first request denied collection agency refused remove request created second ticket requested account removed collections collection agency updated payment amount contacted 

In [181]:
# check
test_data[:2]

[('subscribed cancelled within first days service also shipped back equipment within days well saying owe fee refuse pay owe',
  {'cats': {'Bank account or service': 0,
    'Consumer Loan': 0,
    'Credit card': 0,
    'Credit reporting': 0,
    'Debt collection': 1,
    'Money transfers': 0,
    'Mortgage': 0,
    'Payday loan': 0,
    'Prepaid card': 0,
    'Student loan': 0}}),
 ('mortgage company made payment arrangements days later said would nt take payments',
  {'cats': {'Bank account or service': 0,
    'Consumer Loan': 0,
    'Credit card': 0,
    'Credit reporting': 0,
    'Debt collection': 0,
    'Money transfers': 0,
    'Mortgage': 1,
    'Payday loan': 0,
    'Prepaid card': 0,
    'Student loan': 0}})]

In [182]:
# unpack the text and lables used for evaluation later
train_texts, train_labels = list(zip(*train_data))
test_texts, test_labels = list(zip(*test_data))

### Construct spacy model

In [190]:
def train_spacy(iterations, model_arch, dropout, learn_rate):

    nlp = spacy.load('en_core_web_lg')

    textcat = nlp.create_pipe('textcat', config={'exclusive_classes':True, 'architecture':model_arch})
    nlp.add_pipe(textcat)

    for _, label in enumerate(label_values):
        textcat.add_label(label)

    pipe_exceptions = ['textcat']
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

    with nlp.disable_pipes(*other_pipes):
    #     print(nlp.pipe_names)
        optimizer = nlp.begin_training()
        optimizer.learn_rate = learn_rate
        print('Training the model..')
        total_start_time = time.clock()

    for i in range(iterations):
        print('\nIteration:', str(i+1))
        start_time = time.clock()
        losses = {}
        true_labels = []
        pred_labels = []

        random.shuffle(train_data)
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)

        with textcat.model.use_params(optimizer.averages):

            docs = [nlp.tokenizer(text) for text in test_texts]

            for j, doc in enumerate(textcat.pipe(docs)):
                true_series = pd.Series(test_labels[j]['cats'])
                true_label = true_series.idxmax()
                true_labels.append(true_label)

                pred_series = pd.Series(doc.cats)
                pred_label = pred_series.idxmax()
                pred_labels.append(pred_label)

            score_f1 = f1_score(true_labels, pred_labels, average='weighted')
            score_ac = accuracy_score(true_labels, pred_labels)

            print('textcat_loss: {:.3f}\t f1_score: {:.3f}\t accuracy_score: {:.3f}'.format(losses['textcat'], score_f1, score_ac))

            print('Elapsed time:', str(round((time.clock() - start_time)/60,2)) + ' minutes')
            
    print('Total time:', str(round((time.clock() - total_start_time)/60,2)) + ' minutes')
            
    return nlp

In [186]:
# bag of words model architecture
train_spacy(10, 'bow', 0.2, 4e-4)

Training the model..

Iteration: 1
textcat_loss: 9.368	 f1_score: 0.562	 accuracy_score: 0.630
Elapsed time: 0.67 minutes

Iteration: 2
textcat_loss: 6.224	 f1_score: 0.626	 accuracy_score: 0.680
Elapsed time: 0.58 minutes

Iteration: 3
textcat_loss: 4.375	 f1_score: 0.634	 accuracy_score: 0.685
Elapsed time: 0.57 minutes

Iteration: 4
textcat_loss: 3.226	 f1_score: 0.660	 accuracy_score: 0.705
Elapsed time: 0.57 minutes

Iteration: 5
textcat_loss: 2.501	 f1_score: 0.675	 accuracy_score: 0.715
Elapsed time: 0.59 minutes

Iteration: 6
textcat_loss: 2.062	 f1_score: 0.681	 accuracy_score: 0.720
Elapsed time: 0.61 minutes

Iteration: 7
textcat_loss: 1.687	 f1_score: 0.684	 accuracy_score: 0.720
Elapsed time: 0.75 minutes

Iteration: 8
textcat_loss: 1.335	 f1_score: 0.689	 accuracy_score: 0.725
Elapsed time: 0.97 minutes

Iteration: 9
textcat_loss: 1.116	 f1_score: 0.694	 accuracy_score: 0.730
Elapsed time: 1.04 minutes

Iteration: 10
textcat_loss: 0.945	 f1_score: 0.686	 accuracy_score: 0

<spacy.lang.en.English at 0x7feef1981160>

In [188]:
# try a different learning rate for bow
train_spacy(10, 'bow', 0.2, 4e-3)

Training the model..

Iteration: 1
textcat_loss: 6.909	 f1_score: 0.765	 accuracy_score: 0.780
Elapsed time: 0.7 minutes

Iteration: 2
textcat_loss: 1.860	 f1_score: 0.712	 accuracy_score: 0.730
Elapsed time: 0.71 minutes

Iteration: 3
textcat_loss: 0.737	 f1_score: 0.703	 accuracy_score: 0.730
Elapsed time: 1.46 minutes

Iteration: 4
textcat_loss: 0.380	 f1_score: 0.706	 accuracy_score: 0.735
Elapsed time: 0.98 minutes

Iteration: 5
textcat_loss: 0.353	 f1_score: 0.744	 accuracy_score: 0.760
Elapsed time: 2.46 minutes

Iteration: 6
textcat_loss: 0.204	 f1_score: 0.732	 accuracy_score: 0.750
Elapsed time: 3.06 minutes

Iteration: 7
textcat_loss: 0.102	 f1_score: 0.729	 accuracy_score: 0.745
Elapsed time: 2.92 minutes

Iteration: 8
textcat_loss: 0.061	 f1_score: 0.708	 accuracy_score: 0.735
Elapsed time: 3.63 minutes

Iteration: 9
textcat_loss: 0.121	 f1_score: 0.698	 accuracy_score: 0.725
Elapsed time: 3.5 minutes

Iteration: 10
textcat_loss: 0.150	 f1_score: 0.711	 accuracy_score: 0.7

<spacy.lang.en.English at 0x7fef16df6fd0>

In [191]:
# convolutional neural network as model architecture
train_spacy(10, 'simple_cnn', 0.2, 4e-3)

Training the model..

Iteration: 1
textcat_loss: 9.043	 f1_score: 0.535	 accuracy_score: 0.605
Elapsed time: 0.72 minutes

Iteration: 2
textcat_loss: 6.229	 f1_score: 0.593	 accuracy_score: 0.650
Elapsed time: 0.73 minutes

Iteration: 3
textcat_loss: 4.255	 f1_score: 0.712	 accuracy_score: 0.735
Elapsed time: 1.03 minutes

Iteration: 4
textcat_loss: 3.029	 f1_score: 0.717	 accuracy_score: 0.730
Elapsed time: 1.37 minutes

Iteration: 5
textcat_loss: 1.831	 f1_score: 0.737	 accuracy_score: 0.745
Elapsed time: 2.83 minutes

Iteration: 6
textcat_loss: 1.613	 f1_score: 0.723	 accuracy_score: 0.735
Elapsed time: 3.13 minutes

Iteration: 7
textcat_loss: 1.163	 f1_score: 0.722	 accuracy_score: 0.725
Elapsed time: 2.91 minutes

Iteration: 8
textcat_loss: 0.736	 f1_score: 0.720	 accuracy_score: 0.730
Elapsed time: 3.81 minutes

Iteration: 9
textcat_loss: 0.599	 f1_score: 0.717	 accuracy_score: 0.725
Elapsed time: 3.39 minutes

Iteration: 10
textcat_loss: 0.588	 f1_score: 0.705	 accuracy_score: 0

<spacy.lang.en.English at 0x7fef3880a710>

In [192]:
# ensemble model architecture
train_spacy(10, 'ensemble', 0.2, 4e-3)

Training the model..

Iteration: 1
textcat_loss: 9.235	 f1_score: 0.508	 accuracy_score: 0.570
Elapsed time: 1.05 minutes

Iteration: 2
textcat_loss: 6.580	 f1_score: 0.638	 accuracy_score: 0.665
Elapsed time: 1.2 minutes

Iteration: 3
textcat_loss: 5.604	 f1_score: 0.723	 accuracy_score: 0.730
Elapsed time: 1.64 minutes

Iteration: 4
textcat_loss: 4.349	 f1_score: 0.681	 accuracy_score: 0.680
Elapsed time: 1.71 minutes

Iteration: 5
textcat_loss: 3.843	 f1_score: 0.716	 accuracy_score: 0.720
Elapsed time: 2.92 minutes

Iteration: 6
textcat_loss: 2.944	 f1_score: 0.718	 accuracy_score: 0.720
Elapsed time: 3.55 minutes

Iteration: 7
textcat_loss: 2.823	 f1_score: 0.737	 accuracy_score: 0.740
Elapsed time: 3.57 minutes

Iteration: 8
textcat_loss: 2.100	 f1_score: 0.742	 accuracy_score: 0.745
Elapsed time: 4.12 minutes

Iteration: 9
textcat_loss: 2.163	 f1_score: 0.744	 accuracy_score: 0.745
Elapsed time: 3.78 minutes

Iteration: 10
textcat_loss: 1.669	 f1_score: 0.728	 accuracy_score: 0.

<spacy.lang.en.English at 0x7fed734d6d68>

### Transformer models

In [193]:
def trf_train_spacy(iterations, dropout, learn_rate):

    nlp = spacy.load('en_trf_bertbaseuncased_lg')

    textcat = nlp.create_pipe('trf_textcat', config={'exclusive_classes':True})

    for _, label in enumerate(label_values):
        textcat.add_label(label)
        
    nlp.add_pipe(textcat)

#     pipe_exceptions = ['textcat']
#     other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    
    optimizer = nlp.resume_training()
    optimizer.learn_rate = learn_rate
    print('Training the model..')

    for i in range(iterations):
        print('\nIteration:', str(i+1))
        start_time = time.clock()
        losses = {}
        true_labels = []
        pred_labels = []
        b = 0
        random.shuffle(train_data)
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
            b += 1
            print('Batch:', b)
            total_start_time = time.clock()
            
#         print(i, losses)

        with textcat.model.use_params(optimizer.averages):

            docs = [nlp(text) for text in test_texts]

            for j, doc in enumerate(textcat.pipe(docs)):
                true_series = pd.Series(test_labels[j]['cats'])
                true_label = true_series.idxmax()
                true_labels.append(true_label)

                pred_series = pd.Series(doc.cats)
                pred_label = pred_series.idxmax()
                pred_labels.append(pred_label)

            score_f1 = f1_score(true_labels, pred_labels, average='weighted')
            score_ac = accuracy_score(true_labels, pred_labels)

            print('textcat_loss: {:.3f}\t f1_score: {:.3f}\t accuracy_score: {:.3f}'.format(losses['trf_textcat'], score_f1, score_ac))

            print('Elapsed time:', str(round((time.clock() - start_time)/60,2)) + ' minutes')
    
    print('Total time:', str(round((time.clock() - total_start_time)/60,2)) + ' minutes')
    
    return nlp

In [194]:
# trial with train_data[:1000] on a transformer model
trf_train_spacy(5, 0.2, 4e-3)

Training the model..

Iteration: 1
Batch: 1
Batch: 2
Batch: 3
Batch: 4
Batch: 5
Batch: 6
Batch: 7
Batch: 8
Batch: 9
Batch: 10
Batch: 11
Batch: 12
Batch: 13
Batch: 14
Batch: 15
Batch: 16
Batch: 17
Batch: 18
Batch: 19
Batch: 20
Batch: 21
Batch: 22
Batch: 23
Batch: 24
Batch: 25
Batch: 26
Batch: 27
Batch: 28
Batch: 29
Batch: 30
Batch: 31
Batch: 32
Batch: 33
Batch: 34
Batch: 35
Batch: 36
Batch: 37
Batch: 38
Batch: 39
Batch: 40
Batch: 41
Batch: 42
Batch: 43
Batch: 44
Batch: 45
Batch: 46
Batch: 47
Batch: 48
Batch: 49
Batch: 50
Batch: 51
Batch: 52
Batch: 53
Batch: 54
Batch: 55
Batch: 56
Batch: 57
Batch: 58
Batch: 59
Batch: 60
Batch: 61
Batch: 62
Batch: 63
Batch: 64
Batch: 65
Batch: 66
Batch: 67
Batch: 68
Batch: 69
Batch: 70
Batch: 71
Batch: 72
Batch: 73
Batch: 74
Batch: 75
Batch: 76
Batch: 77
Batch: 78
Batch: 79
Batch: 80
Batch: 81
Batch: 82
Batch: 83
Batch: 84
Batch: 85
Batch: 86
Batch: 87
Batch: 88
Batch: 89
Batch: 90
Batch: 91
Batch: 92
Batch: 93
Batch: 94
Batch: 95
Batch: 96
Batch: 97
Batc

Batch: 154
Batch: 155
Batch: 156
Batch: 157
Batch: 158
Batch: 159
Batch: 160
Batch: 161
Batch: 162
Batch: 163
Batch: 164
Batch: 165
Batch: 166
Batch: 167
Batch: 168
Batch: 169
Batch: 170
Batch: 171
Batch: 172
Batch: 173
Batch: 174
Batch: 175
Batch: 176
Batch: 177
Batch: 178
Batch: 179
Batch: 180
Batch: 181
Batch: 182
Batch: 183
Batch: 184
Batch: 185
Batch: 186
Batch: 187
Batch: 188
Batch: 189
Batch: 190
Batch: 191
Batch: 192
Batch: 193
Batch: 194
Batch: 195
Batch: 196
Batch: 197
Batch: 198
Batch: 199
Batch: 200
textcat_loss: 10.610	 f1_score: 0.126	 accuracy_score: 0.285
Elapsed time: 61.66 minutes

Iteration: 5
Batch: 1
Batch: 2
Batch: 3
Batch: 4
Batch: 5
Batch: 6
Batch: 7
Batch: 8
Batch: 9
Batch: 10
Batch: 11
Batch: 12
Batch: 13
Batch: 14
Batch: 15
Batch: 16
Batch: 17
Batch: 18
Batch: 19
Batch: 20
Batch: 21
Batch: 22
Batch: 23
Batch: 24
Batch: 25
Batch: 26
Batch: 27
Batch: 28
Batch: 29
Batch: 30
Batch: 31
Batch: 32
Batch: 33
Batch: 34
Batch: 35
Batch: 36
Batch: 37
Batch: 38
Batch: 39

<spacy_transformers.language.TransformersLanguage at 0x7fec5b6139b0>

- Try tuning hyperparameters to increase afficiency and accuracy
- Implement this notebook with spacy 3.0
- Need access to GPU for transformer models, training is way too slow on CPU