## Classification of Covid 19 tweets to assertain what can be learnt about public knowledge on symptoms, treatment and prevention of the virus

this project is an effort to contribute to health surveillance in order to help medical professionals coalate useful information to aid in intervention measures in the situation of an outbreak

The objectives of this project are:

1. implementation of bert algorithm 
2. aid in labelling large data
3. information gathering


## Data processing

### Processing literature data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# load covid literature data
covid_literature_df = pd.read_csv('C:/Users/Sammy/2021/final-project/data/metadata.csv', usecols=['title','abstract','authors','doi','publish_time','pdf_json_files'])
np.random.seed()


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
covid_literature_df.head()

Unnamed: 0,title,doi,abstract,publish_time,authors,pdf_json_files
0,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",document_parses/pdf_json/d1aafb70c066a2068b027...
1,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",document_parses/pdf_json/6b0567729c2143a66d737...
2,Surfactant protein-D and pulmonary host defense,10.1186/rr19,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",document_parses/pdf_json/06ced00a5fc04215949aa...
3,Role of endothelin-1 in lung disease,10.1186/rr44,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",document_parses/pdf_json/348055649b6b8cf2b9a37...
4,Gene expression in epithelial cells in respons...,10.1186/rr61,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",document_parses/pdf_json/5f48792a5fa08bed9f560...


In [4]:
#missing values
missing_values = covid_literature_df.isnull().sum()
missing_values

title                216
doi               180633
abstract          115815
publish_time         220
authors            12283
pdf_json_files    274650
dtype: int64

In [5]:
#percentage of missing data
all_cells = np.product(covid_literature_df.shape)
all_missing = missing_values.sum()
percentage_missing = (all_missing/all_cells)*100
print(percentage_missing)

23.501964478366585


In [6]:
#drop duplicates and fill missing values
covid_literature_df.duplicated()
covid_literature_df.drop_duplicates(keep='first', inplace=True)
covid_literature_df.fillna('no text available', inplace = True)

In [7]:
covid_literature_df['abstract'] = (covid_literature_df['title']+ ' ' + covid_literature_df['abstract']).apply(lambda row: row.strip())

In [8]:
covid_literature_df['abstract'] = covid_literature_df['abstract'].apply(lambda row: row.replace('no text available',''))

In [9]:
#reduce the dataset to literature discussing covid specific topics
def find_covid_lit(df):
    df1 = df[df['abstract'].str.contains('ncov')]
    df2 = df[df['abstract'].str.contains('corona')]
    df3 = df[df['abstract'].str.contains('covid')]
    df4 = df[df['abstract'].str.contains('-cov-2')]
    df5 = df[df['abstract'].str.contains('cov2')]
    
    
    data =[df1,df2,df3,df4,df5]
    df = pd.concat(data)
    df=df.drop_duplicates(subset='title', keep="first")
    return df
    
df=find_covid_lit(covid_literature_df)
print (df.shape)
df.head()   

(70814, 6)


Unnamed: 0,title,doi,abstract,publish_time,authors,pdf_json_files
524,Genomic Signatures of Strain Selection and Enh...,10.1371/journal.pone.0017836,Genomic Signatures of Strain Selection and Enh...,2011-03-25,"Gibbons, Henry S.; Broomall, Stacey M.; McNew,...",document_parses/pdf_json/6cc30d377f0bd9004378e...
549,The Multifaceted Poliovirus 2A Protease: Regul...,10.1155/2011/369648,The Multifaceted Poliovirus 2A Protease: Regul...,2011-04-14,"Castelló, Alfredo; Álvarez, Enrique; Carrasco,...",document_parses/pdf_json/5bff40b0df656057c9db8...
621,Dominating Biological Networks,10.1371/journal.pone.0023016,Dominating Biological Networks Proteins are es...,2011-08-26,"Milenković, Tijana; Memišević, Vesna; Bonato, ...",document_parses/pdf_json/d0acac75cb3d2c19abfe0...
784,Both TLR2 and TRIF Contribute to Interferon-β ...,10.1371/journal.pone.0033299,Both TLR2 and TRIF Contribute to Interferon-β ...,2012-03-14,"Aubry, Camille; Corr, Sinéad C.; Wienerroither...",document_parses/pdf_json/2305cbae32a50cf992335...
884,Proteasome-Dependent Disruption of the E3 Ubiq...,10.1371/journal.ppat.1002789,Proteasome-Dependent Disruption of the E3 Ubiq...,2012-07-05,"Fehr, Anthony R.; Gualberto, Nathaniel C.; Sav...",document_parses/pdf_json/5ae7223cc6ab3437ba262...


### labeling the literature data based on terms in the abstract

In [60]:
import functools
from nltk import PorterStemmer
from gensim.parsing.preprocessing import remove_stopwords

# converts terms to morphological root eg. smoked and smoking to smok
def stemmer(words):
    stemmer = PorterStemmer()
    tokens=[]
    for w in words:
        tokens.append(stemmer.stem(w))
    return tokens


def get_sentences(df1,search_terms,str1):
    df_table = pd.DataFrame(columns = ["pub_date","authors","title","excerpt",'label'])
    search_terms=stemmer(search_terms)
    for index, row in df1.iterrows():
        
        pub_sentence=''
        sentences_used=0
        
        sentences = row['abstract'].split('. ')
        
        highligts=[]
        for sentence in sentences:
            
            missing=0
            
            for word in search_terms:
                
                if word not in sentence:
                    missing=1
                    
            
            if missing==0 and len(sentence)<1000 and sentence!='':
                sentence=sentence.capitalize()
                if sentence[len(sentence)-1]!='.':
                    sentence=sentence+'.'
                pub_sentence=pub_sentence+sentence
                
        if pub_sentence!='':
            sentence=pub_sentence
            sentences_used=sentences_used+1
            authors=row["authors"].split(" ")            
            title=row["title"]                       
            to_append = [row['publish_time'],authors[0]+' et al.',title,sentence,str1]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
    return df_table


def search_dataframe(df,search_terms):   
    search_words=stemmer(search_terms)
    df1=df[functools.reduce(lambda a, b: a&b, (df['abstract'].str.contains(s) for s in search_terms))]
    return df1







In [17]:
#list of labels and list of search terms per label
labels = ['transmission','treatment','symptom','health risk', 'prevention']
search=[['transmission','infection','case'],['treatment','care'],['symptom','asymptomatic'],['risk','vulnerable','affect'],['prevent','quarantine']]
label_table = ['','','','','']
q=0


for search_terms in search:
    #str1=''
    str1 = labels[q]
    
    #search the dataframe for all words
    df1=search_dataframe(df,search_terms)
   
    # get best sentences
    df_table=get_sentences(df1,search_terms,str1)    
    
    length=df_table.shape[0]

    if length<1:
        print ("No reliable data could be located in the literature")
    else:
        display(df_table.head())
    label_table[q] = df_table
    q=q+1
    
print ('done')

Unnamed: 0,pub_date,authors,title,excerpt,label
0,2020-11-20,"Boshier, et al.",Remdesivir induced viral RNA and subgenomic RN...,We conclude that these are likely to have aris...,transmission
1,2020-01-03,"Chertow, et al.","Influenza, Measles, SARS, MERS, and Smallpox",Early case identification and strict infection...,transmission
2,2020-09-17,"Gupta, et al.",Estimating the Impact of Daily Weather on the ...,"Daily maximum (t(max)), minimum (t(min)), mean...",transmission
3,2015,"Al-Tawfiq, et al.",Middle East respiratory syndrome coronavirus i...,"Data suggest the overcrowding, late recognitio...",transmission
4,2020-06-17,"Jing, et al.",Household secondary attack rate of COVID-19 an...,We assessed the demographic determinants of tr...,transmission


Unnamed: 0,pub_date,authors,title,excerpt,label
0,2020-12-23,"Taylor, et al.",Awake-Prone Positioning Strategy for Non-Intub...,Five inpatient medical service teams were rand...,treatment
1,2020-06-01,"Blais, et al.",Consensus statement: summary of the Quebec Lun...,"For each level-of-activity scenario, suggestio...",treatment
2,2020,"Taylor, et al.",Awake-Prone Positioning Strategy for Non-Intub...,Five inpatient medical service teams were rand...,treatment
3,2020,"Berman, et al.",Supportive Care: An Indispensable Component of...,The current focus of cancer care is on initial...,treatment
4,2020,"Chamsi-Pasha, et al.",Ethical dilemmas in the era of COVID-19,"Across the globe, hospitals are being challeng...",treatment


Unnamed: 0,pub_date,authors,title,excerpt,label
0,2016-03-24,"Ziegler, et al.",The Lymphocytic Choriomeningitis Virus Matrix ...,The lymphocytic choriomeningitis virus matrix ...,symptom
1,2019,"Erben, et al.",Multicenter experience with endovascular treat...,Clinical presentation included asymptomatic in...,symptom
2,2020-11-20,"Badgujar, et al.",Structural insights into loss of function of a...,The opportunistic pathogen streptococcus pneum...,symptom
3,2011,"Gray, et al.",Influence of site and operator characteristics...,Methods in this assessment of the capture 2 st...,symptom
4,2016,"Brachmann, et al.",Uncovering Atrial Fibrillation Beyond Short-Te...,"However, af can be paroxysmal and asymptomatic...",symptom


Unnamed: 0,pub_date,authors,title,excerpt,label
0,2020-11-04,"Eufemia, et al.",Peacebuilding in times of COVID-19: risk-adapt...,"After reviewing academic and grey literature, ...",health risk
1,2020,"Torres-Pinzon, et al.",Coronavirus Disease 2019 and the Case to Cover...,Coronavirus disease 2019 and the case to cover...,health risk
2,2020,"Stasevic-Karlicic, et al.",Perspectives on mental health services during ...,Perspectives on mental health services during ...,health risk
3,2020,"Cattaneo, et al.",Clinical characteristics and risk factors for ...,Clinical characteristics and risk factors for ...,health risk
4,2020-04-06,"Islam, et al.",Modeling risk of infectious diseases: a case o...,Results according to the calculated risk index...,health risk


Unnamed: 0,pub_date,authors,title,excerpt,label
0,2020-09-21,"Su, et al.",Early diagnosis and population prevention of c...,Due to the lack of effective drugs and vaccine...,prevention
1,2020-06-17,"Jing, et al.",Household secondary attack rate of COVID-19 an...,"In addition to case finding and isolation, tim...",prevention
2,2020-09-01,"Ayenigbara, et al.","COVID-19 (SARS-CoV-2) pandemic: fears, facts a...",The world is faced with containing the spread ...,prevention
3,2020-05-13,"Singh, et al.",Knowledge and Perception Towards Universal Saf...,Although participants' overall knowledge score...,prevention
4,2020-11-04,Feiz et al.,The health effects of quarantine during the CO...,"In the case of the coronavirus epidemic, quara...",prevention


done


In [18]:
covid_data = pd.concat([label_table[0],label_table[1],label_table[2],label_table[3],label_table[4]],axis = 0)
covid_data.reset_index()

Unnamed: 0,index,pub_date,authors,title,excerpt,label
0,0,2020-11-20,"Boshier, et al.",Remdesivir induced viral RNA and subgenomic RN...,We conclude that these are likely to have aris...,transmission
1,1,2020-01-03,"Chertow, et al.","Influenza, Measles, SARS, MERS, and Smallpox",Early case identification and strict infection...,transmission
2,2,2020-09-17,"Gupta, et al.",Estimating the Impact of Daily Weather on the ...,"Daily maximum (t(max)), minimum (t(min)), mean...",transmission
3,3,2015,"Al-Tawfiq, et al.",Middle East respiratory syndrome coronavirus i...,"Data suggest the overcrowding, late recognitio...",transmission
4,4,2020-06-17,"Jing, et al.",Household secondary attack rate of COVID-19 an...,We assessed the demographic determinants of tr...,transmission
5,5,2020-07-09,"Rockett, et al.",Revealing COVID-19 transmission in Australia b...,We report that the prospective sequencing of s...,transmission
6,6,2020-02-27,"Zhu, et al.",[Challenges and countermeasures on Chinese mal...,Due to the extensive spread and high transmiss...,transmission
7,7,2020-04-08,"Wang, et al.",Current trends and future prediction of novel ...,"Based on the current control measures, we prop...",transmission
8,8,2020-02-06,no et al.,"2019-nCoV acute respiratory disease, Australia...",It includes data on australian cases notified ...,transmission
9,9,2020-07-14,"Bahramian, et al.",COVID-19 Considerations in Pediatric Dentistry.,"Therefore, we need to beware of the symptoms a...",transmission


In [19]:
covid_data.sample(20)

Unnamed: 0,pub_date,authors,title,excerpt,label
12,2020-09-02,"Khokhar, et al.",Viricidal treatments for prevention of coronav...,"However, indirect transmission from inanimate ...",transmission
2149,2020,"Kshatri, et al.",Findings from serological surveys (in August 2...,A majority of the seropositive participants we...,symptom
464,2016,"Iinuma, et al.",[Middle East Respiratory Syndrome (MERS)].,Clinical features of mers range from asymptoma...,symptom
2100,2020-04-22,"Smith, et al.",How best to use limited tests? Improving COVID...,Findings: we estimated a median delay of 7 (95...,symptom
565,2020,"Jamerson, et al.",Glucose-6-Phosphate Dehydrogenase Deficiency: ...,"Although most individuals are asymptomatic, ex...",symptom
208,2020-05-15,"Myers, et al.",Identification and Monitoring of International...,Effectiveness of covid-19 screening and monito...,symptom
2335,2020-10-15,"Imai, et al.",Characteristics and considerations in the medi...,"Compared to adults, quite a few children are a...",symptom
225,2020-12-07,"Ebeid, et al.",COVID-19 in Children With Cancer: A Single Low...,"Initially, 10 (66.7%) were asymptomatic and 5 ...",symptom
1004,2020-04-10,"Serper, et al.",Telemedicine in Liver Disease and Beyond: Can ...,"Telemedicine, defined as the delivery of healt...",treatment
97,2020-12-23,"Jimenez-Kurlander, et al.",COVID-19 in pediatric survivors of childhood c...,"We assessed covid-19-related symptoms, severe ...",symptom


### Processing tweet data

In [34]:
tweet_data = pd.read_csv('C:/Users/Sammy/2021/final-project/data/tweets/tweets.csv')

In [35]:
print(tweet_data.head())
tweet_data.shape

                                                text         label
0  The question we need urgent answer to is how l...  transmission
1  it takes like 2 weeks for symptoms to start sh...       symptom
2  Fort Bend County has a confirmed Coronavirus p...    prevention
3  China confirms 170 deaths as coronavirus sprea...       symptom
4  Excellent discussion from @PascalJabbourMD on ...   health risk


(135, 2)

In [36]:
tweet_data['label'].value_counts(normalize=False)

prevention      60
symptom         25
transmission    23
treatment       21
health risk      6
Name: label, dtype: int64

In [44]:
tweet_data.count()

text     135
label    135
dtype: int64

In [50]:
tweet_data.drop_duplicates(inplace=True)
tweet_data.dropna(inplace=True)

In [52]:
tweet_data.count()

text     135
label    135
dtype: int64

In [53]:
tweet_data.head()

Unnamed: 0,text,label
0,The question we need urgent answer to is how l...,transmission
1,it takes like 2 weeks for symptoms to start sh...,symptom
2,Fort Bend County has a confirmed Coronavirus p...,prevention
3,China confirms 170 deaths as coronavirus sprea...,symptom
4,Excellent discussion from @PascalJabbourMD on ...,health risk


### Pre-process tweets

In [37]:
#!pip install tweet-preprocessor

Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


You should consider upgrading via the 'c:\users\sammy\anaconda3\python.exe -m pip install --upgrade pip' command.


In [40]:
import preprocessor as tp

In [55]:
def tweet_processor(row):
    
    tweet = row['text']
    tweet = tp.clean(tweet)
    
    return tweet
    

In [56]:
tweet_data['text'] = tweet_data.apply(tweet_processor,axis = 1)

In [57]:
tweet_data.head()

Unnamed: 0,text,label
0,The question we need urgent answer to is how l...,transmission
1,it takes like weeks for symptoms to start show...,symptom
2,Fort Bend County has a confirmed Coronavirus p...,prevention
3,China confirms deaths as coronavirus spreads t...,symptom
4,Excellent discussion from on cases during,health risk


In [58]:
#remove stop words and further processing
def stopword_remover(row):
    text = row['text']
    text = remove_stopwords(text)
    return text   
    

In [61]:
tweet_data['text'] = tweet_data.apply(stopword_remover,axis = 1)

In [63]:
tweet_data['text'] = tweet_data['text'].str.lower().str.strip()

In [None]:
#tweet_data['text'].str.replace('[^\w\s]',' ').str.replace('\s\s+', ' ')

####  Defining our base model

In [64]:
from sklearn.model_selection import train_test_split

In [65]:
tweet = tweet_data['text'].values
y_label = tweet_data['label'].values

In [66]:
tweet_train, tweet_test, tweet_y_train, tweet_y_test = train_test_split(tweet, y_label,test_size = 0.25, random_state = 42)

In [67]:
from sklearn.feature_extraction.text import CountVectorizer

In [68]:
vectorizer = CountVectorizer()
vectorizer.fit(tweet_train)

X_train = vectorizer.transform(tweet_train)
X_test = vectorizer.transform(tweet_test)

In [69]:
len(tweet)
#X_train

135

In [70]:
# y_train = pd.factorize(tweet_y_train)
# y_test = pd.factorize(tweet_y_test)

In [71]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier 

#### Decision tree classifier model

In [72]:
classifier = DecisionTreeClassifier(max_depth = 2)
classifier.fit(X_train,tweet_y_train)
score = classifier.score(X_test,tweet_y_test)
print("Accuracy:", score)

predictions = classifier.predict(X_test) 
  
# creating a confusion matrix 
cm = confusion_matrix(tweet_y_test, predictions) 
cm

Accuracy: 0.47058823529411764


array([[ 0,  3,  0,  0,  0],
       [ 0, 15,  0,  0,  0],
       [ 0,  3,  0,  0,  0],
       [ 0,  6,  0,  1,  0],
       [ 0,  6,  0,  0,  0]], dtype=int64)

#### Logistic regression model

In [73]:
lr_classifier = LogisticRegression()
lr_classifier.fit(X_train,tweet_y_train)
lr_score = lr_classifier.score(X_test,tweet_y_test)
print("Accuracy:", lr_score)

predictions = lr_classifier.predict(X_test) 
  
# creating a confusion matrix 
lr_cm = confusion_matrix(tweet_y_test, predictions) 
lr_cm

Accuracy: 0.5294117647058824


array([[ 0,  3,  0,  0,  0],
       [ 0, 14,  0,  0,  1],
       [ 0,  2,  0,  1,  0],
       [ 0,  4,  0,  3,  0],
       [ 0,  4,  1,  0,  1]], dtype=int64)

### Implementing BERT with keras

In [74]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

input_dim = X_train.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [75]:
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                10540     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 10,551
Trainable params: 10,551
Non-trainable params: 0
_________________________________________________________________


In [76]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import logging
logging.basicConfig(level=logging.INFO)

In [77]:
#!pip install wget
#import wget

In [78]:
# url = 'https://raw.githubusercontent.com/google-research/bert/master/tokenization.py'
# filename = wget.download(url)

In [80]:
import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)

INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'.
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 63.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 103.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 153.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 203.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 253.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 303.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 363.76MB
INFO:absl:Downloading https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2: 413.76MB
INFO:absl:Downloaded https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2, Total size: 421.50MB
INFO:absl:Downloaded TF-Hub Modul

In [81]:
import tokenization

In [82]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [83]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [90]:
def build_model(bert_layer, max_len=512):
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    net = tf.keras.layers.Dense(64, activation='relu')(clf_output)
    net = tf.keras.layers.Dropout(0.2)(net)
    net = tf.keras.layers.Dense(32, activation='relu')(net)
    net = tf.keras.layers.Dropout(0.2)(net)
    out = tf.keras.layers.Dense(5, activation='softmax')(net)
    
    model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [91]:
from collections import Counter
tweet_y_train = pd.factorize(tweet_y_train)[0]
tweet_y_test = pd.factorize(tweet_y_test)[0]
print(Counter(tweet_y_train))

Counter({0: 45, 1: 22, 2: 16, 3: 15, 4: 3})


In [92]:
max_len = 150
train_input = bert_encode(tweet_train, tokenizer, max_len=max_len)
test_input = bert_encode(tweet_test, tokenizer, max_len=max_len)
train_labels = tf.keras.utils.to_categorical(tweet_y_train, num_classes=5)
test_labels = tf.keras.utils.to_categorical(tweet_y_test, num_classes=5)

In [93]:
model = build_model(bert_layer, max_len=max_len)
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 150)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 150)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 150)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]           

In [95]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('model.h5', monitor='val_accuracy', save_best_only=True, verbose=1)
earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5, verbose=1)

train_history = model.fit(
    train_input, train_labels, 
    validation_split=0.2,
    epochs=100,
    callbacks=[checkpoint, earlystopping],
    batch_size=32,
    verbose=1
)

# history = model.fit(X_train, y_train,
#                     epochs=100,
#                     verbose=False,
#                     validation_data=(X_test, y_test)
#                     batch_size=10)

Epoch 1/100

Epoch 00001: val_accuracy improved from -inf to 0.52381, saving model to model.h5
Epoch 2/100

Epoch 00002: val_accuracy did not improve from 0.52381
Epoch 3/100

Epoch 00003: val_accuracy did not improve from 0.52381
Epoch 4/100

Epoch 00004: val_accuracy did not improve from 0.52381
Epoch 5/100

Epoch 00005: val_accuracy did not improve from 0.52381
Epoch 6/100


KeyboardInterrupt: 

In [None]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)

In [None]:
loss, accuracy = model.evaluate(train_input, train_labels, verbose=1)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(test_input, test_labels, verbose=1)
print("Testing Accuracy:  {:.4f}".format(accuracy))

In [None]:
print(train_history.history.keys())

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(train_history):
    acc = train_history.history['accuracy']
    val_acc = train_history.history['val_accuracy']
    loss = train_history.history['loss']
    val_loss = train_history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()


In [None]:
plot_history(train_history)

In [None]:
from tensorflow.keras.backend import clear_session
clear_session()