## Using Fine-Tuned Roberta Model on Unseen Data (Part 2)

This script shows how to use fine-tuned Roberta model on new data to assign labels of interest.

#### Step 1: Download Packages & Data
* make sure to do the **same data cleaning procedures** as for train data above

In [2]:
from transformers import AutoTokenizer
from transformers import Trainer, TrainingArguments, RobertaForSequenceClassification, \
     RobertaTokenizerFast,DataCollatorWithPadding, pipeline
from datasets import load_metric, Dataset
import numpy as np
import evaluate
import pandas as pd
import sklearn
import re
import os
import torch
from sklearn import metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import classification_report
from matplotlib.pylab import plt
from numpy import arange


pd.set_option('display.max_colwidth', None)
os.environ["WANDB_DISABLED"] = "true"
torch.manual_seed(123)

<torch._C.Generator at 0x10e759b90>

In [3]:
unseen_data = pd.read_csv("test.csv", index_col=0, nrows=50).drop(['keyword', 'location'], axis=1)
unseen_data.head(2)

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
0,Just happened a terrible car crash
2,"Heard about #earthquake is different cities, stay safe everyone."


#### Step 2: Load Tokenizer & Create Pipeline Using Saved Model

* the **pipeline** is made of tokenizer and model to use for prediction
* we can now load our fine-tuned model from our directory
* **'text-classification'** is an argument to specify the type of the pipeline
* **link** specifies where model is located

In [4]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length = 512)

In [5]:
pipe = pipeline("text-classification", "./model_disaster/results", tokenizer=tokenizer)

#### Step 3: Perform Basic Data Cleaning

In [6]:
def prune_multple_consecutive_same_char(tweet_text):
    '''
    yesssssssss  is converted to yes
    ssssssssssh is converted to ssh
    '''
    tweet_text = re.sub(r'(.)\1+', r'\1\1', tweet_text)
    return tweet_text

prune_multple_consecutive_same_char("yessss!!!!")

'yess!!'

In [7]:
#remove emoji
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

In [8]:
def clean_tweet(text):
    text = text.lower()
    text = prune_multple_consecutive_same_char(text)
    text = remove_emojis(text) 
    text = text.replace('\\n', '')
    text = re.sub(r'http\S+', '', text) 
    text = re.sub("@[A-Za-z0-9_]+",'', text)
    text = text.encode('ascii',errors='ignore').decode()
    text = re.sub("^\s+|\s+$", "", text, flags=re.UNICODE)
    text = " ".join(re.split("\s+", text, flags=re.UNICODE))
    return text

In [9]:
unseen_data.loc[:, "clean_text"] = unseen_data.loc[:, "text"].apply(clean_tweet)
unseen_data.head(2)

Unnamed: 0_level_0,text,clean_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Just happened a terrible car crash,just happened a terrible car crash
2,"Heard about #earthquake is different cities, stay safe everyone.","heard about #earthquake is different cities, stay safe everyone."


#### Step 4: Run the Model

* first try the model on a sample sentence

In [10]:
print(pipe("there was a hurricane in georgia"))
print(pipe(("I like ice-cream")))

[{'label': 1, 'score': 0.898166298866272}]
[{'label': 0, 'score': 0.8866080045700073}]


* run the model and create dataframe to include the results

In [11]:
sentences = unseen_data["text"].values.tolist()

In [12]:
preds = pipe(sentences)

In [13]:
def create_result_dict(sentences, preds):
    results = []
    for i, sent in enumerate(sentences):
        prd = preds[i]
        result = {}
        result['clean_text'] = sent
        result['label'] = prd['label']
        result['score'] = prd['score']
        results.append(result)
    df = pd.DataFrame(results)
    return df

In [14]:
pred_df = create_result_dict(sentences, preds)

In [15]:
pred_df.head(10)

Unnamed: 0,clean_text,label,score
0,Just happened a terrible car crash,1,0.915809
1,"Heard about #earthquake is different cities, stay safe everyone.",1,0.89074
2,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",1,0.93125
3,Apocalypse lighting. #Spokane #wildfires,1,0.894984
4,Typhoon Soudelor kills 28 in China and Taiwan,1,0.928908
5,We're shaking...It's an earthquake,1,0.684168
6,"They'd probably still show more life than Arsenal did yesterday, eh? EH?",0,0.846586
7,Hey! How are you?,0,0.792071
8,What a nice hat?,0,0.927709
9,Fuck off!,0,0.808605
