Full workspace for training, evaluating, and getting input for model

Parsing and getting a dataset. We use the dataset SEntFiN-v1.1.csv from Kaggle (Source: https://www.kaggle.com/code/ankurzing/sentfin/notebook). 

ONLY RUN IF YOU HAVE NOT CREATED THE DATASET YET 

Remember for SGNLP, add the additional argument thingy. Refer to Google Colab 
Activate conda sgnlp env. Try to so you can test your code

In [None]:
# Importing modules 
import json 
import pandas as pd 
import random 

In [None]:
# Splitting the files into training and test sets in an 80-20 ratio 
# (60-20-20 for train, cv, and test more precisely)

# Load the data from the csv into a df 
CSV_FILE = './datasets/finance/SEntFiN-v1.1.csv'
df = pd.read_csv(CSV_FILE)
del df['S No.']

# The full dataframe 
df['Decisions'] = df['Decisions'].apply(json.loads)
csv_data = df.to_dict(orient='list')
# print(csv_data)

# Change the data into this text format 
text = []
# print(csv_data['Decisions'])
get_decision = lambda x: '1' if x == 'positive' else '0' if x == 'neutral' else '-1'
for i in range(len(csv_data['Title'])):
    for key in csv_data['Decisions'][i]:
        text.append('\n'.join([csv_data['Title'][i].replace(key, '$T$'), key, get_decision(csv_data['Decisions'][i][key])]))
        # text.append(csv_data['Title'][i].replace(key, '$T$'))
        # text.append(key)
        # text.append(get_decision(csv_data['Decisions'][i][key]))


# How do I write/append random elements in text to a .raw file, and the remaining elements to another .raw file
random.shuffle(text)
training_text, test_text = text[:len(text)//5 * 4], text[len(text)//5 * 4:]

with open('./datasets/finance/finance_train.raw', 'w') as f:
    f.write('\n'.join(training_text))
    
with open('./datasets/finance/finance_test.raw', 'w') as f:
    f.write('\n'.join(test_text))

# Write the data 
with open('./datasets/finance/financedata.txt', 'w') as f:
    f.write('\n'.join(text))

print("All data written. Number of rows: " + str(len(text) * 3))

The next few cells are for training the model. We are training the Sentic GCN model here on the dataset. 

First, remember to configure your sentic_gcn_config.json file INSIDE the module itself. The config file is located at path "config/sentic_gcn_config.json"

We import the modules first, then train the model. 

IF YOU HAVE ALREADY TRAINED THE MODEL, DO NOT RUN THIS AGAIN! ONLY RUN THIS ONCE! 

In [None]:
# Import the training modules 
from sgnlp.models.sentic_gcn.train import SenticGCNTrainer, SenticGCNBertTrainer
from sgnlp.models.sentic_gcn.utils import parse_args_and_load_config, set_random_seed
import sys 

In [None]:
# Training the model. Takes a few hours to complete. Only run if you have not run before. 

# Required not to throw argparse error 
sys.argv = ['']
del sys 

# Instantiate the config file and start the training 
cfg = parse_args_and_load_config(config_path="config/sentic_gcn_config.json")
if cfg.seed is not None:
    set_random_seed(cfg.seed)
# Using SenticGCNTrainer
trainer = SenticGCNTrainer(cfg) if cfg.model == "senticgcn" else SenticGCNBertTrainer(cfg)
trainer.train()

After training, evaluate our code to see how accurate your model is. Remember to configure your sentic_gcn_config.json file INSIDE the module itself, under eval_args. Path: "config/sentic_gcn_config.json"

Acc: 0.7590987868284229 \
F1: 0.7558443499898401

In [None]:
# Import modules for evaluation first 
from sgnlp.models.sentic_gcn.eval import SenticGCNEvaluator, SenticGCNBertEvaluator
from sgnlp.models.sentic_gcn.utils import parse_args_and_load_config, set_random_seed
import sys 

In [None]:
# Evaluating the model's performance on a test dataset 

# Required not to throw argparse error 
sys.argv = ['']
del sys 

# Instantiate the config file and start the evaluation 
cfg = parse_args_and_load_config(config_path="config/sentic_gcn_config.json")
print(cfg)
if cfg.seed is not None:
    set_random_seed(cfg.seed)
evaluator = SenticGCNEvaluator(cfg) if cfg.model == "senticgcn" else SenticGCNBertEvaluator(cfg)
evaluator.evaluate()

After evaluating and training the model, we can actually input data into it now and get it to output its evaluation. The code below requests for user input, and provides an evaluated sentiment

In [1]:
# Import necessary modules first 
from sgnlp.models.sentic_gcn import(
    SenticGCNConfig,
    SenticGCNModel,
    SenticGCNEmbeddingConfig,
    SenticGCNEmbeddingModel,
    SenticGCNTokenizer,
    SenticGCNPreprocessor,
    SenticGCNPostprocessor,
    download_tokenizer_files,
)
from inputhelper import (
    inputString, 
    inputInt, 
    inputList
)

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
# Gets input from the user 

# constants 
SENTENCE = "sentence"
ASPECTS = "aspects"

# Gets number of inputs 
num_sentences = inputInt("How many sentences do you want to evaluate?\n", min=1, max=20)

# Instantiate inputs 
inputs = [{} for _ in range(num_sentences)]

for i in range(num_sentences):
    inputs[i][SENTENCE] = inputString("Enter the text you wish to evaluate:\n")
    inputs[i][ASPECTS] = inputList("Enter the aspects you wish to evaluate, separating them with backslashes (/):\n")


In [12]:
# Gets user input, parses it into inputs for the model, and runs the model 

# Ensure that input exists 
try:
    assert len(inputs) > 0
except NameError:
    raise 
except AssertionError:
    raise 

# Obtaining the tokenizer. NOT SURE what argument to put 
tokenizer = SenticGCNTokenizer.from_pretrained("./tokenizers/senticgcn/")

# Obtaining the config variable for the MODEL
config = SenticGCNConfig.from_pretrained(
    "./models/senticgcn/config.json"
)

# Obtaining the model itself 
model = SenticGCNModel.from_pretrained(
    "./models/senticgcn/pytorch_model.bin",
    config=config
)

# Obtaining the config variable for the embedding model 
embed_config = SenticGCNEmbeddingConfig.from_pretrained(
    "./embed_models/senticgcn_embed_semeval14_rest/config.json"
)

# Obtaining the embedding model itself 
embed_model = SenticGCNEmbeddingModel.from_pretrained(
    "./embed_models/senticgcn_embed_semeval14_rest/pytorch_model.bin",
    config=embed_config
)

# Getting the preprocessor from everything 
preprocessor = SenticGCNPreprocessor(
    tokenizer=tokenizer, embedding_model=embed_model,
    senticnet="./senticNet/senticnet.pickle",
    device="cpu")

# Postprocessor for everything 
postprocessor = SenticGCNPostprocessor()

# Getting the raw outputs from the preprocessor 
processed_inputs, processed_indices = preprocessor(inputs)
raw_outputs = model(processed_indices)

# Getting the postprocessor outputs 
post_outputs = postprocessor(processed_inputs=processed_inputs, model_outputs=raw_outputs)

for output in post_outputs:
    print(output) 



In [17]:
# Parse and return the data from the model 

# Verify that outputs exist 
# print(post_outputs)
try:
    assert len(post_outputs) >= 0
except NameError:
    raise
except AssertionError:
    raise 

# List storing all dictionaries with sentiments 
total_results = []

for output in post_outputs:
    output_sentiments = {} 
    for i in range(len(output['labels'])):
        resultant_phrase = ' '.join([output['sentence'][j] for j in output['aspects'][i]])
        # resultant_phrase = ' '.join([output['sentence'][j] for j in range(len(output['aspects'][i]))])
        output_sentiments[resultant_phrase] = output['labels'][i]

# Better way of expressing, please...
print(output_sentiments)

{'Adani': -1, 'Hindenburg': -1}
