# Create Dataset: Combine Generated Synthetic Datasets with Real Datasets

1. **Synthetic:** generated with various LLMs across multiple domains (finance, weather, policy, health, sports, misc)
    1. DATASET: Predictions: future tense sentences + more.
    2. DATASET: Non-Predictions (or Observations): past tense sentences + more.
2. **Real:**
    1. DATASET: [financial_phrasebank](https://huggingface.co/datasets/takala/financial_phrasebank): 4840 statements (with sentiments) from financial calls, reports, news, etc

In [1]:
import os
import sys

import pandas as pd

from tqdm import tqdm

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

# import log_files
from data_processing import DataProcessing

In [2]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
save_data_bool = True

## Load Data

1. [x] Load Predictions
2. [x] Load Non-Predictions
3. [x] Load Financial PhraseBank

### Load Predictions

In [4]:
predictions_df = DataProcessing.load_multiple_batches(
    notebook_dir,
    sep=',', 
    data_type='prediction'
)
predictions_df.head(7)

Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/prediction_logs/batch_1-prediction/batch_1-from_df.csv
✓ Loaded batch 1
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/prediction_logs/batch_2-prediction/batch_2-from_df.csv
✓ Loaded batch 2
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/prediction_logs/batch_3-prediction/batch_3-from_df.csv
✓ Loaded batch 3
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/prediction_logs/batch_4-prediction/batch_4-from_df.csv
✓ Loaded batch 4
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to Goldman Sachs, the research and development expenses at Facebook would fall in 2025.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 21 August 2024, Morgan Stanley envisions that the gross profit at Johnson & Johnson has some probability to remain stable.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
5,"The stock price at Visa should stay same in Q2 of 2026, according to Wells Fargo.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,6
6,JPMorgan forecasts that the revenue at Microsoft potentially decrease in Q3 of 2027.,1,finance,llama-3.3-70b-instruct,NAVI_GATOR,0,1


### Load Non-Predictions

In [5]:
non_predictions_df = DataProcessing.load_multiple_batches(
    notebook_dir,
    sep=',', 
    data_type='observation'
)
non_predictions_df.head(7)

Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/observation_logs/batch_1-observation/batch_1-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/observation_logs/batch_2-observation/batch_2-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/observation_logs/batch_3-observation/batch_3-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/observation_logs/batch_4-observation/batch_4-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/observation_logs/batch_5-observation/batch_5-from_df

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase observed that the net profit at Amazon had remained stable in Q2 2026.,0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On 08/20/2024 to 08/20/2025, Bank of America speculated the operating income at Microsoft changed.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup noted on 2024-08-20, the research and development expenses at Alphabet fell.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to a financial analyst, the gross profit at Facebook rose in Q3 2025.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 2025-08-20, a college student envisioned that the operating cash flow at Intel decreased.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
5,"The stock price at Google increased in August 2025, according to a research advisor.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,6
6,"JPMorgan Chase noted that the operating income at Microsoft had risen significantly on September 10, 2024.",0,finance,llama-3.3-70b-instruct,NAVI_GATOR,0,1


### Load financial_phrasebank

In [6]:
base_data_path = DataProcessing.load_base_data_path(notebook_dir)
finacial_phrase_bank_path = os.path.join(base_data_path, 'financial_phrase_bank/annotators/maya_annotations-financial_phrasebank_statements-v2.csv')
print(finacial_phrase_bank_path)
finacial_phrase_bank_df = DataProcessing.load_from_file(finacial_phrase_bank_path, 'csv', sep = ",", encoding = 'latin')
finacial_phrase_bank_df.dropna(inplace=True) # drop rows without a lable
# fpb_label_values = finacial_phrase_bank_df['label_value'].values # convert to np.array
# fpb_label_values_updated = fpb_label_values.astype(int) # convert values to be type int
# finacial_phrase_bank_df['label_value'] = fpb_label_values_updated # update original df with type int
# finacial_phrase_bank_df.rename(columns={"sentence":"Base Sentence", "label_value":"Sentence Label"}, inplace=True)
# finacial_phrase_bank_df

finacial_phrase_bank_df.head(7)

/Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/financial_phrase_bank/annotators/maya_annotations-financial_phrasebank_statements-v2.csv


Unnamed: 0,statement,maya_label
1,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .",NON-PREDICTION
3,"Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .",NON-PREDICTION
5,"The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported .",NON-PREDICTION
7,With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .,PREDICTION
9,"According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of net sales .",PREDICTION
11,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is aggressively pursuing its growth strategy by increasingly focusing on technologically more demanding HDI printed circuit boards PCBs .,NON-PREDICTION
13,"For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",NON-PREDICTION


In [7]:
# Using custom column name
updated_finacial_phrase_bank_df = DataProcessing.match_text_label_to_int(
    finacial_phrase_bank_df, 
    text_label_col_name='maya_label',
    target_label='PREDICTION'  # This is the default
)
updated_finacial_phrase_bank_df.rename(columns={"statement":"Base Sentence", "Binary Label":"Sentence Label"}, inplace=True)
updated_finacial_phrase_bank_df.head(3)

Unnamed: 0,Base Sentence,maya_label,Sentence Label
0,With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .,PREDICTION,1
1,"According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of net sales .",PREDICTION,1
2,TeliaSonera TLSN said the offer is in line with its strategy to increase its ownership in core business holdings and would strengthen Eesti Telekom 's offering to its customers .,PREDICTION,1


## Data Processing

1. [x] Assign label 0 to LLM Generated for all Predictions
2. [x] Assign label 0 to LLM Generated for all Non-Predictions
3. [x] Assign label 1 to Human for financial_phrasebank and chronicles
4. [x] Filter for finance domain across Predictions
5. [x] Filter for finance domain across Non-Predictions
6. [x] Combine Predictions + Non-Predictions + financial_phrasebank + chronicles

In [8]:
llm_author_type = 0
human_author_type = 1

### Assign label 0 to LLM Generated for all Predictions

In [9]:
predictions_df['Author Type'] = llm_author_type
predictions_df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number,Author Type
0,JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1,0
1,"On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2,0
2,"Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3,0


### Assign label 0 to LLM Generated for all Non-Predictions

In [10]:
non_predictions_df['Author Type'] = llm_author_type
non_predictions_df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number,Author Type
0,JPMorgan Chase observed that the net profit at Amazon had remained stable in Q2 2026.,0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1,0
1,"On 08/20/2024 to 08/20/2025, Bank of America speculated the operating income at Microsoft changed.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2,0
2,"Citigroup noted on 2024-08-20, the research and development expenses at Alphabet fell.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3,0


### Assign label 1 to Human for financial_phrasebank

In [11]:
updated_finacial_phrase_bank_df['Author Type'] = human_author_type
updated_finacial_phrase_bank_df.head(3)

Unnamed: 0,Base Sentence,maya_label,Sentence Label,Author Type
0,With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .,PREDICTION,1,1
1,"According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of net sales .",PREDICTION,1,1
2,TeliaSonera TLSN said the offer is in line with its strategy to increase its ownership in core business holdings and would strengthen Eesti Telekom 's offering to its customers .,PREDICTION,1,1


### Filter for finance domain across Predictions

In [12]:
filt_finance = (predictions_df['Domain'] == 'finance')
financial_predictions_df = predictions_df[filt_finance]
financial_predictions_df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number,Author Type
0,JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1,0
1,"On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2,0
2,"Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3,0


### Filter for finance domain across Non-Predictions

In [13]:
filt_finance = (non_predictions_df['Domain'] == 'finance')
financial_non_predictions_df = non_predictions_df[filt_finance]
financial_non_predictions_df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number,Author Type
0,JPMorgan Chase observed that the net profit at Amazon had remained stable in Q2 2026.,0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1,0
1,"On 08/20/2024 to 08/20/2025, Bank of America speculated the operating income at Microsoft changed.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2,0
2,"Citigroup noted on 2024-08-20, the research and development expenses at Alphabet fell.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3,0


### Combine Predictions + Non-Predictions + financial_phrasebank + chronicles

In [14]:
joint_cols = ['Base Sentence', 'Sentence Label', 'Author Type']
combined_df = DataProcessing.concat_dfs([financial_predictions_df, financial_non_predictions_df, updated_finacial_phrase_bank_df])
# combined_df.head(3)
basic_cols_df = combined_df.loc[: , joint_cols]
basic_cols_df.tail(7)

Unnamed: 0,Base Sentence,Sentence Label,Author Type
1286,This acquisition supports our strategy of being close to our customers all around the world offering both equipment and related services .,0,1
1287,This lays a foundation for sustainable business development for the long term .,0,1
1288,"Turun kaupunkin , Finland based company has awarded contract to Lemminkainen Talotekniikka Oy for electrical installation work .",0,1
1289,The world 's biggest magazine paper maker said the program to improve efficiency will include closing several of its least competitive mills and would cover all the company 's operations resulting in annual savings of some euro200 million US$ 240 million .,0,1
1290,These moderate but significant changes resulted in a significant 24-32 % reduction in the estimated CVD risk .,0,1
1291,Uponor improved its performance in spite of the decrease in residential construction in the US .,0,1
1292,The inventor was issued U.S. Patent No. .,0,1


## Save Combined Dataset

In [15]:
if save_data_bool == True:
    save_path = os.path.join(base_data_path, 'combined_datasets/')
    DataProcessing.save_to_file(basic_cols_df, save_path, 'combined-synthetic-fin_phrase_bank', 'csv')

Skipping combined-synthetic-fin_phrase_bank-chronicles-v2.csv: invalid literal for int() with base 10: 'chronicles-v2'
Skipping combined-synthetic-fin_phrase_bank-chronicles-v1.csv: invalid literal for int() with base 10: 'chronicles-v1'
Using file number: 2
Saving CSV file to: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/prediction_classification_experiments-v2/../data/combined_datasets/combined-synthetic-fin_phrase_bank-v2.csv
