# Models + Evaluation Metrics

- **Goal:** Prediction Recognition

- **Purpose:** To train our models and to make predictions on unseen data.

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys

import pandas as pd

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from data_processing import DataProcessing
from classification_models import PerceptronModel

In [None]:
%store -r shuffled_base_df
%store -r tfidf_vectorized_features_df
%store -r encoded_word_level_tags_entities_df

pd.set_option('max_colwidth', 800)
pre_processed_df = shuffled_base_df.copy()
pre_processed_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,The music echoed through the empty hall.,0,llama-3.3-70b-versatile,any,0
1,"According to a policy analyst, Emily Chen, from the Congressional Budget Office, on 2024-08-22, the federal budget deficit is expected to decrease beyond $1 trillion in the timeframe of Q4 of 2027.",1,llama-3.3-70b-versatile,policy,4
2,"On 2024-10-15, Dr. David Lee, a health expert, predicts that the obesity rate at the World Health Organization will likely decrease by 3% in Q2 of 2026.",1,llama-3.3-70b-versatile,health,1
3,"According to a senior level person from 3M, on 2024/08/22, the operating income is expected to increase as much as $500 million, reflecting a 20% increase, in the timeframe of Q2 of 2029.",1,llama-3.3-70b-versatile,financial,4
4,"On 2024-10-15, Rachel Patel, a financial analyst, predicts that the operating income at General Motors will likely increase by $5 billion in Q2 of 2026.",1,llama-3.3-70b-versatile,financial,1
...,...,...,...,...,...
75,"Michael Davis, a top executive, predicts on 15 October 2024 that the stock price at Visa may rise by 15% to $200 per share in 2027.",1,llama-3.3-70b-versatile,financial,3
76,The city lights twinkled at night time.,0,llama-3.3-70b-versatile,any,0
77,The little boy fed the hungry birds.,0,llama-3.3-70b-versatile,any,0
78,"In 2024/08/20, Senator James Davis from the Senate Committee on Energy and Natural Resources, forecasts that the renewable energy consumption will increase from 20% to 50% in 2028.",1,llama-3.3-70b-versatile,policy,2


In [5]:
base_pipeline = BasePipeline()

## Combine Features (TF x IDF and POS & NER Encodings)

In [6]:
all_features_df = DataProcessing.concat_dfs([tfidf_vectorized_features_df, encoded_word_level_tags_entities_df], axis=1)
all_features_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,122,123,124,125,126,127,128,129,130,131
0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0,0,0,0,0,0,0,0,0,0
1,0.0,0.0,0.0,0.184990,0.000000,0.000000,0.000000,0.117449,0.0,0.000000,...,1,0,0,0,1,0,0,1,0,0
2,0.0,0.0,0.0,0.000000,0.163610,0.202573,0.000000,0.118748,0.0,0.208631,...,0,1,0,0,1,0,0,0,0,0
3,0.0,0.0,0.0,0.160171,0.000000,0.000000,0.160171,0.101691,0.0,0.000000,...,0,1,0,0,0,0,0,1,0,0
4,0.0,0.0,0.0,0.000000,0.168884,0.209102,0.000000,0.122575,0.0,0.215355,...,0,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,0.0,0.0,0.0,0.000000,0.000000,0.436666,0.000000,0.127987,0.0,0.000000,...,0,1,0,0,1,0,0,1,0,0
76,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0,0,0,0,0,0,0,0,0,0
77,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0,0,0,0,0,0,0,0,0,0
78,0.0,0.0,0.0,0.210736,0.000000,0.000000,0.421472,0.133794,0.0,0.000000,...,1,1,0,0,1,0,0,0,0,0


# Split Features and Prediction Labels

In [8]:
prediction_labels = shuffled_base_df['Prediction Label']
prediction_labels

0     0
1     1
2     1
3     1
4     1
     ..
75    1
76    0
77    0
78    1
79    1
Name: Prediction Label, Length: 80, dtype: int64

## Models

1. Perceptron
    - $ x_n \in X $, `tfidf_vectorized_features`
        - N, `tfidf_vectorized_features_n`. Each row (formally document).
        - D, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - Thus, $ X \in R^{N \times D} $
    - $ w^T $, Weights, which are randomly initialize (in sklearn)
        - N, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - D, 1
        - Thus, $ w^T \in R^{N \times D} $
    
    $$
    (w^T \cdot x_n) \Rightarrow (100 \times 1) \cdot (1 \times 100)
    $$

In [9]:
prediction_labels = pre_processed_df['Prediction Label']
X_train, X_test, y_train, y_test = DataProcessing.split_data(all_features_df, prediction_labels)

In [10]:
perception_model = PerceptronModel()

perception_model.train_model(X_train, y_train)
y_predictions = perception_model.predict(X_test)
y_predictions

0     0
1     0
2     0
3     0
4     1
5     0
6     0
7     1
8     1
9     0
10    0
11    0
12    1
13    1
14    1
15    0
dtype: int64

In [11]:
DataProcessing.join_predictions_with_labels(pre_processed_df, y_test, y_predictions, perception_model)

Unnamed: 0,Base Sentence,Model Name,Domain,Template Number,Prediction Label,Perceptron Model Prediction
64,"In 2024-10-10, Governor Sarah Taylor from the State of New York, envisions that the number of new businesses will rise from 50,000 to 100,000 in Q2 of 2028.",llama-3.3-70b-versatile,policy,2,1,0
65,The cat purred contentedly on her lap.,llama-3.3-70b-versatile,any,0,0,0
66,They danced to the rhythm of the music.,llama-3.3-70b-versatile,any,0,0,0
67,"According to a health expert from Stanford University, on 2024-08-25, the prevalence of mental health disorders is expected to increase beyond 15% in the timeframe of Q2 of 2030.",llama-3.3-70b-versatile,health,4,1,0
68,"In Q3 of 2024, Dr. Sophia Patel from the National Institutes of Health envisions that the average daily sugar intake may fall by 5% in 2027.",llama-3.3-70b-versatile,health,2,1,1
69,"According to a senior level person from the Australian Bureau of Meteorology, on 2024-03-10, the cyclone activity in Sydney is expected to rise beyond 5% in the timeframe of 2028-04-01.",llama-3.3-70b-versatile,weather,4,1,0
70,"On Wednesday, November 20, 2024, Kevin White, a financial analyst, predicts that the net profit at AT&T will decrease by 5% to $3.5 billion in Q1 of 2026.",llama-3.3-70b-versatile,financial,1,1,0
71,The bright sun shone through the window.,llama-3.3-70b-versatile,any,0,0,1
72,"On 2024-10-15, Dr. Rachel Lee, a senior meteorologist at the National Oceanic and Atmospheric Administration, forecasts that the precipitation levels in Los Angeles will likely increase by 15% in 2026-02-20.",llama-3.3-70b-versatile,weather,1,1,1
73,"According to a financial expert from Cisco, on 08/20/2024, the gross profit is expected to increase beyond $10 million in the timeframe of Q4 of 2027.",llama-3.3-70b-versatile,financial,4,1,0


In [12]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=True)
metrics

              precision    recall  f1-score   support

           0       0.90      1.00      0.95         9
           1       1.00      0.86      0.92         7

    accuracy                           0.94        16
   macro avg       0.95      0.93      0.94        16
weighted avg       0.94      0.94      0.94        16



In [13]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=False)
metrics

{'Accuracy': 0.9375,
 'Precision': 1.0,
 'Recall': 0.8571428571428571,
 'F1 Score': 0.9230769230769231}