# Models + Evaluation Metrics

- **Goal:** Prediction Recognition

- **Purpose:** To train our models and to make predictions on unseen data.

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys

import pandas as pd

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from pre_processing import PreProcessing
from classification_models import ModelFactory

In [2]:
%store -r shuffled_base_df
%store -r tfidf_vectorized_features

pd.set_option('max_colwidth', 800)
pre_processed_df = shuffled_base_df.copy()
pre_processed_df

Unnamed: 0,Base Predictions,Prediction Label
0,"on thursday, september 19, 2024, mia patel speculates that the dividend payout ratio at coca-cola (ko) will probably remain at 75% in q1 of 2026.",1
1,the city has a large population.,0
2,"on thursday, april 18, 2024, jackson lee forsees that the total debt at intel (intc) will probably decrease by 10% to $20 billion in q1 of 2029.",1
3,the color blue is often associated with calmness.,0
4,the music is playing loudly in the room.,0
5,the book fell off the table.,0
6,"on tuesday, november 19, 2024, ava lee predicts that the operating cash flow at exxonmobil (xom) should decrease by 5% to $20 billion in q2 of 2027.",1
7,"on monday, december 16, 2024, detravious forecasts that the revenue at apple will rise by 8% to $120 per share in q1 of 2025.",1
8,"on tuesday, june 18, 2024, detravious martin predicts that the capital expenditures at unitedhealth group (unh) should decrease by 3% to $2 billion in q2 of 2028.",1
9,"on friday, august 16, 2024, logan white predicts that the research and development expenses at pfizer (pfe) may increase by 8% to $10 billion in fy 2029.",1


# Split Features and Prediction Labels

In [3]:
base_pipeline = BasePipeline()
models = ModelFactory

In [4]:
prediction_labels = shuffled_base_df['Prediction Label']
prediction_labels

0     1
1     0
2     1
3     0
4     0
5     0
6     1
7     1
8     1
9     1
10    0
11    1
12    0
13    1
14    0
15    0
16    0
17    1
Name: Prediction Label, dtype: int64

## Models

1. Perceptron
    - $ x_n \in X $, `tfidf_vectorized_features`
        - N, `tfidf_vectorized_features_n`. Each row (formally document).
        - D, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - Thus, $ X \in R^{N \times D} $
    - $ w^T $, Weights, which are randomly initialize (in sklearn)
        - N, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - D, 1
        - Thus, $ w^T \in R^{N \times D} $
    
    $$
    (w^T \cdot x_n) \Rightarrow (155 \times 1) \cdot (1 \times 155)
    $$

In [5]:
features_N, features_D = tfidf_vectorized_features.shape[0], tfidf_vectorized_features.shape[1]
print(f"There are {features_N} features (rows) with {features_D} words (columns).")

There are 18 features (rows) with 163 words (columns).


In [6]:
prediction_labels = pre_processed_df['Prediction Label']
X_train, X_test, y_train, y_test = PreProcessing.split_data(tfidf_vectorized_features, prediction_labels)

In [7]:
perception_model = ModelFactory.select_model("perceptron")

y_predictions = base_pipeline.train_and_predict(perception_model, X_train, y_train, X_test)
y_predictions

array([1, 0, 1, 0])

In [8]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=True)
metrics

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4



In [9]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=False)
metrics

{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}