# Models + Evaluation Metrics

- **Goal:** Prediction Recognition

- **Purpose:** To train our models and to make predictions on unseen data.

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys

import pandas as pd

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from data_processing import DataProcessing
from classification_models import PerceptronModel

In [2]:
%store -r shuffled_base_df
%store -r tfidf_vectorized_features

pd.set_option('max_colwidth', 800)
pre_processed_df = shuffled_base_df.copy()
pre_processed_df

Unnamed: 0,Base Predictions,Prediction Label
0,the city has a large population.,0
1,"on wednesday, october 23, 2024, sophia rodriguez envisions that the stock price at microsoft (msft) will likely rise by 20% to $400 per share in q4 of 2028.",1
2,"on thursday, september 19, 2024, ethan hall speculates that the dividend payout ratio at procter & gamble (pg) will probably remain at 70% in q3 of 2026.",1
3,the flower smells very sweet.,0
4,the baby laughed at the clown.,0
5,"on monday, december 16, 2024, emily chen forecasts that the revenue at amazon (amzn) will rise by 12% to $150 billion in q2 of 2026.",1
6,the sun is shining brightly in the clear sky.,0
7,"on tuesday, november 19, 2024, liam kim predicts that the operating cash flow at chevron (cvx) should decrease by 4% to $15 billion in q1 of 2027.",1
8,"on thursday, april 18, 2024, hannah taylor forsees that the total debt at at&t (t) will probably decrease by 8% to $150 billion in q3 of 2029.",1
9,the teacher wrote on the board.,0


# Split Features and Prediction Labels

In [3]:
base_pipeline = BasePipeline()

In [4]:
prediction_labels = shuffled_base_df['Prediction Label']
prediction_labels

0     0
1     1
2     1
3     0
4     0
5     1
6     0
7     1
8     1
9     0
10    1
11    1
12    0
13    0
14    0
15    1
16    0
17    1
Name: Prediction Label, dtype: int64

## Models

1. Perceptron
    - $ x_n \in X $, `tfidf_vectorized_features`
        - N, `tfidf_vectorized_features_n`. Each row (formally document).
        - D, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - Thus, $ X \in R^{N \times D} $
    - $ w^T $, Weights, which are randomly initialize (in sklearn)
        - N, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - D, 1
        - Thus, $ w^T \in R^{N \times D} $
    
    $$
    (w^T \cdot x_n) \Rightarrow (155 \times 1) \cdot (1 \times 155)
    $$

In [5]:
features_N, features_D = tfidf_vectorized_features.shape[0], tfidf_vectorized_features.shape[1]
print(f"There are {features_N} features (rows) with {features_D} words (columns).")

There are 18 features (rows) with 100 words (columns).


In [6]:
prediction_labels = pre_processed_df['Prediction Label']
X_train, X_test, y_train, y_test = DataProcessing.split_data(tfidf_vectorized_features, prediction_labels)

In [7]:
perception_model = PerceptronModel()

perception_model.train_model(X_train, y_train)
y_predictions = perception_model.predict(X_test)
y_predictions

0    0
1    1
2    1
3    1
dtype: int64

In [8]:
DataProcessing.join_predictions_with_labels(pre_processed_df, y_test, y_predictions, perception_model)

Unnamed: 0,Base Predictions,Prediction Label,Perceptron Model Prediction
14,the music is playing loudly.,0,0
15,"on monday, july 22, 2024, julian sanchez forecasts that the return on equity (roe) at visa (v) has a high probability of improving by 3% to 20% in q2 of 2027.",1,1
16,the dog is running quickly.,0,1
17,"on tuesday, june 18, 2024, mia garcia predicts that the capital expenditures at cisco systems (csco) should decrease by 5% to $1.5 billion in q1 of 2028.",1,1


In [9]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=True)
metrics

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         3

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4



In [10]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=False)
metrics

{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}