# Models + Evaluation Metrics

- **Goal:** Prediction Recognition

- **Purpose:** To train our models and to make predictions on unseen data.

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys

import pandas as pd

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from data_processing import DataProcessing
from classification_models import PerceptronModel

In [2]:
%store -r shuffled_df
%store -r tfidf_vectorized_features

pd.set_option('max_colwidth', 800)
pre_processed_df = shuffled_df.copy()
pre_processed_df

Unnamed: 0,Base Predictions,Prediction Label
0,the weather today is mostly sunny with some clouds in the sky.,0
1,"according to david harper from accuweather, on friday, august 15, 2024, the rainfall in portland will decrease by 12% in the timeframe of early june 2025.",1
2,the hotel room was clean and comfortable for guests to stay.,0
3,**health-based predictions**,1
4,"in 2027, the average weekly exercise hours in the united states are expected to rise by 18%, as predicted by the national institutes of health on friday, december 13, 2025.",1
5,the company's mission statement is to provide excellent customer service always.,0
6,**weather-based predictions**,1
7,the team is working diligently to resolve the ongoing technical issues.,0
8,the meeting will be rescheduled due to unforeseen circumstances that occurred.,0
9,"according to julian hall from microsoft (msft), on tuesday, july 23, 2024, the net profit will increase by 10% to $25 billion in the timeframe of q3 of 2029.",1


# Split Features and Prediction Labels

In [3]:
base_pipeline = BasePipeline()

In [4]:
prediction_labels = shuffled_df['Prediction Label']
prediction_labels

0     0
1     1
2     0
3     1
4     1
5     0
6     1
7     0
8     0
9     1
10    1
11    0
12    1
13    1
14    1
15    1
16    0
17    1
18    1
19    1
20    0
21    1
22    1
23    0
24    1
25    0
26    1
27    1
28    1
29    0
30    1
31    1
32    0
33    0
34    1
35    1
36    1
37    1
38    0
39    0
Name: Prediction Label, dtype: int64

## Models

1. Perceptron
    - $ x_n \in X $, `tfidf_vectorized_features`
        - N, `tfidf_vectorized_features_n`. Each row (formally document).
        - D, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - Thus, $ X \in R^{N \times D} $
    - $ w^T $, Weights, which are randomly initialize (in sklearn)
        - N, `tfidf_vectorized_features_d`. Each column (formally unique terms/features)
        - D, 1
        - Thus, $ w^T \in R^{N \times D} $
    
    $$
    (w^T \cdot x_n) \Rightarrow (100 \times 1) \cdot (1 \times 100)
    $$

In [5]:
features_N, features_D = tfidf_vectorized_features.shape[0], tfidf_vectorized_features.shape[1]
print(f"There are {features_N} features (rows) with {features_D} words (columns).")

There are 40 features (rows) with 100 words (columns).


In [6]:
prediction_labels = pre_processed_df['Prediction Label']
X_train, X_test, y_train, y_test = DataProcessing.split_data(tfidf_vectorized_features, prediction_labels)

In [7]:
perception_model = PerceptronModel()

perception_model.train_model(X_train, y_train)
y_predictions = perception_model.predict(X_test)
y_predictions

0    1
1    0
2    1
3    1
4    1
5    1
6    0
7    1
dtype: int64

In [8]:
DataProcessing.join_predictions_with_labels(pre_processed_df, y_test, y_predictions, perception_model)

Unnamed: 0,Base Predictions,Prediction Label,Perceptron Model Prediction
32,the office is closed due to a holiday today and tomorrow.,0,1
33,the new restaurant is open for lunch and dinner daily now.,0,0
34,**company-based financial predictions**,1,1
35,"on monday, march 17, 2025, dr. ethan kim predicts that the temperature will rise by 3°c in new york city by friday, march 21, 2025.",1,1
36,"on tuesday, april 15, 2025, samantha taylor from noaa forecasts that the precipitation levels will increase by 15% in san francisco in may 2025.",1,1
37,**public policy predictions**,1,1
38,the employees are required to attend a mandatory training session tomorrow.,0,0
39,the company's financial reports are available for public review now.,0,1


In [9]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=True)
metrics

              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.86      0.92         7

    accuracy                           0.88         8
   macro avg       0.75      0.93      0.79         8
weighted avg       0.94      0.88      0.89         8



In [10]:
metrics = base_pipeline.evaluation_metrics(y_test, y_predictions, default_metrics=False)
metrics

{'Accuracy': 0.875,
 'Precision': 1.0,
 'Recall': 0.8571428571428571,
 'F1 Score': 0.9230769230769231}