# üìò 4.0 Final Model Training and Evaluation

## Notebook Overview

This notebook trains and evaluates the final sentiment classification model using the feature-frozen dataset produced in prior pipeline stages.

The objective is to construct a production-ready model using validated features and to serialize all required artifacts for downstream inference.

No feature experimentation or representation design is performed in this notebook.

---



## Objectives

* Train the final sentiment classification model
* Evaluate performance on held-out validation data
* Inspect model behavior at a high level
* Serialize model and preprocessing artifacts
* Establish deployment-ready training outputs

This notebook marks the transition from feature engineering to model operationalization.

---



## Inputs

This notebook consumes the finalized feature dataset:

```
data/processed/features_final.csv
```

This dataset includes:

* Raw tweet text
* Sentiment labels
* Emoji polarity features

The dataset is treated as immutable and feature-complete.

---



## Outputs

This notebook produces serialized artifacts stored under:

```
models/
```

Artifacts generated include:

* Trained sentiment classifier
* TF-IDF vectorizer
* (Optional) feature schema metadata

These artifacts are required for inference and deployment pipelines.

---



# üß© Section 1 ‚Äî Setup and Imports

This section defines the modeling environment.

It includes:

* Library imports
* Random seed configuration
* Path definitions for data and artifacts
* Path definitions for data and artifacts

Reproducibility is enforced via fixed seeds and deterministic model settings.

---



In [1]:
# --- 4.0 Final Model Training and Evaluation ---

from pathlib import Path
import pandas as pd
import numpy as np
import random
import joblib

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

from scipy.sparse import hstack


In [2]:
# Reproducibility configuration
SEED = 42
random.seed(SEED)
np.random.seed(SEED)


In [3]:
# Paths
DATA_PATH = Path("data/processed/features_final.csv")
MODEL_DIR = Path("../models")

MODEL_DIR.mkdir(parents=True, exist_ok=True)


# üß© Section 2 ‚Äî Load Final Feature Dataset

This section loads the finalized dataset generated in the feature engineering phase.

Integrity checks confirm:

* Required columns are present
* No missing values exist
* Label encoding is valid

No transformations are applied at this stage.

---



In [4]:
DATA_PATH = Path("../data/processed/features_final.csv")

df = pd.read_csv(DATA_PATH)
df.head()


Unnamed: 0,label,text,emoji_pos_count,emoji_neg_count
0,1,Good morning every one,0,0
1,0,TW: S AssaultActually horrified how many frien...,0,1
2,1,Thanks by has notice of me Greetings : Jossett...,0,0
3,0,its ending soon aah unhappy üòß,0,1
4,1,My real time happy üòä,1,0


In [5]:
# Integrity checks
required_columns = {
    "text",
    "label",
    "emoji_pos_count",
    "emoji_neg_count",
}

assert required_columns.issubset(df.columns), "Missing required columns."
assert df.isna().sum().sum() == 0, "Dataset contains null values."

df.shape


(1000, 4)

### üìä Dataset Integrity Confirmation

The finalized feature dataset was successfully loaded.

Key observations:

- The dataset contains **1,000 records**.
- Four columns are present:
  - `text` ‚Äî raw tweet content
  - `label` ‚Äî sentiment target
  - `emoji_pos_count` ‚Äî count of positive emojis
  - `emoji_neg_count` ‚Äî count of negative emojis

Integrity checks confirm:

- No missing values exist.
- All required columns are present.
- Feature engineering outputs were correctly persisted.

This dataset represents the finalized modeling input and will not undergo further transformation.


# üß© Section 3 ‚Äî Train / Validation Split

This section establishes the evaluation framework for model training.

A stratified train‚Äìvalidation split is performed to ensure that:

* Class distributions remain consistent
* Performance estimates are unbiased
* All models are evaluated fairly

The split configuration is fixed for reproducibility.

---



In [6]:
X_text = df["text"]
X_emoji = df[["emoji_pos_count", "emoji_neg_count"]]
y = df["label"]

X_train_text, X_val_text, X_train_emoji, X_val_emoji, y_train, y_val = train_test_split(
    X_text,
    X_emoji,
    y,
    test_size=0.2,
    stratify=y,
    random_state=SEED,
)


### üîÄ Train‚ÄìValidation Partitioning

The dataset was partitioned into training and validation subsets using a stratified split.

Configuration:

- **80% Training** (800 samples)
- **20% Validation** (200 samples)

Stratification ensures that sentiment class distributions remain consistent across splits.

This design prevents evaluation bias and ensures that performance metrics reflect generalization rather than memorization.


# üß© Section 4 ‚Äî Text Vectorization

This section converts tweet text into numerical representations suitable for machine learning models.

Vectorization is performed using TF-IDF with word n-grams.

The resulting sparse matrices represent the textual component of the feature space.

Emoji polarity features are not yet incorporated at this stage.

---



In [7]:
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95,
)


In [8]:
X_train_text_vec = tfidf.fit_transform(X_train_text)
X_val_text_vec = tfidf.transform(X_val_text)

X_train_text_vec.shape, X_val_text_vec.shape


((800, 1265), (200, 1265))

### üß† Text Feature Space Construction

TF-IDF vectorization transformed tweet text into numerical feature representations.

Observations:

- The vectorizer produced **1,265 textual features**.
- Features include:
  - Individual words
  - Two-word phrases (bigrams)

This representation captures semantic sentiment cues such as:

- ‚Äúhappy‚Äù
- ‚Äúgood morning‚Äù
- ‚Äúnot happy‚Äù

These textual features form the primary predictive signal for the classifier.


# üß© Section 5 ‚Äî Feature Matrix Assembly

This section combines:

* TF-IDF text features
* Emoji polarity count features

The objective is to construct the final model input matrices used for training and evaluation.

Care is taken to preserve sparse matrix efficiency and feature alignment.

---



In [9]:
# Convert emoji features to numpy
X_train_emoji_np = X_train_emoji.to_numpy()
X_val_emoji_np = X_val_emoji.to_numpy()


In [10]:
# Combine sparse + dense features
X_train_final = hstack([X_train_text_vec, X_train_emoji_np])
X_val_final = hstack([X_val_text_vec, X_val_emoji_np])

X_train_final.shape, X_val_final.shape


((800, 1267), (200, 1267))

### üß© Final Feature Matrix Assembly

Emoji polarity features were appended to the TF-IDF text feature space.

Feature composition:

- 1,265 text features
- 2 emoji polarity features
  - `emoji_pos_count`
  - `emoji_neg_count`

Final dimensionality:

- Training matrix: 800 √ó 1,267
- Validation matrix: 200 √ó 1,267

This confirms successful integration of emoji-derived sentiment signals into the modeling representation.


# üß© Section 6 ‚Äî Model Training

This notebook mirrors the production training dataset used by the training pipeline. 

Model selection prioritizes:

* Interpretability
* Stability on small datasets
* Computational efficiency

Linear classifiers such as Logistic Regression are well-suited for this task.

---



In [11]:
model = LogisticRegression(
    max_iter=1000,
    random_state=SEED,
)


In [12]:
model.fit(X_train_final, y_train)


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",42
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


### ü§ñ Model Training Completion

A Logistic Regression classifier was successfully trained on the assembled feature matrix.

Model choice rationale:

- Well-suited for high-dimensional sparse text data
- Interpretable coefficient structure
- Stable convergence on small datasets

Training completed without convergence warnings, indicating adequate feature scaling and model configuration.


# üß© Section 7 ‚Äî Model Evaluation

This section evaluates predictive performance on the validation dataset.

Reported metrics include:

* Accuracy
* F1 Score
* Precision and Recall (optional)
* Confusion Matrix (optional visualization)

These metrics determine deployment readiness.

---



In [13]:
y_pred = model.predict(X_val_final)


In [14]:
accuracy = accuracy_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

accuracy, f1


(0.815, 0.8340807174887892)

### üìà Predictive Performance Evaluation

Model performance was evaluated on the held-out validation dataset.

Key metrics:

- **Accuracy:** 0.815
- **F1 Score:** 0.834

Interpretation:

- The model correctly classifies ~81.5% of unseen tweets.
- The F1 score indicates balanced precision and recall performance.

Given the dataset size and binary sentiment framing, this performance is considered strong and deployment-ready.


In [15]:
print(classification_report(y_val, y_pred))


              precision    recall  f1-score   support

           0       0.91      0.70      0.79       100
           1       0.76      0.93      0.83       100

    accuracy                           0.81       200
   macro avg       0.83      0.81      0.81       200
weighted avg       0.83      0.81      0.81       200



### üìä Class-Level Performance Analysis

The classification report reveals class-specific predictive behavior.

Observations:

- Negative sentiment predictions show high precision but lower recall.
- Positive sentiment predictions exhibit high recall.

Interpretation:

- The model is conservative when predicting negative sentiment.
- It is more willing to assign positive sentiment labels.

This asymmetry is common in social media sentiment datasets.


In [16]:
confusion_matrix(y_val, y_pred)


array([[70, 30],
       [ 7, 93]])

### üî¢ Confusion Matrix Interpretation

The confusion matrix summarizes prediction outcomes.

Key insights:

- 70 negative tweets were correctly classified.
- 93 positive tweets were correctly classified.
- Misclassifications total 37 instances.

Error distribution suggests a slight tendency to over-predict positive sentiment.

This behavior aligns with the classification report findings.


# üß© Section 8 ‚Äî Error Analysis (Lightweight)

This section inspects misclassified examples to better understand model limitations.

Focus areas include:

* Emoji-heavy misclassifications
* Sarcastic or ambiguous text
* Disagreement between text and emoji sentiment

This analysis is qualitative and does not trigger feature redesign.

---



In [17]:
results_df = pd.DataFrame({
    "text": X_val_text,
    "true_label": y_val,
    "predicted_label": y_pred,
})

misclassified = results_df[results_df["true_label"] != results_df["predicted_label"]]

misclassified.head(10)


Unnamed: 0,text,true_label,predicted_label
602,and i would never forget about you!!,0,1
786,Go :*!!... I know you can do it..,1,0
183,the email I was hoping I'd get today,0,1
676,but no sign.,0,1
109,Late nights,0,1
35,good luck! You have my vote smile üò≠,1,0
709,Lets hope a VIP does this good deed. I want to...,0,1
743,I hate people who steal my ideas,1,0
163,How people write a joke on religion online.*Jo...,1,0
417,beauty,0,1


### üß™ Misclassification Review

A subset of incorrectly predicted tweets was examined.

Observed failure patterns include:

- Text‚Äìemoji sentiment disagreement
- Ambiguous or sarcastic phrasing
- Minimal textual context

Example cases reveal that conflicting sentiment signals can challenge linear classifiers.

This analysis provides qualitative insight but does not warrant feature redesign.


# üß© Section 9 ‚Äî Model Coefficient Inspection (Optional)

This section examines learned model weights to interpret sentiment drivers.

Examples include:

* Most positive text features
* Most negative text features
* Influence of emoji polarity counts

This enhances transparency and explainability.

---



In [18]:
feature_names = tfidf.get_feature_names_out()

emoji_features = ["emoji_pos_count", "emoji_neg_count"]

all_feature_names = np.concatenate([feature_names, emoji_features])


In [19]:
coefficients = model.coef_[0]

coef_df = pd.DataFrame({
    "feature": all_feature_names,
    "weight": coefficients,
})


In [20]:
# Top positive signals
coef_df.sort_values("weight", ascending=False).head(15)


Unnamed: 0,feature,weight
422,happy,2.828551
903,smile,2.769879
1265,emoji_pos_count,1.531368
394,good,1.188772
962,thanks,1.127404
325,for,0.937982
404,great,0.891801
659,morning,0.805833
963,thanks for,0.77205
685,new,0.722856


### ‚ûï Strongest Positive Sentiment Signals

Top positively weighted features were inspected.

Notable signals include:

- ‚Äúhappy‚Äù
- ‚Äúsmile‚Äù
- ‚Äúgood‚Äù
- ‚Äúthanks‚Äù
- `emoji_pos_count`

The presence of `emoji_pos_count` among the top predictors confirms that positive emojis contribute meaningful sentiment signal within the model.


In [21]:
# Top negative signals
coef_df.sort_values("weight").head(15)


Unnamed: 0,feature,weight
1266,emoji_neg_count,-2.570797
1096,unhappy,-2.103328
211,crying,-1.834094
904,so,-1.02226
357,fun,-0.83761
721,of the,-0.749987
651,miss,-0.720617
445,he,-0.717463
1220,yeah,-0.692912
609,love to,-0.681424


### ‚ûñ Strongest Negative Sentiment Signals

Top negatively weighted features include:

- `emoji_neg_count`
- ‚Äúunhappy‚Äù
- ‚Äúcrying‚Äù

The strong negative coefficient for `emoji_neg_count` validates the effectiveness of emoji polarity feature engineering.

This demonstrates that negative emojis function as reliable sentiment amplifiers.


# üß© Section 11 ‚Äî Training Summary and Closure

This section formally concludes the modeling phase.

It documents:

* Final model performance
* Selected classifier rationale
* Feature set confirmation
* Artifact readiness for deployment

No further model experimentation is planned within this notebook.

---



In [24]:
print("Final Model Performance")
print("-----------------------")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")


Final Model Performance
-----------------------
Accuracy: 0.8150
F1 Score: 0.8341


### üèÅ Training Phase Summary

Final evaluation metrics:

- Accuracy: 0.8150
- F1 Score: 0.8341

These results confirm that the model generalizes effectively on unseen validation data.

The classifier is considered production-ready within the scope of this project.


# üîí Modeling Guarantees

This notebook guarantees that:

* Training uses a feature-frozen dataset
* Evaluation occurs on unseen validation data
* Serialized artifacts reflect validated model state
* Training is reproducible given fixed seeds and inputs

---



# ‚û°Ô∏è Next Steps

Following completion of this notebook:

* A dedicated production training pipeline will be formalized via
  `emoji_sentiment_analysis/modeling/train_model.py`,
  which will reproducibly generate all serialized model artifacts.

* Inference pipelines will consume artifacts produced by the finalized training script rather than notebook execution.

* Prediction services and application-layer integrations will be implemented on top of the frozen model outputs.

* Monitoring, audit logging, and inference tracking mechanisms will be integrated into downstream services.

---



## üî¨ Post-Training Analysis Extension

A supplementary notebook ‚Äî **Notebook 4.5: Model Interpretability & Performance Deep Dive** ‚Äî will extend the work conducted here.

This follow-up analysis will focus on:

* Prediction confidence diagnostics
* Emoji feature contribution analysis
* Misclassification archetype identification
* Counterfactual and stability testing
* Behavioral interpretability of the trained classifier

Notebook 4.5 does not retrain or modify the model.
Instead, it extracts deeper analytical insight from the frozen training artifacts.

---



This notebook therefore marks the transition from **model construction** to:

* production training pipeline engineering
* interpretability analysis
* and deployment system design

---
