# Baseline ML Modeling with Logistic Regression and TF-IDF

This notebook demonstrates the baseline process for author classification using logistic regression and TF-IDF. It includes data preprocessing, model training, and evaluation.

## Logistic Regression as a Semantic Classifier

Logistic Regression can be used for both regression and classification tasks. This baseline model works under the assumption that different authors leverage a different lexicon, so when we measure the most common relevant words in the given texts, there will be different patters associated with different authors. 

The idea is that if our baseline Logistic Classifier can achieve a 70-80% recall and precision, then our hypothesis has promise and a more mature model will be able to better classify these texts.

## Intuition and Limitations

Many fiction authors have their own distinctive style and genres.  A passage from J.R.R. Tolkien is likely going to sound different than a passage from Frank Herbert.  So given that we can quantify some of these semantic patterns, even crudely by term frequency, we can see if that hypothesis is true of our data set.

One limitation of this approach is that it only works for larger and more stylistic passages.  Of these two examples from Tolkien, one is clearly distinctively Tolkien:
> The Black Rider flung back his hood, and behold! he had a kingly crown; and yet upon no head visible was it set. The red fires shone between it and the mantled shoulders vast and dark. From a mouth unseen there came a deadly laughter.
'Old fool!' he said. 'Old fool! This is my hour. Do you not know Death when you see it? Die now and curse in vain!' And with that he lifted high his sword and flames ran down the blade.  **Return of the King**, _The Siege of Gondor_

> "Noon?" said Sam, trying to calculate.  "Noon of what day?"  **Return of the King**, _The Field of Cormallen_



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

import sys
sys.path.append('../')

from src.utils.helpers import classification_report_to_df

In [None]:
# Load your pre-cleaned dataset
df = pd.read_csv('../data/raw/train.csv')

# Split data
X = df['text']
y = df['author']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Build pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=10000)),
    ('clf', LogisticRegression(max_iter=1000, solver='lbfgs', multi_class='multinomial'))
])

# Train
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluation
display(classification_report_to_df(y_test, y_pred))

# Optional: Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=pipeline.classes_)
sns.heatmap(cm, annot=True, fmt='d', xticklabels=pipeline.classes_, yticklabels=pipeline.classes_)
plt.title('Confusion Matrix - Baseline TF-IDF + Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
