In [None]:
# Load full dataset
import pandas as pd
df = pd.read_csv("../data/processed_dataset.csv")

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Define feature set and target variable
X = df["Text"]
y = df["Label"]

# Convert text into numerical features using TF-IDF
# TF-IDF (Term Frequency–Inverse Document Frequency) transforms raw text into a matrix of word importance scores.
# This helps capture how unique or common a word is across the dataset, improving classification.
vectorizer = TfidfVectorizer(max_features=1000, stop_words="english")
X_tfidf = vectorizer.fit_transform(X)

# We split the data into 80% for training and 20% for testing, ensuring both AI and Human labels are represented equally using stratified sampling.
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Train a Logistic Regression classifier
# Logistic Regression is a simple yet effective linear model used here as a baseline to classify AI vs Human text.

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Evaluate model performance
# We check accuracy, precision, recall, and F1-score using a confusion matrix and classification report.

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[6 4]
 [1 9]]
              precision    recall  f1-score   support

          AI       0.86      0.60      0.71        10
       Human       0.69      0.90      0.78        10

    accuracy                           0.75        20
   macro avg       0.77      0.75      0.74        20
weighted avg       0.77      0.75      0.74        20



## TF-IDF + Logistic Regression Classifier (Baseline)

To establish a baseline for detecting AI-generated text, we trained a simple Logistic Regression classifier on TF-IDF-transformed text samples.

---

### Feature Engineering
- **Vectorizer:** `TfidfVectorizer` with 1000 max features and English stopwords
- **Input:** Text column (100 samples total: 50 AI, 50 Human)
- **Output:** Sparse matrix of word importance scores

---

###  Model Summary
- **Model Used:** Logistic Regression (Scikit-learn)
- **Train/Test Split:** 80/20 stratified
- **Evaluation Metrics:** Confusion matrix, precision, recall, F1-score

---

### Results

| Class  | Precision | Recall | F1-score |
|--------|-----------|--------|----------|
| AI     | 0.86      | 0.60   | 0.71     |
| Human  | 0.69      | 0.90   | 0.78     |
| **Accuracy** |  –       | –      | **0.75**     |

- The model **recalls human text better** than AI, but **precision is higher for AI** (fewer false positives).
- F1-scores show balanced but imperfect performance, validating that stylometric + TF-IDF features already offer separation power.

---
