## Vectorising Features into TF-IDF Values

## TF-IDF (Term Frequency-Inverse Document Frequency)
It is the product of TF, and IDF.

Statistical measure that evaluates how important a word is to a document in a corpus (collection of documents).

$TF-IDF = TF \times IDF$

**Usage**
helps in identifying words that are significant in a document but not too common across all documents.
### sklearn differences
Because of `zero` division possibility, sklearn has a slightly different formula.
The `1`'s purpose: makes it such that an extra document was seen containing every term in collection EXACTLY once. 

$IDF(t) = log(\frac{(1 + n)}{(1 + DF(t))}) + 1$

here, 
	`n` is Term Frequency
	`t` is the actual Term

`log` is used to dampen the effect of very large values.
we know that, 1 million and 2 million can be have equal influence, that's why while representing them with a numerical value, we dampen their values.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# read and store the csv file dataset as a dataframe
df = pd.read_csv("../dataset/1_fake_and_real_news.csv")
print(df.shape)
df.head()

(9900, 2)


Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [3]:
# imbalanced dataset does affect training,
# however, this is almost 50:50 so i will consider it negligible.
df.label.value_counts()

label
Fake    5000
Real    4900
Name: count, dtype: int64

In [4]:
# mapping label categories to numbers
# fake: 0, real: 1
# df.label.map({
#     "Fake": 0,
#     "Real": 1,
# })
# takes in a dictionary, and produces an effect on dataframe

# create a new column to the current dataframe
df["df_label"] = df.label.map({
    "Fake": 0,
    "Real": 1,
})

df.head()
print(df["Text"])

0        Top Trump Surrogate BRUTALLY Stabs Him In The...
1       U.S. conservative leader optimistic of common ...
2       Trump proposes U.S. tax overhaul, stirs concer...
3        Court Forces Ohio To Allow Millions Of Illega...
4       Democrats say Trump agrees to work on immigrat...
                              ...                        
9895     Wikileaks Admits To Screwing Up IMMENSELY Wit...
9896    Trump consults Republican senators on Fed chie...
9897    Trump lawyers say judge lacks jurisdiction for...
9898     WATCH: Right-Wing Pastor Falsely Credits Trum...
9899     Sean Spicer HILARIOUSLY Branded As Chickensh*...
Name: Text, Length: 9900, dtype: object


In [5]:
# create a vectorizer object for tf-idf
"""
.fit_transform(data: str) -> tf_idf_value_matrix
# text_data is fit and transformed as features
# each feature has a TF-IDF value associated with it.
# the resulting value matrix is returned.
"""

vectorizer = TfidfVectorizer()
transformed_output = vectorizer.fit_transform(df["Text"])

# print(vectorizer.vocabulary_)
print(transformed_output)

# lower the tf_idf value, higher the occurence of that feature

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2171752 stored elements and shape (9900, 58445)>
  Coords	Values
  (0, 52736)	0.08528328193860288
  (0, 53356)	0.24524663418922665
  (0, 50744)	0.09251973107666457
  (0, 10048)	0.09392101862235107
  (0, 49438)	0.12871702155365564
  (0, 25352)	0.11471603956019787
  (0, 26799)	0.015539354339607682
  (0, 51929)	0.07504302286307662
  (0, 7267)	0.0363100044045118
  (0, 24816)	0.13020485802624113
  (0, 39074)	0.20959773740142504
  (0, 55605)	0.041286127767650706
  (0, 28064)	0.03525468440873214
  (0, 32092)	0.05478450462654908
  (0, 6542)	0.07517839739693681
  (0, 52211)	0.044047628252456714
  (0, 44093)	0.02984434205652043
  (0, 41173)	0.03979535097146953
  (0, 10763)	0.05086183225006348
  (0, 17439)	0.03882026654266726
  (0, 27981)	0.12045052559736223
  (0, 32163)	0.06532150123396688
  (0, 50663)	0.08050833780921683
  (0, 19741)	0.09161237789961373
  (0, 22102)	0.02042598845225193
  :	:
  (9899, 58115)	0.03696372778235382
  (989

## Preprecessing Text

### The basic idea is:
1. remove stop-words (is, and, the, etc.)
2. lemmatization (reduce words to root form, running -> run)
3. convert words to lowercase
4. remove periods and other irrelevant special characters
5. tokenization: split text into words

In [6]:
# preprocessing the text
import spacy

# load the model once (takes a few seconds)
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])  # disable unused components

def preprocess_spacy(text):
    doc = nlp(text)
    tokens = [
        token.lemma_.lower() 
        for token in doc 
        if not token.is_stop and not token.is_punct
    ]
    return " ".join(tokens)

In [7]:
# add the preprocessed text as a column to dataframe
df["preprocessed_text"] = df["Text"].apply(preprocess_spacy)

In [8]:
df.head()

Unnamed: 0,Text,label,df_label,preprocessed_text
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0,trump surrogate brutally stabs pathetic vide...
1,U.S. conservative leader optimistic of common ...,Real,1,u.s. conservative leader optimistic common gro...
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1,trump propose u.s. tax overhaul stir concern d...
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0,court forces ohio allow millions illegally p...
4,Democrats say Trump agrees to work on immigrat...,Real,1,democrats trump agree work immigration bill wa...


In [9]:
# write the preprocessed data into an output file for this dataset
df.to_csv("../dataset/1_output.csv", index=False)

## Dataset Split: Training and Testing

In [26]:
from sklearn.model_selection import train_test_split

# now, read data from the generated output file
df = pd.read_csv("../dataset/1_output.csv")

transformed_output = vectorizer.fit_transform(df["preprocessed_text"])
X = transformed_output
y = df["df_label"]

# splits dataset into two parts:
# training data (80%) and testing data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Logistic Regression
A *statistical* and Machine Learning algorithm which is used for Binary Classficiation, meaning it can only be implemented for data which require output in the form of 1 or 0 (here, fake and real). 

Gives a probability of either 1 or 0 for an input belonging to a class.

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# decision boundary, separates "Real" and "Fake" news based on word patterns
model = LogisticRegression()
model.fit(X_train, y_train)
# if words like, hoax or scam -> likely fake
# if words like, official or study -> likely real

In [13]:
# now, the LogisticRegression model is trained
# use the test dataset to predict the output from the trained model
y_pred = model.predict(X_test)
print(y_pred)

[0 1 1 ... 1 0 1]


In [14]:
# evaluating performance
# compare predicted values with actual values
print(accuracy_score(y_test, y_pred))

0.9914141414141414


In [15]:
print(classification_report(y_test, y_pred))
# precision: % of predicted fake news, that was actually fake
# recall: % of actual fake news correctly detected

# f1-score: balance between precision and recall

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       985
           1       0.99      0.99      0.99       995

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980



## Random Forest

A powerful ensemble learning model which can work well with TF-IDF vectors for *text classification* tasks like *sentiment analysis*, *spam detection*, or topic categorization.

Main dataset is divided into batches of random datasets, then a decision tree is created for each of those batches. Because of this random sampling, the model is called random forest.

The `inputs_n` dataset is passed to each of those sub-decision trees, the most common output is then selected among all the outputs generated. 

In [16]:
# using random forest classifier to achieve the same
from sklearn.ensemble import RandomForestClassifier

In [17]:
# initialise and train the model on the training dataset
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

y_pred = rf_clf.predict(X_test)
y_pred

array([0, 1, 1, ..., 1, 0, 1])

In [18]:
# generate an accuracy and classification report
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9974747474747475
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       985
           1       1.00      0.99      1.00       995

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980



## XGBoost
eXtreme Gradient Boosting is a supervised machine learning algorithm which utilises gradient-boosted decision trees, designed for speed and performance.
XGBoost builds an ensemble of *weak learners (here, decision trees)* sequentially. Each tree progresses by correcting errors from the previous ones. 

### Regularization
Utilises regularization to prevent overfitting: introduces *penalty terms* to the model's loss function, discourages large models, etc.
- L1/L2 regularization.
- Max depth control.
- Column/Row subsampling.
- Early stopping.
- Tree pruning.

In [20]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    n_estimators=100,
    early_stopping_rounds=10,
    random_state=42
)

In [29]:
# train the model with training data and monitor performance on the test set
# eval_set: monitors performance on the test set
# verbose: prints current progress
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)]
)

[0]	validation_0-logloss:0.59873
[1]	validation_0-logloss:0.52358
[2]	validation_0-logloss:0.45846
[3]	validation_0-logloss:0.40872
[4]	validation_0-logloss:0.36109
[5]	validation_0-logloss:0.32006
[6]	validation_0-logloss:0.28450
[7]	validation_0-logloss:0.25355
[8]	validation_0-logloss:0.22640
[9]	validation_0-logloss:0.20239
[10]	validation_0-logloss:0.18131
[11]	validation_0-logloss:0.16256
[12]	validation_0-logloss:0.14590
[13]	validation_0-logloss:0.13111
[14]	validation_0-logloss:0.11790
[15]	validation_0-logloss:0.10615
[16]	validation_0-logloss:0.09566
[17]	validation_0-logloss:0.08629
[18]	validation_0-logloss:0.07783
[19]	validation_0-logloss:0.07114
[20]	validation_0-logloss:0.06524
[21]	validation_0-logloss:0.05898
[22]	validation_0-logloss:0.05338
[23]	validation_0-logloss:0.04832
[24]	validation_0-logloss:0.04377
[25]	validation_0-logloss:0.03962
[26]	validation_0-logloss:0.03594
[27]	validation_0-logloss:0.03277
[28]	validation_0-logloss:0.02970
[29]	validation_0-loglos

In [30]:
y_pred = xgb_model.predict(X_test)

print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       985
           1       1.00      1.00      1.00       995

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980



## CatBoost
Categorical Boosting is a Gradient Boosting Machine Learning algorithm which is capable of natively handling *categorical features*, so there's no need for encoding features (such as, Tf-Idf values).

Unlike XGBoost, works efficiently with raw-text and has no requirement of explicit preprocessing of aforementioned text.

In [42]:
# since CatBoost handles raw_text natively
# there's no need for preprocessed_text.
X = df[["Text"]]
y = df["df_label"]

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [43]:
# initialise CatBoost classifier
model = CatBoostClassifier(
    iterations=500,            # number of boosting rounds.
    learning_rate=0.1,         # step-size shrinkage
    depth=6,                   # tree-depth
    text_features=["Text"],    # specify text column for categorisation
    eval_metric="Accuracy",    # metric to optimise
    verbose=100,               # print progress every 100 iterations
    random_state=42
)

model.fit(
    X_train, 
    y_train,
    eval_set=(X_test, y_test),  # validating data
    early_stopping_rounds=50    # if no improvement after 50 rounds, stop
)

0:	learn: 0.9991162	test: 0.9994949	best: 0.9994949 (0)	total: 47.2ms	remaining: 23.5s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 1
bestIteration = 1

Shrink model to first 2 iterations.


<catboost.core.CatBoostClassifier at 0x7588dc179b40>

In [45]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       973
           1       1.00      1.00      1.00      1007

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980



In [48]:
model.get_feature_importance(prettified=True)

Unnamed: 0,Feature Id,Importances
0,Text,100.0


## Stacking Classifier
Ensembling ensemble models under a base model (here, **Logistic Regression**) to create a model which covers up the weaknesses of individual ensemble models.

StackingClassifier:
- **CatBoost**
    - Processes raw text.
    - Useful for categorising texts,
- **XGBoost**
    - Processes vectorised data.
    - Feeding features as tf-idf values.
    - Extremely performance oriented.
- **Pipelines**
    - Raw text: for CatBoost.
    - TF-IDF values: for XGBoost.

In [33]:
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from catboost import CatBoostClassifier
from sklearn.base import BaseEstimator, TransformerMixin

In [34]:
# for X, use both raw and preprocessed text.
X = df[["Text", "preprocessed_text"]]
y = df["df_label"] # mapped label values

# split dataset for testing and training the model.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [35]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            return X[self.columns]
        elif hasattr(X, 'shape'):  # For sparse/numpy arrays
            if isinstance(self.columns, str):
                return X  # Pass through for single column
            else:
                return X  # Or handle multi-column case as needed
        return X