## Instructions

For this project, you will build a text-classification model to distinguish real news from fake news.
Instructions:
1. Split the training.csv file into training and test sets.
2. Apply text preprocessing (lowercasing, tokenization, removing stop words and punctuation, etc.).
3. Create text vectors using Bag of Words or TF-IDF, and experiment with their parameters (n-grams, max_df, min_df, max_features, custom tokenizer, etc.).
4. Try different classifiers such as Logistic Regression, Random Forest, XGBoost, SVM, MultinomialNB, etc.
5. Perform hyperparameter tuning.
6. Compare all models and choose the best-performing one.
7. Use your best model to predict the labels in testing.csv, replacing the value 2 in the first column with your predicted class (0 or 1).

Deliverables:
- Your Jupyter Notebook
- A PPTX presentation
- The updated testing.csv file containing your best model’s predictions

**Deadline**: Saturday 17/01/2026, and each presentation will last 10 minutes.

In [24]:
# Import packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import re

### 1. Import dataset and inspect data

In [25]:
# Import data

data = pd.read_csv("dataset/training_data.csv", sep="\t", header=None, names=["label", "text"])

In [26]:
data.head(10)

Unnamed: 0,label,text
0,0,donald trump sends out embarrassing new year‚s...
1,0,drunk bragging trump staffer started russian c...
2,0,sheriff david clarke becomes an internet joke ...
3,0,trump is so obsessed he even has obama‚s name ...
4,0,pope francis just called out donald trump duri...
5,0,racist alabama cops brutalize black boy while ...
6,0,fresh off the golf course
7,0,trump said some insanely racist stuff inside t...
8,0,former cia director slams trump over un bullying
9,0,brand-new pro-trump ad features so much a** ki...


In [27]:
print(data["text"][0])

donald trump sends out embarrassing new year‚s eve message; this is disturbing


In [28]:
data.shape

(34152, 2)

In [29]:
# Check occurance of label values
print(data["label"].value_counts())

# Check relative distribution to check balance of the dataset
print(data["label"].value_counts(normalize=True))

label
0    17572
1    16580
Name: count, dtype: int64
label
0    0.514523
1    0.485477
Name: proportion, dtype: float64


### 2. Divide in training and test set

In [30]:
from sklearn.model_selection import train_test_split

X = data["text"]
y = data["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state=42)

### 3. Data Preprocessing

In [31]:
# 3.1. Lowercasing

X_train_clean = X_train.str.lower()
X_test_clean = X_test.str.lower()

In [32]:
# 3.2 Tokenization is not necessary, as the tokenizers have built-in parameters to tokenize input.

In [33]:
# You need to remove puncation first: If you have punctuation attached to words (like "hello," or "world!"), removing stop words first won't match them properly.
# Clean punctuation first so words are isolated, then remove stop words.
# Hence, switching order of instructions.

# 3.3. Remove punctuation

X_train_clean = X_train_clean.str.replace(r"[^\w\s]", "", regex=True) # Remove all punctuation
X_train_clean = X_train_clean.str.replace(r"\b[A-Za-z]\b", "", regex=True) # Remove all single letter words
X_train_clean = X_train_clean.str.replace(r"\b\d+\b", "", regex=True) # Remove all standalone numbers
X_train_clean = X_train_clean.str.replace(r"\s+", " ", regex=True) # Remove double spaces

print(X_train_clean.head(10))



6851     republicans punish georgia governor for refusi...
17313    father of soldier slain in niger defends presi...
22435    south dakotas governor vetoes loosening of con...
29488    turkeys erdogan says will take jerusalem resol...
6625     bill maher insults trumps supposed masculinity...
9772     is this dem senator switching parties calls ou...
22488    ryan says trump playing constructive role on h...
20387    epa chief wants scientists to debate climate o...
29500    macron rebuffs assad accusations that france s...
24209    factbox trump fills top jobs for his administr...
Name: text, dtype: object


In [34]:
X_test_clean = X_test_clean.str.replace(r"[^\w\s]", "", regex=True)
X_test_clean = X_test_clean.str.replace(r"\b[A-Za-z]\b", "", regex=True)
X_test_clean = X_test_clean.str.replace(r"\b\d+\b", "", regex=True)
X_test_clean = X_test_clean.str.replace(r"\s+", " ", regex=True)

print(X_test_clean.head(15))

10145    msnbc propagandist the word trump is modern da...
26343    clinton says trump is most divisive candidate ...
22173     ivanka trump becomes unpaid white house employee
365      trump supporters at mother of all rallies mass...
13323    breaking fresno police release graphic video o...
22414    g20 ministers give mnuchin space to define tru...
11425    full interview president trump nails it on imm...
15385    mooch says black kids arent as welcome in muse...
11615    what democrat congresswoman calls violent riot...
569      trump lashes out at black ceo for resigning fr...
27496    republican party says complaints over delegate...
28827    president obamas final state of the union address
10487                                    obamas gunrunning
7424              this 19yearold from flint destroys trump
20284    us list of nafta goals not earthshattering can...
Name: text, dtype: object


In [35]:
# 3.4. Lemmatization

# It generally makes more sense to lemmatize first, then remove stop words.
# Reasoning: Lemmatization reduces words to their base form. Stop word lists are usually in base form.
# Hence, including lemmatizations step here.

# Your code

from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

wordnet_lemma  = WordNetLemmatizer()

# POS mapping for WordNet lemmatizer
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatize_text(text):
    if not isinstance(text, str):
        return ""
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    lemmatized_words = [wordnet_lemma.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
    return " ".join(lemmatized_words)

X_train_clean = X_train_clean.apply(lemmatize_text)
X_test_clean = X_test_clean.apply(lemmatize_text)




In [36]:
print(X_test_clean.head(15))

10145    msnbc propagandist the word trump be modern da...
26343    clinton say trump be most divisive candidate i...
22173     ivanka trump becomes unpaid white house employee
365      trump supporter at mother of all rally massive...
13323    break fresno police release graphic video of f...
22414    g20 minister give mnuchin space to define trum...
11425    full interview president trump nail it on immi...
15385    mooch say black kid arent a welcome in museum ...
11615    what democrat congresswoman call violent riot ...
569      trump lash out at black ceo for resign from ma...
27496    republican party say complaint over delegate d...
28827    president obamas final state of the union address
10487                                    obamas gunrunning
7424              this 19yearold from flint destroys trump
20284    u list of nafta goal not earthshattering canad...
Name: text, dtype: object


In [37]:
# 3.5. Remove stop words

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

X_train_clean = X_train_clean.apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))
X_test_clean = X_test_clean.apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))


In [38]:
print(X_test_clean.head(15))

10145    msnbc propagandist word trump modern day swast...
26343        clinton say trump divisive candidate lifetime
22173     ivanka trump becomes unpaid white house employee
365      trump supporter mother rally massively outnumb...
13323    break fresno police release graphic video fata...
22414    g20 minister give mnuchin space define trump t...
11425      full interview president trump nail immigration
15385    mooch say black kid arent welcome museum white...
11615    democrat congresswoman call violent riot berke...
569      trump lash black ceo resign manufacture counci...
27496     republican party say complaint delegate distract
28827           president obamas final state union address
10487                                    obamas gunrunning
7424                        19yearold flint destroys trump
20284      u list nafta goal earthshattering canada source
Name: text, dtype: object


### 4. Vectorization

#### 4.1. Bag of Words (BoW)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create the Bag of Words vectorizer
vectorizer_bow = CountVectorizer(
    ngram_range=(1, 2),
    min_df=3, # term must appear in ≥3 documents
    max_df=0.9, # appears in >90% of documents → drop
    max_features=10000,
)

X_train_bow = vectorizer_bow.fit_transform(X_train_clean) # .fit_transform defines the vocabularly for the vectorization
X_test_bow = vectorizer_bow.transform(X_test_clean) # .transform vectorized the test data based on the defined vocuabularly
# Important note! Vectorizer: .fit_transform() on training data, .transform() on test data

In [66]:
print(type(X_train_bow))
print(X_train_bow.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(27321, 6886)


In [60]:
print(type(X_test_bow))
print(X_test_bow.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(6831, 10000)


In [61]:
# Check proportion of test and train data

print(6831/(27321+6831))

0.20001756851721714


#### 4.2. TF-IDF

In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create Tfidf vectorizer:
vectorizer_tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=3, # term must appear in ≥3 documents
    max_df=0.9, # appears in >90% of documents → drop
    max_features=10000,
    lowercase=True,
    stop_words="english"
)

X_train_tfidf = vectorizer_tfidf.fit_transform(X_train_clean)
X_test_tfidf = vectorizer_tfidf.transform(X_test_clean)

In [64]:
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(27321, 10000)
(6831, 10000)


### 5. Classifier

#### 5.1. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomclassifier = RandomForestClassifier()
randomclassifier.fit(X_train_tfidf, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [80]:
# Evalute results of RandomForestClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

predictions = randomclassifier.predict(X_test_tfidf)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)

[[3222  293]
 [ 284 3032]]
0.9155321329234373
              precision    recall  f1-score   support

           0       0.92      0.92      0.92      3515
           1       0.91      0.91      0.91      3316

    accuracy                           0.92      6831
   macro avg       0.92      0.92      0.92      6831
weighted avg       0.92      0.92      0.92      6831



#### 5.2. XGBoost

In [84]:
from xgboost import XGBClassifier

xgb=XGBClassifier()
xgb.fit(X_train_bow, y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [None]:
# Evaluate results for XGBoost

predictions = xgb.predict(X_test_bow)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)

# “XGBoost is usually best” is a myth for text classification.
# For Bag-of-Words or TF-IDF: linear classifiers or Naive Bayes almost always outperform default boosted trees.
# Boosted trees shine on dense tabular features, numeric interactions, and mixed types.

[[2919  596]
 [ 209 3107]]
0.8821548821548821
              precision    recall  f1-score   support

           0       0.93      0.83      0.88      3515
           1       0.84      0.94      0.89      3316

    accuracy                           0.88      6831
   macro avg       0.89      0.88      0.88      6831
weighted avg       0.89      0.88      0.88      6831



#### 5.3. MultinominalNB

In [74]:
from sklearn.naive_bayes import MultinomialNB

naive=MultinomialNB()
naive.fit(X_train_bow, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [None]:
# Evalute results of MultinominalNB

predictions = naive.predict(X_test_bow)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)


[[3301  214]
 [ 261 3055]]
0.9304640608988435
              precision    recall  f1-score   support

           0       0.93      0.94      0.93      3515
           1       0.93      0.92      0.93      3316

    accuracy                           0.93      6831
   macro avg       0.93      0.93      0.93      6831
weighted avg       0.93      0.93      0.93      6831



#### 5.4. Logistic Regression

In [90]:
from sklearn.linear_model import LogisticRegression

log=LogisticRegression()
log.fit(X_train_tfidf, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [None]:
# Evalute results for Logistic Regression

predictions = log.predict(X_test_tfidf)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)


[[3250  265]
 [ 250 3066]]
0.9246084028692725
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      3515
           1       0.92      0.92      0.92      3316

    accuracy                           0.92      6831
   macro avg       0.92      0.92      0.92      6831
weighted avg       0.92      0.92      0.92      6831



#### 5.5. Support Vector Machine (SVM)

In [92]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train_tfidf, y_train)

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


In [94]:
# Evalute results for SVM

predictions = svm.predict(X_test_tfidf)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)

[[3294  221]
 [ 189 3127]]
0.9399795051968965
              precision    recall  f1-score   support

           0       0.95      0.94      0.94      3515
           1       0.93      0.94      0.94      3316

    accuracy                           0.94      6831
   macro avg       0.94      0.94      0.94      6831
weighted avg       0.94      0.94      0.94      6831



### 6. Hyperparmeter Tuning

#### 6.1. Hypertuning Random Forest

In [None]:

randomclassifier = RandomForestClassifier(
    n_estimators=500,          
    max_depth=None,             
    min_samples_split=5,        
    min_samples_leaf=2,         
    max_features='sqrt',        
    bootstrap=True,             
    n_jobs=-1,                  
    random_state=42
)

randomclassifier.fit(X_train_tfidf, y_train)


predictions = randomclassifier.predict(X_test_tfidf)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)

[[3158  357]
 [ 271 3045]]
0.9080661689357341
              precision    recall  f1-score   support

           0       0.92      0.90      0.91      3515
           1       0.90      0.92      0.91      3316

    accuracy                           0.91      6831
   macro avg       0.91      0.91      0.91      6831
weighted avg       0.91      0.91      0.91      6831



#### 6.2. Hypertuning XGBoost

In [None]:
xgb = XGBClassifier(
    n_estimators=500,
    max_depth=8,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss'
)

xgb.fit(X_train_tfidf, y_train)


predictions = xgb.predict(X_test_tfidf)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)

[[3084  431]
 [ 198 3118]]
0.9079197774849949
              precision    recall  f1-score   support

           0       0.94      0.88      0.91      3515
           1       0.88      0.94      0.91      3316

    accuracy                           0.91      6831
   macro avg       0.91      0.91      0.91      6831
weighted avg       0.91      0.91      0.91      6831



#### 6.3. Hypertuning MultinominalNB

In [None]:
naive = MultinomialNB(alpha=0.1, fit_prior=True) # No other hyperparameters for MultinomialNB. It's pretty simple.
naive.fit(X_train_tfidf, y_train)


predictions = naive.predict(X_test_tfidf)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)


[[3314  201]
 [ 269 3047]]
0.9311960181525399
              precision    recall  f1-score   support

           0       0.92      0.94      0.93      3515
           1       0.94      0.92      0.93      3316

    accuracy                           0.93      6831
   macro avg       0.93      0.93      0.93      6831
weighted avg       0.93      0.93      0.93      6831



#### 6.4. Hypertuning Logistic Regression

In [108]:
log=LogisticRegression(
     C=1.0,                
    penalty='l2',          
    solver='saga',         
    max_iter=1000,         
    class_weight=None,     
    n_jobs=-1,             
    random_state=42
)
log.fit(X_train_bow, y_train)

predictions = log.predict(X_test_bow)

matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)


[[3254  261]
 [ 192 3124]]
0.9336846728151076
              precision    recall  f1-score   support

           0       0.94      0.93      0.93      3515
           1       0.92      0.94      0.93      3316

    accuracy                           0.93      6831
   macro avg       0.93      0.93      0.93      6831
weighted avg       0.93      0.93      0.93      6831



### 7. Choose best performing model

The best performing model is the Logistic Regression with either the Bag of Words vectorizer or the TF-IDF vectorizer. See results below:

In [109]:
matrix = confusion_matrix(y_test, predictions)
print(matrix)
score=accuracy_score(y_test, predictions)
print(score)
report=classification_report(y_test, predictions)
print(report)

[[3254  261]
 [ 192 3124]]
0.9336846728151076
              precision    recall  f1-score   support

           0       0.94      0.93      0.93      3515
           1       0.92      0.94      0.93      3316

    accuracy                           0.93      6831
   macro avg       0.93      0.93      0.93      6831
weighted avg       0.93      0.93      0.93      6831



### 8. Apply the model to the testing.csv data set

In order to apply the model to a new dataset we need to go through the intial data preprocessing steps for the new testing.csv dataset again.

#### 8.1. Import new data set

In [131]:

testing = pd.read_csv("dataset/testing_data.csv", sep="\t", header=None, names=["label", "text"])

print(testing.head(10))

  label                                               text
0     2  copycat muslim terrorist arrested with assault...
1     2  wow! chicago protester caught on camera admits...
2     2   germany's fdp look to fill schaeuble's big shoes
4     2  u.n. seeks 'massive' aid boost amid rohingya '...
5     2  did oprah just leave ‚nasty‚ hillary wishing s...
6     2  france's macron says his job not 'cool' cites ...
7     2  flashback: chilling ‚60 minutes‚ interview wit...
8     2  spanish foreign ministry says to expel north k...
9     2  trump says cuba 'did some bad things' aimed at...


In [132]:
testing.shape

(9984, 2)

#### 8.2. Predict labels

In [None]:
# Vectorize complete testing dataset
testing_vec_tfidf = vectorizer_tfidf.transform(testing["text"])

In [137]:
# Predict labels
test_preds = log.predict(testing_vec_tfidf)

In [138]:
# Assign predicted labels
testing["label"] = test_preds

In [139]:
testing.head(10)

Unnamed: 0,label,text
0,1,copycat muslim terrorist arrested with assault...
1,1,wow! chicago protester caught on camera admits...
2,1,germany's fdp look to fill schaeuble's big shoes
3,0,mi school sends welcome back packet warning ki...
4,1,u.n. seeks 'massive' aid boost amid rohingya '...
5,0,did oprah just leave ‚nasty‚ hillary wishing s...
6,1,france's macron says his job not 'cool' cites ...
7,1,flashback: chilling ‚60 minutes‚ interview wit...
8,0,spanish foreign ministry says to expel north k...
9,0,trump says cuba 'did some bad things' aimed at...


In [144]:
print(testing.iloc[5])

label                                                    0
text     did oprah just leave ‚nasty‚ hillary wishing s...
Name: 5, dtype: object


#### 8.3. Save data set in new file

In [145]:
testing.to_csv("dataset/testing_data_update.csv")