# üì∞ Categorizing Fake News Using NLP

*(AI Engineer Course ‚Äì Applied NLP Project)*

This notebook demonstrates an **end-to-end Natural Language Processing (NLP) pipeline**
for categorizing news articles as **Fake** or **Factual**.

The project combines:
- Linguistic analysis
- Text preprocessing
- Statistical analysis
- Topic modeling
- Sentiment analysis
- Supervised machine learning

The focus is on understanding **how language patterns differ** between fake and factual news
and how NLP techniques can be combined to build effective text classification systems.


In [None]:
import seaborn as sns
import spacy
from spacy import displacy
from spacy import tokenizer
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import gensim
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
from gensim.models import LsiModel, TfidfModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import pandas as pd

## 1Ô∏è‚É£ Data Loading and Exploration

We begin by loading the dataset and exploring its structure.

This step helps us:
- Understand available columns
- Inspect data types and missing values
- Preview sample articles
- Examine the distribution of fake vs factual news


In [None]:
#setting plot options
plt.rcParams['figure.figsize'] = (12,8)
default_plot_colour = '#00bfbf'

In [None]:
data = pd.read_csv('../../data/fake_news_data.csv')

In [None]:
data.head(10)

In [None]:
data.info()

## 2Ô∏è‚É£ Distribution of Fake vs Factual News

Before applying NLP techniques, it is important to understand
the balance between classes in the dataset.

We visualize the number of fake and factual articles to:
- Detect class imbalance
- Set expectations for model performance


In [None]:
data['fake_or_factual'].value_counts().plot(kind='bar', color=default_plot_colour)
plt.title('Distribution of Fake vs Factual News Articles')

## 3Ô∏è‚É£ Linguistic Analysis Using spaCy

spaCy is used to extract linguistic features from text, including:
- Part-of-speech (POS) tags
- Named entity labels

Articles are processed separately for fake and factual news
to compare language usage patterns.


In [None]:
nlp =  spacy.load('en_core_web_sm')

In [None]:
fake_news = data[data['fake_or_factual']=='Fake News']
factual_news = data[data['fake_or_factual']=='Factual News']

In [None]:
fake_spacydocs = list(nlp.pipe(fake_news['text']))
factual_spacydocs = list(nlp.pipe(factual_news['text']))

## 4Ô∏è‚É£ Part-of-Speech (POS) Tag Analysis

Part-of-speech tagging assigns grammatical roles such as:
- Nouns
- Verbs
- Adjectives

By comparing POS tag frequencies between fake and factual news,
we can observe stylistic and structural differences in writing.


In [None]:
def extract_token_tags(doc):
    return [(i.text, i.ent_type_, i.pos_) for i in doc]

In [None]:
fake_tagsdf = []
columns = ['token', "ner_tag", "pos_tag"]

In [None]:
rows = [{'token': token.text, 'ner_tag': token.ent_type_, 'pos_tag': token.pos_} for doc in fake_spacydocs for token in doc]

In [None]:
fake_tagsdf = pd.DataFrame(rows)

In [None]:
fact_tagsdf = []
rows = [{'token': token.text, 'ner_tag': token.ent_type_, 'pos_tag': token.pos_} for doc in factual_spacydocs for token in doc]

In [None]:
fact_tagsdf = pd.DataFrame(rows)

In [None]:
pos_counts_fake = fake_tagsdf.value_counts(['token','pos_tag']).reset_index(name='counts')
pos_counts_fake.head(10)

In [None]:
pos_counts_fact = fact_tagsdf.value_counts(['token','pos_tag']).reset_index(name='counts')
pos_counts_fact.head(10)

In [None]:
pos_counts_fake['pos_tag'].value_counts().head(10)


In [None]:
pos_counts_fact['pos_tag'].value_counts().head(10)

In [None]:
pos_counts_fake[pos_counts_fake.pos_tag == 'NOUN'][:15]

In [None]:
pos_counts_fact[pos_counts_fact.pos_tag == 'NOUN'][:15]

## 5Ô∏è‚É£ Named Entity Recognition (NER)

Named Entity Recognition identifies references to:
- People
- Organizations
- Locations
- Dates and quantities

In this section:
- Named entities are extracted from both classes
- The most common entities are compared
- Entity frequencies are visualized for interpretation


In [None]:
#pull out all named entities before starting preprocessing
top_entities_fake = fake_tagsdf[fake_tagsdf['ner_tag'] != ''].value_counts(['token','ner_tag']).reset_index(name='counts')

In [None]:
top_entities_fact = fact_tagsdf[fact_tagsdf['ner_tag'] != ''].value_counts(['token','ner_tag']).reset_index(name='counts')

In [None]:
ner_palette = {
    'ORG': sns.color_palette("Set2").as_hex()[0],
    'GPE': sns.color_palette("Set2").as_hex()[1],
    'NORP': sns.color_palette("Set2").as_hex()[2],
    'PERSON': sns.color_palette("Set2").as_hex()[3],
    'DATE': sns.color_palette("Set2").as_hex()[4],
    'CARDINAL': sns.color_palette("Set2").as_hex()[5],
    'PERCENT': sns.color_palette("Set2").as_hex()[6]
}

In [None]:
sns.barplot(
    x = 'counts',
    y = 'token',
    hue='ner_tag',
    palette=ner_palette,
    data = top_entities_fake[:10],
    orient= 'h',
    dodge=False
).set_title('Top Common Named Entities in Fake News Articles')

In [None]:
sns.barplot(
    x = 'counts',
    y = 'token',
    hue='ner_tag',
    palette=ner_palette,
    data = top_entities_fact[:10],
    orient= 'h',
    dodge=False
).set_title('Top Common Named Entities in Factual News Articles')

## 6Ô∏è‚É£ Text Preprocessing

Raw text must be cleaned and standardized before modeling.

The preprocessing steps applied include:
- Removing metadata and prefixes
- Lowercasing text
- Removing punctuation
- Stopword removal
- Tokenization
- Lemmatization

These steps reduce noise and prepare the text for analysis and modeling.


In [None]:
data.head()

In [None]:
data['text_clean'] =data.apply(lambda x: re.sub(r"^[^-]*-\s", '', x['text']), axis=1)

In [None]:
data['text_clean'] = data['text_clean'].str.lower()

In [None]:
data['text_clean'] = data.apply(lambda x: re.sub(r"([^\w\s])", '', x['text_clean']), axis=1)

In [None]:

en_stopwards =  stopwords.words('english')
print(en_stopwards)

In [None]:
data['text_clean'] =  data['text_clean'].apply(lambda x: ' '.join ([word for word in x.split() if word not in en_stopwards]))

In [None]:
data['text_clean'] = data.apply(lambda x: word_tokenize(x['text_clean']), axis=1)

In [None]:
lemmatizer = WordNetLemmatizer()

data['text_clean'] = data['text_clean'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])

In [None]:
data.head()

## 7Ô∏è‚É£ N-gram Analysis

N-grams represent sequences of words:
- Unigrams (single words)
- Bigrams (pairs of words)

This analysis helps identify:
- Frequently used terms
- Common word combinations
- Dominant language patterns in the dataset


In [None]:
#Most common N grams
tokens_clean = sum(data['text_clean'], []) 

In [None]:
unigrams = (pd.Series(nltk.ngrams(tokens_clean, 1)).value_counts()).reset_index(name='count')
print(unigrams)

In [None]:
unigrams['tokens'] =  unigrams['index'].apply(lambda x: x[0])

sns.barplot(x='count',
            y='tokens',
            data=unigrams[:15],
            orient='h',
            palette= [default_plot_colour],
            hue='tokens', legend=False).set_title('Most Common Unigrams in News Articles')

In [None]:
bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts()).reset_index(name='count')


In [None]:
bigrams['tokens'] =  bigrams['index'].apply(lambda x: x[0])
print(bigrams)

## 8Ô∏è‚É£ Sentiment Analysis

Sentiment analysis measures the emotional tone of text.

Using the VADER sentiment analyzer:
- Each article receives a compound sentiment score
- Scores are categorized as negative, neutral, or positive
- Sentiment distributions are compared across fake and factual news


In [None]:
vader_sentiment = SentimentIntensityAnalyzer()
data['vader_sentiment'] = data['text'].apply(lambda x: vader_sentiment.polarity_scores(x)['compound'])

In [None]:
data.head()

In [None]:
bins = [-1, -0.01, 0.1, 1]
names= ['negative', 'neutral', 'positive']
data['vader_sentiment_label'] = pd.cut(data['vader_sentiment'], bins, labels=names)
data.head()

In [None]:
data['vader_sentiment_label'].value_counts().plot(kind='bar', color=default_plot_colour)
plt.title('Distribution of VADER Sentiment Labels in News Articles')

In [None]:
sns.countplot(
    x ='fake_or_factual',
    data=data, 
    palette= sns.color_palette("hls"), 
    hue= 'vader_sentiment_label'
    ).set_title('VADER Sentiment Labels by Fake vs Factual News Articles')

## 9Ô∏è‚É£ Topic Modeling

Topic modeling uncovers hidden thematic structures in text data.

Two approaches are used:
- Latent Dirichlet Allocation (LDA)
- Latent Semantic Indexing (LSI)

Multiple topic counts are evaluated using coherence scores
to identify meaningful topic representations.


In [None]:
fake_news_text = data[data['fake_or_factual']=='Fake News']['text_clean'].reset_index(drop=True)

In [None]:
dictionary_fake = corpora.Dictionary(fake_news_text)

In [None]:
doc_term_fake =  [dictionary_fake.doc2bow(text) for text in fake_news_text]

In [None]:
coherence_values = []
model_list = []
min_topics = 2
max_topics = 11

for num_topics_i in range (min_topics, max_topics+1):
    model =  gensim.models.LdaModel(doc_term_fake, num_topics=num_topics_i, id2word=dictionary_fake)
    model_list.append(model)
    coherence_model = CoherenceModel(model=model, texts=fake_news_text, dictionary=dictionary_fake, coherence='c_v')
    coherence_values.append(coherence_model.get_coherence())

In [None]:
plt.plot(range(min_topics, max_topics+1), coherence_values)
plt.xlabel('Number of Topics')
plt.ylabel('Coherence Score')
plt.legend(('Coherence Values'), loc='best')
plt.show()

In [None]:
num_topics_lda = 7
lda_model = gensim.models.LdaModel(doc_term_fake, num_topics=num_topics_lda, id2word=dictionary_fake)
lda_model.print_topics(num_topics=num_topics_lda, num_words=10)

In [None]:
def tdidf_corpus(doc_term_matrix):
    tdfidf = TfidfModel(corpus=doc_term_matrix, normalize=True)
    corpus_tfidf = tdfidf[doc_term_matrix]
    return corpus_tfidf

In [None]:
def get_coherence_scores_lsi(corpus,dictionary,min_topics=2,max_topics=11):
    corpus = list(corpus)
    coherence_values = []
    topic_range = range(min_topics, max_topics + 1)

    for k in topic_range:
        print(f"Training LSI model with {k} topics")
        lsi_model = LsiModel(corpus=corpus,num_topics=k,id2word=dictionary)
        coherence_model = CoherenceModel(model=lsi_model,corpus=corpus,dictionary=dictionary,coherence='u_mass')
        coherence = coherence_model.get_coherence()
        coherence_values.append(coherence)
    # Plot
    plt.figure(figsize=(12, 6))
    plt.plot(topic_range, coherence_values, marker='o')
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence Score (u_mass)")
    plt.title("LSI Topic Coherence (u_mass)")
    plt.grid(True)
    plt.show()
    # return coherence_values

In [None]:
corpus_tdidf_fake = tdidf_corpus(doc_term_fake)

In [None]:
corpus_tfidf_fake = tdidf_corpus(doc_term_fake)
coherence_scores = get_coherence_scores_lsi(
    corpus=corpus_tfidf_fake,
    dictionary=dictionary_fake,
    min_topics=2,
    max_topics=11
)


In [None]:
lsa_model =  LsiModel(corpus_tdidf_fake, num_topics=3, id2word=dictionary_fake)

In [None]:
lsa_model.print_topics()

## üîü Text Classification

In the final stage, supervised machine learning models are trained
to classify articles as Fake or Factual.

Steps include:
- Vectorizing text using Bag-of-Words
- Splitting data into training and test sets
- Training classification models
- Evaluating performance using accuracy and classification reports

Models used:
- Logistic Regression
- Linear Support Vector Machine (SGDClassifier)


In [None]:
data.head()

In [None]:
X = [','.join(map(str, tokens)) for tokens in data['text_clean']]  #map list of tokens back to string for vectorization

In [None]:
Y= data['fake_or_factual']

In [None]:
countvec = CountVectorizer()

In [None]:
countvec_fit = countvec.fit_transform(X)

In [None]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns=countvec.get_feature_names_out())

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(bag_of_words,Y,test_size=0.3)

In [None]:
lr = LogisticRegression(random_state=0).fit(X_train, Y_train)

In [None]:
y_pred_lr = lr.predict(X_test)

In [None]:
accuracy_score(y_pred_lr, Y_test)

In [None]:
print(classification_report(Y_test, y_pred_lr))

In [None]:
svm = SGDClassifier(random_state=0).fit(X_train, Y_train)

In [None]:
y_pred_svm = svm.predict(X_test)

In [None]:
accuracy_score(y_pred_svm, Y_test)

In [None]:
print(classification_report(Y_test, y_pred_svm))

## Key Takeaways

- Fake and factual news exhibit distinct linguistic patterns
- Proper preprocessing significantly improves model performance
- Topic modeling reveals hidden thematic differences
- Sentiment alone is not sufficient but adds useful context
- Classical ML models perform well when combined with NLP features

This project demonstrates how multiple NLP techniques can be
integrated into a complete text classification workflow.
