#**Fake News Detection Using Logistic Regression**
###**Summary**

Fake news, designed to mislead and misinform, has become a pressing issue, particularly due to its rapid spread on social media platforms.

 Addressing this challenge is crucial to minimize its negative impact on society. Effective strategies include utilizing reliable news sources, fact-checking, and fostering awareness.

This project focuses on identifying fake news using machine learning techniques, starting with Logistic Regression.
 This algorithm, known for its efficiency in binary classification tasks,
was applied to classify news articles as fake or genuine.

####**Key Aspects**



* Model evaluation was conducted using metrics like accuracy, precision, recall, and F1-score.

*   The dataset was preprocessed to extract meaningful features for training and testing.
*   Logistic Regression was selected for its balance between simplicity and robust performance.



### Importing all the necessary libraries

In [6]:
import numpy as np
import pandas as pd

# for cleaning text
import re
import nltk
from nltk.corpus import wordnet, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# for vectorizing text and spliting the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# for logistic regression
from sklearn.linear_model import LogisticRegression

# for getting accuracy and report
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score

### Data Collection

* Collecting data from the source (More info below).
* Merge it into single dataset.
* Remove unnecessary data that are not required.


### Data Source

I shall be using the dataset that is available on kaggle provided by Emine Bozkus.

Source: https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets

In [7]:
df_fake = pd.read_csv("Fake.csv")
df_true = pd.read_csv("True.csv")

In [8]:
print(df_true['title'][0])
print(df_true['text'][0])
print("==========================")
print(df_fake['title'][0])
print(df_fake['text'][0])

As U.S. budget fight looms, Republicans flip their fiscal script
WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-

In [9]:
# since there is no indication of status of real and fake news,
# I shall give new indication, 0 shall represent that the news is fake and 1 shall represent that news is real

df_fake['status'] = 0
df_true['status'] = 1

df = pd.concat([df_fake, df_true])
df.drop(['subject', 'date'], inplace=True, axis=1)

In [10]:
df.head()

Unnamed: 0,title,text,status
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,0


In [11]:
# Since the data is not blended properly, we shall shuffle the data
random_index = np.random.randint(0,len(df),len(df))
df = df.iloc[random_index].reset_index(drop=True)

In [12]:
df.head()

Unnamed: 0,title,text,status
0,Trump's body language during debate raises soc...,NEW YORK (Reuters) - U.S. Republican presiden...,1
1,Possible Trump VP pick says he supports aborti...,WASHINGTON (Reuters) - Retired Lt. Gen. Michae...,1
2,Here’s How Trump Is Going To F*CK Us Into WWI...,Has anybody ever wondered whether all these ac...,0
3,MEDIA SILENT: President Trump Makes Americans ...,"It s Thursday, July 20th. As of today, Donald ...",0
4,Trump Slapped With Lawsuit For Refusing To Re...,Donald Trump is being sued again. Three orga...,0


### Data Cleaning

From both Title and Content-
* Split the data.
* Remove all the stop words and special characters ((),.><$@/?"''"\][\|!*-{}^%#+).
* Lemmatization of the words.
* And join the data back again for next stage.

From the Title-
* Removing the reuter tag and state's name

I will be using nltk library to achieve this step

In [13]:
#Download all necessary NLTK data files
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')

def get_wordnet_pos(tag):
    # Maps POS tag to the first character for WordNetLemmatizer.
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Perform POS tagging
    pos_tags = nltk.pos_tag(tokens)

    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [
        lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tags
    ]

    # Rejoin tokens into a single string
    return ' '.join(lemmatized_tokens)


def preprocess_title(text):
    # Remove special characters and numbers for both Title and Text
    text = re.sub(r'[^a-zA-Z\s]', '', text).strip()

    # Convert to lowercase
    text = text.lower()

    return preprocess_text(text)

def preprocess_content(text):
    # Remove URLs from text
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    #To remove "Reuters - State/Location Name" or standalone "Reuters" from the Text
    # text = re.sub(r'\bReuters\b(?:\s*-\s*[A-Za-z\s]+)?', '', text).strip()

    #if the state names comes first then the reuter tag
    text = re.sub(r'\b[A-Za-z\s]+\s*\(Reuters\)|\(Reuters\)', '', text)

    #to Remove special characters and numbers for both Title and Text
    text = re.sub(r'[^a-zA-Z\s]', '', text).strip()

    #to Convert to lowercase
    text = text.lower()

    return preprocess_text(text)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [14]:
#to apply above functions for converting text (takes 10 min of execution time)
df['title'] = df['title'].apply(preprocess_title)
df['text'] = df['text'].apply(preprocess_content)

#### Combine title and text into single column

In [15]:
df['combined_text'] = df['title'] + " " + df['text']

In [16]:
df.head()

Unnamed: 0,title,text,status,combined_text
0,trump body language debate raise social medium...,u republican presidential nominee donald trump...,1,trump body language debate raise social medium...
1,possible trump vp pick say support abortion right,retire lt gen michael flynn consideration repu...,1,possible trump vp pick say support abortion ri...
2,here trump go fck u wwiii doesnt actually star...,anybody ever wonder whether act war include mo...,0,here trump go fck u wwiii doesnt actually star...
3,medium silent president trump make american tr...,thursday july th today donald trump president ...,0,medium silent president trump make american tr...
4,trump slap lawsuit refuse release white house ...,donald trump sue three organization join toget...,0,trump slap lawsuit refuse release white house ...


#### We shall split the data into train and test data using train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    df['combined_text'], df['status'],
    test_size = 0.2,
    random_state = 42
)

#### Now we shall to convert qualitative data into quantative data for our model to understand and train upon it

In [18]:
# max_length = max(df['combined_text'].map(len))
vectorize = TfidfVectorizer(max_features=10000)

X_train_Tfid = vectorize.fit_transform(X_train)
X_test_Tfid = vectorize.transform(X_test)

### Logistic model creation and training

In [19]:
model = LogisticRegression()
model.fit(X_train_Tfid, y_train)

In [24]:
y_pred = model.predict(X_test_Tfid)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
matrix =  confusion_matrix(y_test, y_pred)

print("Accuracy= ", accuracy)
print("Classification Report=\n ", report)
print("Confusion Matrix=\n ", matrix)

Accuracy=  0.9850779510022272
Classification Report=
                precision    recall  f1-score   support

           0       0.99      0.98      0.99      4673
           1       0.98      0.99      0.98      4307

    accuracy                           0.99      8980
   macro avg       0.98      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980

Confusion Matrix=
  [[4584   89]
 [  45 4262]]


##**conclusion**
The Logistic Regression model demonstrated excellent performance in fake news detection, achieving an accuracy of 98.51%. With high precision, recall, and F1-scores (all approximately 0.99), the model effectively balanced minimizing false positives and false negatives. The confusion matrix showed minimal misclassifications, with only 89 false positives and 45 false negatives. These results highlight the model's reliability and effectiveness for binary classification tasks, making it a strong candidate for identifying fake news. Further testing on real-world data is recommended to ensure robustness.