#**Fake News Detection Using Logistic Regression**
###**Summary**

Fake news, designed to mislead and misinform, has become a pressing issue, particularly due to its rapid spread on social media platforms.

 Addressing this challenge is crucial to minimize its negative impact on society. Effective strategies include utilizing reliable news sources, fact-checking, and fostering awareness.

This project focuses on identifying fake news using machine learning techniques, starting with Logistic Regression.
 This algorithm, known for its efficiency in binary classification tasks,
was applied to classify news articles as fake or genuine.

####**Key Aspects**



* Model evaluation was conducted using metrics like accuracy, precision, recall, and F1-score.

*   The dataset was preprocessed to extract meaningful features for training and testing.
*   Logistic Regression was selected for its balance between simplicity and robust performance.



### Importing all the necessary libraries

In [None]:
import numpy as np
import pandas as pd

# for cleaning text
import re
import nltk
from nltk.corpus import wordnet, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# for vectorizing text and spliting the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# for logistic regression, decision tree classifier and random forest classifier
from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier

# for getting accuracy and report
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score

### Data Collection

* Collecting data from the source (More info below).
* Merge it into single dataset.
* Remove unnecessary data that are not required.


### Data Source

I shall be using the dataset that is available on kaggle provided by Emine Bozkus.

Source: https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets

In [None]:
df_fake = pd.read_csv("Fake.csv")
df_true = pd.read_csv("True.csv")

In [None]:
print(df_true['title'][0])
print(df_true['text'][0])
print("==========================")
print(df_fake['title'][0])
print(df_fake['text'][0])

As U.S. budget fight looms, Republicans flip their fiscal script
WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-

In [None]:
# since there is no indication of status of real and fake news,
# I shall give new indication, 0 shall represent that the news is fake and 1 shall represent that news is real

df_fake['status'] = 0
df_true['status'] = 1

df = pd.concat([df_fake, df_true])
df.drop(['subject', 'date'], inplace=True, axis=1)

In [None]:
df.head()

Unnamed: 0,title,text,status
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,0


In [None]:
# Since the data is not blended properly, we shall shuffle the data
random_index = np.random.randint(0,len(df),len(df))
df = df.iloc[random_index].reset_index(drop=True)

In [None]:
df.head()

Unnamed: 0,title,text,status
0,MSNBC Shocks Viewers By Announcing Prime Time...,"It s come to this.On Tuesday, MSNBC announced ...",0
1,Hillary Clinton says her family's foundation i...,WASHINGTON (Reuters) - U.S. Democratic preside...,1
2,Mnuchin not worried by lower U.S. tax receipts...,OTTAWA (Reuters) - U.S. Treasury Secretary Ste...,1
3,U.S. House Intelligence chairman questions lea...,WASHINGTON (Reuters) - The head of the U.S. Ho...,1
4,Britain's government to push ahead with plan o...,LONDON (Reuters) - Britain s Prime Minister Th...,1


### Data Cleaning

From both Title and Content-
* Split the data.
* Remove all the stop words and special characters ((),.><$@/?"''"\][\|!*-{}^%#+).
* Lemmatization of the words.
* And join the data back again for next stage.

From the Title-
* Removing the reuter tag and state's name

I will be using nltk library to achieve this step

In [None]:
#Download all necessary NLTK data files
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')

def get_wordnet_pos(tag):
    # Maps POS tag to the first character for WordNetLemmatizer.
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Perform POS tagging
    pos_tags = nltk.pos_tag(tokens)

    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [
        lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tags
    ]

    # Rejoin tokens into a single string
    return ' '.join(lemmatized_tokens)


def preprocess_title(text):
    # Remove special characters and numbers for both Title and Text
    text = re.sub(r'[^a-zA-Z\s]', '', text).strip()

    # Convert to lowercase
    text = text.lower()

    return preprocess_text(text)

def preprocess_content(text):
    # Remove URLs from text
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    #To remove "Reuters - State/Location Name" or standalone "Reuters" from the Text
    # text = re.sub(r'\bReuters\b(?:\s*-\s*[A-Za-z\s]+)?', '', text).strip()

    #if the state names comes first then the reuter tag
    text = re.sub(r'\b[A-Za-z\s]+\s*\(Reuters\)|\(Reuters\)', '', text)

    #to Remove special characters and numbers for both Title and Text
    text = re.sub(r'[^a-zA-Z\s]', '', text).strip()

    #to Convert to lowercase
    text = text.lower()

    return preprocess_text(text)


#  testing
# if __name__ == "__main__":
#     sample_text = """
#     Breaking news: Government plans to reduce taxes next year!
#     Visit https://news.example.com for more details.
#     """
#     clean_text = preprocess_text(sample_text)
#     print("Original Text:", sample_text)
#     print("Processed Text:", clean_text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#to apply above functions for converting text (takes 10 min of execution time)
df['title'] = df['title'].apply(preprocess_title)
df['text'] = df['text'].apply(preprocess_content)

KeyboardInterrupt: 

#### Combine title and text into single column

In [None]:
df['combined_text'] = df['title'] + " " + df['text']

In [None]:
df.head()

Unnamed: 0,title,text,status,combined_text
0,germany fdp see common ground green education ...,germany liberal free democrat fdp see common g...,1,germany fdp see common ground green education ...
1,netanyahu putin israel may act curb iran clout...,sochi israeli prime minister benjamin netanyah...,1,netanyahu putin israel may act curb iran clout...
2,trump hurricane irma big monster,u president donald trump call hurricane irma b...,1,trump hurricane irma big monster u president d...
3,lawyer legal precedent clear clinton email inv...,decline seek prosecution hillary clinton fbi d...,1,lawyer legal precedent clear clinton email inv...
4,washington post ask major question trump admin...,trump young administration fraught trouble kne...,0,washington post ask major question trump admin...


#### We shall split the data into train and test data using train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['combined_text'], df['status'],
    test_size = 0.2,
    random_state = 42
)

#### Now we shall to convert qualitative data into quantative data for our model to understand and train upon it

In [None]:
# max_length = max(df['combined_text'].map(len))
vectorize = TfidfVectorizer(max_features=10000)

X_train_Tfid = vectorize.fit_transform(X_train)
X_test_Tfid = vectorize.transform(X_test)

### Logistic model creation and training

In [None]:
model = LogisticRegression()
model.fit(X_train_Tfid, y_train)

In [None]:
y_pred = model.predict(X_test_Tfid)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
matrix =  confusion_matrix(y_test, y_pred)

# for roc auc score
# y_prob = model.predict_proba(X_test_Tfid)
# roc_score = roc_auc_score(y_test, y_prob)

print("Accuracy= ", accuracy)
print("Classification Report=\n ", report)
print("Confusion Matrix=\n ", matrix)
# print("ROC-AUC Score= ", roc_score)

Accuracy=  0.9820712694877506
Classification Report=
                precision    recall  f1-score   support

           0       0.99      0.98      0.98      4703
           1       0.98      0.99      0.98      4277

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980

Confusion Matrix=
  [[4599  104]
 [  57 4220]]


#### Decision Tree Classifier

#### Random Forest Classifier

#### BERT Model

For BERT model, we shall create new quantative conversion.