## **Problem Statement**

### **Introduction**
Fake news has emerged as one of the most significant challenges of our time, severely impacting both online and offline discourse. Its proliferation poses a direct threat to the democratic processes and societal stability, particularly in the western world. The ability to accurately identify and reduce the spread of fake news is essential to maintaining informed public discourse and safeguarding democratic institutions.

### **Problem Statement**
The primary challenge addressed by this project is the automatic detection of fake news articles using machine learning and natural language processing (NLP) techniques. By developing a reliable model to classify news articles as either fake or real, we aim to contribute to the efforts to curb the spread of misinformation and enhance the quality of information available to the public.

### **Aim of the Project**

The aim of this project is to build a robust and accurate fake news detection system.

### **How Does the Solution Solve the Problem?**

The proposed solution involves developing a machine learning model that leverages NLP and deep learning techniques to classify news articles as fake or real, allowing users to input news articles and classify them as fake or real, thereby providing a valuable tool for combating misinformation.


### **About the Dataset**

The dataset used in this project contains labeled news articles, categorized as either fake or real. This dataset is essential for training and evaluating the machine learning models developed to detect fake news.

### **Content**
The dataset comprises rows and columns that represent various attributes of news articles, including their textual content and labels indicating whether they are fake or real. The dataset includes information on how it was acquired and the time period it represents, providing valuable context for the analysis.




In [2]:
pip install spacy

Collecting spacyNote: you may need to restart the kernel to use updated packages.

  Downloading spacy-3.7.5-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp311-cp311-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp311-cp311-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.5-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import spacy

In [5]:
df = pd.read_csv(r'C:\Users\Aya\Downloads\Python\Exercise Dataset\fake_or_real_news.csv', usecols=lambda col: col if 'Unnamed' not in col else None)


In [6]:
df.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [7]:
df['input_text'] = df['title'] + ' ' + df['text']

In [8]:
df.head()

Unnamed: 0,title,text,label,input_text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,You Can Smell Hillary’s Fear Daniel Greenfield...
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Watch The Exact Moment Paul Ryan Committed Pol...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,Kerry to go to Paris in gesture of sympathy U....
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,Bernie supporters on Twitter erupt in anger ag...
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,The Battle of New York: Why This Primary Matte...


## **Machine Learning Task Instructions**

In this task, you will work with the provided `input_text` variable. Your objective is to apply any machine learning algorithm to process the data and achieve meaningful results.

### **Steps to Follow:**

1. **Preprocess the Data**: Clean and preprocess the `input_text` data as necessary. This might include actions such as tokenization, removing stop words, and lemmatization.

2. **Extract Features**: Transform the text data into numerical features suitable for machine learning algorithms. Consider using techniques like `TfidfVectorizer` or `CountVectorizer`.

3. **Select a Machine Learning Algorithm**: Choose an appropriate machine learning algorithm for your task. Options include classification algorithms (e.g., Logistic Regression, SVM, Random Forest, and others).

4. **Train Your Model**: Split your data into training and testing sets, then train your chosen model on the preprocessed data.

5. **Evaluate Your Model**: Measure the performance of your model using suitable metrics (e.g., accuracy, precision, recall, F1-score).

Good luck!


In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
import joblib

In [21]:
data = pd.read_csv(r'C:\Users\Aya\Downloads\Python\Exercise Dataset\fake_or_real_news.csv')
print(data.head())

   Unnamed: 0                                              title  \
0        8476                       You Can Smell Hillary’s Fear   
1       10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2        3608        Kerry to go to Paris in gesture of sympathy   
3       10142  Bernie supporters on Twitter erupt in anger ag...   
4         875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  


In [22]:
df.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [23]:
data = data.drop(['Unnamed: 0'], axis=1)
data.dropna(inplace=True)

In [24]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X = vectorizer.fit_transform(data['text'])
y = data['label']

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [27]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, pos_label='REAL'))
print("Recall:", recall_score(y_test, y_pred, pos_label='REAL'))
print("F1 Score:", f1_score(y_test, y_pred, pos_label='REAL'))


Accuracy: 0.9155485398579322
Precision: 0.9375
Recall: 0.892018779342723
F1 Score: 0.9141940657578187


In [28]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, pos_label='FAKE'))
print("Recall:", recall_score(y_test, y_pred, pos_label='FAKE'))
print("F1 Score:", f1_score(y_test, y_pred, pos_label='FAKE'))

Accuracy: 0.9155485398579322
Precision: 0.8952959028831563
Recall: 0.9394904458598726
F1 Score: 0.916860916860917


In [30]:
joblib.dump(model, 'logistic_model.joblib')

['logistic_model.joblib']