## Fake News Classifier Using Passive Agressive Classifier

1. **Import Necessary Libraries**: We import pandas for data manipulation, sklearn's train_test_split for splitting the data into training and testing sets, TfidfVectorizer for converting text data into numerical data, PassiveAggressiveClassifier for building the machine learning model, and accuracy_score and confusion_matrix for evaluating the model's performance.

In [11]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

2. **Load the Data**: We load the data from two CSV files into pandas DataFrames. One file contains fake news and the other contains true news.

In [12]:
# Load the data
fake_news = pd.read_csv('Fake.csv')
true_news = pd.read_csv('True.csv')

3. **Label the Data**: We add a new column to each DataFrame to label the news as fake or true. This will be our target variable for the machine learning model.

In [13]:
# Label the data
fake_news['label'] = 'FAKE'
true_news['label'] = 'TRUE'

4. **Concatenate the DataFrames**: We concatenate the two DataFrames into one. This is necessary because we want to train our machine learning model on both the fake news and the true news.

In [14]:
# Concatenate the dataframes
df = pd.concat([fake_news, true_news])

5. **Check for Missing Values**: We check for missing values in the DataFrame. Missing values can cause issues with many machine learning algorithms, so it's important to identify and handle them.

In [15]:
# Check for missing values
print(df.isnull().sum())

title      0
text       0
subject    0
date       0
label      0
dtype: int64


6. **Handle Missing Values**: We handle missing values by removing any rows that contain them. This is one of the simplest ways to handle missing values, but it might not be the best strategy for all datasets or all situations.

In [16]:
# Handle missing values
df = df.dropna()

7. **Split the Data**: We split the data into a training set and a testing set. This is a common practice in machine learning. The model is trained on the training set and then evaluated on the testing set.

In [17]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

8. **Initialize a TfidfVectorizer**: The TfidfVectorizer converts the text data into numerical data that can be used by the machine learning algorithm. It does this by calculating the TF-IDF score for each word in the text. The TF-IDF score is a statistical measure that reflects how important a word is to a document in a collection of documents.

In [18]:
# Initialize a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

9. **Fit and Transform the Vectorizer**: We fit the vectorizer on the training set and transform both the training and testing sets. This calculates the TF-IDF scores for all the words in the training set and transforms the text data into a numerical format that can be used by the machine learning algorithm.

In [19]:
# Fit and transform the vectorizer on the training set, and transform the testing set
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

10. **Initialize a PassiveAggressiveClassifier**: The Passive Aggressive Classifier is a type of online learning algorithm. It's well-suited to large-scale learning and is particularly good for text classification.

In [20]:
# Initialize a PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50)

11. **Fit the Classifier**: We fit the classifier on the training set. This trains the model to classify news as fake or true based on the TF-IDF scores of the words in the text.

In [21]:
pac.fit(tfidf_train, y_train)

12. **Predict on the Testing Set**: We use the trained model to make predictions on the testing set.

In [22]:
# Predict on the testing set
y_pred = pac.predict(tfidf_test)

13. **Calculate Accuracy**: We calculate the accuracy of the model by comparing the model's predictions to the actual labels of the news in the testing set.

In [23]:
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 99.48%


14. **Build Confusion Matrix**: We build a confusion matrix to understand the performance of the model in more detail. The confusion matrix shows the number of true positives, true negatives, false positives, and false negatives.

In [24]:
# Build confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=['FAKE', 'TRUE'])

In [25]:
# Calculate the performance metrics
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * precision * recall / (precision + recall)

# Print the performance metrics
print(f'Confusion Matrix: \n{cm}')
print(f'True Positives: {tp}')
print(f'True Negatives: {tn}')
print(f'False Positives: {fp}')
print(f'False Negatives: {fn}')
print(f'Precision: {round(precision*100,2)}%')
print(f'Recall: {round(recall*100,2)}%')
print(f'F1 Score: {round(f1_score*100,2)}%')

Confusion Matrix: 
[[4710   23]
 [  24 4223]]
True Positives: 4223
True Negatives: 4710
False Positives: 23
False Negatives: 24
Precision: 99.46%
Recall: 99.43%
F1 Score: 99.45%
