# Fake News Detection
#### Akshay U
Its a Machine Learning program to find Fake news by training this system with Naive Bayes.

We have two datasets. `True.csv` and `Fake.csv`. <br>
True.csv contains only true news and Fake.csv contains only fake news.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import joblib

### Location of Dataset
https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

### Import data and Cleaning

In [None]:
true_df = pd.read_csv("../input/fake-and-real-news-dataset/True.csv")
fake_df = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv")

Check the dataset, its shape and basic info. 

Details of True.csv

In [None]:
true_df.head(10)

In [None]:
true_df['title'][0]

In [None]:
true_df['text'][0]

In [None]:
true_df.shape

In [None]:
true_df.info()

Details of Fake.csv

In [None]:
fake_df.head(10)

In [None]:
fake_df['title'][0]

In [None]:
fake_df['text'][0]

In [None]:
fake_df.shape

In [None]:
fake_df.info()

#### Check NaN
Check if any null oe NaN values in the dataset

In [None]:
true_df.isnull().values.any()

In [None]:
fake_df.isnull().values.any()

In [None]:
true_df.columns

In [None]:
fake_df.columns

We identified that this dataset is clean without NaN values. Also, we understand that we need only the `text` attribute for predicting the output. <br>
So we should divide the dataset columns for input and output.

Add a new column as `label` for store the news as REAL or FAKE.

Then concatinate the two dataframe to one for training.

In [None]:
true_df['label'] = "Real"
true_df.head()

In [None]:
fake_df['label'] = "Fake"
fake_df.head()

In [None]:
df = pd.concat([true_df,fake_df])
df.shape

#### Inputs

In [None]:
X = df['text']
X

#### Corresponding outputs

In [None]:
y = df['label']
y

### Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=99)
print("Training set - Features: ", X_train.shape, "  Target: ", y_train.shape)
print("Testing set  - Features: ", X_test.shape, "  Target: ",y_test.shape)

Lets check the split

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

In [None]:
y_test

### Feature Extraction
<p style='text-align: justify'>Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing. A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process. Feature extraction is the name for methods that select and /or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.</p>

Initialize a CountVectorizer with stop_words = 'english'. 

Then use **fit()** and store the result to a variable for make ***joblib*** file.

In [None]:
vect = CountVectorizer(stop_words='english')
vectorizer = vect.fit(X)

In [None]:
X_train_transformed = vect.transform(X_train)
X_test_transformed = vect.transform(X_test)
print("New Transformed...")
print("Training set - Features: ", X_train_transformed.shape, "  Target: ", y_train.shape)
print("Testing set  - Features: ", X_test_transformed.shape, "  Target: ",y_test.shape)

### Modeling - Naive Bayes

In [None]:
def print_metrics(labels, preds):
    print("Precision Score\t: {}".format(precision_score(labels, preds, average='weighted')))
    print("Recall Score\t: {}".format(recall_score(labels, preds, average='weighted')))
    print("Accuracy Score\t: {}".format(accuracy_score(labels, preds)))
    print("F1 Score\t: {}".format(f1_score(labels, preds, average='weighted')))

In [None]:
mnb = MultinomialNB()
mnb.fit(X_train_transformed,y_train)

### Prediction and Accuracy

In [None]:
prediction = mnb.predict(X_test_transformed)
print_metrics(prediction, y_test)

### Confusion Matrix

In [None]:
cm = confusion_matrix(prediction, y_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=mnb.classes_)
disp.plot() 

### Train with full set of data -- 100% of data and store it for future prediction
We found the accuracy of this machine with 80% training data.<br>For the future prediction, we can train the machine with 100% dataset, which may increase the accuracy.

In [None]:
X_train_transformed = vect.transform(X)
naive = mnb.fit(X_train_transformed,y)

### Joblib File
Our dataset is little bit big, hence we are using ***joblib*** instead ***pickle***. Joblib file work similar to pickle file. And this file is using for future prediction and helps to avoid training the machine over again.

In [None]:
joblib.dump(naive,"naive.joblib")
joblib.dump(vectorizer,"vectoriszer.joblib")