## 1. Fake News Detector 
<p>Fake news is false or misleading information presented as news. Fake news often has the aim of damaging the reputation of a person or entity, or making money through advertising revenue. The term was first used in the 1890s when sensational reports in newspapers were common.</p>
<p><img src="https://www.pngall.com/wp-content/uploads/4/Fake-News-Stamp-PNG.png" alt="Fake News Logo"></p>
<p>I am going to build a model that will accurately classify news articles as REAL or FAKE.  Using sklearn, I will build a <a  href="https://stackoverflow.com/questions/25902119/scikit-learn-tfidfvectorizer-meaning">TfidfVectorizer</a> on the dataset. Then, I will initialize a <a href="https://www.geeksforgeeks.org/passive-aggressive-classifiers/">PassiveAggressive Classifier</a> and fit the model. In the end, the accuracy score and the confusion matrix show how well the model fares. Let's look at the initial dataset I will use for this project:</p>
<ul>
<li><code>news.csv</code>: This dataset has a shape of 7796×4. The first column identifies the news, the second and third are the title and text, and the fourth column has labels denoting whether the news is REAL or FAKE.</li></ul>
<p>First of all,  libraries that are needed to complete and run this project must be installed using: <code>pip install numpy pandas sklearn</code>. 

Below the necessary imports required:</p>


In [None]:
# import 
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## 2. Data Import and Splitting
<p>Now, at this point I must load the data into a dataframe, and split the dataset into training and testing sets: </p>


In [226]:
#Read the data
df = pd.read_csv('news.csv') #ensure the full path to the data is inserted here if it is stored locally. 
#Get shape and head
df.shape
df.head()
#get the labels
labels = df.label
labels.head()

#split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=7)


## 3. Initialising TfidfVectorization
<p>The next step is to initialise a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. Initialising a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.</p>

In [228]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
# this code fits and transforms train set and transforms the test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)
      

## 4. Initialise PassiveAggressiveClassifer

<p>This section initialises the PassiveAggressiveClassifier. It will be fit on tfidf_train and y_train. Then, I will predict on the test set from the TfidfVectorizer and calculate the accuracy with <code>accuracy_score()</code> from sklearn.metrics.</p>


In [230]:
#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

#Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%') 

Accuracy: 92.66%


## 5. Accuracy
<p>According to the above we reached an accuracy of 92.66% with this model. We then must print out a confusion matrix to gain insight into the number of false and true negatives and positives.</p>


In [232]:
#Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[588,  50],
       [ 43, 586]])

## 6. Summary
<p>Based on this output, we have 588 true positives, 586 true negatives, 50 false positives, and 43 false negatives. I took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit a model. I ended up obtaining an accuracy of 92.66% in magnitude.

Now, this model can be used to test further articles if arranged in the same format - minimal code alterations are required if the dataset needs to be expanded in anyway. </p>
