In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

In [17]:
df = pd.read_csv("news.csv",index_col=0)
print(df.shape)
df.head()

(6335, 3)


Unnamed: 0,title,text,label
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [18]:
labels = df.label
labels.head()

8476     FAKE
10294    FAKE
3608     REAL
10142    FAKE
875      REAL
Name: label, dtype: object

In [19]:
X_train, X_test, y_train,y_test = train_test_split(df["text"],labels, test_size=0.2, random_state=7)

##### `TfidfVectorizer`

`TfidfVectorizer` is a **Bag of n-grams** representation transformer that treats a body of text, here the text of a news article, into a feature vector representation that ignores their relative position and treats it as a bag of words.
It is equivalent to `CountVectorizer` followed by `TfidfTransformer`.

1. `CountVectorizer` achieves the following:
    - **Tokenization**: Extracts words from sentences as tokens (by treating spaces and punctuations as token separators) and gives them integer IDs.
    - **Counting**: Counts the occurence of tokens in each document/text file.
    - Returns a sparse matrix where each row corresponds to a text file, and each column index refers to the token IDs.
2. `TfidfTransformer` achieves the following:
    - The words/tokens with a high occurence frequency *across all the documents* (like "a", "is", "the", ...) provide little to no information, and hence need to weighed down to less significance than other tokens.
    - Achieves this using a factor weight **inverse document frequency** (for each term, calculated over all documents) multiplied to the **term-frequency** (for each term in each document):
    - $$ \text{idf(term)} = \text{log}\frac{1+n}{1+\text{df(term)}} + 1 $$
    - Adjusted count becomes: $$\text{tf-idf(term, doc) = tf(term, doc)} \times \text{idf(term)}$$
    - Each feature column vector is then normalized.

> `stop_words` are the set of words (like "and", "the", "him", ...) that the vectorizer needs to ignore, as they are not relevant/informative to the content of the text itself.

> `max_df` ensures here that words with document frequency well over its value are ignored/not considered in the vectorization process.



In [23]:
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

In [39]:
print(tfidf_train.shape)

(5068, 61651)


In [44]:
print(tfidf_train)

  (0, 56381)	0.03622223988286098
  (0, 16314)	0.053492157980948106
  (0, 19620)	0.030351855107005405
  (0, 52607)	0.04266045446208797
  (0, 14900)	0.039165339742818085
  (0, 53749)	0.029756205182552464
  (0, 15211)	0.07772572986248194
  (0, 61154)	0.06726619958695557
  (0, 59042)	0.047893261248723944
  (0, 42972)	0.03152542343098286
  (0, 54232)	0.038673616329284524
  (0, 59249)	0.04106143649018827
  (0, 28891)	0.06514397995138038
  (0, 41708)	0.03983513460128018
  (0, 50192)	0.045331181477256094
  (0, 44691)	0.0318676439567658
  (0, 11820)	0.046381950858248124
  (0, 7682)	0.04137048243377956
  (0, 50343)	0.10196965191544219
  (0, 48095)	0.021092647294770877
  (0, 17916)	0.03674587236023286
  (0, 46027)	0.10236534701241509
  (0, 16993)	0.02775494464904786
  (0, 55006)	0.03368300200002207
  (0, 51389)	0.03397042876291898
  :	:
  (5067, 32909)	0.09429823872256275
  (5067, 59221)	0.11305513144362901
  (5067, 14649)	0.03772971846597005
  (5067, 55827)	0.2218263076177088
  (5067, 10398)	0.0

##### `PassiveAggressiveClassifier`

Passive Aggressive Classifier is a subset of online learning linear classifier alogrithms that are appropriate for large scale learning as they learn each sample (here article text body) by sample, they only learn/ update the weights when a new sample does not get classified correctly with weights built from previous iterations, which is termed as becoming aggressive only when it does mis-classification, otherwise it is passive.

In [37]:
pac = PassiveAggressiveClassifier(max_iter=100)
pac.fit(tfidf_train, y_train)

y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f"Accuracy is {round(score*100,2)}%")

Accuracy is 92.9%


In [38]:
cmat = confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])
cmat

array([[589,  49],
       [ 41, 588]], dtype=int64)