# Building a "Fake News" Classifier
-----------

1.	Which of the following are possible features for a text classification problem ?

    - <input type="radio" disabled> Number of words in a document
    - <input type="radio" disabled> Specific named entities
    - <input type="radio" disabled> Language
    - <input type="radio" disabled checked> All of the above


## 1. About CountVectorizer 
- **CountVectorizer:** Convert a collection of text documents to a matrix of token counts 
- CountVectorizer implements both tokenization and occurrence counting in a single class
- [Must Read User Guide](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [Refer sklearn CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [3]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?'    
]
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

##### What is sparse matrix ? did you learned from user guide?
- [see now](https://scikit-learn.org/stable/modules/feature_extraction.html#sparsity)

In [4]:
vectorizer.get_feature_names() # each word is column/feature

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [5]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [6]:
vectorizer.vocabulary_ # mapping from feature name to column index

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [7]:
vectorizer.vocabulary_.get("and")

0

In [8]:
vectorizer.vocabulary_.get("the")

6

- Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method

In [9]:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

- Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):


In [10]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
bigram_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

- The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly

In [11]:
analyze = bigram_vectorizer.build_analyzer()
analyze('a Bi-grams are cool!') 

['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

In [12]:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]],
      dtype=int64)

In [13]:
feature_index = bigram_vectorizer.vocabulary_.get('is this')
feature_index

7

In [14]:
bigram_vectorizer.get_feature_names()

['and',
 'and the',
 'document',
 'first',
 'first document',
 'is',
 'is the',
 'is this',
 'one',
 'second',
 'second document',
 'second second',
 'the',
 'the first',
 'the second',
 'the third',
 'third',
 'third one',
 'this',
 'this is',
 'this the']

## 2. TfidfTransformer
- It is useful to do normalization
- Important words/tokens will be given more weight
- **Tf** means term-frequency while **tf–idf** means term-frequency times inverse document-frequency
- let us understand maths calculation from below help document 
- [refer](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)

## 3. TfidfVectorizer = CountVectorizer + TfidfTransformer

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)                               

In [16]:
TfidfVectorizer?

In [17]:
X.shape

(4, 9)

In [18]:
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [19]:
print(X.toarray())

[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]


## 4. Scenario: Fake News Classifier
- Let us see which model is performing good
    - CountVectorizer
    - TfidfVectorizer

### Step 1: CountVectorizer for text classification

In [20]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [21]:
# Load Fake news csv file, available in lab data folder
import os
os.chdir("C:\\Users\\ramreddymyla\\Google Drive\\01 DS ML DL NLP and AI With Python Lab Copy\\02 Lab Data\\Python")
import pandas as pd
df = pd.read_csv("fake_or_real_news.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
Unnamed: 0    6335 non-null int64
title         6335 non-null object
text          6335 non-null object
label         6335 non-null object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


In [22]:
# Print the head of df
print(df.head())

   Unnamed: 0                                              title  \
0        8476                       You Can Smell Hillary’s Fear   
1       10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2        3608        Kerry to go to Paris in gesture of sympathy   
3       10142  Bernie supporters on Twitter erupt in anger ag...   
4         875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  


#### create two seriess one as X and another as y

In [23]:
X = df['text']
print(type(X))
X.head()

<class 'pandas.core.series.Series'>


0    Daniel Greenfield, a Shillman Journalism Fello...
1    Google Pinterest Digg Linkedin Reddit Stumbleu...
2    U.S. Secretary of State John F. Kerry said Mon...
3    — Kaydee King (@KaydeeKing) November 9, 2016 T...
4    It's primary day in New York and front-runners...
Name: text, dtype: object

In [24]:
# Create a series to store the labels: y
y = df.label
y.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [25]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size=0.33,
                                                    random_state=53)

In [26]:
X.shape

(6335,)

In [27]:
X_train.shape

(4244,)

In [28]:
X_test.shape

(2091,)

In [29]:
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

In [30]:
count_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [31]:
# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)
count_train

<4244x56922 sparse matrix of type '<class 'numpy.int64'>'
	with 1119820 stored elements in Compressed Sparse Row format>

`Note:`Text analytics/image Analytics most of the times, data sets will  have more fetures and less samples . Unlike our regular data(examples: iris.csv,pima-indians-females data.....)
* in Text Analytics each `word/token` is `one feature`    
* in Image Analytics eah `pixel` is `one feature`

#### what is the column index of X_train dataset first document ,first word ?

In [None]:
X_train.head()

In [32]:
# What is the first word of first record in X_train
X_train[1539] # it is report

'Report Copyright Violation Do you think there will be as many doom sayers if trump should get in office ? I notice here at GLP the amount of doom sayers seems to go down when a republican is in office (Bush). But when the left get in office the doomsaying increases. Now i am sure the effect is opposite. If trump gets in office i am sure the doomsaying will increase on the left side of the political spectrum. Page 1'

In [33]:
# What is the column index
count_vectorizer.get_feature_names().index("report")

42470

In [34]:
count_train.toarray()[1,42470]

1

In [35]:
count_train.shape

(4244, 56922)

In [36]:
# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

In [37]:
# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10]) # top 10 columns

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


In [38]:
print(count_vectorizer.get_feature_names()[-10:-1]) # bottom 10 columns

['حلب', 'عربي', 'عن', 'لم', 'ما', 'محاولات', 'من', 'هذا', 'والمرضى']


### Step 2: TfidfVectorizer for text classification

- What is the difference between CountVectorizer and TfidfVectorizer 
- [Refer](https://stackoverflow.com/questions/22489264/is-a-countvectorizer-the-same-as-tfidfvectorizer-with-use-idf-false)

In [39]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [40]:
# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', 
                                   max_df=0.7)
# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

In [41]:
# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


In [42]:
# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Step 3: Inspecting the vectors created using above two methods
- To get a better idea of how the vectors work, you'll investigate them by converting them into pandas DataFrames


In [43]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A,
                        columns=count_vectorizer.get_feature_names())

In [44]:
count_df.report[:5]

0    0
1    1
2    0
3    0
4    0
Name: report, dtype: int64

In [45]:
# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, 
                        columns=tfidf_vectorizer.get_feature_names())

In [46]:
tfidf_df.report[:5]

0    0.00000
1    0.07711
2    0.00000
3    0.00000
4    0.00000
Name: report, dtype: float64

In [95]:
# Print the head of count_df
print(count_df.head())

   00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0   0    0     0         0       0      0     0       0      0      0  ...   
1   0    0     0         0       0      0     0       0      0      0  ...   
2   0    0     0         0       0      0     0       0      0      0  ...   
3   0    0     0         0       0      0     0       0      0      0  ...   
4   0    0     0         0       0      0     0       0      0      0  ...   

   حلب  عربي  عن  لم  ما  محاولات  من  هذا  والمرضى  ยงade  
0    0     0   0   0   0        0   0    0        0      0  
1    0     0   0   0   0        0   0    0        0      0  
2    0     0   0   0   0        0   0    0        0      0  
3    0     0   0   0   0        0   0    0        0      0  
4    0     0   0   0   0        0   0    0        0      0  

[5 rows x 56922 columns]


In [96]:
# Print the head of tfidf_df
print(tfidf_df.head())

    00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    0.0    0.0  ...   
1  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    0.0    0.0  ...   
2  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    0.0    0.0  ...   
3  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    0.0    0.0  ...   
4  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    0.0    0.0  ...   

   حلب  عربي   عن   لم   ما  محاولات   من  هذا  والمرضى  ยงade  
0  0.0   0.0  0.0  0.0  0.0      0.0  0.0  0.0      0.0    0.0  
1  0.0   0.0  0.0  0.0  0.0      0.0  0.0  0.0      0.0    0.0  
2  0.0   0.0  0.0  0.0  0.0      0.0  0.0  0.0      0.0    0.0  
3  0.0   0.0  0.0  0.0  0.0      0.0  0.0  0.0      0.0    0.0  
4  0.0   0.0  0.0  0.0  0.0      0.0  0.0  0.0      0.0    0.0  

[5 rows x 56922 columns]


In [47]:
# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

set()


- Which of the below is the most reasonable model to use when training a new supervised model using `text` data?
    - <input type="radio" disabled> Random Forests
    - <input type="radio" disabled> Linear Regression
    - <input type="radio" disabled> Deep Learning
    - <input type="radio" disabled checked> Naive Bayes


> **Home Work:** Bayes theorem

> [Refer 1](https://www.mathsisfun.com/data/bayes-theorem.html)

> [Refer 2](http://yudkowsky.net/rational/bayes)(optional)

### Step 4: Training and testing the "Fake News" model with CountVectorizer
- Train and test a **Naive Bayes** model using the CountVectorizer data.


In [2]:
# Import the necessary modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix

In [3]:
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# https://en.wikipedia.org/wiki/Additive_smoothing(optional)
# https://en.wikipedia.org/wiki/Naive_Bayes_classifier(optional)
# https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

In [50]:
# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [51]:
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

In [52]:
# Calculate the accuracy score: score
score = accuracy_score(y_test, pred)
print(score)

0.893352462936394


In [53]:
# Calculate the confusion matrix: cm
cm = confusion_matrix(y_test, 
                      pred, 
                      labels=['FAKE', 'REAL'])
print(cm)

[[ 865  143]
 [  80 1003]]


# Step 5: Training and testing the "fake news" model with TfidfVectorizer
- In above step we evaluated the model using the CountVectorizer, you'll do the same using the TfidfVectorizer with a Naive Bayes model


In [54]:
# Create a Multinomial Naive Bayes classifier: nb_classifier	
nb_classifier = MultinomialNB()

In [55]:
# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [56]:
# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

In [57]:
# Calculate the accuracy score: score
score = accuracy_score(y_test, pred)
print(score)

0.8565279770444764


In [58]:
# Calculate the confusion matrix: cm
cm = confusion_matrix(y_test, 
                      pred, 
                      labels=['FAKE', 'REAL'])
print(cm)

[[ 739  269]
 [  31 1052]]


- Improving the model, what are the possible next steps you could take to improve the model?
    - <input type="radio" disabled> Tweaking alpha levels
    - <input type="radio" disabled> Trying a new classification model
    - <input type="radio" disabled> Training on a larger dataset
    - <input type="radio" disabled> Improving text pre-processing
    - <input type="radio" disabled checked> All of the above


# Step 6: Improving your model
- Test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.


In [59]:
# Create the list of alphas: alphas
import numpy as np
alphas = np.arange(0, 1, .1)
alphas

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [60]:
# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = accuracy_score(y_test, pred)
    return score

In [61]:
# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0
Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score: 

  'setting alpha = %.1e' % _ALPHA_MIN)


 0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001
Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684



`Home Work:` : Predict given mail is spam or not