### ***n_gram bag of words***
- Language modeling involves determining the probability of a sequence of words.
- Counts the number / frequency of each of the sequence of words of its trained vocab in the given text
- Thus the whole text is vectorized

> This can be implemented by **count vectorizer of sklearn with n_gram props** 
***
***Example***
- **I am working on machine learning => work machine learn**
- **I am working on learning machine => work learn machine** 

After lemmitisation and stop words removing the count vectorisation with single words will be same -> `Similar setence`
But they are not actually! we can detect that if we consider `bigram`, these `machine learn` and `learn machine` will differ!

In [163]:
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [164]:
# data 
data = pd.read_csv("Fake.csv")
data.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [165]:
data["label"] = data["label"].apply(lambda x: 1 if x == "Real" else 0)
data

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,0
1,U.S. conservative leader optimistic of common ...,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",1
3,Court Forces Ohio To Allow Millions Of Illega...,0
4,Democrats say Trump agrees to work on immigrat...,1
...,...,...
9895,Wikileaks Admits To Screwing Up IMMENSELY Wit...,0
9896,Trump consults Republican senators on Fed chie...,1
9897,Trump lawyers say judge lacks jurisdiction for...,1
9898,WATCH: Right-Wing Pastor Falsely Credits Trum...,0


In [166]:
# data checking
data["label"].value_counts()

label
0    5000
1    4900
Name: count, dtype: int64

In [167]:
# train test split
from sklearn.model_selection import train_test_split
train_text, test_text, train_label, test_label = train_test_split(data["Text"], data["label"], test_size=0.2, random_state=42)

In [168]:
train_label

1665    1
1416    1
7298    1
4700    1
6192    0
       ..
5734    0
5191    0
5390    0
860     1
7270    1
Name: label, Length: 7920, dtype: int64

In [169]:
train_text

1665    As new fiscal year dawns, hope for Illinois bu...
1416    Schumer says U.S. budget deal doable if Trump ...
7298    NRA calls for more regulation after Vegas shoo...
4700    White House says Tillerson to remain as secret...
6192     Scientists Scramble To Copy Climate Data Befo...
                              ...                        
5734     WATCH: Hannity Loses His Sh*t And Refers To H...
5191     Sean Hannity Just Showed EXACTLY How Delusion...
5390     WATCH: CNN Host Makes Trump’s Campaign Manage...
860     Democratic senator lifts hold on Trump antitru...
7270    'Unremarkable' Virginia attacker shows difficu...
Name: Text, Length: 7920, dtype: object

***text modification***

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.vocab["not"].is_stop = False

In [171]:
def modify_text (text):
    doc = nlp(text)
    # lemmitization, stop words & puncs remiving
    modified_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return modified_tokens
    

In [172]:
# train_text["modified_text"] = train_text["Text"].apply(modify_text)

In [173]:
# Process all texts efficiently
""" docs = nlp.pipe(data["Text"])
data["coded_text"] = [
    [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    for doc in docs
]
 """

' docs = nlp.pipe(data["Text"])\ndata["coded_text"] = [\n    [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]\n    for doc in docs\n]\n '

***vectorization***

In [174]:
from sklearn.feature_extraction.text import CountVectorizer

`monogram`

In [175]:
vectorizer1 = CountVectorizer()

In [176]:
vectorizer1.fit(train_text)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [177]:
vectorizer1.vocabulary_

{'as': 5823,
 'new': 32914,
 'fiscal': 19178,
 'year': 52290,
 'dawns': 13787,
 'hope': 23344,
 'for': 19590,
 'illinois': 24036,
 'budget': 9080,
 'dims': 15051,
 'chicago': 10589,
 'reuters': 40227,
 'risking': 40572,
 'historic': 23055,
 'drop': 16161,
 'to': 47559,
 'junk': 26497,
 'bond': 8269,
 'status': 44878,
 'began': 7174,
 'its': 25519,
 'third': 47191,
 'straight': 45244,
 'without': 51687,
 'on': 34179,
 'saturday': 41662,
 'political': 36624,
 'maneuvering': 29787,
 'dimmed': 15047,
 'hopes': 23351,
 'bipartisan': 7736,
 'spending': 44386,
 'and': 5130,
 'revenue': 40254,
 'package': 34923,
 'anytime': 5406,
 'soon': 44120,
 'while': 51308,
 'the': 46982,
 'house': 23458,
 'scheduled': 41853,
 'session': 42535,
 'sunday': 45727,
 'take': 46317,
 'up': 49645,
 'senate': 42374,
 'is': 25415,
 'not': 33455,
 'slated': 43597,
 'return': 40206,
 'until': 49585,
 'monday': 31711,
 'nation': 32572,
 'fifth': 19005,
 'largest': 27943,
 'state': 44837,
 'has': 22371,
 'lacked': 27

`Bi-gram`

In [178]:
vectorizer2 = CountVectorizer(ngram_range=(2, 2))

In [179]:
vectorizer2.fit(train_text)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(2, ...)"


In [180]:
vectorizer2.vocabulary_

{'as new': 79897,
 'new fiscal': 497751,
 'fiscal year': 284202,
 'year dawns': 856322,
 'dawns hope': 206498,
 'hope for': 357122,
 'for illinois': 290273,
 'illinois budget': 367074,
 'budget dims': 128443,
 'dims chicago': 224989,
 'chicago reuters': 154246,
 'reuters risking': 633541,
 'risking historic': 638094,
 'historic drop': 353965,
 'drop to': 239263,
 'to junk': 767678,
 'junk bond': 412695,
 'bond status': 120336,
 'status illinois': 701255,
 'illinois began': 367070,
 'began its': 106229,
 'its third': 404274,
 'third straight': 754839,
 'straight fiscal': 705845,
 'year without': 856952,
 'without budget': 847359,
 'budget on': 128559,
 'on saturday': 529671,
 'saturday as': 651048,
 'as political': 80136,
 'political maneuvering': 572202,
 'maneuvering dimmed': 455774,
 'dimmed hopes': 224956,
 'hopes for': 357388,
 'for bipartisan': 288911,
 'bipartisan spending': 115823,
 'spending and': 692360,
 'and revenue': 61448,
 'revenue package': 634125,
 'package anytime': 54

`trigram`

In [181]:
vectorizer3 = CountVectorizer(ngram_range=(3,3))

In [182]:
vectorizer3.fit(train_text)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(3, ...)"


In [183]:
vectorizer3.vocabulary_

{'as new fiscal': 193186,
 'new fiscal year': 1124423,
 'fiscal year dawns': 614333,
 'year dawns hope': 2077001,
 'dawns hope for': 457183,
 'hope for illinois': 801443,
 'for illinois budget': 629166,
 'illinois budget dims': 825607,
 'budget dims chicago': 296256,
 'dims chicago reuters': 495822,
 'chicago reuters risking': 355428,
 'reuters risking historic': 1445460,
 'risking historic drop': 1455415,
 'historic drop to': 796003,
 'drop to junk': 526106,
 'to junk bond': 1824887,
 'junk bond status': 949864,
 'bond status illinois': 283087,
 'status illinois began': 1592060,
 'illinois began its': 825601,
 'began its third': 255859,
 'its third straight': 932603,
 'third straight fiscal': 1775048,
 'straight fiscal year': 1600695,
 'fiscal year without': 614376,
 'year without budget': 2079122,
 'without budget on': 2052996,
 'budget on saturday': 296611,
 'on saturday as': 1224127,
 'saturday as political': 1486506,
 'as political maneuvering': 193891,
 'political maneuvering dim

In [184]:
monogram = vectorizer1.transform(train_text)
bigram = vectorizer2.transform(train_text)
trigram = vectorizer3.transform(train_text)

In [185]:
monogram

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 1745278 stored elements and shape (7920, 52816)>

***classification analysis***

In [186]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [187]:
clf = MultinomialNB()

In [188]:
clf.fit(monogram, train_label)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [189]:
monogram_ = vectorizer1.transform(test_text)

In [190]:
clf.predict(monogram_[0:8])

array([1, 1, 1, 1, 0, 1, 0, 0])

In [191]:
type(test_label[0:8])

pandas.core.series.Series

In [192]:
preds1 = clf.predict(monogram_)
type(preds1)

numpy.ndarray

In [193]:
print(classification_report(test_label, preds1))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       973
           1       0.98      0.98      0.98      1007

    accuracy                           0.98      1980
   macro avg       0.98      0.98      0.98      1980
weighted avg       0.98      0.98      0.98      1980



In [194]:
clf.fit(bigram, train_label)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [195]:
bigram_ = vectorizer2.transform(test_text)

In [196]:
preds2 = clf.predict(bigram_)

In [197]:
print(classification_report(test_label, preds2))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       973
           1       0.99      0.99      0.99      1007

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980



In [198]:
clf.fit(trigram, train_label)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [199]:
trigram_ = vectorizer3.transform(test_text)

In [200]:
preds3 = clf.predict(trigram_)
preds3

array([0, 1, 1, ..., 0, 1, 1], shape=(1980,))

In [201]:
print(classification_report(test_label, preds3))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       973
           1       0.98      1.00      0.99      1007

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980

