### **Bag of n_grams**

- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


### **About Data: Fake News Detection**

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


- This data consists of two columns.
        - Text
        - label
- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the **Binary Classification.**


In [3]:
import pandas as pd
import numpy as np
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
df = pd.read_csv('Fake_Real_Data.csv')
df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [5]:
df.label.value_counts()

label
Fake    5000
Real    4900
Name: count, dtype: int64

In [6]:
df['label_num']=df['label'].apply(lambda x: 1 if x=='Real' else 0)

In [7]:
df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


### **Modelling without Pre-processing Text data**

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Text'],df['label_num'],test_size=0.2)

In [9]:
X_train.shape, X_test.shape

((7920,), (1980,))

**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.


In [13]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ('cv',CountVectorizer(ngram_range=(1,3))),
    ('knn',KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])

pipe.fit(X_train, y_train)
print(classification_report(y_test,pipe.predict(X_test)))

              precision    recall  f1-score   support

           0       0.95      0.46      0.62       969
           1       0.65      0.98      0.78      1011

    accuracy                           0.72      1980
   macro avg       0.80      0.72      0.70      1980
weighted avg       0.80      0.72      0.70      1980



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
- print the classification report.


In [14]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ('cv',CountVectorizer(ngram_range=(1,3))),
    ('knn',KNeighborsClassifier(n_neighbors=10, metric='cosine'))
])

pipe.fit(X_train, y_train)
print(classification_report(y_test,pipe.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      0.56      0.72       969
           1       0.70      1.00      0.82      1011

    accuracy                           0.78      1980
   macro avg       0.85      0.78      0.77      1980
weighted avg       0.85      0.78      0.77      1980




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [15]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ('cv',CountVectorizer(ngram_range=(1,3))),
    ('rf',RandomForestClassifier())
])

pipe.fit(X_train, y_train)
print(classification_report(y_test,pipe.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       969
           1       1.00      1.00      1.00      1011

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier with an alpha value of 0.75.
- print the classification report.


In [16]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ('cv',CountVectorizer(ngram_range=(1,3))),
    ('nb',MultinomialNB(alpha=0.75))
])

pipe.fit(X_train, y_train)
print(classification_report(y_test,pipe.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       969
           1       0.99      0.99      0.99      1011

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [18]:
def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct:
            filtered_tokens.append(token.lemma_)
    return ' '.join(filtered_tokens)

**Build a model with pre processed text**

In [19]:
df['preprocessed_text'] = df['Text'].apply(preprocess)

In [20]:
df.head()

Unnamed: 0,Text,label,label_num,preprocessed_text
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0,Trump Surrogate BRUTALLY Stabs Pathetic vide...
1,U.S. conservative leader optimistic of common ...,Real,1,U.S. conservative leader optimistic common gro...
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1,trump propose U.S. tax overhaul stir concern d...
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0,Court Forces Ohio allow million illegally pu...
4,Democrats say Trump agrees to work on immigrat...,Real,1,Democrats Trump agree work immigration bill wa...


In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['preprocessed_text'],
    df['label_num'],
    test_size=0.2,
    random_state=2022,
    stratify=df['label_num']
)

**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [25]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ('cv',CountVectorizer(ngram_range=(3,3))),
    ('rf',RandomForestClassifier())
])

pipe.fit(X_train, y_train)
print(classification_report(y_test,pipe.predict(X_test)))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1000
           1       0.99      0.93      0.96       980

    accuracy                           0.96      1980
   macro avg       0.96      0.96      0.96      1980
weighted avg       0.96      0.96      0.96      1980



**Attempt2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, Bigram, and trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [27]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe1 = Pipeline([
    ('cv',CountVectorizer(ngram_range=(1,3))),
    ('rf',RandomForestClassifier())
])

pipe1.fit(X_train, y_train)
print(classification_report(y_test,pipe.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      1000
           1       1.00      0.99      1.00       980

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980



In [28]:
#finally print the confusion matrix for the best model
y_pred = pipe1.predict(X_test)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,y_pred))

[[999   1]
 [  6 974]]


In [29]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.9964646464646465


## **Please write down Final Observations**


##### Final Observations
##### As machine learning algorithms do not work on text data directly, we need to convert them into numeric vectors and feed that into models while training.

##### In this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words and we use sklearn CountVectorizer for this.

##### Without Pre-Processing Data

##### From the above in most of the cases, we can see that when we have the count vectorizer above trigrams or at trigrams, the performance keeps degrading. The major possible reason for this as the ngram_range keeps increasing, the number of dimensions/features (possible combination of words) also increases enormously and models have the risk of overfitting and resulting in terrible performance.

##### For this reason, models like KNN failed terribly when performed with trigrams and using the euclidean distance. K-Nearest Neighbours(KNN) doesn't work well with high-dimensional data because, with a large number of dimensions, it becomes difficult for the algorithm to calculate the distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of the model. It performed well for class 1 and had terrible results for Class 0.

##### Both recall and F1 scores increase better when trained with the same KNN model but with cosine distance as cosine distance does not get influenced by the number of dimensions as it uses the angle better the two text vectors to calculate the similarity.

##### With respect to Naive and RandomForest models, both performed really well, and random forest with trigrams has a better edge on the recall metric.

##### As Random Forest uses Bootstrapping(row and column Sampling) with many decision trees and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifying the categories.

##### The easy calculation of probabilities for the words in the corpus(Bag of words) and storing them in a contingency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.

##### With Pre-Processing Data

##### Have trained the best model RandomForest on the pre-processed data, but RandomForest with trigrams fails to produce the same results here.

##### But the same randomForest with Unigram to Trigram features helps to produce very amazing results and is tops in the entire list with very good F1 scores and Recall scores.

##### Machine Learning is like a trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which gives good results and satisfies the requirements like latency, interpretability, etc.

## [**Solution**](./bag_of_n_grams_exercise_solutions.ipynb)