### team members
- Syed Razauddin Shahlal ;         ubid: 50496396
- Tajammul Shuja Sayyad  ;         ubid: 50495179 

### Traditional Models for fake news classification

##### This notebook uses google colab environment(with no gpu usage) to perform classification task on fake_news dataset from kaggle

In [1]:
#install the required libraries
!pip install nltk gensim transformers




In [2]:
import nltk
from nltk.tokenize import word_tokenize
import string
import pandas as pd
from nltk.corpus import stopwords

#### To download dataset from kaggle on google colab we make use of kaggle API as shown below

In [4]:
from google.colab import files
files.upload()
#do nothing

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"frostpiee","key":"7f3d2f6c32ebdb0fbe8e8862dcb86fbd"}'}

In [5]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
!kaggle datasets download -d saurabhshahane/fake-news-classification

Downloading fake-news-classification.zip to /content
 91% 84.0M/92.1M [00:00<00:00, 105MB/s] 
100% 92.1M/92.1M [00:00<00:00, 108MB/s]


In [7]:
!unzip fake-news-classification.zip

Archive:  fake-news-classification.zip
  inflating: WELFake_Dataset.csv     


##### The data has been unzipped and can now be used 

In [8]:
import pandas as pd

In [35]:
#install stopwords from nltk library. These list of words will be ignored from our sentences.
nltk.download('stopwords')


In [35]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')

In [35]:
#define a preprocess a function that takes each row at a time and preprocess the text.

def preprocess_document(row):
    # Tokenize and lowercase
    tokens = word_tokenize(row.lower())

    # Remove punctuation
    tokens = [word for word in tokens if word.isalpha()]

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Rejoin tokens into a single string
    return ' '.join(tokens)

In [35]:
df=pd.read_csv("/content/WELFake_Dataset.csv")

##### We will be using 'text' column from the df which represents the main body of the news text as the feature and will preprocess that text.

In [20]:
df['text_processed']=df['text'].apply(preprocess_document)

##### From text features to numerical features
- For machine learning models we require input features to be numeric. For this reason we convert our raw text into numerical vectors.
- In text data, many words may not appear in most documents, resulting in a sparse matrix. TF-IDF naturally deals with this by providing non-zero scores for words present in a document, thus providing a way to summarize the text data.

- TF(t)=(Number of times term t appears in a document)/(Total number of terms in the document)

- IDF(t)=log( Total Number of documents/ no. of docs with term t in them)

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [27]:
# X---> Preprocessed text, y---> label
X_train, X_test, y_train, y_test = train_test_split(df['text_processed'], df['label'], test_size=0.3, random_state=42)


##### Before applying tf-idf vectorizer we split our data into train and test to ensure there is no data leakage.

In [29]:
tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

X_test_tfidf = tfidf_vectorizer.transform(X_test)

##### Naive bayes model
- Now that our feature set and labels are ready we will try different classification models.

In [30]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [31]:
model=MultinomialNB()

In [32]:
model.fit(X_train_tfidf,y_train)

In [33]:
y_pred = model.predict(X_test_tfidf)

##### After predicting on test data we see how our model is doing using classification report.

In [35]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.86      0.85      0.85     10595
           1       0.85      0.87      0.86     10867

    accuracy                           0.86     21462
   macro avg       0.86      0.86      0.86     21462
weighted avg       0.86      0.86      0.86     21462



##### Logistic Regression Model

In [36]:
from sklearn.linear_model import LogisticRegression

In [37]:
lr=LogisticRegression()

In [38]:
lr.fit(X_train_tfidf,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [40]:
y_pred=lr.predict(X_test_tfidf)

In [41]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94     10595
           1       0.93      0.95      0.94     10867

    accuracy                           0.94     21462
   macro avg       0.94      0.94      0.94     21462
weighted avg       0.94      0.94      0.94     21462



##### Observations:
- We find that just by using text column and converting it into vector representation we are getting accuracy on test set of :
- For naive bayes : 86%
- For logistic regression: 94%

##### Next steps:
- Next we will try transformer models to increase our accuracy. This is in another attached notebook.