# 1. Importing Libraries

* Imports the pandas library and gives it the short name pd.
* Pandas is a powerful Python library used for handling and analyzing data.
* It helps you load, view, clean, and process data, especially in table form like.
* We import TensorFlow (used for building deep learning models).

* tf.__version__ prints which version of TensorFlow you're using. Useful for checking compatibility.

* nltk: A library for natural language processing.
* re: Used for regular expressions (removing unwanted characters).
* stopwords: Common words like "the", "is", "in" that add little meaning — we remove them.
* PorterStemmer: Reduces words to their root form (e.g., "running" → "run").
* importing all the building blocks needed to create an LSTM model that can handle text data.

 * Embedding: Converts words into dense vector representations for the model.
 * LSTM: A type of neural network layer that remembers context, useful for sequence/text data.
 * Dense: Fully connected layer used to make predictions.
 * pad_sequences: Ensures all text inputs have the same length (needed for training).
 * Sequential: Lets you build a model layer by layer.
one_hot: Converts words into integer tokens as part of preprocessing.

In [3]:
import pandas as pd
import tensorflow as tf
import nltk
import re
from nltk.corpus import stopwords
from tensorflow .keras.layers import Embedding
from tensorflow .keras.layers import LSTM
from tensorflow .keras.layers import Dense
from tensorflow .keras.preprocessing.sequence import pad_sequences
from tensorflow .keras.models import Sequential
from tensorflow .keras.preprocessing.text import one_hot
import numpy as np
from sklearn.model_selection import train_test_split

* This line reads a CSV file (which contains your dataset) from your Google Drive path using pandas.

* read_csv() is used to load data from a .csv file into a DataFrame — a table-like structure.

* df is the variable where this data is stored. We'll use it to explore and process the data later.

* on_bad_lines='skip' tells pandas to skip any rows in the file that are broken or have formatting issues, so the code doesn't crash

# 2. Loading the Dataset

In [4]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/WELFake_Dataset.csv',on_bad_lines='skip')

# 3. Data Exploration and Cleaning

* This shows the first 5 rows of your dataset.
* It helps you take a quick look at the data: what columns are there, what kind of values are in the rows, etc.

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


* This shows the dimensions of your dataset — how many rows and columns it has.

* It returns a tuple like (number_of_rows, number_of_columns).

In [7]:
df.shape

(72412, 4)

* df.isnull() checks which cells in the dataset are empty or have missing values (NaN).

* .sum() adds up the True values (which represent missing data) column by column.

In [9]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
title,560
text,39
label,0


* dropna() removes any rows that contain missing (null) values from the dataset.

* df = means you’re saving the cleaned data back into the same variable, so now df has only complete rows.

In [10]:
df = df.dropna()

* We're creating a new variable X that contains all columns except the label column.

* axis=1 means you're dropping a column (not a row).

* The label column likely contains the target values.

In [12]:
X = df.drop('label',axis=1)

# 4. Feature and Target Separation

* We're confirming that both the inputs (X) and outputs (y) are the same length and ready for training.

In [14]:
X.shape

(71813, 3)

In [15]:
y = df['label']

In [17]:
y.shape

(71813,)

* We're setting the vocabulary size to 5000.
* This means we'll only consider the top 5000 most common words when converting text to numbers using one-hot encoding.
* Making a copy of the input features (X) into a new variable messages
* Removing extra columns
* We are giving our text data a clean slate: removing junk columns and resetting row numbers.

In [21]:
voc_size = 5000

In [22]:
messages = X.copy()

In [35]:
messages = messages.drop(['index','Unnamed: 0'],axis=1)

In [37]:
messages.reset_index(inplace=True)

# 5. Text Preprocessing

* nltk.download('stopwords'): Downloads the list of English stopwords.
* **ps** stemmer object to use later when cleaning each word.

* **This loop cleans each news title:**
 * Removes punctuation/numbers.
 * Converts to lowercase.
 * Splits into words.
 * Removes stopwords.
 * Applies stemming.
 * Joins the cleaned words into one string.
 * Adds the cleaned title to the corpus list.

**Converts each cleaned sentence into a list of integers using one-hot encoding.
Each word is turned into a number between 1 and voc_size (5000).**

In [29]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [43]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

corpus = []

for i in range(0,len(messages)):
  review = re.sub('[^a-zA-Z]',' ',messages['title'][i])
  review = review.lower()
  review = review.split()

  review = [ ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)


# 6. One-Hot Encoding and Padding

In [64]:
onehot_repr = [one_hot(words,voc_size) for words in corpus]


Embedding Representation

* Every news title will be represented by 20 words.
* Shorter titles will be padded; longer ones will be trimmed.

* pad_sequences() makes all sequences the same length, which is required for deep learning models.
* padding='pre' means if a sequence is shorter than 20, it will add zeros at the beginning.
* maxlen=sent_length makes sure no sequence is longer than 20 — extra words will be cut from the beginning.

* Embedding layer converts each word number (from one-hot encoding) into a dense vector of 40 values.

* input_dim=voc_size means it expects words represented by numbers from a vocabulary of size 5000.

* output_dim=embedding_vector_features defines the size of each word vector.

* Adds an **LSTM laye**r with 100 units.

* LSTM (Long Short-Term Memory) helps the model remember word order and context, which is important in text.

* **Dense(1)** adds one neuron to predict either 0 or 1 (fake or real).

* **sigmoid activation** squashes output between 0 and 1, so it can be interpreted as probability.

* loss='binary_crossentropy': Used for binary classification tasks (fake vs real).

* optimizer='adam': Efficient algorithm to update weights and train the model.

* metrics=['accuracy']: Track accuracy during training and testing.

In [46]:
sent_length = 20
embedded_docs = pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 3469 4779 3727]
 [   0    0    0 ...  152 4520 3727]
 [   0    0    0 ...  601 1284 2328]
 ...
 [   0    0    0 ... 4056 1496 3119]
 [   0    0    0 ...  326 1905 2806]
 [   0    0    0 ...  983 3592 3009]]


* Each word will be turned into a 40-length vector that the model can learn from.
* Sequential() creates a model where you can stack layers one after another.

# 7. Building the LSTM Model

In [49]:
embedding_vector_features = 40
model = Sequential()
model.add(Embedding(input_dim= voc_size,output_dim=embedding_vector_features))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.build(input_shape=(None,sent_length))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

# 8. Preparing Data for Training

In this part of the code, the preprocessed text data and labels are converted into NumPy arrays and split into training and testing sets (70/30). The LSTM model is then trained on the training data for 10 epochs, using a batch size of 64, while also validating on the test set. After training, the model makes predictions, which are converted into binary values using a 0.5 threshold. Finally, the model’s performance is evaluated using a confusion matrix, accuracy score, and a classification report showing precision, recall, and F1-score

In [50]:
len(embedded_docs),y.shape

(71813, (71813,))

In [52]:
X_final = np.array(embedded_docs)
y_final = np.array(y)

In [53]:
X_final.shape,y_final.shape

((71813, 20), (71813,))

In [54]:
X_train,X_test,y_train,y_test = train_test_split(X_final,y_final,test_size = 0.30,random_state=42)

# 9. Model Training

In [55]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
[1m786/786[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 7ms/step - accuracy: 0.8127 - loss: 0.3922 - val_accuracy: 0.8826 - val_loss: 0.2727
Epoch 2/10
[1m786/786[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 9ms/step - accuracy: 0.9123 - loss: 0.2154 - val_accuracy: 0.8931 - val_loss: 0.2500
Epoch 3/10
[1m786/786[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.9305 - loss: 0.1768 - val_accuracy: 0.8938 - val_loss: 0.2562
Epoch 4/10
[1m786/786[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9383 - loss: 0.1570 - val_accuracy: 0.8955 - val_loss: 0.2595
Epoch 5/10
[1m786/786[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.9480 - loss: 0.1315 - val_accuracy: 0.8957 - val_loss: 0.2796
Epoch 6/10
[1m786/786[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.9549 - loss: 0.1128 - val_accuracy: 0.8948 - val_loss: 0.3384
Epoch 7/10
[1m786/786[0m 

<keras.src.callbacks.history.History at 0x7e5a502baa10>

# 10. Prediction and Evaluation

In [56]:
y_pred = model.predict(X_test)

[1m674/674[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step


In [57]:
y_pred = np.where(y_pred > 0.5,1,0)

In [60]:
from sklearn.metrics import confusion_matrix

In [61]:
confusion_matrix(y_test,y_pred)

array([[9298, 1320],
       [1043, 9883]])

In [62]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8903174897883401

In [63]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.88      0.89     10618
           1       0.88      0.90      0.89     10926

    accuracy                           0.89     21544
   macro avg       0.89      0.89      0.89     21544
weighted avg       0.89      0.89      0.89     21544

