{ “cells”: \[ { “cell_type”: “markdown”, “metadata”: {}, “source”: \[
“\# Detecting Fake News using Support Vector Machines (SVM)”, “\###
Step-by-Step Implementation in Google Colab with Multilingual Support
(English & Chinese)” \] }, { “cell_type”: “markdown”, “metadata”: {},
“source”: \[ “\## Step 1: Install & Import Necessary Libraries”, “We
first import all the required Python libraries. NLTK is used for natural
language processing, sklearn for machine learning, and pandas for data
manipulation. We also use googletrans for translating Chinese text into
English.” \] }, { “cell_type”: “code”, “execution_count”: null,
“metadata”: {}, “outputs”: \[\], “source”: \[ “\# Install & Import
Necessary Libraries”, “!pip install googletrans==4.0.0-rc1”, “import
pandas as pd \# Data handling”, “import numpy as np \# Numerical
operations”, “import re \# Regular expressions for text cleaning”,
“import string \# String operations”, “import nltk \# Natural Language
Processing”, “from nltk.corpus import stopwords \# List of stopwords”,
“from nltk.tokenize import word_tokenize \# Tokenization”, “from
nltk.stem import WordNetLemmatizer \# Lemmatization”, “from googletrans
import Translator \# Translation from Chinese to English”, “from
sklearn.feature_extraction.text import TfidfVectorizer \# Convert text
to numerical representation”, “from sklearn.model_selection import
train_test_split \# Splitting data”, “from sklearn.svm import SVC \#
Support Vector Machine model”, “from sklearn.metrics import
accuracy_score, classification_report \# Model evaluation”, “”, “\#
Download necessary resources for NLTK”, “nltk.download(‘stopwords’)”,
“nltk.download(‘punkt’)”, “nltk.download(‘wordnet’)” \] }, {
“cell_type”: “markdown”, “metadata”: {}, “source”: \[ “\## Step 2: Load
and Merge Datasets”, “We will load and merge two datasets: Weibo21
(train, test, val) and the Kaggle Fake News dataset (True.csv,
Fake.csv). Chinese text will be translated to English.” \] }, {
“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”:
\[\], “source”: \[ “\# Load and Merge Datasets”, “df_train =
pd.read_csv(‘train.csv’)”, “df_test = pd.read_csv(‘test.csv’)”, “df_val
= pd.read_csv(‘val.csv’)”, “df_fake = pd.read_csv(‘Fake.csv’)”, “df_real
= pd.read_csv(‘True.csv’)”, “df_fake\[‘label’\] = 1”,
“df_real\[‘label’\] = 0”, “df_train\[‘label’\] =
df_train\[‘label’\].map({‘fake’: 1, ‘real’: 0})”, “df_test\[‘label’\] =
df_test\[‘label’\].map({‘fake’: 1, ‘real’: 0})”, “df_val\[‘label’\] =
df_val\[‘label’\].map({‘fake’: 1, ‘real’: 0})”, “df_kaggle =
pd.concat(\[df_fake, df_real\])”, “df_weibo = pd.concat(\[df_train,
df_test, df_val\])”, “”, “\# Translate Chinese text to English”,
“translator = Translator()”, “df_weibo\[‘content’\] =
df_weibo\[‘content’\].apply(lambda x: translator.translate(x,
src=‘zh-cn’, dest=‘en’).text)”, “”, “\# Merge both datasets”,
“df_combined = pd.concat(\[df_kaggle\[\[‘title’, ‘text’, ‘label’\]\],
df_weibo\[\[‘content’, ‘label’\]\].rename(columns={‘content’:
‘text’})\])”, “df_combined =
df_combined.sample(frac=1).reset_index(drop=True) \# Shuffle data”,
“df_combined.head()” \] }, { “cell_type”: “markdown”, “metadata”: {},
“source”: \[ “\## Step 3: Data Preprocessing”, “We clean the text by
converting it to lowercase, removing punctuation, and lemmatizing words
to their base form.” \] }, { “cell_type”: “code”, “execution_count”:
null, “metadata”: {}, “outputs”: \[\], “source”: \[ “\# Data
Preprocessing”, “lemmatizer = WordNetLemmatizer()”, “stop_words =
set(stopwords.words(‘english’))”, “def clean_text(text):”, ” text =
text.lower()“,” text = re.sub(r’d+‘,’‘, text)“,” text =
re.sub(r’\[^ws\]‘,’‘, text)“,” text =’
’.join(\[lemmatizer.lemmatize(word) for word in word_tokenize(text) if
word not in stop_words\])“,” return
text“,”df_combined\[‘processed_text’\] =
df_combined\[‘text’\].apply(clean_text)“,”df_combined\[\[‘text’,
‘processed_text’\]\].head()” \] }, { “cell_type”: “markdown”,
“metadata”: {}, “source”: \[ “\## Step 4: Feature Extraction using
TF-IDF”, “Convert the cleaned text into numerical representation using
TF-IDF vectorization.” \] }, { “cell_type”: “code”, “execution_count”:
null, “metadata”: {}, “outputs”: \[\], “source”: \[ “\# Feature
Extraction”, “vectorizer = TfidfVectorizer(ngram_range=(1,2),
max_features=5000)”, “X =
vectorizer.fit_transform(df_combined\[‘processed_text’\])”, “y =
df_combined\[‘label’\]” \] }, { “cell_type”: “markdown”, “metadata”: {},
“source”: \[ “\## Step 5: Train and Evaluate SVM Model”, “Train an SVM
classifier and evaluate its performance.” \] }, { “cell_type”: “code”,
“execution_count”: null, “metadata”: {}, “outputs”: \[\], “source”: \[
“\# Train and Evaluate Model”, “X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)”, “svm_model =
SVC(kernel=‘linear’, C=1)”, “svm_model.fit(X_train, y_train)”, “y_pred =
svm_model.predict(X_test)”, “print(‘Accuracy:’, accuracy_score(y_test,
y_pred))”, “print(‘Classification Report:’,
classification_report(y_test, y_pred))” \] } \], “metadata”: {},
“nbformat”: 4, “nbformat_minor”: 4 }