# Detecting Fake News using Support Vector Machines (SVM)
### Step-by-Step Implementation in Google Colab with Multilingual Support (English & Chinese)

## Step 1: Install & Import Necessary Libraries
We first import all the required Python libraries. NLTK is used for natural language processing, sklearn for machine learning, and pandas for data manipulation. We also use deep_translator for translating Chinese text into English.

In [6]:
# Install & Import Necessary Libraries
!pip install deep_translator
import pandas as pd  # Data handling
import numpy as np  # Numerical operations
import re  # Regular expressions for text cleaning
import string  # String operations
import nltk  # Natural Language Processing
from nltk.corpus import stopwords  # List of stopwords
from nltk.tokenize import word_tokenize  # Tokenization
from nltk.stem import WordNetLemmatizer  # Lemmatization
from deep_translator import GoogleTranslator  # Translation from Chinese to English
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text to numerical representation
from sklearn.model_selection import train_test_split  # Splitting data
from sklearn.svm import SVC  # Support Vector Machine model
from sklearn.metrics import accuracy_score, classification_report  # Model evaluation

# Download necessary resources for NLTK
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Step 2: Load and Merge Datasets
We will load and merge two datasets: Weibo21 (train, test, val) and the Kaggle Fake News dataset (True.csv, Fake.csv). Chinese text will be translated to English.

In [None]:
# Load and Merge Datasets
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_val = pd.read_csv('val.csv')
df_fake = pd.read_csv('Fake.csv')[['title', 'text']].assign(label=1)  # Fake news
df_real = pd.read_csv('True.csv')[['title', 'text']].assign(label=0)  # Real news

df_kaggle = pd.concat([df_fake, df_real])
df_weibo = pd.concat([df_train, df_test, df_val])

# Ensure the time module is imported
import time

# Safe Translation Function with Retry Mechanism
translator = GoogleTranslator(source='zh-CN', target='en')

def safe_translate(text):
    """Safely translates text with retries and rate limiting."""
    if not isinstance(text, str) or not text.strip():
        return text  # Skip empty or non-string values

    for attempt in range(3):  # Retry up to 3 times
        try:
            time.sleep(1)  # Prevent hitting rate limits
            return translator.translate(text)
        except Exception as e:
            print(f'Attempt {attempt + 1}: Translation failed for: {text[:50]}... Retrying. Error: {e}')

    print(f"Translation permanently failed for: {text[:50]}... Skipping.")
    return text  # Return original text if all retries fail

# Apply translation with error handling
df_weibo['content'] = df_weibo['content'].apply(safe_translate)

# Merge content columns
df_kaggle['content'] = df_kaggle['title'] + ' ' + df_kaggle['text']
df_combined = pd.concat([df_kaggle[['content', 'label']], df_weibo[['content', 'label']]])
df_combined = df_combined.sample(frac=1).reset_index(drop=True)  # Shuffle data
df_combined.head()





Attempt 1: Translation failed for: #12345回复北京降级热点问题#【#北京#响应等级调低后，您的这些疑问12345答复了！】4月30... Retrying. Error: Request exception can happen due to an api connection error. Please check your connection and try again
Attempt 2: Translation failed for: #12345回复北京降级热点问题#【#北京#响应等级调低后，您的这些疑问12345答复了！】4月30... Retrying. Error: Request exception can happen due to an api connection error. Please check your connection and try again
Attempt 3: Translation failed for: #12345回复北京降级热点问题#【#北京#响应等级调低后，您的这些疑问12345答复了！】4月30... Retrying. Error: Request exception can happen due to an api connection error. Please check your connection and try again
Translation permanently failed for: #12345回复北京降级热点问题#【#北京#响应等级调低后，您的这些疑问12345答复了！】4月30... Skipping.
Attempt 1: Translation failed for: 【秦朔：唯偏执狂才能让赢家们不再如此生存（节选）】（ Hello好公司 5天前）1、好几位朋友让我评... Retrying. Error: Request exception can happen due to an api connection error. Please check your connection and try again
Attempt 2: Translation failed for: 【秦朔：唯偏执狂才能让赢家们不再如此生存（节选）】

# Step 3: Preprocess Text Data

Before training the model, we need to clean the text data by:

Removing punctuation, numbers, and special characters.
Converting text to lowercase.
Tokenizing and removing stopwords.
Applying lemmatization to normalize words.


In [14]:
# Ensure NLTK resources are downloaded
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')

# Text Preprocessing Function
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def clean_text(text):
    """Cleans and preprocesses text for machine learning."""
    # Handle None or empty strings
    if text is None or not isinstance(text, str) or not text.strip():
        return ""  # Return empty string for None or empty text

    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]  # Lemmatize & remove stopwords
    return " ".join(tokens)

# Apply text preprocessing
df_combined['content'] = df_combined['content'].apply(clean_text)

# Display sample after preprocessing
df_combined.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,content,label
0,dear mr presidentwhen said lock werent asking ...,1
1,ivanka trump business get slapped lien owing u...,1
2,jaycee chan arrested taking drugshong kong med...,1
3,use disposable chopstick make dried bamboo sho...,1
4,jill stein demanding recount hillary camp usin...,1


### Step 4: Convert Text to Numerical Representation (TF-IDF Vectorization)
Since SVM works with numerical data, we need to convert text into TF-IDF vectors:

TF-IDF (Term Frequency-Inverse Document Frequency) gives weight to important words.
Limits to 5,000 most relevant words to keep computation efficient.

In [15]:
# Convert text data into TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df_combined['content'])
y = df_combined['label']

# Split dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display dataset shapes
X_train.shape, X_test.shape

((43592, 5000), (10898, 5000))

## Step 5: Train the Support Vector Machine (SVM) Model
Now, we train an SVM classifier:

Uses a linear kernel for text classification.
The C parameter controls the trade-off between accuracy and generalization.

In [16]:
# Train the SVM model
svm_model = SVC(kernel="linear", C=1)
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Display classification report
print(classification_report(y_test, y_pred))

Model Accuracy: 0.95
              precision    recall  f1-score   support

           0       0.96      0.94      0.95      5353
           1       0.95      0.96      0.95      5545

    accuracy                           0.95     10898
   macro avg       0.95      0.95      0.95     10898
weighted avg       0.95      0.95      0.95     10898



# Step 6: Test with a Sample News Input
Now, let's test the model with a custom news headline and check if it’s real or fake.

In [23]:
# Function to predict new input
def predict_news(text):
    """Predicts whether a given news article is Fake or Real."""
    text = clean_text(text)  # Preprocess text
    text_vectorized = vectorizer.transform([text])  # Convert to TF-IDF
    prediction = svm_model.predict(text_vectorized)[0]  # Get prediction
    return "Fake News" if prediction == 1 else "Real News"

# Example Test
sample_text = "Judge grants 19 AGs preliminary injunction against DOGE access to Treasury payment system The ruling came amid a lawsuit filed by 19 state attorneys general concerned about the Elon Musk-led DOGE accessing the payment system"
print(f"Prediction: {predict_news(sample_text)}")

Prediction: Real News
