## Fake News Detection: Machine Learning & NLP Approach

This Jupyter Notebook guides you through a step-by-step process to detect fake news using NLP and Machine Learning.

---

### **Step 1: Load and Inspect Data**
```python
import pandas as pd

# Load datasets
true_path = "True.csv"
fake_path = "Fake.csv"

df_true = pd.read_csv(true_path)
df_fake = pd.read_csv(fake_path)

# Add labels
df_true['label'] = 1  # Real news
df_fake['label'] = 0  # Fake news

# Combine datasets
df = pd.concat([df_true, df_fake], axis=0).reset_index(drop=True)

# Drop unnecessary columns
df = df.drop(columns=["subject", "date"], errors='ignore')

# Display dataset info
df.info()
```
---

### **Step 2: Data Cleaning & Preprocessing**
```python
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove punctuation/numbers
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return ' '.join(words)

# Apply preprocessing
df['clean_text'] = df['text'].apply(preprocess_text)
```
---

### **Step 3: Exploratory Data Analysis (EDA)**
```python
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Distribution of fake vs. real news
sns.countplot(x=df['label'])
plt.title("Fake vs. Real News Distribution")
plt.show()

# Word cloud for fake news
fake_words = ' '.join(df[df['label'] == 0]['clean_text'])
wordcloud = WordCloud(width=800, height=400).generate(fake_words)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Most Common Words in Fake News")
plt.show()
```
---

### **Step 4: Convert Text into TF-IDF Features**
```python
# Convert text into TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']
```
---

### **Step 5: Train Machine Learning Models**
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"{name} trained.")
```
---

### **Step 6: Evaluate Model Performance**
```python
from sklearn.metrics import classification_report

for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"\n{name} Performance:")
    print(classification_report(y_test, y_pred))
```
---

### **Step 7: Key Findings & Next Steps**
- **Logistic Regression** provided a strong baseline but struggled with complex patterns.
- **Decision Trees** captured some patterns but tended to overfit.
- **Random Forest** improved generalization by aggregating multiple trees.
- **Gradient Boosting** achieved the highest accuracy and precision.

#### **Future Improvements:**
- Fine-tuning hyperparameters
- Exploring Deep Learning approaches (LSTMs, Transformers)
- Expanding the dataset for better generalization
- Implementing real-time detection

---

