## Step 0: Importing Required Libraries

Before we start, we need to import some Python libraries.

Each library has a very specific role.  
We import **only what we need** — nothing extra.

Below is an explanation of every import.


In [1]:
import pandas as pd
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Explanation of Imports

- **pandas (`pd`)**
  - Used to load and manipulate tabular data (CSV files).
  - Our dataset is stored in a table format, so pandas is essential.

- **re (Regular Expressions)**
  - Used for text cleaning.
  - Helps remove URLs, punctuation, and unwanted characters from text.

- **train_test_split**
  - Used to split our data into training and testing sets.
  - This helps us evaluate how well our model performs on unseen data.

- **TfidfVectorizer**
  - Converts text into numerical vectors.
  - Machines cannot understand text, so we convert words into numbers using TF-IDF.

- **LogisticRegression**
  - Our machine learning model.
  - Used for binary classification (Fake vs Real).

- **accuracy_score, confusion_matrix, classification_report**
  - Used to evaluate model performance.
  - They tell us how many predictions were correct and where the model made mistakes.


## Step 1: Loading the Dataset

We are using a **Fake News dataset** where:
- Each row represents a news article
- We have just have two columns from the dataset
  - `text` → the news content
  - `label` → whether the news is Fake or Real

We will load this dataset using pandas.


In [None]:
df = pd.read_csv("_____") #fill here

# Keep only relevant columns
#fill here

print(df.head())


## Step 2: Text Cleaning (Preprocessing)

Raw text is messy.

It contains:
- Capital letters
- URLs
- Punctuation
- Extra spaces

If we feed this raw text directly into a model, it creates **noise** and hurts performance.

So we clean the text before doing anything else.


In [3]:
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)

    # Remove punctuation and special characters
    text = re.sub(r"[^a-z\s]", "", text)

    # Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    return text


In [None]:
df["clean_text"] = df["text"].apply(clean_text)

print(df[["text", "clean_text"]].head())


In [5]:
X = df["clean_text"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


## Step 3: Converting Text into Numbers using TF-IDF

Machine learning models cannot understand text.

They only work with numbers.

TF-IDF (Term Frequency - Inverse Document Frequency) converts each document
into a numerical vector based on word importance.


In [6]:
vectorizer = TfidfVectorizer(
    stop_words="english",
    max_df=0.7
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


## Step 4: Loading the Logistic Regression model & Splitting Data into Train and Test Sets

We do NOT train our model on the entire dataset.

Instead:
- **Training data** → Used to teach the model
- **Testing data** → Used to evaluate the model

This simulates a real-world scenario where the model sees new, unseen data.


In [None]:
model = LogisticRegression(max_iter=1000)

model.fit(X_train_tfidf, y_train)


#Step 5: Evaluation

In [None]:
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


#Major Features for prediction

In [None]:
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

coef_df = pd.DataFrame({
    "word": feature_names,
    "coefficient": coefficients
})

# Top words pushing prediction towards FAKE
print("\nTop FAKE indicators:")
print(coef_df.sort_values(by="coefficient").head(10))

# Top words pushing prediction towards REAL
print("\nTop REAL indicators:")
print(coef_df.sort_values(by="coefficient", ascending=False).head(10))
