#### Task 1:
***Evaluate the performance of your Simple and Advanced Models on your FakeNewsCorpus test set. It should be possible to achieve > 80% accuracy but you will not fail the project if your model cannot reach this performance.***

In [2]:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
LOGISTIC REGRESSION FOR CLEANED_FILE.CSV 
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

print("Beginning Logistic regression...")

# Load CSV
data = pd.read_csv("WITHOUT_NUM_cleaned_file_with_labels.csv")
print("  - Loaded CSV.")

# Ensure text data is string type 
data["content"] = data["content"].fillna("").astype(str)
print("  - Ensured text data is string type.")

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=10000, binary=True)
X = vectorizer.fit_transform(data["content"])
print("  - Vectorized text data.")

# Use the binary 'label' column as the target variable
y = data["label"].astype(int)  # Ensure 'label' is integer type
print("  - Used the binary 'label' column as the target variable.")

# Splitting dataset into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
print("  - Splitting dataset into train, validation, and test sets.")

# Implementing logistic regression
clf = LogisticRegression(max_iter=500, random_state=42, solver="saga", n_jobs=-1)
clf.fit(X_train, y_train)
print("  - Implementing logistic regression.")

# Evaluating the model
y_pred = clf.predict(X_val)
f1 = f1_score(y_val, y_pred, average="binary")

# Display results
print("\nResults of logistic regression of 995k")
print(f"  - F1-score: {f1:.4f}\n")
print("Hyperparameters for Logistic Regression:")
print(f"  - Max Iterations: {clf.max_iter}")
print(f"  - Solver: {clf.solver}")
print(f"  - Random State: {clf.random_state}")

Beginning Logistic regression for 995k + scraped:
  - Loaded CSV.
  - Ensured text data is string type.
  - Vectorized text data.
  - Used the binary 'label' column as the target variable.
  - Splitting dataset into train, validation, and test sets.
  - Implementing logistic regression.

Results of logistic regression of 995k
  - F1-score: 0.9197

Hyperparameters for Logistic Regression:
  - Max Iterations: 500
  - Solver: saga
  - Random State: 42


Above, we have trained the model. Below we will not try to evaluate it. 

In [3]:
# Evaluate the final model on the test set
y_test_pred = clf.predict(X_test)
f1_test = f1_score(y_test, y_test_pred, average="binary")

print("Final evaluation on the FakeNewsCorpus test set:")
print(f"  - Test F1-score: {f1_test:.4f}")


Final evaluation on the FakeNewsCorpus test set:
  - Test F1-score: 0.9180


#### Task 2: 

***In order to allow you to play around cross-domain performance, try the same exercise on the LIAR dataset Links to an external site., where you know the labels, and can thus immediately calculate the performance. You are expected to directly evaluate the model you trained on the FakeNewsCorpus. In other words, you do not need to retrain the model on the LIAR dataset.***

To do this, we need to make sure that the LIAR dataset has the same format as the original dataset. To do this we will therefor implement a similar cleaning approach as we did for Part 1. Note that we only clean the "statement" part of the LIAR dataset, which we acctually change to "content" to make it match our training data. Also, the LIAR dataset dosent have any headlines for the columns, so we also add those for convenience. 


In [6]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# The file we want to read:
file_path = "test.tsv"
chunksize = 40000 # Can in practice be ignored since this file is very small compared to the 995K. 

# Column names from README. The liar test doesnt have any headlines for columns, so we define those manually. 
column_names = [
    "id", "type", "content", "subjects", "speaker", "speaker_job_title", 
    "state_info", "party_affiliation", "barely_true", "false", "half_true", 
    "mostly_true", "pants_on_fire", "context"
]

# We only want to clean statement (content). We can add the other columns if needed. 
columns_to_clean = ["content"]

# Initialize NLTK
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

# Our clean_text function from Part 1:
def clean_text(data):
    if not isinstance(data, str):
        return ""
    data = data.lower()
    data = re.sub(r'\s+', " ", data)
    data = re.sub(r'\d{1,2}[./-]\d{1,2}[./-]\d{2,4}', "<DATE>", data)
    data = re.sub(r'(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec).? \d{1,2},? \d{4}', "<DATE>", data)
    data = re.sub(r'\d{4}-\d{2}-\d{2}', "<DATE>", data)
    data = re.sub(r'[\w._%+-]+@[\w.-]+\.[a-zA-Z]{2,}', "<EMAIL>", data)
    data = re.sub(r'http[s]?://[^\s]+', "<URL>", data)
    data = re.sub(r'\d+(\.\d+)?', "<NUM>", data)
    return data

# Our tokenize and stem function from Part 1:
def tokenize_and_stem(text):
    tokens = word_tokenize(text)
    filtered_tokens = [ps.stem(word) for word in tokens if word.isalpha() and word not in stop_words]
    return filtered_tokens

#Process and store all chunks in a single DataFrame
preprocessed_data = pd.DataFrame()

for chunk in pd.read_csv(file_path, sep='\t', chunksize=chunksize, 
                        low_memory=False, header=None, names=column_names):
    """SLIGTHLY EDITED FROM PART 1"""
    # Process each specified column with your original 3-step pipeline
    for col in columns_to_clean:
        if col in chunk.columns:
            # Step 1: Clean text (dates, URLs, etc.)
            chunk[col] = chunk[col].apply(clean_text)
            
            # Step 2: Remove stopwords (your original lambda)
            chunk[col] = chunk[col].astype(str).apply(
                lambda x: ' '.join([word for word in x.split() if word.lower() not in stop_words]))
            
            # Step 3: Tokenize and stem
            chunk[col] = chunk[col].apply(tokenize_and_stem)
    
    # Append processed chunk
    preprocessed_data = pd.concat([preprocessed_data, chunk], ignore_index=True)

# Save output now cleaned 
preprocessed_data.to_csv("LIAR_DATA_test_CLEANED.csv", index=False)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/simonhvidtfeldt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/simonhvidtfeldt/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Perfect. Now we are almost ready. We just need to make the labels binary in the LIAR dataset. 

In [7]:
"""""""""""""""""""""""""""""""""""""""""""""
CODE FROM PART_2_TASK_1 *EDITED*
"""""""""""""""""""""""""""""""""""""""""""""
data_path = "LIAR_DATA_test_CLEANED.csv"
data = pd.read_csv(data_path)

# Define mapping
label_mapping = {
    "true": "True",
    "false": "Fake",
    "half-true": "Fake",
    "pants-fire": "Fake",
    "barely-true": "Fake",
    "mostly-true": "True"
}

# Create the 'label' column
data["label"] = data["type"].map(label_mapping)
print(" - Created the 'label' column")

# Convert 'True' to 1 and 'Fake' to 0
data["label"] = data["label"].map({"True": 1, "Fake": 0})
print(" - Converted 'True' to 1 and 'Fake' to 0")

# Drop rows with NaN in the 'label' column (if any)
data = data.dropna(subset=["label"])
print(" - Dropped rows with NaN in the 'label' column")

# Save the DataFrame to a CSV file (if needed)
data.to_csv("LIAR_DATA_test_CLEANED_label.csv", index=False)
print(" - Saved the DataFrame to a CSV file")

 - Created the 'label' column
 - Converted 'True' to 1 and 'Fake' to 0
 - Dropped rows with NaN in the 'label' column
 - Saved the DataFrame to a CSV file


Great! Now we are ready to evaluate the performance of our simple logistic regression on the LIAR dataset. 

In [8]:
# Load your new dataset
new_liar_data = pd.read_csv("LIAR_DATA_test_CLEANED_label.csv")

# Same process as orginal 
new_liar_data["content"] = new_liar_data["content"].fillna("").astype(str)

In [9]:
# Transform the new text data using the existing vectorizer
X_liar_data = vectorizer.transform(new_liar_data["content"])

# Get the labels
y_liar_data = new_liar_data["label"].astype(int)

In [10]:
from sklearn.metrics import f1_score

# Make predictions on the new data
y_new_pred = clf.predict(X_liar_data)

# F1 score
f1 = f1_score(y_liar_data, y_new_pred, average="binary")

# Display results
print("\nResults of logistic regression on LIAR:")
print(f"  - F1-score: {f1:.4f}\n")
print("Hyperparameters for Logistic Regression:")
print(f"  - Max Iterations: {clf.max_iter}")
print(f"  - Solver: {clf.solver}")
print(f"  - Random State: {clf.random_state}")


Results of logistic regression on LIAR:
  - F1-score: 0.5201

Hyperparameters for Logistic Regression:
  - Max Iterations: 500
  - Solver: saga
  - Random State: 42


In [14]:
# Original validation performance
y_val_pred = clf.predict(X_val)


print("\nPerformance Comparison:")
print(f"  - FAKE News F1 : {f1_test:.4f}")
print(f"  - New LIAR data F1: {f1:.4f}")


Performance Comparison:
  - FAKE News F1 : 0.9180
  - New LIAR data F1: 0.5201
