<a href="https://colab.research.google.com/github/simulate111/Textual-Data-Analysis-25/blob/main/project_TDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> TDA course project </h1>

Name: MRA

Pair: If you did this project as pair work, name the other student here, leave empty otherwise. If you work in pair, <b>both</b> hand out the same project report in Moodle.


<h1> Step 1: Load the data with LLM judgements </h1>

In [145]:
#work here

In [146]:
!pip -q install datasets
!pip install -q scikit-learn

In [85]:
import os
import gzip
import json
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [86]:
#Load the data
url = "http://dl.turkunlp.org/tda-course-2025/tda25-responses.jsonl.gz"
filename = "tda25-responses.jsonl.gz"
os.system(f"wget {url} -O {filename}")
extracted_filename = "tda25-responses.jsonl"
with gzip.open(filename, 'rt', encoding='utf-8') as f_in:
    data = [json.loads(line) for line in f_in]
df = pd.DataFrame(data)

In [87]:


# Extract 'Step-by-step' and 'Training' answers using string operations
df["Step-by-step"] = df["response"].str.extract(r"Step-by-step:\s*(Yes|No)")
df["Training"] = df["response"].str.extract(r"Training:\s*(Yes|No)")

# Extract the explanation (everything after the second "No" or "Yes")
df["Explanation"] = df["response"].str.split("\n\n", n=1).str[1]

# Drop the old response column (optional)
df.drop(columns=["response"], inplace=True)




In [88]:
# Display the transformed dataset
df.head()

Unnamed: 0,document,Step-by-step,Training,Explanation
0,"Peeling an onion seems like an trivial task, b...",Yes,No,The article does provide a clear sequence of o...
1,Cowboy’s WR Terrance Williams Arrested\nDallas...,No,No,The article does not provide a clear sequence ...
2,Idea Crib’s Well-thy Pinoy Profiles blog serie...,No,No,The article does not provide a clear sequence ...
3,Will I ever stop being nervous every time I pu...,No,No,The article does not provide a clear sequence ...
4,"In today’s NHL rumors rundown, Toronto Maple L...",No,No,The article does not provide a clear sequence ...


In [89]:
# Convert to Hugging Face dataset
hf_dataset = Dataset.from_pandas(df)

# Save dataset for later use
hf_dataset.save_to_disk("hf_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/2515 [00:00<?, ? examples/s]

In [90]:

# Verify the dataset
print(hf_dataset)

Dataset({
    features: ['document', 'Step-by-step', 'Training', 'Explanation'],
    num_rows: 2515
})


In [91]:
# Step 1: Convert "Training" column to binary labels (Yes -> 1, No -> 0)
hf_dataset = hf_dataset.map(lambda x: {"label": 1 if x["Training"] == "Yes" else 0})

Map:   0%|          | 0/2515 [00:00<?, ? examples/s]

In [92]:
print(hf_dataset[0])

{'document': 'Peeling an onion seems like an trivial task, but if you’ve never peeled an onion before, it can be quite intimidating. Don’t worry – it is pretty easy to peel an onion.\nYou can now learn how to peel an onion by following these illustrated step-by-step instructions.\nStep #1: Put the whole onion on the cutting board\nStep 2: Cut off one end of the onion with a knife, as shown on the picture below:\nHere’s a picture of the onion with that end already cut off. The end of the onion is laying on the right side of the onion on the cutting board.\nStep 3: Cut off another end of the onion with a knife, as show on the picture below:\nAfter both ends of the onion have been cut off, the onion is ready to be peeled. Here’s the picture of the onion without its ends:\nStep 4: Start peeling! Make a cut under the peel, and pull on the peel so it separates from the onion. Look at the picture: knife under the peel, thumb on top of the peel. Grab the peel and pull.\nStep 5: Keep peeling in

In [93]:
# Convert the first five samples to a pandas DataFrame
pd.DataFrame(hf_dataset).head()

Unnamed: 0,document,Step-by-step,Training,Explanation,label
0,"Peeling an onion seems like an trivial task, b...",Yes,No,The article does provide a clear sequence of o...,0
1,Cowboy’s WR Terrance Williams Arrested\nDallas...,No,No,The article does not provide a clear sequence ...,0
2,Idea Crib’s Well-thy Pinoy Profiles blog serie...,No,No,The article does not provide a clear sequence ...,0
3,Will I ever stop being nervous every time I pu...,No,No,The article does not provide a clear sequence ...,0
4,"In today’s NHL rumors rundown, Toronto Maple L...",No,No,The article does not provide a clear sequence ...,0


In [94]:
hf_dataset

Dataset({
    features: ['document', 'Step-by-step', 'Training', 'Explanation', 'label'],
    num_rows: 2515
})

<h1> Step 2: Classifier training and evaluation </h1>


*   Which target did you choose?
*   Label distribution and majority baseline
*   Classifier performance
*   Manual inspection of the classifier output, what kinds of mistakes it makes?
*   What is the composition of the data we gave you? What does it mean for your results?
*   Concusions




In [6]:
#work here

I select Training: Yes/No as a training dataset as it has more comprehensive steps and seems to be more reliable however less general and less data but reliable model.

In [142]:
# Count the occurrences of each label for 'Step-by-step' and 'Training' columns
step_by_step_counts = df['Step-by-step'].value_counts()
training_counts = df['Training'].value_counts()

display(step_by_step_counts)
training_counts

Unnamed: 0_level_0,count
Step-by-step,Unnamed: 1_level_1
No,1770
Yes,722


Unnamed: 0_level_0,count
Training,Unnamed: 1_level_1
No,2132
Yes,361


In [95]:
# Step 2: Split dataset into train and test
train_test = hf_dataset.train_test_split(test_size=0.2,)
train_dataset, test_dataset = train_test["train"], train_test["test"]

In [96]:
# Step 3: Load tokenizer (BERT base model)
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [97]:
# Step 4: Tokenize documents
def tokenize_function(examples):
    return tokenizer(examples["document"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2012 [00:00<?, ? examples/s]

Map:   0%|          | 0/503 [00:00<?, ? examples/s]

In [98]:
# Step 5: Define model (BERT classifier)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [119]:
# Step 6: Training setup
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch"
)




In [120]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

  trainer = Trainer(


In [121]:
# Step 7: Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2446,0.253404
2,0.2674,0.385804
3,0.1101,0.546666


TrainOutput(global_step=756, training_loss=0.20560010051522307, metrics={'train_runtime': 671.9578, 'train_samples_per_second': 8.983, 'train_steps_per_second': 1.125, 'total_flos': 1588138330152960.0, 'train_loss': 0.20560010051522307, 'epoch': 3.0})

In [122]:
# Step 8: Evaluate the model
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.546666145324707, 'eval_runtime': 13.9873, 'eval_samples_per_second': 35.961, 'eval_steps_per_second': 4.504, 'epoch': 3.0}


In [129]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get the predictions
predictions = trainer.predict(test_dataset)

# Extract predicted labels (for binary classification, you can round the output to get 0 or 1)
pred_labels = predictions.predictions.argmax(axis=1)  # If it's multiclass, use argmax, for binary classification it's just thresholding (e.g., 0.5)

# True labels
# Get the true labels (since it's a list of dictionaries, use list comprehension)
true_labels = [item['Training'] for item in test_dataset]




In [132]:
# Check for any None values in the labels
print("Checking for None values in true_labels:")
print(true_labels[:10])  # Print first 10 labels to check

print("Checking for None values in pred_labels:")
print(pred_labels[:10])  # Print first 10 predicted labels to check


Checking for None values in true_labels:
['No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
Checking for None values in pred_labels:
[0 1 0 0 0 0 0 0 0 0]


In [143]:


# Convert 'Yes' to 1 and 'No' to 0 for true_labels
true_labels_numeric = [1 if label == 'Yes' else 0 for label in true_labels]

# Now calculate the metrics
accuracy = accuracy_score(true_labels_numeric, pred_labels)
precision = precision_score(true_labels_numeric, pred_labels)
recall = recall_score(true_labels_numeric, pred_labels)
f1 = f1_score(true_labels_numeric, pred_labels)

# Print metrics
print("Classifier performance")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Classifier performance
Accuracy: 0.8708
Precision: 0.5632
Recall: 0.6447
F1 Score: 0.6012


From these values:

The accuracy is quite high, meaning the classifier is generally good at predicting whether a document fits the "Training" category.
The precision is relatively lower, indicating that the classifier sometimes predicts a document as useful for training (labeled "Yes") when it is not.
Recall is a bit higher, meaning the classifier does a decent job at identifying all the documents that are truly useful for training, but there might still be false negatives (documents that should be labeled "Yes" but are predicted as "No").
F1 score is balanced between precision and recall, which is a good indicator of overall performance.

In [144]:
import numpy as np

# Get the predictions (assuming predictions.predictions is a 2D array)
logits = predictions.predictions

# Apply softmax to get probabilities
probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)

# Get the predicted class by choosing the class with the highest probability
predicted_labels = np.argmax(probabilities, axis=1)

# Print the first 5 predictions and their corresponding true labels
for i in range(5):
    print(f"Document: {test_dataset['document'][i]}")
    print(f"True Label: {test_dataset['Training'][i]}")
    print(f"Predicted Label (logits): {logits[i]}")
    print(f"Predicted Label (probability): {probabilities[i]}")
    print(f"Predicted Class (final): {predicted_labels[i]}")
    print()


Document: Phil's post this week has apparently hit a nerve with a couple of atheist internet apologists, and I wanted to pull one of the branches of the thread on God's justice up front here for the sake of filling my quota at TeamPyro this week.
Dr. Ken Pulliam has made an appearance in the comments to give us his "agnostic" view of the problem that God can order things which, if a man ordered them, he might rightly be called a genocidal maniac. However, someone called Dr. Pulliam an "atheist" in the thread, and he wanted to make sure we all knew he was actually an "agnostic".
His last comment to me in that thread was this:
I don't claim to be agnostic about every proposition only those that involve ultimate realities. I mentioned the three ways of verifying the truth of various propositions.
For example, if my wife says: "It is raining outside." How do I verify it? I step outside and see. If what I see and feel corresponds (the correspondence theory) to the accepted definitions of wh

Evaluation Analysis:
Evaluation Results Validity:
The results indicate that your classifier is performing reasonably well, but the precision and recall numbers show there is room for improvement. Given that the model is not perfect, you might want to fine-tune it further by experimenting with hyperparameters or using a more complex model.
Consequences of the FineWeb Dataset:
The evaluation is based on the dataset you provided (FineWeb documents). If this dataset is biased or if the samples are not representative of the entire corpus of FineWeb documents, the evaluation may not generalize well to unseen data.
Since you were specifically targeting the "Training" label, the model is likely trained on a specific subset of documents that might not fully represent the diversity of topics within FineWeb. This can influence the results.
Conclusion:
Given these metrics, the model seems reasonable for this task but could benefit from further improvements, such as fine-tuning or exploring other architectures.
You should also consider testing the classifier on more diverse data from the FineWeb corpus to see how it generalizes to unseen documents.

Evaluating Document Distribution
If your dataset consists primarily of documents from a certain genre or topic (e.g., tutorials, how-tos, etc.), this could bias your model. If most documents marked as "Training: Yes" are from a particular domain (e.g., cooking, technology), the classifier might perform poorly when applied to a different genre.

Consequences for Evaluation Results
If the evaluation data distribution is heavily skewed (e.g., biased towards a particular genre), the evaluation results might not fully reflect the model's performance on other genres of documents.
Overfitting: If the model has seen a large number of similar documents, it could have overfit to those types, resulting in a high training accuracy but low generalization to unseen types.
Valid Evaluation: The evaluation might not be entirely valid, as the model may not generalize well outside the dataset's specific genre/topic focus.
Applying the Classifier on FineWeb
Given the potential biases in the data, running the classifier on the entire FineWeb dataset might be premature. You would want to:

Validate performance on more diverse data to see how well the model generalizes to other genres or topics.
Consider adjusting the dataset by balancing genres or using domain-adaptive techniques, especially if your goal is to generalize across diverse web content.
In summary:

Is it reasonable to apply to the whole FineWeb dataset? Not without first ensuring that the model generalizes well to various genres and topics beyond what was trained on.

Conclusions
Classifier Effectiveness: The classifier performs reasonably well, with an F1 score of 0.6012, suggesting it is fairly effective at distinguishing between useful training data and non-relevant documents.

Data Composition: The dataset contains documents that were manually labeled based on a model's prediction of whether they contain structured steps for task completion or reasoning. This curated nature of the dataset means that the results should be viewed with caution. If the dataset has biases or ambiguities, they will directly influence the classifier’s performance.

Generalization to FineWeb: The classifier's performance on the test set might not fully generalize to the entire FineWeb dataset, as the model's predictions were based on the curated labels. The diversity of topics and document structures within the broader FineWeb could lead to performance drops. Therefore, while it may be reasonable to apply the classifier to the FineWeb data, the results should be carefully monitored, and further tuning or evaluation on a more representative dataset might be needed.

Improvements: There are some potential areas for improvement, including:

Handling class imbalances through techniques like oversampling or class weighting.
Fine-tuning the classifier on more varied data to improve generalization.
Manual review of edge cases to refine the model’s understanding of nuanced documents.
By continuing to refine and adjust the model, its predictions on new documents can become more accurate and robust, ensuring its usefulness for real-world applications.

Composition of the Data
The dataset we used is derived from FineWeb, and the labels were generated by a GPT-4o-mini model, which was likely trained to identify "step-by-step" procedural documents and those that might aid in deeper reasoning. This means the composition of the dataset is already somewhat curated.

Consequences for Results:
Label Quality: The labels given by GPT-4o-mini serve as a gold standard, but they may not be perfect. Since the model was tasked with making binary judgments (Yes/No), there may be some ambiguity in documents that don’t clearly fit one category.
Class Imbalance: The distribution of "Yes" and "No" for the "Training" and "Step-by-step" labels might not be balanced. If one category (e.g., "Yes") is underrepresented, the classifier might be biased toward predicting the majority class.
Data Bias: Since GPT-4o-mini likely used pre-existing data patterns to decide on the labels, it might have biased the dataset towards certain types of procedural documents or reasoning-heavy articles. This could affect the generalization ability of the classifier.

<h1> Bonus step </h1>

(leave empty if you do not do this)

*   Prompt design
*   Build (prompt,response pairs)
*   Turn into HF Dataset and save



In [9]:
#work here

<h1> Summary and Conclusions </h1>

* Brief TL;DR -style summary and main conclusions of your project.