## Getting Started
This notebook provides a baseline model using PU learning techniques to tackle this challenge.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import problem

X_df, y = problem.get_train_data()
train_df, test_df, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42)
train_df['label'] = y_train
test_df['label'] = y_test
# Display dataset information
print("Train dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)

# Sample of train dataset
train_df.head()

Train dataset shape: (44730, 3)
Test dataset shape: (11183, 3)


Unnamed: 0,text,type,label
31309,“my son holds a lot of resentment towards me a...,unlabeled,0
30919,i gotta friend who is known for talkin shit al...,unlabeled,0
37807,"If that's what a voter believe, that explained...",unlabeled,0
21499,Snoop Dogg didn't start saying -izzle but he w...,human,1
47441,These alternatives to popular apps can help re...,unlabeled,0


In [5]:
train_df['type'].unique()

array(['unlabeled', 'human', 'llm'], dtype=object)

Let's have a look at an interesting column in our  dataset : 'type'. This column is referring to the labelling method employed, it takes three values : 

* 'unlabeled': Which means the text has not been labeled
* 'human': Which indicates that the text has been annotated by a human labeler
* 'llm': Which indicates the annotation by a LLM (Notably Llama 3.2 2B)

Note that the label for unlabeled samples is by default zero. The PU learning task will require you to adjust this label as close as possible to its true value.


## Exploring the Labels,
    "Before training the model, let's inspect the unique values and their counts in the `label` column for both the training and test datasets."
   

In [6]:
# Unique labels and counts in the training dataset
print("Unique labels in train dataset:")
print(train_df['label'].unique())
print("Label distribution in train dataset:")
print(train_df['label'].value_counts())

# Unique labels and counts in the test dataset
print("Unique labels in test dataset:")
print(test_df['label'].unique())
print("Label distribution in test dataset:")
print(test_df['label'].value_counts())

Unique labels in train dataset:
[0 1]
Label distribution in train dataset:
label
0    27365
1    17365
Name: count, dtype: int64
Unique labels in test dataset:
[0 1]
Label distribution in test dataset:
label
0    6868
1    4315
Name: count, dtype: int64


## Baseline Model
We will implement a simple **Positive-Unlabeled (PU) learning** model using logistic regression. Since we lack explicit negative labels, we will use a two-step approach:
1. Train a naive model using only positive and unlabeled data.
2. Use heuristics or additional methods to infer probable negative instances.


In [8]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Convert text data into numerical features
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['text'].astype(str))
X_test = vectorizer.transform(test_df['text'].astype(str))

# For training data, compare labels as strings
y_train = train_df['label']

# For test data, convert labels to binary (assuming 2 is positive and 0 is negative)
y_test = test_df['label']

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict probabilities on the test set
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Convert probabilities to binary predictions using a threshold
threshold = 0.5
y_pred = (y_pred_proba >= threshold).astype(int)

# Evaluate the model using F1-score
f1 = f1_score(y_test, y_pred, average='binary')
print("F1-score:", f1)


F1-score: 0.7233883989566513


Note that this score is artificially inflated as the labels for 'unlabeled' test is incorrectly set to 0.

Combining it in a sklearn pipeline and evaluating the score with cross-validation gives the following :

In [9]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer

def get_tf_idf_features(X_df):
    tf_idf = TfidfVectorizer(max_features=100)
    X_df_features = tf_idf.fit_transform(X_df['text'].astype(str))
    return X_df_features.toarray() # Convert to dense format 

cols = ['text']

transformer = make_column_transformer(
    (FunctionTransformer(lambda X_df: get_tf_idf_features(X_df)), cols),
)

pipeline = make_pipeline(
    transformer,
    LogisticRegression()
)

def get_estimator():
    return pipeline

In [10]:
import problem
from sklearn.model_selection import cross_val_score

X_df, y = problem.get_train_data()

scores = cross_val_score(get_estimator(), X_df, y, cv=2, scoring='f1')
print(scores)

[0.54706341 0.31926659]



## Next Steps
Participants are encouraged to:
- Experiment with different **PU learning techniques** such as **weighted loss functions** or **semi-supervised learning**.
- Use **self-training** or **EM algorithms** to improve pseudo-labeling.
- Incorporate **pre-trained embeddings (e.g., BERT, RoBERTa)** for feature extraction.

Good luck with the challenge!
