## Getting Started
This notebook provides a baseline model using PU learning techniques to tackle this challenge.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load datasets
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

# Display dataset information
print("Train dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)

# Sample of train dataset
train_df.head()

Train dataset shape: (55913, 3)
Test dataset shape: (6347, 3)


Unnamed: 0,text,type,label
0,"""QT @user In the original draft of the 7th boo...",human,2
1,@user Alciato: Bee will invest 150 million in ...,human,2
2,@user LIT MY MUM 'Kerry the louboutins I wonde...,human,2
3,"""\"""""""" SOUL TRAIN\"""""""" OCT 27 HALLOWEEN SPECIA...",human,2
4,So disappointed in wwe summerslam! I want to s...,unlabeled,unlabeled


## Exploring the Labels,
    "Before training the model, let's inspect the unique values and their counts in the `label` column for both the training and test datasets."
   

In [4]:
# Unique labels and counts in the training dataset
print("Unique labels in train dataset:")
print(train_df['label'].unique())
print("Label distribution in train dataset:")
print(train_df['label'].value_counts())

# Unique labels and counts in the test dataset
print("Unique labels in test dataset:")
print(test_df['label'].unique())
print("Label distribution in test dataset:")
print(test_df['label'].value_counts())

Unique labels in train dataset:
['2' 'unlabeled' '0']
Label distribution in train dataset:
label
unlabeled    33059
2            21680
0             1174
Name: count, dtype: int64
Unique labels in test dataset:
[2 0]
Label distribution in test dataset:
label
0    3972
2    2375
Name: count, dtype: int64


## Baseline Model
We will implement a simple **Positive-Unlabeled (PU) learning** model using logistic regression. Since we lack explicit negative labels, we will use a two-step approach:
1. Train a naive model using only positive and unlabeled data.
2. Use heuristics or additional methods to infer probable negative instances.


In [5]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Convert text data into numerical features
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['text'].astype(str))
X_test = vectorizer.transform(test_df['text'].astype(str))

# For training data, compare labels as strings
y_train = np.where(train_df['label'].astype(str) == '2', 1, 0)  # Positive = 1, Unlabeled = 0

# For test data, convert labels to binary (assuming 2 is positive and 0 is negative)
y_test = np.where(test_df['label'] == 2, 1, 0)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict probabilities on the test set
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Convert probabilities to binary predictions using a threshold
threshold = 0.5
y_pred = (y_pred_proba >= threshold).astype(int)

# Evaluate the model using F1-score
f1 = f1_score(y_test, y_pred, average='binary')
print("F1-score:", f1)


F1-score: 0.510138113429327



## Next Steps
Participants are encouraged to:
- Experiment with different **PU learning techniques** such as **weighted loss functions** or **semi-supervised learning**.
- Use **self-training** or **EM algorithms** to improve pseudo-labeling.
- Incorporate **pre-trained embeddings (e.g., BERT, RoBERTa)** for feature extraction.

Good luck with the challenge!
