# Stolen Content Detection Model

In this notebook, I build a simple text-similarity model to detect stolen posts on DOT.

Goal:
- Use only **post text and timestamps** (no ground-truth labels).
- Flag posts that are likely duplicates of earlier posts.
- Compare model predictions to our synthetic ground truth (`is_stolen`) to see how well a real-world detector could perform.


In [80]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [81]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import classification_report, confusion_matrix

In [82]:
# Load posts table generated in step 1
posts = pd.read_csv("../data/posts.csv")
users = pd.read_csv("../data/users.csv")
impr = pd.read_csv("../data/feed_impressions.csv")

posts.head()



Unnamed: 0,post_id,author_id,group_id,is_original,is_stolen,created_at,media_type,text,like_count,comment_count,share_count
0,1,2986,1,True,False,2024-08-04,image,Content group 1 original post about topic 9,37,9,1
1,2,559,2,True,False,2024-07-17,image,Content group 2 original post about topic 18,2,0,0
2,3,1170,3,True,False,2024-07-24,video,Content group 3 original post about topic 23,28,4,2
3,4,21,4,True,False,2024-07-26,video,Content group 4 original post about topic 34,20,4,2
4,5,130,5,True,False,2024-07-05,video,Content group 5 original post about topic 4,9,1,0


In [83]:
posts.shape, users.shape


((3021, 11), (3000, 5))

## Text preprocessing and vectorization

Here I:
1. Clean the post text (lowercase, fill missing).
2. Use TF-IDF to turn each post into a text vector.
3. This lets me compute cosine similarity between posts to detect near-duplicates.


In [84]:
# Basic text cleaning
posts["text"] = posts["text"].fillna("").astype(str)
posts["post_text_clean"] = posts["text"].str.lower()

# TF-IDF vectorization on unigrams and bigrams
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    stop_words="english"
)

tfidf_matrix = vectorizer.fit_transform(posts["post_text_clean"])
tfidf_matrix.shape


(3021, 1454)

## Similarity graph between posts

Next, I compute cosine similarity between every pair of posts.
For each post, I find the **most similar other post** and treat that as its best candidate "original".

Later, if:
- similarity is high, and  
- the candidate original is **older**

then I flag the newer post as a potential stolen post.


In [85]:
from sklearn.metrics.pairwise import cosine_similarity

# Cosine similarity between all posts (works fine for ~3k posts)
sim_matrix = cosine_similarity(tfidf_matrix)

# A post is perfectly similar to itself; set diagonal to 0
np.fill_diagonal(sim_matrix, 0)

# For each post, find the index of the most similar other post
best_match_idx = sim_matrix.argmax(axis=1)
best_match_score = sim_matrix.max(axis=1)

posts["best_match_index"] = best_match_idx
posts["best_match_score"] = best_match_score

posts[["post_id", "best_match_score"]].head()


Unnamed: 0,post_id,best_match_score
0,1,1.0
1,2,0.744558
2,3,1.0
3,4,0.746047
4,5,1.0


## Rule-based detector: is this post stolen?

Heuristic:
- If a post's best match has cosine similarity above a threshold (e.g. 0.8)
- and the best-match post was created **earlier in time**
- then the newer post is flagged as `pred_is_stolen = True`.

This simulates how a simple production detector might work using only text similarity and time.


In [86]:
# Make sure created_at is in datetime format
posts["created_at"] = pd.to_datetime(posts["created_at"])

# Get numpy arrays for speed
created_at_values = posts["created_at"].values
best_match_created_at = created_at_values[best_match_idx]

# Similarity threshold
SIM_THRESHOLD = 0.8

posts["pred_is_stolen"] = (
    (posts["best_match_score"] >= SIM_THRESHOLD) &
    (created_at_values > best_match_created_at)   # newer than its best match
)

posts[["post_id", "best_match_score", "pred_is_stolen"]].head()
posts["pred_is_stolen"].value_counts()


pred_is_stolen
False    1912
True     1109
Name: count, dtype: int64

## Model evaluation against synthetic ground truth

Because I generated the dataset, I have a true `is_stolen` label for each post.

I can now:
- Compare my rule-based detector (`pred_is_stolen`) with `is_stolen`
- Compute precision, recall, F1-score
- Look at the confusion matrix to see where the model makes mistakes


In [87]:
# Ground truth label from our generator
y_true = posts["is_stolen"].astype(bool)
y_pred = posts["pred_is_stolen"].astype(bool)

print("Class balance (true):")
print(y_true.value_counts(normalize=True))

print("\nClassification report:")
print(classification_report(y_true, y_pred, digits=3))

print("Confusion matrix [[TN, FP],[FN, TP]]:")
print(confusion_matrix(y_true, y_pred))

Class balance (true):
is_stolen
False    0.662032
True     0.337968
Name: proportion, dtype: float64

Classification report:
              precision    recall  f1-score   support

       False      0.640     0.612     0.625      2000
        True      0.299     0.325     0.312      1021

    accuracy                          0.515      3021
   macro avg      0.470     0.468     0.468      3021
weighted avg      0.525     0.515     0.519      3021

Confusion matrix [[TN, FP],[FN, TP]]:
[[1223  777]
 [ 689  332]]


## Who does the detector protect (or fail)?

Finally, I segment model performance by creator type and geography to see:

- Are we better at catching stolen content for some creators than others?
- Do we under-protect certain countries or segments?

This is important for fairness and long-term ecosystem health.

In [88]:
# Load users to join creator attributes
users = pd.read_csv("../data/users.csv")

posts_with_users = posts.merge(
    users,
    left_on="author_id",
    right_on="user_id",
    how="left",
    suffixes=("", "_user")
)

# Example: performance by country
country_stats = (
    posts_with_users
    .groupby("country")
    .apply(lambda df: pd.Series({
        "n_posts": len(df),
        "precision": (
            (df["pred_is_stolen"] & df["is_stolen"]).sum() /
            max(df["pred_is_stolen"].sum(), 1)
        ),
        "recall": (
            (df["pred_is_stolen"] & df["is_stolen"]).sum() /
            max(df["is_stolen"].sum(), 1)
        )
    }))
    .sort_values("n_posts", ascending=False)
)

country_stats.head()

  .apply(lambda df: pd.Series({


Unnamed: 0_level_0,n_posts,precision,recall
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,1095.0,0.297436,0.310992
IN,765.0,0.32069,0.341912
BR,463.0,0.305882,0.351351
GB,443.0,0.266272,0.304054
CA,255.0,0.288889,0.325


## Summary

In this step, I built a simple text similarityâ€“based detector for stolen posts on DOT:

- Represented posts using **TF-IDF** over unigrams and bigrams.
- Computed **cosine similarity** between all posts to find the best candidate original for each post.
- Used a rule-based heuristic (similarity threshold + earlier timestamp) to flag potential stolen posts.
- Evaluated the detector against synthetic ground truth (`is_stolen`) and reported precision/recall/F1.
- Segmented performance by **country** (and optionally creator type) to understand fairness and coverage.

This mirrors how a Data Scientist in a product analytics role might:
- Prototype a detection signal
- Quantify its quality
- And connect model performance back to **creator experience** and **ecosystem health**.

In [89]:
posts.to_csv("../data/posts_with_predictions.csv", index=False)
