# DOT Stolen Content Detection Project

This project simulates a social media platform (DOT) where users create posts and, in some cases, copy or "steal" content from other users.  
The goal is to:

1. Generate a synthetic dataset representing users, posts, and feed impressions.  
2. Identify stolen posts using metadata.  
3. Quantify the harm stolen posts cause in terms of exposure, engagement, and fairness.  
4. Perform SQL-based Product Analytics, similar to Meta DS interviews.  

In [10]:
import sys
print(sys.executable)

/Users/sakshigandhi/Desktop/dot_stolen_content_project/.venv/bin/python


In [11]:
!{sys.executable} -m pip install numpy pandas



In [12]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import os

np.random.seed(42)

# Ensure data folder exists
os.makedirs("../data", exist_ok=True)

## Step 1 — Generate Users Table

We create a simulated users table with:

- `user_id`: unique identifier  
- `join_date`: when the user joined the platform  
- `country`: geographic attribute  
- `creator_type`: casual, influencer or business  
- `is_flagged`: whether the user has past policy violations  

This helps us segment harm and behavior by creator type and geography later.


In [13]:
n_users = 3000

user_ids = np.arange(1, n_users + 1)

countries = ["US", "IN", "BR", "GB", "CA"]
creator_types = ["casual", "influencer", "business"]

users = pd.DataFrame({
    "user_id": user_ids,
    "join_date": pd.to_datetime("2024-01-01") + pd.to_timedelta(
        np.random.randint(0, 180, size=n_users), unit="D"
    ),
    "country": np.random.choice(countries, size=n_users, p=[0.35, 0.25, 0.15, 0.15, 0.10]),
    "creator_type": np.random.choice(creator_types, size=n_users, p=[0.7, 0.2, 0.1]),
    "is_flagged": np.random.binomial(1, 0.05, size=n_users).astype(bool)
})

users.head()

Unnamed: 0,user_id,join_date,country,creator_type,is_flagged
0,1,2024-04-12,IN,business,False
1,2,2024-06-28,CA,casual,False
2,3,2024-04-02,IN,casual,False
3,4,2024-01-15,CA,business,False
4,5,2024-04-16,BR,casual,False


## Step 2 — Generate Posts Table

Each content group (`group_id`) represents an underlying original idea or piece of content.  
For each group:

- We create *one* original post.  
- With 25% probability, we generate 1–3 stolen copies made by other users.

Fields include:
- `post_id`
- `author_id`
- `group_id`
- `is_original`
- `is_stolen`
- `created_at`
- `media_type`
- `text`

This allows us to study how stolen content competes with originals in the feed.


In [14]:
n_groups = 2000
group_ids = np.arange(1, n_groups + 1)

# whether this group of content will have stolen copies
group_is_stolen = np.random.binomial(1, 0.25, size=n_groups).astype(bool)

posts_rows = []
post_id = 1
start_date = datetime(2024, 7, 1)

for gid, stolen_flag in zip(group_ids, group_is_stolen):
    original_author = np.random.choice(user_ids)
    created_at = start_date + timedelta(days=int(np.random.randint(0, 60)))

    # original post
    posts_rows.append({
        "post_id": post_id,
        "author_id": original_author,
        "group_id": gid,
        "is_original": True,
        "is_stolen": False,
        "created_at": created_at,
        "media_type": np.random.choice(["image", "video", "text"], p=[0.4, 0.5, 0.1]),
        "text": f"Content group {gid} original post about topic {np.random.randint(1, 50)}",
    })
    post_id += 1

    # stolen copies
    if stolen_flag:
        n_copies = np.random.randint(1, 4)
        for _ in range(n_copies):
            thief = np.random.choice(user_ids)
            if thief == original_author:
                continue

            copy_created_at = created_at + timedelta(days=int(np.random.randint(0, 10)))

            posts_rows.append({
                "post_id": post_id,
                "author_id": thief,
                "group_id": gid,
                "is_original": False,
                "is_stolen": True,
                "created_at": copy_created_at,
                "media_type": np.random.choice(["image", "video", "text"], p=[0.4, 0.5, 0.1]),
                "text": f"Content group {gid} copied post about topic {np.random.randint(1, 50)}",
            })
            post_id += 1

posts = pd.DataFrame(posts_rows)
posts.head()


Unnamed: 0,post_id,author_id,group_id,is_original,is_stolen,created_at,media_type,text
0,1,2986,1,True,False,2024-08-04,image,Content group 1 original post about topic 9
1,2,559,2,True,False,2024-07-17,image,Content group 2 original post about topic 18
2,3,1170,3,True,False,2024-07-24,video,Content group 3 original post about topic 23
3,4,21,4,True,False,2024-07-26,video,Content group 4 original post about topic 34
4,5,130,5,True,False,2024-07-05,video,Content group 5 original post about topic 4


## Step 3 — Add Engagement Metrics

We simulate basic engagement:

- `like_count`
- `comment_count`
- `share_count`

Original posts tend to receive slightly more engagement than stolen copies.


In [15]:
# popularity per content group
group_popularity = {
    gid: np.random.gamma(shape=2.0, scale=100.0)
    for gid in group_ids
}

popularity = posts["group_id"].map(group_popularity)
popularity_adjusted = popularity * np.where(posts["is_original"], 1.1, 0.9)

posts["like_count"] = (popularity_adjusted * np.random.uniform(0.05, 0.15, size=len(posts))).astype(int)
posts["comment_count"] = (posts["like_count"] * np.random.uniform(0.1, 0.3, size=len(posts))).astype(int)
posts["share_count"] = (posts["like_count"] * np.random.uniform(0.05, 0.2, size=len(posts))).astype(int)

posts.head()


Unnamed: 0,post_id,author_id,group_id,is_original,is_stolen,created_at,media_type,text,like_count,comment_count,share_count
0,1,2986,1,True,False,2024-08-04,image,Content group 1 original post about topic 9,37,9,1
1,2,559,2,True,False,2024-07-17,image,Content group 2 original post about topic 18,2,0,0
2,3,1170,3,True,False,2024-07-24,video,Content group 3 original post about topic 23,28,4,2
3,4,21,4,True,False,2024-07-26,video,Content group 4 original post about topic 34,20,4,2
4,5,130,5,True,False,2024-07-05,video,Content group 5 original post about topic 4,9,1,0


## Step 4 — Generate Feed Impressions

This table simulates the ranking system:

- Each post receives 200–1500 impressions.
- Impression-level data includes:
  - position in feed
  - clicked
  - liked
  - viewer_id

This table is key for analyzing the exposure and harm caused by stolen posts.


In [16]:
impression_rows = []
impression_id = 1

max_impressions_per_post = 1500
min_impressions_per_post = 200

for _, row in posts.iterrows():
    base = group_popularity[row["group_id"]]
    scaled = min_impressions_per_post + (base / (base + 300)) * (max_impressions_per_post - min_impressions_per_post)
    n_impressions = int(scaled * np.random.uniform(0.7, 1.3))
    n_impressions = min(n_impressions, max_impressions_per_post)

    viewers = np.random.choice(user_ids, size=n_impressions, replace=True)

    for viewer in viewers:
        clicked = np.random.binomial(1, 0.15)
        liked = clicked and np.random.binomial(1, 0.4)

        impression_rows.append({
            "impression_id": impression_id,
            "viewer_id": viewer,
            "post_id": row["post_id"],
            "position": np.random.randint(1, 101),
            "clicked": bool(clicked),
            "liked": bool(liked),
        })
        impression_id += 1

feed_impressions = pd.DataFrame(impression_rows)
feed_impressions.head()


Unnamed: 0,impression_id,viewer_id,post_id,position,clicked,liked
0,1,1909,1,63,True,False
1,2,227,1,30,False,False
2,3,1406,1,69,True,True
3,4,2039,1,39,False,False
4,5,2228,1,10,False,False


## Step 5 — Save Dataset

We save the users, posts, and feed impressions tables as CSV files in the `data/` directory.  
These will be used for SQL-based analysis in the next notebook.

In [17]:
users.to_csv("../data/users.csv", index=False)
posts.to_csv("../data/posts.csv", index=False)
feed_impressions.to_csv("../data/feed_impressions.csv", index=False)

users.shape, posts.shape, feed_impressions.shape

((3000, 5), (3021, 11), (2007080, 6))

In [18]:
import os

print(os.getcwd())
print(os.listdir(".."))     # what is above "notebooks/"?
print(os.listdir("../data") if "data" in os.listdir("..") else "No data folder above")

/Users/sakshigandhi/Desktop/dot_stolen_content_project/notebooks:
['.DS_Store', 'new_venv', 'src:', '.venv', 'sql:', 'notebooks:', 'data']
['.DS_Store', 'posts.csv', 'users.csv', 'harm_by_country_summary.csv', 'policy_metrics_summary.csv', 'feed_impressions.csv', 'posts_with_predictions.csv']
