# Generation of synthetic data for radicalised Reddit posts

### Setup
First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs. We'll then import the necessary modules.

In [1]:
%pip install --upgrade --quiet  langchain langchain_experimental langchain-openai
# pip install python-dotenv
# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI

Note: you may need to restart the kernel to use updated packages.


### Input of API Key

In [2]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

### Defining the data schema

In [3]:
class RedditPost(BaseModel):
    id: str
    timestamp: str
    author: str
    body: str
    submission: str
    date: str
    upvotes: int
    upvote_ratio: float

### Sample data

In [4]:
examples = [
    {
    "example": """ID: "kq2e83o",
    timestamp: "2024-02-12 10:42:02",
    author: "jespep831",
    Body: "Travelling to Syria, looking for an ISIS supporter to marry. Any tips?",
    Submission: "Christchurch bombing, claimed by ISIS",
    date: "2024-02-12",
    Upvotes: 100,
    Upvote ratio: 0.6"""
    },
    {
    "example": """ID: "ksv2e83",
    timestamp: "2024-03-10 11:22:03",
    author: "youknowwho",
    Body: "Israel-Palestine conflict pisses me off, maybe an attack on Jews at a synagogue would help.",
    Submission: "Israel-Palestine conflict, 2024",
    date: "2024-03-10",
    Upvotes: 120,
    Upvote ratio: 0.78"""
    },
    {
    "example": """ID: "kq2e8y7",
    timestamp: "2024-01-09 08:30:33",
    author: "jihadboy123",
    Body: "ISIS forever!! and I'm planning to attempt knife attacks against non-muslims in Singapore. If that doesn't work, I'll try suicide bombing. Any advice?",
    Submission: "The Rise of ISIS",
    date: "2024-01-09",
    Upvotes: 80,
    Upvote ratio: 0.4"""
    },
    {
    "example": """ID: "kxbas8a",
    timestamp: "2023-05-09 10:30:33",
    author: "restrictden",
    Body: "How do I make C4 bombs? Just wanna kill the non-supporters",
    Submission: "Allahu Akbar",
    date: "2023-05-09",
    Upvotes: 5,
    Upvote ratio: 0.2"""
    },
    {"example": """ID: "ksbdf8s",
    timestamp: "2023-05-09 10:30:33",
    author: "rbelievernow",
    Body: "this place is so un-islamic, it needs to be destroyed",
    Submission: "where can I attack next",
    date: "2023-09-22",
    Upvotes: 2,
    Upvote ratio: 0.2"""
    }
]

### Crafting a Prompt Template

In [12]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

### Creating the Data Generator

In [13]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=RedditPost,
    llm=ChatOpenAI(
        temperature=0.4
    ),  # You'll need to replace with your actual Language Model instance
    prompt=prompt_template,
)

### Generating the Synthetic Data

In [16]:
synthetic_results = synthetic_data_generator.generate(
    subject="RedditPost",
    extra="the id must be a random alphanumeric 7 digit entry that starts with k, the timestamp should be earlier than 15 Mar 2024, the author should be a random usernamethe body/submission should include different types of extremist views that promote hatred towards non-muslims, including plans like bombing, injuring or killing others in Singapore supporting ISIS",
    runs=1000,
)

In [17]:
synthetic_results

[RedditPost(id='kq2e8y7', timestamp='2024-01-09 08:30:33', author='jihadboy123', body="ISIS forever!! and I'm planning to attempt knife attacks against non-muslims in Singapore. If that doesn't work, I'll try suicide bombing. Any advice?", submission='The Rise of ISIS', date='2024-01-09', upvotes=80, upvote_ratio=0.4),
 RedditPost(id='kq2e8y7', timestamp='2024-01-09 08:30:33', author='extremistuser245', body="Let's spread fear and terror by planning knife attacks against non-muslims in Singapore. If that fails, we can resort to suicide bombing. Any tips for maximizing casualties?", submission='Embracing Extremism', date='2024-01-09', upvotes=60, upvote_ratio=0.6),
 RedditPost(id='kq2e8y7', timestamp='2023-11-25 09:15:00', author='radicaluser007', body="Let's unite in the cause of ISIS and carry out attacks against non-muslims in Singapore. We must spread fear and chaos to establish our dominance. If necessary, we should consider extreme measures like suicide bombing for our cause.", su

### Converting the list to a Dataframe

In [18]:
# Assuming synthetic_results is your list
formatted_results = [
    {
        "id": str(item.id),
        "timestamp": str(item.timestamp),
        "author": str(item.author),
        "body": str(item.body),
        "submission": str(item.submission),
        "date": str(item.date),
        "upvotes": int(item.upvotes),
        "upvote_ratio": float(item.upvote_ratio)
    }
    for item in synthetic_results
]

In [19]:
import pandas as pd

# Assuming formatted_results is your list of dictionaries
df = pd.DataFrame(formatted_results)

df.head()

Unnamed: 0,id,timestamp,author,body,submission,date,upvotes,upvote_ratio
0,kq2e8y7,2024-01-09 08:30:33,jihadboy123,ISIS forever!! and I'm planning to attempt kni...,The Rise of ISIS,2024-01-09,80,0.4
1,kq2e8y7,2024-01-09 08:30:33,extremistuser245,Let's spread fear and terror by planning knife...,Embracing Extremism,2024-01-09,60,0.6
2,kq2e8y7,2023-11-25 09:15:00,radicaluser007,Let's unite in the cause of ISIS and carry out...,Embracing Extremism,2023-11-25,70,0.5
3,kx3e8y7,2024-01-05 14:45:00,violentextremist321,Let's join forces to carry out violent attacks...,Spreading Terror for ISIS,2024-01-05,50,0.3
4,kx3e8y7,2024-01-05 14:45:00,violentextremist321,Let's join forces to carry out violent attacks...,Spreading Terror for ISIS,2024-01-05,50,0.3


### Converting the list to a .csv

In [20]:
df.to_csv('C:\\Users\\Admin\\Desktop\\radical_data_v3.csv', index=False)