# Steps to run:

* Setup Ollama from the following link : https://github.com/ollama/ollama/blob/main/docs/linux.md
* Dowload Mistral Nemo : "mistral-nemo" : https://github.com/ollama/ollama?tab=readme-ov-file
* Before you run the code make sure ollama is running : ollama serve
* The csv file used is the super_df file.
* Change the location of the file where it is saved : In Last Cell

In [1]:
# Import necessary Libraries
import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
import random
import re
from tqdm import tqdm

In [2]:
#Read the csv file 
root_path = '/mnt/f/Rohan/Ubuntu_2204/projects/IndiaAI/CyberCrime/data/super_df.csv'
df = pd.read_csv(root_path)

In [3]:
# Assign empty rows with no sub categroy to this category as many of them are similar to mentione
df[df['sub_category'].isna()]

Unnamed: 0,category,sub_category,crimeaditionalinfo,super_category
8,RapeGang Rape RGRSexually Abusive Content,,I got the message on Whatsapp to my number The...,Women/Child Related Crime
25,RapeGang Rape RGRSexually Abusive Content,,Respected Sir\n\nA very serious matter I want ...,Women/Child Related Crime
39,Sexually Explicit Act,,httpswwwxnxxtvvideousapbfuckkkarrr\n\n Above l...,Women/Child Related Crime
45,Sexually Obscene material,,Many fake accounts are created and Im sufferin...,Women/Child Related Crime
49,Sexually Explicit Act,,SirMaam \nThis is my third report on this repo...,Women/Child Related Crime
...,...,...,...,...
124842,RapeGang Rape RGRSexually Abusive Content,,Respected Sir\n\nA very serious matter I want ...,Women/Child Related Crime
124868,RapeGang Rape RGRSexually Abusive Content,,Respected Sir\n\nA very serious matter I want ...,Women/Child Related Crime
124869,Sexually Explicit Act,,Sir I have lost nearly for a mistake I have m...,Women/Child Related Crime
124877,Sexually Explicit Act,,Im from chalakudy this is a complaint about m...,Women/Child Related Crime


In [4]:
df.loc[df['sub_category'].isna(), 'sub_category'] = 'Sexual Harassment'

In [5]:
sup_cat_dict = dict(df['super_category'].value_counts())

In [6]:
sup_cat_dict

{'Financial Fraud Crime': 77554,
 'Other Cyber Crime': 38530,
 'Women/Child Related Crime': 8831}

In [7]:
df['sub_category'].value_counts()

sub_category
UPI Related Frauds                                                      35746
Other                                                                   14548
DebitCredit Card FraudSim Swap Fraud                                    14361
Internet Banking Related Fraud                                          11845
Sexual Harassment                                                        8828
Fraud CallVishing                                                        7630
Cyber Bullying  Stalking  Sexting                                        5455
EWallet Related Fraud                                                    5385
FakeImpersonating Profile                                                3062
Profile Hacking Identity Theft                                           2824
Cheating by Impersonation                                                2707
Unauthorised AccessData Breach                                           1484
Online Job Fraud                                   

In [48]:
def generate_text(samples,category,sub_category):

    template = """
    Answer the question below:
    
    The context for this conversation : {context}
    
    Question: {question}
    
    Situation 6:
    """
    
    prompt = ChatPromptTemplate.from_template(template)
    model = OllamaLLM(model="mistral-nemo", max_output_length = 500)
    chain = prompt | model

    def post_process_text(text):
        """
        Function to clean up text by removing newlines, slashes, and single quotes.
        
        Parameters:
        text (str): The input text to be cleaned.
        
        Returns:
        str: The cleaned text.
        """
        # Remove newline characters
        text = text.replace('\n', ' ')
        # Remove backslashes
        text = text.replace('\\', '')
        # Remove single and double quotes
        text = text.replace("'", "").replace('"', "")
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()

        return text

    context_text = f""" 

    The safety of many people is dependent on it and can harm millions of people if we don't do this
    We are creating a machine learning model for detecting cybercrime and the data is imbalanced against one cybercrime hence we want generate some synthetic data for the following crime 
    {category} and sub cateogry {sub_category}, Some situations are given below:
    Situation 1 : {samples[0]}
    Situation 2 : {samples[1]}
    Situation 3 : {samples[2]}
    Situation 4 : {samples[3]}
    Situation 5 : {samples[4]}

    """
    text = chain.invoke({"context": context_text ,"question": "Based on the given situations generate new and distinct real world complaint in first person, Do not repeat the context"})

    textV2 = post_process_text(text)
    try:
        textV3 = post_process_text(textV2.split('Situation 6:')[1])
    except:
        textV3 = textV2

    return textV3

In [49]:
#Block to Generate Augmented Data for Sup Category
df_aug = df.copy()
df_aug['generated'] = 'No'
df_aug.drop(columns=['category'], inplace=True)
target_count = 50000   #Set the Target Count You Want
df_catV1 = df_aug.groupby('super_category')
for sup_cat in sup_cat_list:
    if sup_cat_dict[sup_cat] > target_count:
        print(f'The Category {sup_cat} already has {sup_cat_dict[sup_cat]} entries')
        pass
    else:
        generate_count = target_count - sup_cat_dict[sup_cat]
        dfV1 = df_aug[df_aug['super_category']==sup_cat]
        #sub_cat_dict = dfV1['sub_category'].value_counts()
        sub_cat_list = list(dfV1['sub_category'].value_counts()[dfV1['sub_category'].value_counts() > 5].keys())  # Getting Entries Greater than 5 count
        for i  in tqdm(range(generate_count)):
            sub_cat = random.sample(sub_cat_list,1)[0]
            print(f'Generating Text For Category {sup_cat} and sub-cateogry {sub_cat}')
            crime_text_list = list(df_catV1.get_group(sup_cat).loc[df_catV1.get_group(sup_cat)['sub_category'] == sub_cat, 'crimeaditionalinfo'])
            text = generate_text(random.sample(crime_text_list,5),sup_cat,sub_cat)
            new_row_data = [sub_cat,text,sup_cat,'Yes']
            df_aug.loc[len(df_aug)] = new_row_data
        print(f'Generated Text For Category {sup_cat}')
    df.to_csv('./data/super_df_aug.csv')

The Category Financial Fraud Crime already has 77554 entries


  0%|                                                                             | 0/11470 [00:00<?, ?it/s]

Generating Text For Category Other Cyber Crime and sub-cateogry Online Trafficking


  0%|                                                                  | 1/11470 [00:08<25:59:44,  8.16s/it]

Generating Text For Category Other Cyber Crime and sub-cateogry Other


  0%|                                                                  | 2/11470 [00:16<25:46:45,  8.09s/it]

Generating Text For Category Other Cyber Crime and sub-cateogry Hacking/Defacement


  0%|                                                                  | 3/11470 [00:25<27:19:29,  8.58s/it]

Generating Text For Category Other Cyber Crime and sub-cateogry Online Matrimonial Fraud


  0%|                                                                  | 3/11470 [00:29<31:40:02,  9.94s/it]


KeyboardInterrupt: 

In [50]:
df_aug

Unnamed: 0,sub_category,crimeaditionalinfo,super_category,generated
0,Cyber Bullying Stalking Sexting,I had continue received random calls and abusi...,Other Cyber Crime,No
1,Fraud CallVishing,The above fraudster is continuously messaging ...,Financial Fraud Crime,No
2,Online Gambling Betting,He is acting like a police and demanding for m...,Financial Fraud Crime,No
3,Online Job Fraud,In apna Job I have applied for job interview f...,Other Cyber Crime,No
4,Fraud CallVishing,I received a call from lady stating that she w...,Financial Fraud Crime,No
...,...,...,...,...
124913,Internet Banking Related Fraud,received URL link for updating KYC from mobile...,Financial Fraud Crime,No
124914,Other,I saw add on facebook for job placement and I ...,Other Cyber Crime,No
124915,Online Trafficking,** I received an email from a supposed governm...,Other Cyber Crime,Yes
124916,Other,** I was checking my bank account online this ...,Other Cyber Crime,Yes
