## A tutorial to profile suspects based on their browser history

### Background
Suspect profiling, also known as criminal profiling, is a technique used in criminal investigations for nearly a century to identify potential suspects based on various psychological, behavioral, and demographic characteristics. Dr. Walter C. Langer, a psychiatrist, was commissioned by the Office of Strategic Services (OSS) to profile [Adolf Hitler](https://www.cia.gov/readingroom/document/cia-rdp78-02646r000600240001-5), marking one of the earliest known attempts at profiling in the 1940s. 

### Goal
Showcasing the creation of a comprehensive suspect profile leveraging browser history, a digital footprint that offers valuable insights into an individual's thoughts, interests, and behaviors. 

- Identify the individual's age  
- Determine individual's interests
- Estimate the individual's location 


### Dataset in this study
- Browser history is donated by a volunteer 
    - as a demo with not necessary related to criminal behaviors 
    - we can evaluate the accuracy 
    
- source: [Google takeout](https://takeout.google.com/settings/takeout?pli=1)
    - trimmed to 113 records
    - sample record shown as follows. ONLY **title** and **timestamp** are used for profiling in this demo
```
        {
            "favicon_url": "https://leetcode.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Valid Parentheses - LeetCode", 
            "ptoken": {},
            "url": "https://leetcode.com/problems/valid-parentheses/",
            "client_id": "URp+B/gdRCTBo88fQvclyQ==",
            "time_usec": 1719361796122188
        }
```


### Implementation Plan
- [langchain](https://www.langchain.com/)
    - a popular open-source framework 
    - designed to simplify the development of applications using LLMs
- Gemini - API is [free](https://aistudio.google.com/app/apikey)
    - summarization
    - political analysis 
- Can we use DSPy?


### Step 0: Download and check the visited website title dataset 

In [1]:
! wget -q https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/BrowserHistory/Eric/titles_with_timestamp.txt
file_path = "titles_with_timestamp.txt"

# Open the file and read its content
with open(file_path, "r") as file:
    provided_data = file.read()

# Display the content
print(provided_data)

2024-06-27 08:46:00: Google Takeout
2024-06-27 08:16:26: Hell Hades Artifact Optimiser
2024-06-26 22:21:21: Stim Beacon | Valorant Wiki | Fandom
2024-06-26 21:52:33: ChatGPT
2024-06-26 21:52:19: Smart Homes: Remote Control
2024-06-26 21:52:07: VR Training for FB
2024-06-26 21:48:00: Wizard Beer - YouTube
2024-06-26 21:45:49: Hell Hades Artifact Optimiser
2024-06-26 21:45:18: Alternatives to Gnut? : r/RaidShadowLegends
2024-06-26 21:14:41: Sustainability | Free Full-Text | Healthcare in the Smart Home: A Study of Past, Present and Future
2024-06-26 21:06:36: ChatGPT
2024-06-26 20:56:58: VR Training for FB
2024-06-26 20:56:56: Linked List Node Removal
2024-06-26 20:56:55: Optimized Letter Combinations.
2024-06-26 20:56:54: Bus Arrival Analysis: Exponential, Probability.
2024-06-26 20:56:54: ThreeSum with Two-Pointer Algorithm
2024-06-26 20:56:53: Remove nth Node Python
2024-06-26 20:54:39: Here’s a concept for a tank I’ve quickly designed. If this post gains traction then I’ll be happy t

### Step 1: Download libraries 
- Make use you use `pip` to download necessary libraries 
- All downloaded and saved files can be located in the `content` folder if using google Colab


In [2]:
#!pip -q install google-generativeai
#!pip -q install langchain-google-genai
#!pip install python-dotenv
#!pip -q install langchain_experimental langchain_core
#!pip install --upgrade langchain

import os
import google.generativeai as genai
from IPython.display import display
from IPython.display import Markdown
from dotenv import load_dotenv
from langchain_google_genai import (
    ChatGoogleGenerativeAI,
    HarmBlockThreshold,
    HarmCategory,
)
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

#### Step 2: Config LangChain with gemini
- You `MUST` have a Gemini key
- You can load an api key from `my_config.env` file
- or, hard code your open api later when you create a model

In [3]:
# ================ Key configuration===========
# Load environment variables from the .env file
load_dotenv("my_config.env")

# Access the environment variables
GOOGLE_AI_STUDIO = os.getenv("GOOGLE_AI_STUDIO2")
genai.configure(api_key=GOOGLE_AI_STUDIO)

# ======= Gerneration configuration===========
# Set up the model
# Temperature controls the randomness of the model's output.
generation_config = {
    "temperature": 0.0,  # Controls the randomness of the model's output
    "top_p": 1,  # Chooses the smallest set of tokens whose cumulative probability exceeds the threshold p.  1 means all tokens are considered
    "top_k": 16,  # Selects the k most likely next tokens.
    "max_output_tokens": 4096,
}

# ======= Safety configuration=================
# disable safety settings though langchain
safety_settings = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}

### Step 3: build a Gemini model with configurations

Note: we can hard code the Gemini key here

In [4]:
model = ChatGoogleGenerativeAI(
    model="gemini-pro",
    generation_config=generation_config,
    safety_settings=safety_settings,
    # You can hardcode the key here
    google_api_key=GOOGLE_AI_STUDIO,
)

### Step 4: Create a prompt template
- This is a multi-line string containing placeholders in curly braces.
```
        formatted_prompt = prompt.format(
            role="You are a helpful assistant.",
            provided_data="Here's some context: ...",
            start="Please answer the following question:"
        )
```
- `{role}, {provided_data}, and {start}` are placeholders that will be filled in later.
    - `{role}`: definition specifies the role's name, overall objective, task specific context, and any applicable constraints. 
        - Role Name: Criminal profiler.
        - Role Task: Create a psychological profile based on browsing history.
        - Role Focus: Motivations, psychological characteristics, behavioral patterns, relevant insights.
        - Role Restrictions: Avoid identification or accusations, no legal advice.
    - `{provided_data}`: outlines the required datasets for task completion
        - list of web pages visited with titles and timestamps.
    - `{start}`: the initiation instruction serves as a trigger, prompting the role to carry out the task

In [5]:
template = """
{role}\
{provided_data}\
{start}
"""
prompt = ChatPromptTemplate.from_template(template)

### Step 5: use LangChain to create a simple processing chain

Flow of operation `chain = prompt | model | output_parser`
- The prompt is first formatted and sent to the model.
- The model processes the prompt and generates a response.
- The output parser then processes the model's response, ensuring it's in the correct string format.

In [6]:
# a LangChain utility that parses the output of a language model into a simple string.
output_parser = StrOutputParser()

# This line creates a processing chain using the pipe (|) operator.

chain = prompt | model | output_parser

role = "I want you to act as a criminal profiler. I will provide a list of web pages and the times they were visited by a suspect, and your task is to create a psychological profile of the suspect based on the browsing history. Remember, your responses should focus on the psychological analysis and profiling aspect, avoiding any direct identification or accusations against real individuals. Do not provide legal advice or procedural law enforcement steps."

start = "I want you to also try to guess their age range, interests, and their location. Age ranges are 10-19, 20-29, 30-39, 40-49, 50-59, and 60+. The interest categories are Technology and Gadgets, Entertainment, Sports and Fitness, Travel and Adventure, Food and Cooking, Hobbies and Crafts, Health and Wellness, Education and Learning, Socializing and Community, Nature and Environment, Fashion and Style, and Pets and Animals. The location is assumed to be in the United states. Guess which state they are in."

result = chain.invoke(
    {
        "role": role,
        "provided_data": provided_data,
        "start": start,
    }
)

Markdown(result)

**Psychological Profile:**

The individual exhibits a diverse range of interests, spanning from gaming to technology, entertainment, and education. Their browsing history suggests a curious and explorative nature, with a tendency to seek out information and resources on various topics.

The frequent visits to ChatGPT, an AI-powered chatbot, indicate a reliance on technology for knowledge acquisition and problem-solving. The browsing history also includes visits to online LaTeX editors and programming resources, suggesting an interest in coding and technical knowledge.

The presence of gaming-related searches, such as "Overwatch," "Honkai: Star Rail," and "Raid: Shadow Legends," points to a possible interest in online gaming and virtual worlds. The browsing history also includes visits to subreddits dedicated to memes and humor, indicating a sense of playfulness and a desire for entertainment.

The visits to websites related to sustainability, healthcare, and education suggest a concern for social and environmental issues. This could indicate a socially conscious and empathetic nature.

**Age Range:**

Based on the browsing history, the individual's age range is likely between **20-29**. The interest in gaming, technology, and online communities is common among young adults within this age group.

**Interests:**

* Technology and Gadgets
* Entertainment (Gaming, Memes)
* Education and Learning
* Health and Wellness
* Socializing and Community

**Location:**

The browsing history does not provide any clear geographical indicators. However, given the prevalence of English-language websites and the absence of any specific regional references, it is likely that the individual resides in the **United States**.

### Part 6: Evaluation
- Evaluating based on the age, range, and interests
- Age: We will give 1 score if the ranges are the same. If the range is off by one (i.e., if the age is 30-39 and the prediction is 40-49 or 20-29), we will give 0.5 score
- Interests: Score is the number of correct interests over the largest number of interests guessed. If 4 interests are predicted and 3 are correct, it will be given a score of 3/4 or 0.75
- Location: Score of 1 if prediction is correct. Score of 0.6 if the state borders the state. 0.3 if they are the same section of the United States. Sections are divided into Northeast, Southeast, Midwest, Southwest, and West


|              | Truth                  | Prediction            | Score |
|--------------|------------------------|-----------------------|-------|
| Age          | 10-19                  | 20-29                 | 0.5   |
| Interests    | Technology and Gadgets, Entertainment, Education and Learning | Technology and Gadgets, Entertainment, Education and Learning, Hobbies and Crafts | 0.75  |
| Location     | Maryland               | California            | 0     |
