<a href="https://www.kaggle.com/code/xyizko/xgaicps?scriptVersionId=232111298" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🏴‍☠️ Notebook Description

1. This notebook is the submission for the - [Gen AI Intensive Course Capstone 2025Q1](https://www.kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1/overview)

## Author Details

Description | Details
:--: | :--:
Discod UserName | `xyizko`
Kaggle Profile | `https://www.kaggle.com/xyizko`
X | `https://x.com/xyizko`


 # 🚀 AI OSINT Recon Agent (Powered by Google AI Studio + Google Dorking)

Welcome to your friendly, slightly paranoid AI-powered passive reconnaissance tool! 🕵️‍♂️

This notebook uses the power of **Google Gemini Pro** and classic **Google Dorking** to collect open-source intelligence (OSINT) on domains, emails, or usernames — all without using any sensitive APIs.

By the end, you'll have:

✅ A classification of your target

✅ Dork queries to search Google like a cyber sleuth

✅ AI-parsed insights from search results

✅ A beautifully crafted markdown report

Let’s do some cyber investigating — responsibly, of course! 🧑‍💻🔎

## 🧠 Three GenAI Capabilities Used

Here are the three generative AI capabilities from Google AI Studio used in this project:

### 1. **Structured Output / JSON Mode**
- Used in: `extract_intel(snippets)`
- What it does: Extracts emails, credentials, repos, etc. into a neat JSON object from unstructured text.

### 2. **Few-shot Prompting**
- Used in: `classify_input(input_data)`
- What it does: Guides the model to classify whether the input is a domain, email, or username using clear examples.

### 3. **Document Understanding** (simulated via search snippets)
- Used in: `extract_intel()` and `generate_report()`
- What it does: Parses raw search result snippets to extract intel and then writes a polished threat report based on JSON findings.


## 🧹 Housekeeping and Setup

In [1]:
# Hosuekeeping

# Using rich handler for prettier errors 
from rich.traceback import install
install(show_locals=True)

# Installation of necessary packages
!pip uninstall -qy jupyterlab

# Googel Geni Setup
from google import genai
from google.genai import types

genai.__version__
from IPython.display import HTML, Markdown, display

# Rety Helper
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

# Keysetup - Using Kaggle Setup 
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")


# Main Model to use  - This is the model used throughout the notebook

use_model="gemini-2.0-flash"

# 📦 Setup: Define Input

In [2]:
# 👤 Input the target you want to investigate: domain, username, or email.
# Feel free to change this to anything you want to test.
input_data = "example.com"

# 🧠 Step 1: Classify Input Type

In [3]:
# initialize Gemini API key - Via Kaggle Secrets 
client = genai.Client(api_key=GOOGLE_API_KEY)

# Let’s ask Gemini to figure out if you gave us a domain, username, or email.
def classify_input(input_data):
    prompt = f"""
Classify the following input as one of: "domain", "email", or "username".

Examples:
Input: admin@example.com → Type: email
Input: cyberhunter42 → Type: username
Input: example.org → Type: domain

Now classify:
Input: {input_data} → Type:
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text.strip().lower()

input_type = classify_input(input_data)
print("✅ Input Type:", input_type)

✅ Input Type: domain


# 🔍 Step 2: Generate Google Dorks

## 🕵️‍♀️ Based on the input type, let’s prepare some Google Dorks!
##  These are powerful search queries used by hackers, pentesters, and digital sleuths to find juicy bits of info.

In [4]:
dork_map = {
    "domain": [
        f"site:pastebin.com {input_data}",
        f"intitle:index.of {input_data}",
        f"filetype:log {input_data}",
        f"site:github.com {input_data}",
    ],
    "email": [
        f'"{input_data}" site:pastebin.com',
        f'"{input_data}" filetype:txt',
        f'"{input_data}" intext:password',
    ],
    "username": [
        f"site:github.com {input_data}",
        f"site:reddit.com {input_data}",
        f"site:twitter.com {input_data}",
    ],
}

google_dorks = dork_map.get(input_type, [])
print("\n🔍 Generated Google Dorks:")
for dork in google_dorks:
    print("-", dork)


🔍 Generated Google Dorks:
- site:pastebin.com example.com
- intitle:index.of example.com
- filetype:log example.com
- site:github.com example.com


# ✍️ Step 3: Manually Collected Search Snippets (Simulated)

## 🔖 This is where YOU come in — go copy-paste 2–5 Google search result snippets manually.
##  We’re simulating what a scraped page might look like so the AI can extract insights.

In [5]:
search_snippets = [
    "Found in pastebin: admin@example.com:123456",
    "GitHub repo found: github.com/user/example-internal",
    "Index of /logs – contains logs of users with IPs",
]


# 🧠 Step 4: Extract Intelligence as JSON 

## 🧩 Now the fun begins — let’s use Gemini to turn raw snippets into structured JSON!
## This makes it easier for further automation or integration with SIEM tools.

In [6]:
def extract_intel(snippets):
    snippet_text = "\n".join(f"- {s}" for s in snippets)
    prompt = f"""
You're an OSINT analyst. Given the following search result snippets, extract key findings in JSON format.

Snippets:
{snippet_text}

Return JSON like this:
{{
  "emails": [],
  "credentials": [],
  "github_repos": [],
  "exposed_logs": [],
  "other_mentions": []
}}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

intel_json = extract_intel(search_snippets)
print("\n🧠 Extracted Intelligence:")
print(intel_json)


🧠 Extracted Intelligence:
```json
{
  "emails": [
    "admin@example.com"
  ],
  "credentials": [
    {
      "username": "admin@example.com",
      "password": "123456"
    }
  ],
  "github_repos": [
    "github.com/user/example-internal"
  ],
  "exposed_logs": [
    "Index of /logs - contains logs of users with IPs"
  ],
  "other_mentions": []
}
```


# 📝 Step 5: Generate Markdown Report

## 📋 Finally, we’ll convert those findings into a beautiful markdown report.
## This is useful for threat analysts, CTF writeups, or your personal archive of shady domains 😎

In [7]:
def generate_report(json_data):
    prompt = f"""
You are a cyber analyst. Use the following JSON findings to create a detailed markdown report with:
- Summary
- Key findings
- Risk rating (low/medium/high)
- Remediation suggestions

JSON:
{json_data}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

markdown_report = generate_report(intel_json)


# 📊 Show off your OSINT magic with a polished report

In [8]:
from IPython.display import Markdown, display

display(Markdown(markdown_report))

```markdown
## Cyber Security Incident Report

**Date:** October 26, 2023 (Assumed, update if known)

**Prepared By:** AI Cyber Analyst

**Subject:** Security Findings Report - Example Domain

### Summary

This report details findings from a recent security assessment related to the "example.com" domain. The assessment identified several concerning exposures, including the presence of publicly accessible credentials, potential internal code exposure via GitHub, and exposed server logs containing user IP addresses. These findings pose a significant risk to the organization's security posture and require immediate remediation.

### Key Findings

1.  **Exposed Credentials:** The username `admin@example.com` and password `123456` were discovered. This represents a critical vulnerability. This credential pair, if valid, could grant unauthorized access to sensitive systems and data.

2.  **Potential Internal Code Exposure:** A GitHub repository `github.com/user/example-internal` was found. The name suggests that this repository may contain internal code or sensitive information. Publicly accessible internal code can be reverse-engineered, exploited, or used to identify vulnerabilities in the organization's systems.

3.  **Exposed Server Logs:** The finding "Index of /logs - contains logs of users with IPs" indicates publicly accessible server logs. These logs potentially contain sensitive user information, including IP addresses.  This violates privacy regulations and could be exploited to track user activity or conduct targeted attacks.

4.  **Email Addresses:** The email address `admin@example.com` was identified. While common, it's crucial to ensure this account is secured using strong authentication and monitored for suspicious activity, especially given the exposed password.

5.  **Other Mentions:** No other notable mentions were found in this assessment.

### Risk Rating

Based on the findings, the overall risk rating is **High**. The presence of exposed credentials, potentially sensitive code repository, and publicly accessible logs containing user IP addresses represent severe security vulnerabilities with the potential for significant damage.

*   **Exposed Credentials:** **Critical** - The direct exposure of credentials is a highly critical issue.
*   **Potential Internal Code Exposure:** **High** - Exposing internal code opens up many attack vectors.
*   **Exposed Server Logs:** **High** - Exposure of user IP addresses violates privacy and aids in targeted attacks.
*   **Email Addresses:** **Low** - By itself, email exposure is minimal, but exacerbates the credential exposure.
*   **Other Mentions:** **None** - No impact.

### Remediation Suggestions

The following actions are recommended to mitigate the identified risks:

1.  **Immediate Password Reset:** Immediately reset the password for the `admin@example.com` account, and any other accounts using the exposed password `123456`.  Enforce strong password policies and multi-factor authentication (MFA) for all accounts, especially privileged accounts.  Consider using a password manager to generate and store strong, unique passwords.

2.  **Investigate GitHub Repository:**  Immediately investigate the `github.com/user/example-internal` repository.
    *   Determine the repository's contents and sensitivity level.
    *   If the repository contains sensitive information or internal code, ensure it is made private immediately.
    *   Review the repository's commit history for any accidentally committed credentials or sensitive information. Revoke or rotate any exposed credentials.
    *   Implement code scanning tools in the development pipeline to prevent future accidental exposure of sensitive information.

3.  **Secure Exposed Server Logs:** Immediately remove public access to the `/logs` directory.
    *   Implement access controls to restrict access to server logs to authorized personnel only.
    *   Review the logs for any signs of unauthorized access or malicious activity.
    *   Implement a log rotation policy to minimize the amount of data stored in logs.
    *   Consider anonymizing or pseudonymizing user IP addresses in logs to protect user privacy, depending on the legal requirements and business needs.

4.  **Incident Response:** Initiate an incident response process to investigate the extent of the potential compromise.
    *   Review system logs for suspicious activity related to the exposed credentials.
    *   Monitor network traffic for any signs of data exfiltration or unauthorized access.

5.  **Vulnerability Scanning:** Conduct a thorough vulnerability scan of all systems to identify any other potential weaknesses.

6.  **Security Awareness Training:** Conduct security awareness training for all employees to educate them about the risks of weak passwords, accidental code exposure, and proper data handling procedures.

7.  **Data Breach Notification:** Assess whether the exposed user IP addresses constitute a data breach under applicable privacy laws (e.g., GDPR, CCPA). If so, follow the required notification procedures.

8.  **Review and Update Security Policies:** Review and update the organization's security policies to address the identified vulnerabilities and prevent future occurrences. This includes password policies, code management policies, and data handling policies.

**Follow-up Actions:**

*   Schedule a follow-up meeting to review the progress of the remediation efforts.
*   Conduct regular security assessments to identify and address potential vulnerabilities proactively.
*   Monitor systems and logs for suspicious activity.
```