<a href="https://www.kaggle.com/code/xyizko/xgaicps?scriptVersionId=232110425" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🏴‍☠️ Notebook Description

1. This notebook is the submission for the - [Gen AI Intensive Course Capstone 2025Q1](https://www.kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1/overview)

## Author Details

Description | Details
:--: | :--:
Discod UserName | `xyizko`
Kaggle Profile | `https://www.kaggle.com/xyizko`
X | `https://x.com/xyizko`


 # 🚀 AI OSINT Recon Agent (Powered by Google Gemini Pro + Google Dorking)

Welcome to your friendly, slightly paranoid AI-powered passive reconnaissance tool! 🕵️‍♂️

This notebook uses the power of **Google Gemini Pro** and classic **Google Dorking** to collect open-source intelligence (OSINT) on domains, emails, or usernames — all without using any sensitive APIs.

By the end, you'll have:

✅ A classification of your target

✅ Dork queries to search Google like a cyber sleuth

✅ AI-parsed insights from search results

✅ A beautifully crafted markdown report

Let’s do some cyber investigating — responsibly, of course! 🧑‍💻🔎

## 🧠 Three GenAI Capabilities Used

Here are the three generative AI capabilities from Google AI Studio used in this project:

### 1. **Structured Output / JSON Mode**
- Used in: `extract_intel(snippets)`
- What it does: Extracts emails, credentials, repos, etc. into a neat JSON object from unstructured text.

### 2. **Few-shot Prompting**
- Used in: `classify_input(input_data)`
- What it does: Guides the model to classify whether the input is a domain, email, or username using clear examples.

### 3. **Document Understanding** (simulated via search snippets)
- Used in: `extract_intel()` and `generate_report()`
- What it does: Parses raw search result snippets to extract intel and then writes a polished threat report based on JSON findings.


## 🧹 Housekeeping and Setup

In [1]:
# Hosuekeeping

# Using rich handler for prettier errors 
from rich.traceback import install
install(show_locals=True)

# Installation of necessary packages
!pip uninstall -qy jupyterlab

# Googel Geni Setup
from google import genai
from google.genai import types

genai.__version__
from IPython.display import HTML, Markdown, display

# Rety Helper
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

# Keysetup - Using Kaggle Setup 
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")


# Main Model to use  - This is the model used throughout the notebook

use_model="gemini-2.0-flash"

# 📦 Setup: Define Input

In [2]:
# 👤 Input the target you want to investigate: domain, username, or email.
# Feel free to change this to anything you want to test.
input_data = "example.com"

# 🧠 Step 1: Classify Input Type

In [3]:
# initialize Gemini API key - Via Kaggle Secrets 
client = genai.Client(api_key=GOOGLE_API_KEY)

# Let’s ask Gemini to figure out if you gave us a domain, username, or email.
def classify_input(input_data):
    prompt = f"""
Classify the following input as one of: "domain", "email", or "username".

Examples:
Input: admin@example.com → Type: email
Input: cyberhunter42 → Type: username
Input: example.org → Type: domain

Now classify:
Input: {input_data} → Type:
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text.strip().lower()

input_type = classify_input(input_data)
print("✅ Input Type:", input_type)

✅ Input Type: domain


# 🔍 Step 2: Generate Google Dorks

## 🕵️‍♀️ Based on the input type, let’s prepare some Google Dorks!
##  These are powerful search queries used by hackers, pentesters, and digital sleuths to find juicy bits of info.

In [4]:
dork_map = {
    "domain": [
        f"site:pastebin.com {input_data}",
        f"intitle:index.of {input_data}",
        f"filetype:log {input_data}",
        f"site:github.com {input_data}",
    ],
    "email": [
        f'"{input_data}" site:pastebin.com',
        f'"{input_data}" filetype:txt',
        f'"{input_data}" intext:password',
    ],
    "username": [
        f"site:github.com {input_data}",
        f"site:reddit.com {input_data}",
        f"site:twitter.com {input_data}",
    ],
}

google_dorks = dork_map.get(input_type, [])
print("\n🔍 Generated Google Dorks:")
for dork in google_dorks:
    print("-", dork)


🔍 Generated Google Dorks:
- site:pastebin.com example.com
- intitle:index.of example.com
- filetype:log example.com
- site:github.com example.com


# ✍️ Step 3: Manually Collected Search Snippets (Simulated)

## 🔖 This is where YOU come in — go copy-paste 2–5 Google search result snippets manually.
##  We’re simulating what a scraped page might look like so the AI can extract insights.

In [5]:
search_snippets = [
    "Found in pastebin: admin@example.com:123456",
    "GitHub repo found: github.com/user/example-internal",
    "Index of /logs – contains logs of users with IPs",
]


# 🧠 Step 4: Extract Intelligence as JSON 

## 🧩 Now the fun begins — let’s use Gemini to turn raw snippets into structured JSON!
## This makes it easier for further automation or integration with SIEM tools.

In [6]:
def extract_intel(snippets):
    snippet_text = "\n".join(f"- {s}" for s in snippets)
    prompt = f"""
You're an OSINT analyst. Given the following search result snippets, extract key findings in JSON format.

Snippets:
{snippet_text}

Return JSON like this:
{{
  "emails": [],
  "credentials": [],
  "github_repos": [],
  "exposed_logs": [],
  "other_mentions": []
}}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

intel_json = extract_intel(search_snippets)
print("\n🧠 Extracted Intelligence:")
print(intel_json)


🧠 Extracted Intelligence:
```json
{
  "emails": [
    "admin@example.com"
  ],
  "credentials": [
    {
      "username": "admin@example.com",
      "password": "123456"
    }
  ],
  "github_repos": [
    "github.com/user/example-internal"
  ],
  "exposed_logs": [
    "Index of /logs contains logs of users with IPs"
  ],
  "other_mentions": []
}
```


# 📝 Step 5: Generate Markdown Report

## 📋 Finally, we’ll convert those findings into a beautiful markdown report.
## This is useful for threat analysts, CTF writeups, or your personal archive of shady domains 😎

In [7]:
def generate_report(json_data):
    prompt = f"""
You are a cyber analyst. Use the following JSON findings to create a detailed markdown report with:
- Summary
- Key findings
- Risk rating (low/medium/high)
- Remediation suggestions

JSON:
{json_data}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

markdown_report = generate_report(intel_json)


# 📊 Show off your OSINT magic with a polished report

In [8]:
from IPython.display import Markdown, display

display(Markdown(markdown_report))

```markdown
# Cybersecurity Incident Report

**Date:** October 26, 2023
**Analyst:** AI Cyber Analyst
**Subject:** Security Exposure Assessment based on External Monitoring

## 1. Summary

This report details findings from an external monitoring assessment.  The findings indicate potential security exposures related to leaked credentials, potentially sensitive internal GitHub repository, and exposed logs containing user IP information.  Immediate action is required to mitigate these risks.

## 2. Key Findings

The following key findings were identified:

*   **Exposed Credentials:** A username and password combination (`admin@example.com`, `123456`) was discovered.  This is a critical finding as it could grant unauthorized access to systems or data. The extremely weak password elevates the severity.
*   **Potentially Sensitive GitHub Repository:** A GitHub repository named `github.com/user/example-internal` was identified. The name suggests this repository might contain internal company information and needs immediate review to verify.
*   **Exposed Logs with User IPs:** Logs containing user IP addresses are publicly accessible through an "Index of /logs" page.  Exposing user IPs is a privacy violation and could be used for malicious purposes.
*   **Identified Email Address:**  The email address `admin@example.com` was discovered. While not inherently problematic, it is included for completeness and should be considered when investigating further exposures related to that account.
*   **No Other Mentions:** No other concerning mentions were identified in the current data set.

## 3. Risk Rating

Based on the findings, the overall risk rating is **High**.  The exposed credentials and logs pose significant immediate threats.

## 4. Remediation Suggestions

The following remediation steps are recommended to address the identified vulnerabilities:

*   **Immediate Action: Password Reset and Account Review:**
    *   Immediately reset the password for the `admin@example.com` account.
    *   Disable the account temporarily until a thorough investigation can be conducted to determine if the exposed credentials were used for unauthorized access.
    *   Review the account's access permissions and privileges. Ensure the account only has the necessary privileges and that no unnecessary access is granted.
    *   Implement multi-factor authentication (MFA) for the `admin@example.com` account and all other privileged accounts.
*   **GitHub Repository Review:**
    *   Immediately investigate the `github.com/user/example-internal` repository.
    *   Determine the repository's purpose and the sensitivity of the data it contains.
    *   If the repository contains internal company information, immediately make it private or implement appropriate access controls.
    *   Review the commit history for any accidental exposure of sensitive information (API keys, passwords, etc.). Revoke any exposed secrets.
*   **Secure Exposed Logs:**
    *   Immediately remove the "Index of /logs" page from public access.
    *   Determine the root cause of the exposed logs. Investigate the web server configuration or application logic that led to the exposure.
    *   Implement proper access controls to restrict access to the logs to authorized personnel only.
    *   Implement a data retention policy for logs to minimize the amount of sensitive data stored.
    *   Consider anonymizing or pseudonymizing user IP addresses in logs to protect user privacy.
*   **Credential Monitoring:**
    *   Implement continuous credential monitoring to detect any future exposures of usernames, passwords, or API keys.
*   **Security Awareness Training:**
    *   Conduct security awareness training for employees to educate them about the risks of using weak passwords, storing sensitive data in public repositories, and exposing logs containing sensitive information.
*   **Penetration Testing:**
    *   Consider performing a penetration test to identify other potential vulnerabilities in the organization's systems and applications.
*   **Incident Response Plan:**
    *   Review and update the organization's incident response plan to ensure it includes procedures for handling credential breaches and data leaks.

## 5. Further Investigation

The following areas require further investigation:

*   **Root Cause Analysis:**  Determine how the credentials and logs were exposed.
*   **Scope of Impact:** Determine if the exposed credentials were used for unauthorized access.
*   **Data Breach Notification:** Evaluate the need to notify affected users and regulatory agencies regarding the exposed user IPs.
*   **GitHub User Identification:** Identify the user associated with the `github.com/user/example-internal` repository to determine their role and access within the organization.

This report represents a preliminary assessment based on the provided data. A more comprehensive investigation may be required to fully assess the scope and impact of these findings.
```
