<a href="https://www.kaggle.com/code/xyizko/xgaicps?scriptVersionId=232133517" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🏴‍☠️ Notebook Description

1. This notebook is the submission for the - [Gen AI Intensive Course Capstone 2025Q1](https://www.kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1/overview)

## Author Details

Description | Details
:--: | :--:
Discod UserName | `xyizko`
Kaggle Profile | `https://www.kaggle.com/xyizko`
X | `https://x.com/xyizko`


 # 🚀 AI OSINT Recon Agent (Powered by Google AI Studio + Google Dorking)

Welcome to your friendly, slightly paranoid AI-powered passive reconnaissance tool! 🕵️‍♂️

This notebook uses the power of **Google GoogleAI Pro** and classic **Google Dorking** to collect open-source intelligence (OSINT) on domains, emails, or usernames — all without using any sensitive APIs.

By the end, you'll have:

✅ A classification of your target

✅ Dork queries to search Google like a cyber sleuth

✅ AI-parsed insights from search results

✅ A beautifully crafted markdown report

Let’s do some cyber investigating — responsibly, of course! 🧑‍💻🔎

## 🧠 Three GenAI Capabilities Used

Here are the three generative AI capabilities from Google AI Studio used in this project:

### 1. **Structured Output / JSON Mode**
- Used in: `extract_intel(snippets)`
- What it does: Extracts emails, credentials, repos, etc. into a neat JSON object from unstructured text.

### 2. **Few-shot Prompting**
- Used in: `classify_input(input_data)`
- What it does: Guides the model to classify whether the input is a domain, email, or username using clear examples.

### 3. **Document Understanding** (simulated via search snippets)
- Used in: `extract_intel()` and `generate_report()`
- What it does: Parses raw search result snippets to extract intel and then writes a polished threat report based on JSON findings.


## 🧹 Housekeeping and Setup

In [1]:
# HouseKeeping

# Using rich handler for prettier errors 
from rich.traceback import install
install(show_locals=True)

# Installation of necessary packages
!pip uninstall -qy jupyterlab

# Googel Geni Setup
from google import genai
from google.genai import types

genai.__version__
from IPython.display import HTML, Markdown, display

# Rety Helper
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

# Keysetup - Using Kaggle Setup 
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")


# Main Model to use  - This is the model used throughout the notebook

use_model="gemini-2.0-flash"

# 📦 Setup: Define Input

In [2]:
# 👤 Input the target you want to investigate: domain, username, or email.
# Feel free to change this to anything you want to test.
input_data = "example.com"

# 🧠 Step 1: Classify Input Type

In [3]:
# initialize GoogleAI API key - Via Kaggle Secrets 
client = genai.Client(api_key=GOOGLE_API_KEY)

# Let’s ask GoogleAI to figure out if you gave us a domain, username, or email.
def classify_input(input_data):
    prompt = f"""
Classify the following input as one of: "domain", "email", or "username".

Examples:
Input: admin@example.com → Type: email
Input: cyberhunter42 → Type: username
Input: example.org → Type: domain

Now classify:
Input: {input_data} → Type:
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text.strip().lower()

input_type = classify_input(input_data)
print("✅ Input Type:", input_type)

✅ Input Type: domain


# 🔍 Step 2: Generate Google Dorks

## 🕵️‍♀️ Based on the input type, let’s prepare some Google Dorks!
##  These are powerful search queries used by hackers, pentesters, and digital sleuths to find juicy bits of info.

In [4]:
dork_map = {
    "domain": [
        f"site:pastebin.com {input_data}",
        f"intitle:index.of {input_data}",
        f"filetype:log {input_data}",
        f"site:github.com {input_data}",
    ],
    "email": [
        f'"{input_data}" site:pastebin.com',
        f'"{input_data}" filetype:txt',
        f'"{input_data}" intext:password',
    ],
    "username": [
        f"site:github.com {input_data}",
        f"site:reddit.com {input_data}",
        f"site:twitter.com {input_data}",
    ],
}

google_dorks = dork_map.get(input_type, [])
print("\n🔍 Generated Google Dorks:")
for dork in google_dorks:
    print("-", dork)


🔍 Generated Google Dorks:
- site:pastebin.com example.com
- intitle:index.of example.com
- filetype:log example.com
- site:github.com example.com


# ✍️ Step 3: Manually Collected Search Snippets (Simulated)

## 🔖 This is where YOU come in — go copy-paste 2–5 Google search result snippets manually.
##  We’re simulating what a scraped page might look like so the AI can extract insights.

In [5]:
search_snippets = [
    "Found in pastebin: admin@example.com:123456",
    "GitHub repo found: github.com/user/example-internal",
    "Index of /logs – contains logs of users with IPs",
]


# 🧠 Step 4: Extract Intelligence as JSON 

## 🧩 Now the fun begins — let’s use GoogleAI to turn raw snippets into structured JSON!
## This makes it easier for further automation or integration with SIEM tools.

In [6]:
def extract_intel(snippets):
    snippet_text = "\n".join(f"- {s}" for s in snippets)
    prompt = f"""
You're an OSINT analyst. Given the following search result snippets, extract key findings in JSON format.

Snippets:
{snippet_text}

Return JSON like this:
{{
  "emails": [],
  "credentials": [],
  "github_repos": [],
  "exposed_logs": [],
  "other_mentions": []
}}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

intel_json = extract_intel(search_snippets)
print("\n🧠 Extracted Intelligence:")
print(intel_json)


🧠 Extracted Intelligence:
```json
{
  "emails": [
    "admin@example.com"
  ],
  "credentials": [
    {
      "username": "admin@example.com",
      "password": "123456"
    }
  ],
  "github_repos": [
    "github.com/user/example-internal"
  ],
  "exposed_logs": [
    "Index of /logs - contains logs of users with IPs"
  ],
  "other_mentions": []
}
```


# 📝 Step 5: Generate Markdown Report

## 📋 Finally, we’ll convert those findings into a beautiful markdown report.
## This is useful for threat analysts, CTF writeups, or your personal archive of shady domains 😎

In [7]:
def generate_report(json_data):
    prompt = f"""
You are a cyber analyst. Use the following JSON findings to create a detailed markdown report with:
- Summary
- Key findings
- Risk rating (low/medium/high)
- Remediation suggestions

JSON:
{json_data}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

markdown_report = generate_report(intel_json)


# 📊 Show off your OSINT magic with a polished report

In [8]:
from IPython.display import Markdown, display

display(Markdown(markdown_report))

```markdown
## Cyber Security Incident Report

**Date:** 2024-02-29 (This date is for example only, adjust accordingly)

**Report Generated By:** Automated Cyber Analyst

**Subject:** Potential Data Exposure and Credential Leak

### 1. Executive Summary

This report summarizes findings indicating potential data exposure and credential leakage based on discovered data. The key concern is the exposure of internal data through multiple channels, including publicly accessible logs and potentially compromised credentials. The presence of a simple password ("123456") associated with an administrative email address significantly elevates the risk. Immediate action is required to mitigate the identified vulnerabilities and prevent potential exploitation.

### 2. Detailed Findings

The analysis of the provided JSON findings revealed the following:

**2.1 Email Addresses:**

*   **Finding:** The email address `admin@example.com` was discovered.
*   **Implication:** This indicates a potential point of entry for targeted phishing attacks or credential stuffing attempts.

**2.2 Exposed Credentials:**

*   **Finding:** The username `admin@example.com` was found associated with the password `123456`.
*   **Implication:** This is a critical finding. The use of a weak and easily guessable password for an administrative account poses a significant security risk.  An attacker could potentially gain unauthorized access to sensitive systems and data.

**2.3 GitHub Repositories:**

*   **Finding:** The repository `github.com/user/example-internal` was identified.
*   **Implication:**  The repository name suggests that it may contain internal, potentially sensitive code or documentation.  Its public accessibility allows potential attackers to reverse engineer internal systems, identify vulnerabilities, or steal proprietary information. This requires immediate assessment.

**2.4 Exposed Logs:**

*   **Finding:** An "Index of /logs" was discovered which contains logs of users with IPs.
*   **Implication:** Publicly accessible logs containing user IPs present a significant privacy and security risk. Attackers could potentially use these logs to correlate user activity, identify potential targets, and launch targeted attacks. The exposure of IP addresses also violates privacy regulations like GDPR in some jurisdictions.

**2.5 Other Mentions:**

*   **Finding:** No other mentions were found.
*   **Implication:** This could be a positive sign, but the lack of further findings does not negate the severity of the existing issues.

### 3. Risk Assessment

Based on the findings, the overall risk level is considered **HIGH**.

**Justification:**

*   **High Risk:** The presence of a compromised administrative credential (`admin@example.com` with password `123456`) represents an immediate and critical threat.
*   **Medium Risk:**  The exposure of internal repositories on GitHub could lead to the discovery of vulnerabilities or the theft of sensitive information.
*   **Medium Risk:** The exposed logs with user IPs pose a significant privacy risk and could be used for malicious purposes.

### 4. Remediation Suggestions

The following actions are recommended to mitigate the identified risks:

**4.1 Immediate Actions:**

*   **Password Reset:** Immediately disable the account `admin@example.com` and enforce a strong password reset policy.  Implement multi-factor authentication (MFA) for all administrative accounts and enforce it across the organization.
*   **Investigate Access Logs:** Immediately investigate the access logs for `admin@example.com` to determine if the account has already been compromised and what actions an attacker may have taken.
*   **Remove Exposed Logs:** Immediately remove the publicly accessible "Index of /logs" directory. Review the logs for any sensitive information that should be redacted or removed entirely. Implement proper logging practices that prevent sensitive information from being logged in the first place (e.g., anonymize or hash IP addresses).
*   **GitHub Repository Review:** Immediately make the `github.com/user/example-internal` repository private. Conduct a thorough security audit of the repository to identify and address any exposed secrets, vulnerabilities, or sensitive information. Revoke any leaked credentials found within the repository.
*   **Incident Response:** Initiate an incident response plan to contain the damage, investigate the extent of the breach, and restore the affected systems.

**4.2 Long-Term Improvements:**

*   **Password Policy:** Implement and enforce a strong password policy that requires complex passwords, regular password changes, and prohibits the use of common or easily guessable passwords.
*   **Multi-Factor Authentication (MFA):**  Implement MFA for all user accounts, especially those with privileged access.
*   **Code Review:** Implement a rigorous code review process to identify and address security vulnerabilities before code is deployed.
*   **Access Control:** Implement strict access control policies to limit access to sensitive data and systems to only authorized personnel.
*   **Data Loss Prevention (DLP):** Implement DLP measures to prevent sensitive data from being accidentally or intentionally exposed.
*   **Vulnerability Scanning:** Implement regular vulnerability scanning to identify and address security weaknesses in systems and applications.
*   **Security Awareness Training:** Provide regular security awareness training to employees to educate them about phishing attacks, password security, and other security threats.
*   **Log Management:** Implement proper log management practices, including centralized logging, log retention, and security monitoring. Ensure logs are regularly reviewed for suspicious activity.  Configure logs to avoid storing sensitive data.
*   **Regular Security Audits:** Conduct regular security audits to identify and address potential security weaknesses.

### 5. Conclusion

The identified data exposure and credential leak pose a significant threat to the organization. Implementing the recommended remediation steps is crucial to mitigate the risks and protect sensitive data. Continuous monitoring and improvement of security practices are essential to prevent future incidents. This report should be disseminated to relevant stakeholders within the security and IT departments for immediate action.
```
