<a href="https://www.kaggle.com/code/xyizko/xgaicps?scriptVersionId=232133342" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🏴‍☠️ Notebook Description

1. This notebook is the submission for the - [Gen AI Intensive Course Capstone 2025Q1](https://www.kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1/overview)

## Author Details

Description | Details
:--: | :--:
Discod UserName | `xyizko`
Kaggle Profile | `https://www.kaggle.com/xyizko`
X | `https://x.com/xyizko`


 # 🚀 AI OSINT Recon Agent (Powered by Google AI Studio + Google Dorking)

Welcome to your friendly, slightly paranoid AI-powered passive reconnaissance tool! 🕵️‍♂️

This notebook uses the power of **Google GoogleAI Pro** and classic **Google Dorking** to collect open-source intelligence (OSINT) on domains, emails, or usernames — all without using any sensitive APIs.

By the end, you'll have:

✅ A classification of your target

✅ Dork queries to search Google like a cyber sleuth

✅ AI-parsed insights from search results

✅ A beautifully crafted markdown report

Let’s do some cyber investigating — responsibly, of course! 🧑‍💻🔎

## 🧠 Three GenAI Capabilities Used

Here are the three generative AI capabilities from Google AI Studio used in this project:

### 1. **Structured Output / JSON Mode**
- Used in: `extract_intel(snippets)`
- What it does: Extracts emails, credentials, repos, etc. into a neat JSON object from unstructured text.

### 2. **Few-shot Prompting**
- Used in: `classify_input(input_data)`
- What it does: Guides the model to classify whether the input is a domain, email, or username using clear examples.

### 3. **Document Understanding** (simulated via search snippets)
- Used in: `extract_intel()` and `generate_report()`
- What it does: Parses raw search result snippets to extract intel and then writes a polished threat report based on JSON findings.


## 🧹 Housekeeping and Setup

In [1]:
# HouseKeeping

# Using rich handler for prettier errors 
from rich.traceback import install
install(show_locals=True)

# Installation of necessary packages
!pip uninstall -qy jupyterlab

# Googel Geni Setup
from google import genai
from google.genai import types

genai.__version__
from IPython.display import HTML, Markdown, display

# Rety Helper
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

# Keysetup - Using Kaggle Setup 
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")


# Main Model to use  - This is the model used throughout the notebook

use_model="gemini-2.0-flash"

# 📦 Setup: Define Input

In [2]:
# 👤 Input the target you want to investigate: domain, username, or email.
# Feel free to change this to anything you want to test.
input_data = "example.com"

# 🧠 Step 1: Classify Input Type

In [3]:
# initialize GoogleAI API key - Via Kaggle Secrets 
client = genai.Client(api_key=GOOGLE_API_KEY)

# Let’s ask GoogleAI to figure out if you gave us a domain, username, or email.
def classify_input(input_data):
    prompt = f"""
Classify the following input as one of: "domain", "email", or "username".

Examples:
Input: admin@example.com → Type: email
Input: cyberhunter42 → Type: username
Input: example.org → Type: domain

Now classify:
Input: {input_data} → Type:
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text.strip().lower()

input_type = classify_input(input_data)
print("✅ Input Type:", input_type)

✅ Input Type: domain


# 🔍 Step 2: Generate Google Dorks

## 🕵️‍♀️ Based on the input type, let’s prepare some Google Dorks!
##  These are powerful search queries used by hackers, pentesters, and digital sleuths to find juicy bits of info.

In [4]:
dork_map = {
    "domain": [
        f"site:pastebin.com {input_data}",
        f"intitle:index.of {input_data}",
        f"filetype:log {input_data}",
        f"site:github.com {input_data}",
    ],
    "email": [
        f'"{input_data}" site:pastebin.com',
        f'"{input_data}" filetype:txt',
        f'"{input_data}" intext:password',
    ],
    "username": [
        f"site:github.com {input_data}",
        f"site:reddit.com {input_data}",
        f"site:twitter.com {input_data}",
    ],
}

google_dorks = dork_map.get(input_type, [])
print("\n🔍 Generated Google Dorks:")
for dork in google_dorks:
    print("-", dork)


🔍 Generated Google Dorks:
- site:pastebin.com example.com
- intitle:index.of example.com
- filetype:log example.com
- site:github.com example.com


# ✍️ Step 3: Manually Collected Search Snippets (Simulated)

## 🔖 This is where YOU come in — go copy-paste 2–5 Google search result snippets manually.
##  We’re simulating what a scraped page might look like so the AI can extract insights.

In [5]:
search_snippets = [
    "Found in pastebin: admin@example.com:123456",
    "GitHub repo found: github.com/user/example-internal",
    "Index of /logs – contains logs of users with IPs",
]


# 🧠 Step 4: Extract Intelligence as JSON 

## 🧩 Now the fun begins — let’s use GoogleAI to turn raw snippets into structured JSON!
## This makes it easier for further automation or integration with SIEM tools.

In [6]:
def extract_intel(snippets):
    snippet_text = "\n".join(f"- {s}" for s in snippets)
    prompt = f"""
You're an OSINT analyst. Given the following search result snippets, extract key findings in JSON format.

Snippets:
{snippet_text}

Return JSON like this:
{{
  "emails": [],
  "credentials": [],
  "github_repos": [],
  "exposed_logs": [],
  "other_mentions": []
}}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

intel_json = extract_intel(search_snippets)
print("\n🧠 Extracted Intelligence:")
print(intel_json)


🧠 Extracted Intelligence:
```json
{
  "emails": [
    "admin@example.com"
  ],
  "credentials": [
    {
      "email": "admin@example.com",
      "password": "123456"
    }
  ],
  "github_repos": [
    "github.com/user/example-internal"
  ],
  "exposed_logs": [
    "Index of /logs - contains logs of users with IPs"
  ],
  "other_mentions": []
}
```


# 📝 Step 5: Generate Markdown Report

## 📋 Finally, we’ll convert those findings into a beautiful markdown report.
## This is useful for threat analysts, CTF writeups, or your personal archive of shady domains 😎

In [7]:
def generate_report(json_data):
    prompt = f"""
You are a cyber analyst. Use the following JSON findings to create a detailed markdown report with:
- Summary
- Key findings
- Risk rating (low/medium/high)
- Remediation suggestions

JSON:
{json_data}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

markdown_report = generate_report(intel_json)


# 📊 Show off your OSINT magic with a polished report

In [8]:
from IPython.display import Markdown, display

display(Markdown(markdown_report))

```markdown
## Cyber Security Incident Report

**Date:** October 26, 2023
**Analyst:** AI Cyber Analyst

### Summary

This report summarizes findings from a security scan of publicly available information. The scan uncovered several potential security vulnerabilities, including exposed credentials, internal repository information, and accessible logs containing user data. Immediate action is recommended to mitigate the identified risks.

### Key Findings

1. **Exposed Credentials:**
    *   An email address (`admin@example.com`) and corresponding password (`123456`) were discovered. This is a significant security risk, as compromised credentials could grant unauthorized access to sensitive systems and data.
    *   **Details:** The password found is extremely weak and susceptible to brute-force attacks.

2. **Internal GitHub Repository Mention:**
    *   The scan identified a mention of a GitHub repository (`github.com/user/example-internal`). This suggests the potential exposure of internal code, configurations, or other sensitive information.
    *   **Details:** The repository name "example-internal" strongly indicates that it should not be publicly accessible.  The 'user' may also be a point of investigation.

3. **Exposed Log Files:**
    *   The scan identified publicly accessible log files at the URL `Index of /logs`.  These logs reportedly contain user IPs, potentially exposing user activity and enabling tracking or correlation with other data breaches.
    *   **Details:** Exposing IP addresses combined with other log data can enable malicious actors to identify, track, and potentially target specific users or systems. The description "Index of /logs" implies directory listing is enabled, which is poor security practice.

4. **Email Addresses Mentioned:**
    *   The email address `admin@example.com` was identified.
    *   **Details:** This may be a common email naming convention, so a full audit of email addresses may uncover more useful data.

5. **Other Mentions:**
    *   No other significant mentions were identified.

### Risk Rating

| Finding                     | Risk Rating | Justification                                                                                                                                       |
| --------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| Exposed Credentials           | **High**    | Compromised credentials can lead to complete system takeover, data breaches, and reputational damage. The weak password further exacerbates the risk. |
| Internal GitHub Repository  | **Medium**  | Potential exposure of proprietary code and sensitive data. The level of risk depends on the content of the repository.                                 |
| Exposed Log Files           | **Medium**  | Exposure of user IP addresses and other log data can lead to privacy violations and targeted attacks.                                              |
| Email Address Mention        | **Low**     | The email address does not itself pose a direct risk, but it does mean that the email address has been found in a public context.                 |

### Remediation Suggestions

1. **Immediately Invalidate Exposed Credentials:**
    *   **Action:** Reset the password for `admin@example.com` to a strong, unique password immediately. Enable multi-factor authentication (MFA) for the account.  Consider invalidating the account completely and creating a new admin account with a randomly generated username.
    *   **Justification:** This is the most critical action to prevent unauthorized access.
    *   **Responsible Party:** IT Security Team / System Administrator

2. **Investigate and Secure the Internal GitHub Repository:**
    *   **Action:**
        *   Verify if the repository `github.com/user/example-internal` is indeed internal and should not be publicly accessible.
        *   If it is internal, immediately make the repository private.
        *   Audit the repository's content to identify any other exposed credentials, API keys, or sensitive information. Revoke or rotate any identified secrets.
        *   Investigate who the 'user' is that created the repo.
    *   **Justification:** Prevents further exposure of potentially sensitive code and data.
    *   **Responsible Party:** Development Team / IT Security Team

3. **Secure the Exposed Log Files:**
    *   **Action:**
        *   Immediately remove public access to the `/logs` directory.
        *   Implement proper access controls to restrict access to authorized personnel only.
        *   Review log retention policies to ensure logs are not stored longer than necessary.
        *   Consider anonymizing or pseudonymizing IP addresses in logs where personally identifiable information is not essential.
        *   Disable directory listing for all web directories.
    *   **Justification:** Prevents unauthorized access to user data and mitigates potential privacy violations.
    *   **Responsible Party:** IT Operations Team / System Administrator

4. **Security Awareness Training:**
    *   **Action:** Conduct security awareness training for all employees, emphasizing the importance of strong passwords, secure coding practices, and the proper handling of sensitive data.  Specifically, train developers on the dangers of committing secrets to repositories.
    *   **Justification:** Reduces the likelihood of future security incidents.
    *   **Responsible Party:** HR / IT Security Team

5. **Regular Security Audits and Monitoring:**
    *   **Action:** Implement regular security audits and monitoring of public-facing resources to identify and address potential vulnerabilities proactively.  Utilize automated tools for credential scanning and secret detection.
    *   **Justification:** Ensures ongoing security posture and timely detection of vulnerabilities.
    *   **Responsible Party:** IT Security Team

6. **Password Complexity Policy:**
    * **Action:** Review and enforce a strong password complexity policy for all user accounts, mandating minimum length, character diversity, and regular password changes.
    * **Justification:** Increases the difficulty for attackers to compromise user accounts through brute-force or dictionary attacks.

By implementing these remediation steps, the organization can significantly reduce the risk associated with the identified vulnerabilities and improve its overall security posture.
```