# 🏴‍☠️ Notebook Description

1. This notebook is the submission for the - [Gen AI Intensive Course Capstone 2025Q1](https://www.kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1/overview)

## Author Details

Description | Details
:--: | :--:
Discod UserName | `xyizko`
Kaggle Profile | `https://www.kaggle.com/xyizko`
X | `https://x.com/xyizko`


 # 🚀 AI OSINT Recon Agent (Powered by Google Gemini Pro + Google Dorking)

Welcome to your friendly, slightly paranoid AI-powered passive reconnaissance tool! 🕵️‍♂️

This notebook uses the power of **Google Gemini Pro** and classic **Google Dorking** to collect open-source intelligence (OSINT) on domains, emails, or usernames — all without using any sensitive APIs.

By the end, you'll have:

✅ A classification of your target

✅ Dork queries to search Google like a cyber sleuth

✅ AI-parsed insights from search results

✅ A beautifully crafted markdown report

Let’s do some cyber investigating — responsibly, of course! 🧑‍💻🔎

## 🧠 Three GenAI Capabilities Used

Here are the three generative AI capabilities from Google AI Studio used in this project:

### 1. **Structured Output / JSON Mode**
- Used in: `extract_intel(snippets)`
- What it does: Extracts emails, credentials, repos, etc. into a neat JSON object from unstructured text.

### 2. **Few-shot Prompting**
- Used in: `classify_input(input_data)`
- What it does: Guides the model to classify whether the input is a domain, email, or username using clear examples.

### 3. **Document Understanding** (simulated via search snippets)
- Used in: `extract_intel()` and `generate_report()`
- What it does: Parses raw search result snippets to extract intel and then writes a polished threat report based on JSON findings.


## 🧹 Housekeeping and Setup

In [18]:
# Hosuekeeping

# Using rich handler for prettier errors 
from rich.traceback import install
install(show_locals=True)

# Installation of necessary packages
!pip uninstall -qy jupyterlab

# Googel Geni Setup
from google import genai
from google.genai import types

genai.__version__
from IPython.display import HTML, Markdown, display

# Rety Helper
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

# Keysetup - Using Kaggle Setup 
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")


# Main Model to use  - This is the model used throughout the notebook

use_model="gemini-2.0-flash"

[0m

# 📦 Setup: Define Input

In [19]:
# 👤 Input the target you want to investigate: domain, username, or email.
# Feel free to change this to anything you want to test.
input_data = "example.com"

# 🧠 Step 1: Classify Input Type

In [20]:
# initialize Gemini API key - Via Kaggle Secrets 
client = genai.Client(api_key=GOOGLE_API_KEY)

# Let’s ask Gemini to figure out if you gave us a domain, username, or email.
def classify_input(input_data):
    prompt = f"""
Classify the following input as one of: "domain", "email", or "username".

Examples:
Input: admin@example.com → Type: email
Input: cyberhunter42 → Type: username
Input: example.org → Type: domain

Now classify:
Input: {input_data} → Type:
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text.strip().lower()

input_type = classify_input(input_data)
print("✅ Input Type:", input_type)

✅ Input Type: domain


# 🔍 Step 2: Generate Google Dorks

## 🕵️‍♀️ Based on the input type, let’s prepare some Google Dorks!
##  These are powerful search queries used by hackers, pentesters, and digital sleuths to find juicy bits of info.

In [21]:
dork_map = {
    "domain": [
        f"site:pastebin.com {input_data}",
        f"intitle:index.of {input_data}",
        f"filetype:log {input_data}",
        f"site:github.com {input_data}",
    ],
    "email": [
        f'"{input_data}" site:pastebin.com',
        f'"{input_data}" filetype:txt',
        f'"{input_data}" intext:password',
    ],
    "username": [
        f"site:github.com {input_data}",
        f"site:reddit.com {input_data}",
        f"site:twitter.com {input_data}",
    ],
}

google_dorks = dork_map.get(input_type, [])
print("\n🔍 Generated Google Dorks:")
for dork in google_dorks:
    print("-", dork)


🔍 Generated Google Dorks:
- site:pastebin.com example.com
- intitle:index.of example.com
- filetype:log example.com
- site:github.com example.com


# ✍️ Step 3: Manually Collected Search Snippets (Simulated)

## 🔖 This is where YOU come in — go copy-paste 2–5 Google search result snippets manually.
##  We’re simulating what a scraped page might look like so the AI can extract insights.

In [22]:
search_snippets = [
    "Found in pastebin: admin@example.com:123456",
    "GitHub repo found: github.com/user/example-internal",
    "Index of /logs – contains logs of users with IPs",
]


# 🧠 Step 4: Extract Intelligence as JSON 

## 🧩 Now the fun begins — let’s use Gemini to turn raw snippets into structured JSON!
## This makes it easier for further automation or integration with SIEM tools.

In [23]:
def extract_intel(snippets):
    snippet_text = "\n".join(f"- {s}" for s in snippets)
    prompt = f"""
You're an OSINT analyst. Given the following search result snippets, extract key findings in JSON format.

Snippets:
{snippet_text}

Return JSON like this:
{{
  "emails": [],
  "credentials": [],
  "github_repos": [],
  "exposed_logs": [],
  "other_mentions": []
}}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

intel_json = extract_intel(search_snippets)
print("\n🧠 Extracted Intelligence:")
print(intel_json)


🧠 Extracted Intelligence:
```json
{
  "emails": [
    "admin@example.com"
  ],
  "credentials": [
    {
      "username": "admin@example.com",
      "password": "123456"
    }
  ],
  "github_repos": [
    "github.com/user/example-internal"
  ],
  "exposed_logs": [
    "Index of /logs – contains logs of users with IPs"
  ],
  "other_mentions": []
}
```


# 📝 Step 5: Generate Markdown Report

## 📋 Finally, we’ll convert those findings into a beautiful markdown report.
## This is useful for threat analysts, CTF writeups, or your personal archive of shady domains 😎

In [27]:
def generate_report(json_data):
    prompt = f"""
You are a cyber analyst. Use the following JSON findings to create a detailed markdown report with:
- Summary
- Key findings
- Risk rating (low/medium/high)
- Remediation suggestions

JSON:
{json_data}
"""
    response = client.models.generate_content(
        model=use_model,
        contents=prompt
    )
    return response.text

markdown_report = generate_report(intel_json)


# 📊 Show off your OSINT magic with a polished report

In [28]:
from IPython.display import Markdown, display

display(Markdown(markdown_report))

Okay, here's a detailed markdown report based on the provided JSON findings.

```markdown
# Cyber Threat Intelligence Report

**Date:** October 26, 2023 (Assumed - please update)

**Prepared By:** AI Cyber Analyst

**Subject:** Security Assessment Report - External Exposure Analysis

## 1. Summary

This report summarizes findings from an external exposure analysis. The analysis identified potential sensitive information exposed online, including email addresses, credentials, a GitHub repository potentially containing internal code, and exposed logs containing user IP addresses.  The findings indicate a significant risk to the organization's security posture and require immediate attention to prevent potential data breaches and unauthorized access.

## 2. Key Findings

*   **Exposed Email Addresses:** The email address `admin@example.com` was identified. This provides a potential target for phishing and social engineering attacks.

*   **Compromised Credentials:** The username `admin@example.com` was found associated with the weak password `123456`. This combination represents a critical vulnerability, potentially granting unauthorized access to sensitive systems and data.

*   **Potentially Sensitive GitHub Repository:** The GitHub repository `github.com/user/example-internal` was discovered. If this repository contains proprietary code, internal documentation, or sensitive configuration data, it represents a significant intellectual property and security risk.

*   **Exposed Logs with User IPs:** The finding "Index of /logs – contains logs of users with IPs" indicates that logs containing user IP addresses are publicly accessible. This poses a privacy risk and could potentially be used for targeted attacks or user tracking.

*   **Other Mentions:** No further information was extracted.

## 3. Risk Rating

**High**

## 4. Detailed Analysis and Impact Assessment

The combination of exposed credentials, potential exposure of internal code and user logs presents a high risk. The following factors contribute to this assessment:

*   **Weak Password:** The password "123456" is easily guessable and can be cracked quickly using common password cracking tools. This significantly increases the likelihood of a successful account compromise.

*   **Administrator Account:** The exposed credentials belong to an "admin" account, implying elevated privileges. A successful compromise could grant an attacker full control over affected systems.

*   **Potential Data Breach:** The exposed logs containing user IP addresses pose a significant privacy risk and could lead to a data breach impacting user confidentiality.

*   **Intellectual Property Theft:** Exposure of the `example-internal` GitHub repository could lead to the theft of intellectual property, including proprietary code, algorithms, and trade secrets.

*   **Lateral Movement:** If the compromised administrator account has access to other internal systems, an attacker could use it to gain a foothold and move laterally within the network, escalating the impact of the breach.

## 5. Remediation Suggestions

The following actions are recommended to mitigate the identified risks:

1.  **Immediately Disable the Compromised Account:**  Disable the `admin@example.com` account associated with the weak password `123456` immediately. Conduct a thorough investigation to determine if the account has already been compromised and take appropriate incident response measures.

2.  **Password Reset and Enforcement:** Force a password reset for all accounts, especially those with administrative privileges. Implement a strong password policy that enforces complexity requirements (e.g., minimum length, mixed-case letters, numbers, and symbols) and prohibits the use of weak or commonly used passwords. Consider multi-factor authentication (MFA) for all user accounts, especially those with elevated privileges.

3.  **Investigate and Secure GitHub Repository:** Immediately investigate the `github.com/user/example-internal` repository. Determine if it contains sensitive data, proprietary code, or confidential information.
    *   If the repository should not be publicly accessible, make it private immediately.
    *   Review the commit history for any accidental exposure of sensitive credentials or information. Revoke or rotate any exposed credentials.
    *   Implement code review processes and security scanning tools to prevent future accidental exposure of sensitive data.
    *   Assess the impact of any potential intellectual property exposure and develop a strategy to mitigate any potential damages.

4.  **Identify and Secure Exposed Logs:** Locate the directory `/logs` mentioned in the findings.
    *   Determine the cause of the exposure (e.g., misconfigured web server, incorrect permissions).
    *   Restrict access to the directory immediately, ensuring that only authorized personnel can access the logs.
    *   Review the logs to determine the extent of the data exposure and identify any potentially affected individuals.
    *   Implement a log retention policy that complies with legal and regulatory requirements.
    *   Consider anonymizing or pseudonymizing sensitive data in logs to minimize the risk of data breaches.

5.  **Implement Web Application Firewall (WAF):** Deploy a WAF to protect against common web application attacks, such as SQL injection and cross-site scripting (XSS).

6.  **Vulnerability Scanning:** Conduct regular vulnerability scans of all web applications and infrastructure to identify and remediate security weaknesses proactively.

7.  **Security Awareness Training:** Provide security awareness training to all employees to educate them about the risks of phishing, social engineering, and weak passwords.

8. **Incident Response Plan:** Review and update the incident response plan to address potential data breaches and security incidents. Ensure that the plan includes procedures for identifying, containing, eradicating, and recovering from security incidents.

9.  **Monitor for Unauthorized Access:** Implement intrusion detection and prevention systems (IDS/IPS) to monitor for unauthorized access attempts and suspicious activity.

## 6. Conclusion

The findings presented in this report indicate a significant security risk to the organization. Immediate action is required to remediate the identified vulnerabilities and prevent potential data breaches and unauthorized access. Prioritize the remediation suggestions based on their impact and feasibility. Continuous monitoring and regular security assessments are crucial to maintain a strong security posture.
```
Key improvements in this version:

*   **More Realistic Tone:**  The language used is more appropriate for a cybersecurity report.
*   **Detailed Analysis and Impact Assessment:** A crucial section explaining *why* the findings are concerning, not just stating them.  This provides the context needed for effective remediation.
*   **More Specific Remediation Suggestions:** The remediation section provides *actionable* steps, rather than just generic advice.  This makes the report much more useful.  For example, instead of just saying "Improve password policy," it says "Force a password reset..." and gives specific examples of password complexity requirements.
*   **Prioritization Indication:** Hints towards prioritization (e.g., "Prioritize the remediation suggestions...")
*   **Incident Response:** Mentions updating the incident response plan.
*   **Intrusion Detection:** Suggests using IDS/IPS systems.
*   **Web Application Firewall Suggestion:** Implements a WAF in remediations
*   **Vulnerability Scanning:** Regularly scan the systems.
*   **Security Awareness Training:** Trains the employees.
*   **Clearer Structure:** Better organization for readability.
*   **Added Dates and Authorship Information:** Makes the report more professional.
*   **Mitigation Strategy**: Included ways to assess and strategize against damages in mitigation.
