# Semgrep AI

In [26]:
import os, sys, json, re
from dotenv import load_dotenv
import google.generativeai as genai
from IPython.display import Markdown, display, update_display

In [11]:
load_dotenv()
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')

if not GEMINI_API_KEY:
    print("No GEMINI_API_KEY in environment")
    sys.exit(1)

In [8]:
gemini = genai.configure(
    api_key = GEMINI_API_KEY
)

In [81]:
try:
    with open("test/report.semgrep.formatted.json", "r") as f:
        report = f.read()
except FileNotFoundError:
    print("Error: Report file not found.")
    sys.exit(1)

try:
    with open("test/Comment.jsx", "r") as f:
        file_content = f.read()
        file_type = "jsx"
except FileNotFoundError:
    print("Error: Offensive file not found.")
    sys.exit(1)

In [69]:
system_prompt = """
You are a Security Analyst AI.
Your task is to review static code analysis findings and their corresponding code.
You will be provided with a JSON object containing the fingerprints from a static code analyzer report and code files.
Your sole output must be a single JSON object with the following structure and no other text or explanation.
First analyze the report. Your findings nest under the `fingerprints` array and include the following fields:
- fingerprint: the fingerprint of the finding
- warning_type: the warning category of the finding
- severity: includes WARNING, ERROR or INFO category values of the finding
- confidence: the confidence level of the finding
- impact: the actual severity of the finding
- check_name: the rule file that the finding was checked against
- message: a short message about the finding
- file: the file that the finding was found
- start_line: the starting line of the finding
- end_line: the ending line of the finding
- code: the small code part of the finding
- references: a reference url for the finding

You are tasked with reviewing these files with their respective findings and you have to output a single response with a JSON object with the following format:
{
  "classification": "True Positive" | "False Positive",
  "confidence": "LOW" | "MEDIUM" | "HIGH",
  "remediation": [
    "<string>"
  ],
  "references": [
    "<string>"
  ],
  "comment": "<string>",
  "code": [
      "<string>"
  ]
}

You have to do the following:
1. Analyze the Input: You will receive a JSON object containing the fingerprints array. Each object in the array represents a single security finding.
2. Determine Classification: Based on the provided code and message, classify the finding as either a "True Positive" or a "False Positive".
3. Set Confidence Level: Indicate your confidence in this classification as "LOW", "MEDIUM", or "HIGH".
4. Provide Remediation: If you classify the finding as a "True Positive", provide a remediation array containing one or more clear, actionable steps to fix the vulnerability. If it's a "False Positive", this array should contain a brief explanation array item. Keep in mind that we have implemented `purifyHTML` and `purifyURL` functions that can be used to sanitize HTML and URLs respectively. Recommend them if applicable. An example of purifyHTML: `import { purifyHTML } from 'app/helpers/purify'; ... dangerouslySetInnerHTML={{ __html: purifyHTML(formattedBody) }} ...`. You can enhance their code in the remediation actions for more eye-pleasing output.
5. Provide Code examples: Do not output whole code blocks in the remediation array. Use the code section for such purposes with one or more examples. Feel free to add example code resolving the mitigation or alternatives.
6. Manage References: Populate the references array. Always include the original reference URL from the input. You may add one to two additional, relevant, and active URLs.
7. Write a Comment: In the comment field, provide a concise plaintext summary of your analysis. For instance: "This is a clear True Positive with high confidence because the user-controlled input is passed directly to a SQL query, leading to a potential SQL injection." or "This appears to be a False Positive with high confidence as the variable is hardcoded and not influenced by external input."
8. Final Output: Ensure your final output is only the JSON object described above. Do NOT include any other text, markdown formatting, or explanations before or after the JSON object. Ignore any empty or irrelevant fields from the initial report.
9. You will be provided with a single Static code analysis report and more than one files containing the code relevant to the finding. Use them accordingly.
10. Your output has to be compatible with python method json.loads(response.text).
"""
user_prompt = f"Please review. Static code analysis report:\n\n{file_content}\n\nFile contents:\n\n{file_content}\n\n"

generation_config = {
    "temperature": 0.2,
    "max_output_tokens": 8192,
}

try:
    gemini = genai.GenerativeModel(
        model_name = "gemini-2.5-pro",
        system_instruction = system_prompt
    )
    response = gemini.generate_content(
        user_prompt,
        generation_config = generation_config
    )
    print(response.text)
except Exception as e:
    print(f"Something went wrong: {e}")

```json
{
  "classification": "True Positive",
  "confidence": "HIGH",
  "remediation": [
    "The `formattedBody` prop, which likely contains user-generated content, should be sanitized before being passed to `dangerouslySetInnerHTML`. Use the `purifyHTML` helper function to mitigate the risk of Cross-Site Scripting (XSS)."
  ],
  "references": [
    "https://react.dev/reference/react-dom/components/common#dangerouslysetinnerhtml",
    "https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html"
  ],
  "comment": "The component uses `dangerouslySetInnerHTML` to render the `formattedBody` prop. Since this prop contains content originating from a user's comment, it is susceptible to Cross-Site Scripting (XSS) attacks if not properly sanitized. The recommended fix is to wrap the `formattedBody` variable with a sanitization function like `purifyHTML` before rendering.",
  "code": [
    "import { purifyHTML } from 'app/helpers/purify';\n\n// ...\n\n    

In [70]:
def parse_llm_json_output(response_text: str) -> dict:
    """
    Parses a JSON object from a string that might be wrapped
    in Markdown code blocks.
    """
    # Use a regex to find the JSON content between ```json and ```
    match = re.search(r"```json\s*(\{.*?\})\s*```", response_text, re.DOTALL)
    
    if match:
        # If a match is found, extract the JSON part
        json_str = match.group(1)
    else:
        # Otherwise, assume the whole string is the JSON
        json_str = response_text

    try:
        # Load the cleaned string as JSON
        return json.loads(json_str)
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        # Handle the error appropriately, maybe return None or raise
        return None

response_json = parse_llm_json_output(response.text)
response_json

{'classification': 'True Positive',
 'confidence': 'HIGH',
 'remediation': ['The `formattedBody` prop, which likely contains user-generated content, should be sanitized before being passed to `dangerouslySetInnerHTML`. Use the `purifyHTML` helper function to mitigate the risk of Cross-Site Scripting (XSS).'],
  'https://react.dev/reference/react-dom/components/common#dangerouslysetinnerhtml',
  'https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html'],
 'comment': "The component uses `dangerouslySetInnerHTML` to render the `formattedBody` prop. Since this prop contains content originating from a user's comment, it is susceptible to Cross-Site Scripting (XSS) attacks if not properly sanitized. The recommended fix is to wrap the `formattedBody` variable with a sanitization function like `purifyHTML` before rendering.",
 'code': ['import { purifyHTML } from \'app/helpers/purify\';\n\n// ...\n\n        <div>\n          <div\n            className="q

In [84]:
def markdown_json_output(response: json, file_type: str) -> str:
    """
    Parses the final JSON object and creates a GitHub compatible
    Markdown text.
    """
    # Use emojis for quick visual identification
    classification_emoji = "‚ùå" if response['classification'] == 'True Positive' else "‚úÖ"
    
    markdown_parts = [
        f"## ‚ú® Gemini thoughs",
        f"{classification_emoji} This looks like a **{response['classification']}** with a **{response['confidence'].lower()}** confidence.",
        f"{response['comment']}"
    ]

    if response.get('remediation'):
        markdown_parts.append("### üõ†Ô∏è Remediation")
        remediation_steps = [f"{i+1}. {step}" for i, step in enumerate(response['remediation'])]
        markdown_parts.append("\n".join(remediation_steps))
        markdown_parts.append("\n")

    if response.get('references'):
        markdown_parts.append("### üîó References")
        reference_links = [f"- [{url.split('//')[-1]}]({url})" for url in response['references']]
        markdown_parts.append("\n".join(reference_links))

    if response.get('code'):
        markdown_parts.append("### üñ•Ô∏è Code Example")
        code_block = "\n".join(response['code'])
        markdown_parts.append(f"```{file_type}\n{code_block}\n```")
        markdown_parts.append("\n")

    return "\n".join(markdown_parts)

display(Markdown(markdown_json_output(response_json, file_type)))

## ‚ú® Gemini thoughs
‚ùå This looks like a **True Positive** with a **high** confidence.
The component uses `dangerouslySetInnerHTML` to render the `formattedBody` prop. Since this prop contains content originating from a user's comment, it is susceptible to Cross-Site Scripting (XSS) attacks if not properly sanitized. The recommended fix is to wrap the `formattedBody` variable with a sanitization function like `purifyHTML` before rendering.
### üõ†Ô∏è Remediation
1. The `formattedBody` prop, which likely contains user-generated content, should be sanitized before being passed to `dangerouslySetInnerHTML`. Use the `purifyHTML` helper function to mitigate the risk of Cross-Site Scripting (XSS).


### üîó References
- [brakeman-pro.com/docs/warning_types/cross_site_scripting_react/](https://brakeman-pro.com/docs/warning_types/cross_site_scripting_react/)
- [react.dev/reference/react-dom/components/common#dangerouslysetinnerhtml](https://react.dev/reference/react-dom/components/common#dangerouslysetinnerhtml)
- [cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html](https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html)
### üñ•Ô∏è Code Example
```jsx
import { purifyHTML } from 'app/helpers/purify';

// ...

        <div>
          <div
            className="qna-comment-content"
            dangerouslySetInnerHTML={{ __html: purifyHTML(formattedBody) }}
          />
          {hasVideo && <VideoList videoIds={ytVideoIds} />}
          {skuPreviews && skuPreviews.length > 0 && (
            <ul className="qna-sku-preview-card-list">
              {skuPreviews.map((skuPreview) => (
                <SkuPreview key={skuPreview.id} {...skuPreview} />
              ))}
            </ul>
          )}
          <div className="qna-user-actions">
            <Score commentId={id} upvoteCount={upvotesCount} hasUserVote={hasUserVote} />
            <Share compact hash={`category_comment_${id}`} />
          </div>
        </div>
// ...
```

