Skip to content

Conversation

@hideokamoto
Copy link
Contributor

…e HTML content extraction

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @hideokamoto, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request addresses issues in the StripeDocsDocumentLoader related to handling HTTP 404 responses and improving HTML content extraction. The changes prevent the loader from crashing when encountering 404 errors and enhance its ability to extract relevant content from various HTML structures. Specifically, the code now skips URLs that return a 400 or greater HTTP status code, logs the skipped URLs, and continues processing other URLs. Additionally, the HTML extraction logic has been improved to handle cases where the expected <article> tags are missing, by searching for content within <div id="main-content"> or <body> tags, and also looking for Content-article or Document classes.

Highlights

  • HTTP 404 Handling: The StripeDocsDocumentLoader now gracefully handles HTTP 404 responses by skipping those URLs and logging the event, preventing the loader from crashing.
  • Improved HTML Extraction: The extractArticleFromHTML function now includes more robust logic to extract content from different HTML structures, including those without <article> tags, by searching for main-content divs, Content-article classes, or Document classes.
  • Test Cases: Added test cases to verify the correct loading of documents and the skipping of pages with HTTP 404 status codes.

Changelog

Click here to see the changelog
  • packages/langchain-stripe-loader/src/StripeDocsLoader.ts
    • Added a try-catch block to handle potential errors during the fetching of URLs.
    • Implemented a check for HTTP status codes >= 400, skipping URLs with such status codes.
    • Added console logging for skipped URLs and errors encountered during fetching.
  • packages/langchain-stripe-loader/src/index.spec.ts
    • Imported StripeDocsDocumentLoader and added describe block for StripeDocsDocumentLoader tests.
    • Added a mock for SitemapProcessor to control the URLs fetched during testing.
    • Added a test case to verify successful document loading from a mock HTML response.
    • Added a test case to verify that pages with HTTP 404 status are skipped, using mockImplementationOnce to mock different responses for different URLs.
  • packages/langchain-stripe-loader/src/utils.ts
    • Added a check for null or undefined input in extractArticleFromHTML, returning an empty array if the input is invalid.
    • Modified extractArticleFromHTML to search for content within <div id="main-content"> or <body> tags if no <article> tags are found.
    • Added logic to search for Content-article or Document classes within main-content or body tags.
    • Added more logging to indicate which content extraction strategy is being used.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Trivia time!

What is the purpose of the try...catch statement in JavaScript/TypeScript?

Click here for the answer
The `try...catch` statement is used to handle exceptions (errors) that may occur during the execution of code. The `try` block contains the code that might throw an exception, and the `catch` block contains the code that is executed if an exception is thrown within the `try` block.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the StripeDocsDocumentLoader by adding error handling for HTTP 404 responses and improving HTML content extraction. The changes include wrapping the fetch logic in a try-catch block to handle potential errors, skipping pages with HTTP status codes >= 400, and implementing more flexible content extraction logic in extractArticleFromHTML. Additionally, tests have been added to verify the new functionality.

Summary of Findings

  • Error Handling: The addition of a try-catch block around the fetch call in StripeDocsDocumentLoader.ts is a good practice for handling network errors and preventing the loader from crashing when encountering issues with specific URLs.
  • HTTP Status Code Handling: The implementation to skip pages with HTTP status codes >= 400 is crucial for ensuring that the loader only processes valid and accessible content, improving the reliability of the document loading process.
  • Improved HTML Content Extraction: The enhanced logic in extractArticleFromHTML to extract content from various HTML structures (article tags, main-content ID, body tags) significantly improves the loader's ability to handle different Stripe documentation page layouts.
  • Testing: The addition of tests for successful document loading and skipping pages with HTTP 404 status ensures the reliability and correctness of the implemented changes.

Merge Readiness

The pull request is well-structured and addresses the issue of handling HTTP 404 responses effectively. The added tests provide confidence in the correctness of the changes. I recommend merging this pull request after addressing the review comments. I am unable to directly approve the pull request, and recommend that others review and approve this code before merging.


// より柔軟な方法でコンテンツを抽出
// 1. まず、main-contentを含む要素を探す
const mainContentMatch = htmlString.match(/<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex /<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i is susceptible to ReDoS (Regular Expression Denial of Service) attacks. Consider using a more specific and less computationally intensive regex, or a HTML parser library.

// const mainContentMatch = htmlString.match(/<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i);
// TODO: Refactor to use an HTML parser library to avoid ReDoS vulnerabilities

} else {
// main-contentが見つからない場合、bodyタグ内のコンテンツを探す
console.log('No main-content found, trying to find body content');
const bodyMatch = htmlString.match(/<body[^>]*>([\s\S]*?)<\/body>/i);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex /<body[^>]*>([\s\S]*?)<\/body>/i is susceptible to ReDoS (Regular Expression Denial of Service) attacks. Consider using a more specific and less computationally intensive regex, or a HTML parser library.

// const bodyMatch = htmlString.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
// TODO: Refactor to use an HTML parser library to avoid ReDoS vulnerabilities

Comment on lines +48 to +52
// HTTPステータスコードが400以上の場合はスキップ
if (response.status >= 400) {
console.log(`Skipping ${docsUrl} - HTTP status: ${response.status}`);
continue;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider extracting this HTTP status check into a separate function for better readability and reusability. This would also make the code easier to test in isolation.

        if (this.isErrorStatus(response.status)) {
          console.log(`Skipping ${docsUrl} - HTTP status: ${response.status}`);
          continue;
        }

Comment on lines +63 to +65
console.error(`Error fetching ${docsUrl}: ${error}`);
// エラーが発生した場合もスキップして次のURLに進む
continue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the console.error is helpful for debugging, consider adding more context to the error message, such as the specific error type or a stack trace, to facilitate easier troubleshooting. Also, consider whether a more robust logging mechanism should be used instead of console.error.

        console.error(`Error fetching ${docsUrl}: ${error.message}, stack: ${error.stack}`);

Comment on lines 40 to 42
if (!htmlString) {
console.log('Input HTML is null or undefined');
return [];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider throwing an error instead of returning an empty array when htmlString is null or undefined. This can help catch potential issues earlier in the development process. Returning an empty array might mask the underlying problem.

    if (!htmlString) {
      throw new Error('Input HTML is null or undefined');
    }

@hideokamoto
Copy link
Contributor Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses an important issue of handling HTTP 404 responses and improving HTML content extraction in the StripeDocsDocumentLoader. The changes include adding error handling for HTTP status codes and try-catch blocks to prevent the loader from crashing when encountering issues. Additionally, the HTML extraction logic has been improved to handle various HTML structures. Overall, the changes seem well-structured and address the problem effectively.

Summary of Findings

  • Error Handling and Logging: The addition of try-catch blocks and HTTP status code checks improves the robustness of the loader. The logging provides useful information for debugging.
  • HTML Content Extraction: The improved HTML content extraction logic in extractArticleFromHTML function enhances the loader's ability to extract relevant content from different HTML structures.
  • Testing: The addition of tests for the StripeDocsDocumentLoader, including a test for skipping pages with HTTP 404 status, ensures the functionality works as expected.

Merge Readiness

The pull request is well-structured and addresses the identified issues effectively. The addition of error handling, improved HTML content extraction, and comprehensive testing make the code more robust and reliable. I recommend merging this pull request after addressing the identified medium severity issue. I am unable to directly approve the pull request, and users should have others review and approve this code before merging.

Comment on lines +50 to +68
const articles: string[] = [];

// 1. まず、articleタグを探す - より限定的な正規表現に変更
// 1つのarticleタグを処理する関数を作成
const processArticleTags = () => {
// articleタグのみに一致するように修正
const articleRegex = /<article(?:\s+[^>]*)?>(.*?)<\/article>/gis;
let articleMatch;
let foundArticles = false;

while ((articleMatch = articleRegex.exec(htmlString)) !== null) {
if (articleMatch[1]) {
articles.push(articleMatch[1].trim());
foundArticles = true;
}
}

return foundArticles;
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The processArticleTags function could be simplified by directly returning articles.length > 0 instead of using a foundArticles variable.

      while ((articleMatch = articleRegex.exec(htmlString)) !== null) {
        if (articleMatch[1]) {
          articles.push(articleMatch[1].trim());
        }
      }
      
      return articles.length > 0;

@hideokamoto hideokamoto merged commit 82f029e into main Mar 11, 2025
@hideokamoto hideokamoto deleted the fix/stripe-docs-loader-404-handling branch March 11, 2025 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants