Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv… #2

hideokamoto · 2025-03-11T06:28:34Z

…e HTML content extraction

gemini-code-assist

Hello @hideokamoto, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request addresses issues in the StripeDocsDocumentLoader related to handling HTTP 404 responses and improving HTML content extraction. The changes prevent the loader from crashing when encountering 404 errors and enhance its ability to extract relevant content from various HTML structures. Specifically, the code now skips URLs that return a 400 or greater HTTP status code, logs the skipped URLs, and continues processing other URLs. Additionally, the HTML extraction logic has been improved to handle cases where the expected <article> tags are missing, by searching for content within <div id="main-content"> or <body> tags, and also looking for Content-article or Document classes.

Highlights

HTTP 404 Handling: The StripeDocsDocumentLoader now gracefully handles HTTP 404 responses by skipping those URLs and logging the event, preventing the loader from crashing.
Improved HTML Extraction: The extractArticleFromHTML function now includes more robust logic to extract content from different HTML structures, including those without <article> tags, by searching for main-content divs, Content-article classes, or Document classes.
Test Cases: Added test cases to verify the correct loading of documents and the skipping of pages with HTTP 404 status codes.

Changelog

Click here to see the changelog

packages/langchain-stripe-loader/src/StripeDocsLoader.ts
- Added a try-catch block to handle potential errors during the fetching of URLs.
- Implemented a check for HTTP status codes >= 400, skipping URLs with such status codes.
- Added console logging for skipped URLs and errors encountered during fetching.
packages/langchain-stripe-loader/src/index.spec.ts
- Imported StripeDocsDocumentLoader and added describe block for StripeDocsDocumentLoader tests.
- Added a mock for SitemapProcessor to control the URLs fetched during testing.
- Added a test case to verify successful document loading from a mock HTML response.
- Added a test case to verify that pages with HTTP 404 status are skipped, using mockImplementationOnce to mock different responses for different URLs.
packages/langchain-stripe-loader/src/utils.ts
- Added a check for null or undefined input in extractArticleFromHTML, returning an empty array if the input is invalid.
- Modified extractArticleFromHTML to search for content within <div id="main-content"> or <body> tags if no <article> tags are found.
- Added logic to search for Content-article or Document classes within main-content or body tags.
- Added more logging to indicate which content extraction strategy is being used.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Trivia time!

What is the purpose of the try...catch statement in JavaScript/TypeScript?

Click here for the answer

The `try...catch` statement is used to handle exceptions (errors) that may occur during the execution of code. The `try` block contains the code that might throw an exception, and the `catch` block contains the code that is executed if an exception is thrown within the `try` block.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the StripeDocsDocumentLoader by adding error handling for HTTP 404 responses and improving HTML content extraction. The changes include wrapping the fetch logic in a try-catch block to handle potential errors, skipping pages with HTTP status codes >= 400, and implementing more flexible content extraction logic in extractArticleFromHTML. Additionally, tests have been added to verify the new functionality.

Summary of Findings

Error Handling: The addition of a try-catch block around the fetch call in StripeDocsDocumentLoader.ts is a good practice for handling network errors and preventing the loader from crashing when encountering issues with specific URLs.
HTTP Status Code Handling: The implementation to skip pages with HTTP status codes >= 400 is crucial for ensuring that the loader only processes valid and accessible content, improving the reliability of the document loading process.
Improved HTML Content Extraction: The enhanced logic in extractArticleFromHTML to extract content from various HTML structures (article tags, main-content ID, body tags) significantly improves the loader's ability to handle different Stripe documentation page layouts.
Testing: The addition of tests for successful document loading and skipping pages with HTTP 404 status ensures the reliability and correctness of the implemented changes.

Merge Readiness

The pull request is well-structured and addresses the issue of handling HTTP 404 responses effectively. The added tests provide confidence in the correctness of the changes. I recommend merging this pull request after addressing the review comments. I am unable to directly approve the pull request, and recommend that others review and approve this code before merging.

gemini-code-assist · 2025-03-11T06:28:56Z

packages/langchain-stripe-loader/src/utils.ts

+
+      // より柔軟な方法でコンテンツを抽出
+      // 1. まず、main-contentを含む要素を探す
+      const mainContentMatch = htmlString.match(/<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i);


The regex /<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i is susceptible to ReDoS (Regular Expression Denial of Service) attacks. Consider using a more specific and less computationally intensive regex, or a HTML parser library.

// const mainContentMatch = htmlString.match(/<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i); // TODO: Refactor to use an HTML parser library to avoid ReDoS vulnerabilities

gemini-code-assist · 2025-03-11T06:28:56Z

packages/langchain-stripe-loader/src/utils.ts

+      } else {
+        // main-contentが見つからない場合、bodyタグ内のコンテンツを探す
+        console.log('No main-content found, trying to find body content');
+        const bodyMatch = htmlString.match(/<body[^>]*>([\s\S]*?)<\/body>/i);


The regex /<body[^>]*>([\s\S]*?)<\/body>/i is susceptible to ReDoS (Regular Expression Denial of Service) attacks. Consider using a more specific and less computationally intensive regex, or a HTML parser library.

// const bodyMatch = htmlString.match(/<body[^>]*>([\s\S]*?)<\/body>/i); // TODO: Refactor to use an HTML parser library to avoid ReDoS vulnerabilities

gemini-code-assist · 2025-03-11T06:28:56Z

packages/langchain-stripe-loader/src/StripeDocsLoader.ts

+        // HTTPステータスコードが400以上の場合はスキップ
+        if (response.status >= 400) {
+          console.log(`Skipping ${docsUrl} - HTTP status: ${response.status}`);
+          continue;
+        }


Consider extracting this HTTP status check into a separate function for better readability and reusability. This would also make the code easier to test in isolation.

if (this.isErrorStatus(response.status)) { console.log(`Skipping ${docsUrl} - HTTP status: ${response.status}`); continue; }

gemini-code-assist · 2025-03-11T06:28:56Z

packages/langchain-stripe-loader/src/StripeDocsLoader.ts

+        console.error(`Error fetching ${docsUrl}: ${error}`);
+        // エラーが発生した場合もスキップして次のURLに進む
+        continue;


While the console.error is helpful for debugging, consider adding more context to the error message, such as the specific error type or a stack trace, to facilitate easier troubleshooting. Also, consider whether a more robust logging mechanism should be used instead of console.error.

console.error(`Error fetching ${docsUrl}: ${error.message}, stack: ${error.stack}`);

gemini-code-assist · 2025-03-11T06:28:56Z

packages/langchain-stripe-loader/src/utils.ts

+    if (!htmlString) {
+      console.log('Input HTML is null or undefined');
+      return [];


Consider throwing an error instead of returning an empty array when htmlString is null or undefined. This can help catch potential issues earlier in the development process. Returning an empty array might mask the underlying problem.

if (!htmlString) { throw new Error('Input HTML is null or undefined'); }

hideokamoto · 2025-03-11T06:36:06Z

/gemini review

gemini-code-assist

Code Review

The pull request addresses an important issue of handling HTTP 404 responses and improving HTML content extraction in the StripeDocsDocumentLoader. The changes include adding error handling for HTTP status codes and try-catch blocks to prevent the loader from crashing when encountering issues. Additionally, the HTML extraction logic has been improved to handle various HTML structures. Overall, the changes seem well-structured and address the problem effectively.

Summary of Findings

Error Handling and Logging: The addition of try-catch blocks and HTTP status code checks improves the robustness of the loader. The logging provides useful information for debugging.
HTML Content Extraction: The improved HTML content extraction logic in extractArticleFromHTML function enhances the loader's ability to extract relevant content from different HTML structures.
Testing: The addition of tests for the StripeDocsDocumentLoader, including a test for skipping pages with HTTP 404 status, ensures the functionality works as expected.

Merge Readiness

The pull request is well-structured and addresses the identified issues effectively. The addition of error handling, improved HTML content extraction, and comprehensive testing make the code more robust and reliable. I recommend merging this pull request after addressing the identified medium severity issue. I am unable to directly approve the pull request, and users should have others review and approve this code before merging.

packages/langchain-stripe-loader/src/StripeDocsLoader.ts

gemini-code-assist · 2025-03-11T06:36:21Z

packages/langchain-stripe-loader/src/utils.ts

+    const articles: string[] = [];
+
+    // 1. まず、articleタグを探す - より限定的な正規表現に変更
+    // 1つのarticleタグを処理する関数を作成
+    const processArticleTags = () => {
+      // articleタグのみに一致するように修正
+      const articleRegex = /<article(?:\s+[^>]*)?>(.*?)<\/article>/gis;
+      let articleMatch;
+      let foundArticles = false;
+
+      while ((articleMatch = articleRegex.exec(htmlString)) !== null) {
+        if (articleMatch[1]) {
+          articles.push(articleMatch[1].trim());
+          foundArticles = true;
+        }
+      }
+
+      return foundArticles;
+    };


The processArticleTags function could be simplified by directly returning articles.length > 0 instead of using a foundArticles variable.

while ((articleMatch = articleRegex.exec(htmlString)) !== null) { if (articleMatch[1]) { articles.push(articleMatch[1].trim()); } } return articles.length > 0;

Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv…

eac640f

…e HTML content extraction

gemini-code-assist bot reviewed Mar 11, 2025

View reviewed changes

gemini-code-assist bot suggested changes Mar 11, 2025

View reviewed changes

update regs

3556380

gemini-code-assist bot reviewed Mar 11, 2025

View reviewed changes

hideokamoto merged commit 82f029e into main Mar 11, 2025

hideokamoto deleted the fix/stripe-docs-loader-404-handling branch March 11, 2025 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv… #2

Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv… #2

Uh oh!

hideokamoto commented Mar 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 11, 2025

Uh oh!

gemini-code-assist bot Mar 11, 2025

Uh oh!

gemini-code-assist bot Mar 11, 2025

Uh oh!

gemini-code-assist bot Mar 11, 2025

Uh oh!

gemini-code-assist bot Mar 11, 2025

Uh oh!

hideokamoto commented Mar 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv… #2

Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv… #2

Uh oh!

Conversation

hideokamoto commented Mar 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

hideokamoto commented Mar 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

gemini-code-assist bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants