-
Notifications
You must be signed in to change notification settings - Fork 0
Fix: Handle HTTP 404 responses in StripeDocsDocumentLoader and improv… #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e HTML content extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @hideokamoto, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request addresses issues in the StripeDocsDocumentLoader related to handling HTTP 404 responses and improving HTML content extraction. The changes prevent the loader from crashing when encountering 404 errors and enhance its ability to extract relevant content from various HTML structures. Specifically, the code now skips URLs that return a 400 or greater HTTP status code, logs the skipped URLs, and continues processing other URLs. Additionally, the HTML extraction logic has been improved to handle cases where the expected <article> tags are missing, by searching for content within <div id="main-content"> or <body> tags, and also looking for Content-article or Document classes.
Highlights
- HTTP 404 Handling: The
StripeDocsDocumentLoadernow gracefully handles HTTP 404 responses by skipping those URLs and logging the event, preventing the loader from crashing. - Improved HTML Extraction: The
extractArticleFromHTMLfunction now includes more robust logic to extract content from different HTML structures, including those without<article>tags, by searching formain-contentdivs,Content-articleclasses, orDocumentclasses. - Test Cases: Added test cases to verify the correct loading of documents and the skipping of pages with HTTP 404 status codes.
Changelog
Click here to see the changelog
- packages/langchain-stripe-loader/src/StripeDocsLoader.ts
- Added a try-catch block to handle potential errors during the fetching of URLs.
- Implemented a check for HTTP status codes >= 400, skipping URLs with such status codes.
- Added console logging for skipped URLs and errors encountered during fetching.
- packages/langchain-stripe-loader/src/index.spec.ts
- Imported
StripeDocsDocumentLoaderand addeddescribeblock forStripeDocsDocumentLoadertests. - Added a mock for
SitemapProcessorto control the URLs fetched during testing. - Added a test case to verify successful document loading from a mock HTML response.
- Added a test case to verify that pages with HTTP 404 status are skipped, using
mockImplementationOnceto mock different responses for different URLs.
- Imported
- packages/langchain-stripe-loader/src/utils.ts
- Added a check for null or undefined input in
extractArticleFromHTML, returning an empty array if the input is invalid. - Modified
extractArticleFromHTMLto search for content within<div id="main-content">or<body>tags if no<article>tags are found. - Added logic to search for
Content-articleorDocumentclasses withinmain-contentorbodytags. - Added more logging to indicate which content extraction strategy is being used.
- Added a check for null or undefined input in
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Trivia time!
What is the purpose of the try...catch statement in JavaScript/TypeScript?
Click here for the answer
The `try...catch` statement is used to handle exceptions (errors) that may occur during the execution of code. The `try` block contains the code that might throw an exception, and the `catch` block contains the code that is executed if an exception is thrown within the `try` block.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enhances the StripeDocsDocumentLoader by adding error handling for HTTP 404 responses and improving HTML content extraction. The changes include wrapping the fetch logic in a try-catch block to handle potential errors, skipping pages with HTTP status codes >= 400, and implementing more flexible content extraction logic in extractArticleFromHTML. Additionally, tests have been added to verify the new functionality.
Summary of Findings
- Error Handling: The addition of a try-catch block around the
fetchcall inStripeDocsDocumentLoader.tsis a good practice for handling network errors and preventing the loader from crashing when encountering issues with specific URLs. - HTTP Status Code Handling: The implementation to skip pages with HTTP status codes >= 400 is crucial for ensuring that the loader only processes valid and accessible content, improving the reliability of the document loading process.
- Improved HTML Content Extraction: The enhanced logic in
extractArticleFromHTMLto extract content from various HTML structures (article tags, main-content ID, body tags) significantly improves the loader's ability to handle different Stripe documentation page layouts. - Testing: The addition of tests for successful document loading and skipping pages with HTTP 404 status ensures the reliability and correctness of the implemented changes.
Merge Readiness
The pull request is well-structured and addresses the issue of handling HTTP 404 responses effectively. The added tests provide confidence in the correctness of the changes. I recommend merging this pull request after addressing the review comments. I am unable to directly approve the pull request, and recommend that others review and approve this code before merging.
|
|
||
| // より柔軟な方法でコンテンツを抽出 | ||
| // 1. まず、main-contentを含む要素を探す | ||
| const mainContentMatch = htmlString.match(/<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex /<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i is susceptible to ReDoS (Regular Expression Denial of Service) attacks. Consider using a more specific and less computationally intensive regex, or a HTML parser library.
// const mainContentMatch = htmlString.match(/<[^>]*id=["']main-content["'][^>]*>([\s\S]*?)<\/[^>]*>/i);
// TODO: Refactor to use an HTML parser library to avoid ReDoS vulnerabilities| } else { | ||
| // main-contentが見つからない場合、bodyタグ内のコンテンツを探す | ||
| console.log('No main-content found, trying to find body content'); | ||
| const bodyMatch = htmlString.match(/<body[^>]*>([\s\S]*?)<\/body>/i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex /<body[^>]*>([\s\S]*?)<\/body>/i is susceptible to ReDoS (Regular Expression Denial of Service) attacks. Consider using a more specific and less computationally intensive regex, or a HTML parser library.
// const bodyMatch = htmlString.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
// TODO: Refactor to use an HTML parser library to avoid ReDoS vulnerabilities| // HTTPステータスコードが400以上の場合はスキップ | ||
| if (response.status >= 400) { | ||
| console.log(`Skipping ${docsUrl} - HTTP status: ${response.status}`); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| console.error(`Error fetching ${docsUrl}: ${error}`); | ||
| // エラーが発生した場合もスキップして次のURLに進む | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the console.error is helpful for debugging, consider adding more context to the error message, such as the specific error type or a stack trace, to facilitate easier troubleshooting. Also, consider whether a more robust logging mechanism should be used instead of console.error.
console.error(`Error fetching ${docsUrl}: ${error.message}, stack: ${error.stack}`);| if (!htmlString) { | ||
| console.log('Input HTML is null or undefined'); | ||
| return []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider throwing an error instead of returning an empty array when htmlString is null or undefined. This can help catch potential issues earlier in the development process. Returning an empty array might mask the underlying problem.
if (!htmlString) {
throw new Error('Input HTML is null or undefined');
}|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request addresses an important issue of handling HTTP 404 responses and improving HTML content extraction in the StripeDocsDocumentLoader. The changes include adding error handling for HTTP status codes and try-catch blocks to prevent the loader from crashing when encountering issues. Additionally, the HTML extraction logic has been improved to handle various HTML structures. Overall, the changes seem well-structured and address the problem effectively.
Summary of Findings
- Error Handling and Logging: The addition of try-catch blocks and HTTP status code checks improves the robustness of the loader. The logging provides useful information for debugging.
- HTML Content Extraction: The improved HTML content extraction logic in
extractArticleFromHTMLfunction enhances the loader's ability to extract relevant content from different HTML structures. - Testing: The addition of tests for the StripeDocsDocumentLoader, including a test for skipping pages with HTTP 404 status, ensures the functionality works as expected.
Merge Readiness
The pull request is well-structured and addresses the identified issues effectively. The addition of error handling, improved HTML content extraction, and comprehensive testing make the code more robust and reliable. I recommend merging this pull request after addressing the identified medium severity issue. I am unable to directly approve the pull request, and users should have others review and approve this code before merging.
| const articles: string[] = []; | ||
|
|
||
| // 1. まず、articleタグを探す - より限定的な正規表現に変更 | ||
| // 1つのarticleタグを処理する関数を作成 | ||
| const processArticleTags = () => { | ||
| // articleタグのみに一致するように修正 | ||
| const articleRegex = /<article(?:\s+[^>]*)?>(.*?)<\/article>/gis; | ||
| let articleMatch; | ||
| let foundArticles = false; | ||
|
|
||
| while ((articleMatch = articleRegex.exec(htmlString)) !== null) { | ||
| if (articleMatch[1]) { | ||
| articles.push(articleMatch[1].trim()); | ||
| foundArticles = true; | ||
| } | ||
| } | ||
|
|
||
| return foundArticles; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…e HTML content extraction