-
-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unix timestamp parsing to scrapers parseDate #2817
Add unix timestamp parsing to scrapers parseDate #2817
Conversation
Fix documentation formatting
@JackDawson94 a small diff to add the html parsing pp mentioned in discord diff --git a/pkg/scraper/mapped.go b/pkg/scraper/mapped.go
index 2886adcf..a1ea0cd2 100644
--- a/pkg/scraper/mapped.go
+++ b/pkg/scraper/mapped.go
@@ -14,6 +14,9 @@ import (
"github.com/stashapp/stash/pkg/logger"
"github.com/stashapp/stash/pkg/models"
"github.com/stashapp/stash/pkg/sliceutil/stringslice"
+
+ "github.com/antchfx/htmlquery"
+
"gopkg.in/yaml.v2"
)
@@ -523,6 +526,19 @@ func (p *postProcessLbToKg) Apply(ctx context.Context, value string, q mappedQue
return value
}
+type postProcessHtmlToText bool
+
+func (p *postProcessHtmlToText) Apply(ctx context.Context, value string, q mappedQuery) string {
+ doc, err := htmlquery.Parse(strings.NewReader(value))
+ if err != nil {
+ logger.Warn("Could not parse html value")
+ return value
+ }
+ value = htmlquery.InnerText(doc)
+
+ return value
+}
+
type mappedPostProcessAction struct {
ParseDate string `yaml:"parseDate"`
SubtractDays bool `yaml:"subtractDays"`
@@ -531,6 +547,7 @@ type mappedPostProcessAction struct {
Map map[string]string `yaml:"map"`
FeetToCm bool `yaml:"feetToCm"`
LbToKg bool `yaml:"lbToKg"`
+ HtmlToText bool `yaml:"htmlToText"`
}
func (a mappedPostProcessAction) ToPostProcessAction() (postProcessAction, error) {
@@ -582,6 +599,14 @@ func (a mappedPostProcessAction) ToPostProcessAction() (postProcessAction, error
action := postProcessLbToKg(a.LbToKg)
ret = &action
}
+ if a.HtmlToText {
+ if found != "" {
+ return nil, fmt.Errorf("post-process actions must have a single field, found %s and %s", found, "htmlToText")
+ }
+ found = "htmlToText"
+ action := postProcessHtmlToText(a.HtmlToText)
+ ret = &action
+ }
if a.SubtractDays {
if found != "" {
return nil, fmt.Errorf("post-process actions must have a single field, found %s and %s", found, "subtractDays")
diff --git a/ui/v2.5/src/docs/en/ScraperDevelopment.md b/ui/v2.5/src/docs/en/ScraperDevelopment.md
index 4fd0cced..bc9f2fd3 100644
--- a/ui/v2.5/src/docs/en/ScraperDevelopment.md
+++ b/ui/v2.5/src/docs/en/ScraperDevelopment.md
@@ -342,6 +342,17 @@ scene:
Post-processing operations are contained in the `postProcess` key. Post-processing operations are performed in the order they are specified. The following post-processing operations are available:
* `feetToCm`: converts a string containing feet and inches numbers into centimeters. Looks for up to two separate integers and interprets the first as the number of feet, and the second as the number of inches. The numbers can be separated by any non-numeric character including the `.` character. It does not handle decimal numbers. For example `6.3` and `6ft3.3` would both be interpreted as 6 feet, 3 inches before converting into centimeters.
+* `htmlToText`: parses a string as html and converts it into plain text.
+Example:
+```yaml
+Details:
+ selector: //meta[name="description"]/@content
+ postProcess:
+ - replace:
+ - regex: \\
+ with: \
+ - htmlToText: true
+```
* `lbToKg`: converts a string containing lbs to kg.
* `map`: contains a map of input values to output values. Where a value matches one of the input values, it is replaced with the matching output value. If no value is matched, then value is unmodified.
and the relevant stash scraper (uses the parseDate: unix as well) to test it with ( sample url with text that needs parsing name: BellesaCo
sceneByURL:
- action: scrapeJson
url:
- bellesa.co/videos/
scraper: sceneScraper
queryURL: "https://www.bellesa.co/api/rest/v1/videos/{url}"
queryURLReplace:
url:
- regex: '.+/videos/(\d+)/.+'
with: "${1}"
jsonScrapers:
sceneScraper:
scene:
Title: title
Image: image
Details:
selector: description
postProcess:
- replace:
- regex: \\
with:
- htmlToText: true
Performers:
Name: performers.#.name
Studio:
Name: content_provider.#.name
Tags:
Name:
selector: tags
split: ","
Date:
selector: posted_on
postProcess:
- parseDate: unix
# Last Updated August 12, 2022 |
@bnkai nice ! |
@bnkai please submit your changes in a separate PR. |
Adding the ability for scrapers to parse unix timestamps in parseDate.
Tested using this parser: