Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unix timestamp parsing to scrapers parseDate #2817

Merged
merged 6 commits into from
Sep 30, 2022

Conversation

JackDawson94
Copy link
Contributor

Adding the ability for scrapers to parse unix timestamps in parseDate.

Tested using this parser:

name: "Test Scraper"
sceneByURL:
  - action: scrapeXPath
    url:
      - unixtimestamp.com
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title:
        fixed: Test Title
      Details:
        fixed: Test Details
      Date:
        selector: //div[@class="value epoch"]/text()
        postProcess:
          - parseDate: unix

@bnkai
Copy link
Collaborator

bnkai commented Aug 12, 2022

@JackDawson94 a small diff to add the html parsing pp mentioned in discord

diff --git a/pkg/scraper/mapped.go b/pkg/scraper/mapped.go
index 2886adcf..a1ea0cd2 100644
--- a/pkg/scraper/mapped.go
+++ b/pkg/scraper/mapped.go
@@ -14,6 +14,9 @@ import (
 	"github.com/stashapp/stash/pkg/logger"
 	"github.com/stashapp/stash/pkg/models"
 	"github.com/stashapp/stash/pkg/sliceutil/stringslice"
+
+	"github.com/antchfx/htmlquery"
+
 	"gopkg.in/yaml.v2"
 )
 
@@ -523,6 +526,19 @@ func (p *postProcessLbToKg) Apply(ctx context.Context, value string, q mappedQue
 	return value
 }
 
+type postProcessHtmlToText bool
+
+func (p *postProcessHtmlToText) Apply(ctx context.Context, value string, q mappedQuery) string {
+	doc, err := htmlquery.Parse(strings.NewReader(value))
+	if err != nil {
+		logger.Warn("Could not parse html value")
+		return value
+	}
+	value = htmlquery.InnerText(doc)
+
+	return value
+}
+
 type mappedPostProcessAction struct {
 	ParseDate    string                   `yaml:"parseDate"`
 	SubtractDays bool                     `yaml:"subtractDays"`
@@ -531,6 +547,7 @@ type mappedPostProcessAction struct {
 	Map          map[string]string        `yaml:"map"`
 	FeetToCm     bool                     `yaml:"feetToCm"`
 	LbToKg       bool                     `yaml:"lbToKg"`
+	HtmlToText   bool                     `yaml:"htmlToText"`
 }
 
 func (a mappedPostProcessAction) ToPostProcessAction() (postProcessAction, error) {
@@ -582,6 +599,14 @@ func (a mappedPostProcessAction) ToPostProcessAction() (postProcessAction, error
 		action := postProcessLbToKg(a.LbToKg)
 		ret = &action
 	}
+	if a.HtmlToText {
+		if found != "" {
+			return nil, fmt.Errorf("post-process actions must have a single field, found %s and %s", found, "htmlToText")
+		}
+		found = "htmlToText"
+		action := postProcessHtmlToText(a.HtmlToText)
+		ret = &action
+	}
 	if a.SubtractDays {
 		if found != "" {
 			return nil, fmt.Errorf("post-process actions must have a single field, found %s and %s", found, "subtractDays")
diff --git a/ui/v2.5/src/docs/en/ScraperDevelopment.md b/ui/v2.5/src/docs/en/ScraperDevelopment.md
index 4fd0cced..bc9f2fd3 100644
--- a/ui/v2.5/src/docs/en/ScraperDevelopment.md
+++ b/ui/v2.5/src/docs/en/ScraperDevelopment.md
@@ -342,6 +342,17 @@ scene:
 
 Post-processing operations are contained in the `postProcess` key. Post-processing operations are performed in the order they are specified. The following post-processing operations are available:
 * `feetToCm`: converts a string containing feet and inches numbers into centimeters. Looks for up to two separate integers and interprets the first as the number of feet, and the second as the number of inches. The numbers can be separated by any non-numeric character including the `.` character. It does not handle decimal numbers. For example `6.3` and `6ft3.3` would both be interpreted as 6 feet, 3 inches before converting into centimeters.
+* `htmlToText`: parses a string as html and converts it into plain text.
+Example:
+```yaml
+Details:
+  selector: //meta[name="description"]/@content
+  postProcess:
+    - replace:
+        - regex: \\
+          with: \
+    - htmlToText: true
+```
 * `lbToKg`: converts a string containing lbs to kg.
 * `map`: contains a map of input values to output values. Where a value matches one of the input values, it is replaced with the matching output value. If no value is matched, then value is unmodified.
 

and the relevant stash scraper (uses the parseDate: unix as well) to test it with ( sample url with text that needs parsing https://www.bellesa.co/videos/1205/feeling-your-rhythm-sacrosanct-by-kayden-kross)

name: BellesaCo
sceneByURL:
  - action: scrapeJson
    url:
      - bellesa.co/videos/
    scraper: sceneScraper
    queryURL: "https://www.bellesa.co/api/rest/v1/videos/{url}"
    queryURLReplace:
      url:
        - regex: '.+/videos/(\d+)/.+'
          with: "${1}"
jsonScrapers:
  sceneScraper:
    scene:
      Title: title
      Image: image
      Details:
        selector: description
        postProcess:
          - replace:
              - regex: \\
                with:
          - htmlToText: true
      Performers:
        Name: performers.#.name
      Studio:
        Name: content_provider.#.name
      Tags:
        Name:
          selector: tags
          split: ","
      Date:
        selector: posted_on
        postProcess:
          - parseDate: unix
# Last Updated August 12, 2022

@JackDawson94
Copy link
Contributor Author

@bnkai nice !
Not sure what happens next, did you add the code already ? Or should I make the change ? 😊

@WithoutPants
Copy link
Collaborator

WithoutPants commented Sep 30, 2022

@bnkai please submit your changes in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants