Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excluding font files #29

Closed
Germminate opened this issue Jun 17, 2021 · 4 comments
Closed

Excluding font files #29

Germminate opened this issue Jun 17, 2021 · 4 comments

Comments

@Germminate
Copy link

Germminate commented Jun 17, 2021

Hello, I tested the repository using the below script:

scrape({
    urls: [
        '/site/url'
    ],
    directory: 'path/to/dirt',
    subdirectories: [
        {
            directory: 'binaries',
            extensions: ['.jpg', '.png', '.svg', '.jpeg', '.mp3', '.mp4', '.wav']
        },
        {
            directory: 'js',
            extensions: ['.js']
        },
        {
            directory: 'css',
            extensions: ['.css']
        },
    ],
    sources: [
        {
            selector: 'img',
            attr: 'src'
        },
        {
            selector: 'audio',
            attr: 'src'
        },
        {
            selector: 'video',
            attr: 'src'
        },
        {
            selector: 'link[rel="stylesheet"]',
            attr: 'href'
        },
        {
            selector: 'script',
            attr: 'src'
        }
    ],
    plugins: [ 
        new PuppeteerPlugin({
          scrollToBottom: { timeout: 10000, viewportN: 10 }, /* optional */
          blockNavigation: true, /* optional */
        })
      ]
}).then(function (result) {
    // Outputs HTML 
    // console.log(result);
    console.log("Content succesfully downloaded");
}).catch(function (err) {
    console.log(err);
});

It returned the font files as well. How do i save a webpage without saving all the font files?

Edit:
In fact, after testing, defining the subdirectories and sources doesn't restrict the scrape to just the stated extension types.

@s0ph1e
Copy link
Member

s0ph1e commented Jun 20, 2021

Hi @Germminate

It depends on how this fonts are loaded. I suggest to

  • review from each html element they come and update sources if needed
  • or use urlFilter option to filter them out

Hope it helps

@Germminate
Copy link
Author

Germminate commented Jun 29, 2021

Hi @s0ph1e,
Thank you for you response.
I am unable to exclude using urlFilter as it is downloaded from the hrefs of the .css files.

I have another question, how can a port of a specific API (website-scraper) instance be closed after it is done (say if i have 100 parallel instances running)?

Right now, I am facing the below issue:

Extracting batch of 5000 urls ...
events.js:353
      throw er; // Unhandled 'error' event
      ^

Error: read ENOTCONN
    at tryReadStart (net.js:574:20)
    at Socket._read (net.js:585:5)
    at Socket.Readable.read (internal/streams/readable.js:481:10)
    at Socket.read (net.js:625:39)
    at new Socket (net.js:377:12)
    at Object.Socket (net.js:269:41)
    at createSocket (internal/child_process.js:314:14)
    at ChildProcess.spawn (internal/child_process.js:435:23)
    at spawn (child_process.js:577:9)
    at Object.spawnWithSignal [as spawn] (child_process.js:714:17)
    at BrowserRunner.start (/home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/node_modules/puppeteer/lib/Launcher.js:77:30)
    at ChromeLauncher.launch (/home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/node_modules/puppeteer/lib/Launcher.js:242:12)
    at async /home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/lib/index.js:21:19
    at async Scraper.runActions (/home/local/KLASS/germaine.tan/Desktop/gitlab/scraper/node_modules/website-scraper/lib/scraper.js:228:14)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21) {
  errno: -107,
  code: 'ENOTCONN',
  syscall: 'read'
}

my script simply parses a list of urls to your API and calls the scrape function.

If i run them one by one, this error doesn't occur.

@s0ph1e
Copy link
Member

s0ph1e commented Jun 30, 2021

Hi @Germminate

  1. I believe urlFilter should work fine with urls from css files. If it doesn't - then it looks like a bug, please open an issue in https://github.com/website-scraper/node-website-scraper/issues
  2. As for closing browser - it should be closed automatically after everything is done
    registerAction('afterFinish', () => this.browser && this.browser.close());
    but I didn't test how it works with multiple parallel instances. I assume that operation system opens only 1 chrome process and reuses it instead of opening 100 chrome processes. I can suggest to try with multiple urls to same scraper call, see https://github.com/website-scraper/node-website-scraper#urls - if that's possible for your use-case.

@Germminate
Copy link
Author

Hi Sophie,

Thanks. It works with parallel instances it was my URLs that were problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants