Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhantomEnvironment is undefined #88

Closed
Dugnist opened this issue Apr 1, 2018 · 10 comments
Closed

PhantomEnvironment is undefined #88

Dugnist opened this issue Apr 1, 2018 · 10 comments
Assignees

Comments

@Dugnist
Copy link

Dugnist commented Apr 1, 2018

import {
   PhantomEnvironment,
   Parser
} from 'goose-parser';

const env = new PhantomEnvironment({
   url: 'http://www.gooseplanet.ru/'
});

TypeError: _gooseParser.PhantomEnvironment is not a constructor

I look at the imported entities and both of them is undefined.

If I write:
import Parser from 'goose-parser'

It return [Function: Parser]
But where I can find PhantomEnvironment???

@maZahaca
Copy link
Member

maZahaca commented Apr 1, 2018

Hello @Dugnist

Which version of goose-parser are you using?

We've started a process to determine goose, environment and other blocks which can be used separately.
Also goose-parser since version v0.5 was:

Here is an example of usage since latest version 0.5.0-alpha.3 of goose-parser:
package.json:

{
  "dependencies": {
    "goose-parser": "^0.5.0-alpha.3",
    "goose-phantom-environment": "^1.0.12"
  }
}

Usage:

const Parser = require('goose-parser');
const PhantomEnvironment = require('goose-phantom-environment');

const env = new PhantomEnvironment({
  url: 'http://www.gooseplanet.ru/',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse(
      require('./rules/rules'),
    );
  } catch (e) {
    console.log(e.message, e.stack);
  }
})();

Also you can consider to user version 0.2.* of goose, it matches the original documentation. But we're working hard on 0.5 to bring all the amazing features soon, so you can use it as well.

@maZahaca maZahaca self-assigned this Apr 1, 2018
@maZahaca
Copy link
Member

maZahaca commented Apr 1, 2018

Let me know if you have any other issues

@Dugnist
Copy link
Author

Dugnist commented Apr 1, 2018

@maZahaca
I connected goose-jsdom-environment because PhantomEnvironment install was crushed

const Parser = require('goose-parser');
const JsDOMEnvironment = require('goose-jsdom-environment');

const env = new JsDOMEnvironment({
  url: 'http://www.google.com',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
            type: 'wait',
            timeout: 10 * 1000,
            scope: '.container',
            parentScope: 'body'
        }
      ]
    });
    console.log(results);
  } catch (e) {
    console.log(e.message);
  }
})();

and it throw me this error:

ReferenceError: arguments is not defined

@maZahaca
Copy link
Member

maZahaca commented Apr 2, 2018

I connected goose-jsdom-environment because PhantomEnvironment install was crushed
Could you please provide:

  • operation system you use
  • error what happened
  • specify version (package.json) you've tried
  • and code if there anything

Current issue with JSDom is related to the fact that this environment does not support dynamic javascript, so any wait, click or whatever iterations with the page won't work.
For dynamic JS you need to use one of goose-phantom-environment, goose-chrome-environment

@Dugnist
Copy link
Author

Dugnist commented Apr 2, 2018

@maZahaca ok, i change jsdom to goose-chrome-environment.

const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');

const env = new ChromeEnvironment({
  url: 'https://www.google.com',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
            type: 'wait',
            timeout: 10 * 1000,
            scope: '#ctr-p',
            parentScope: 'body'
        }
      ]
    });
    console.log(results);
  } catch (e) {
    console.log(e.message);
  }
})();

It also throw errors:

tion id: 959): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 960): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 961): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 962): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 963): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 964): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 965): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 966): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 967): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 968): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 969): ReferenceError: arguments is not defined
Timeout for wait with arguments: body #ctr-p

@Dugnist
Copy link
Author

Dugnist commented Apr 2, 2018

@maZahaca also I change url address to 'https://habrahabr.ru' and I catch this error:

(node:7544) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): TypeError: msg.match is not a function
(node:7544) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

@maZahaca
Copy link
Member

maZahaca commented Apr 2, 2018

@Dugnist please provide your package.json and OS you're operating on, I will try to reproduce these issues.

It's wired bugs, cause we use this parsers in production for now

@maZahaca
Copy link
Member

maZahaca commented Apr 2, 2018

Also, let's stick to one website when testing, and try it out.
Please tell me what example data you want to scrape from this website

@Dugnist
Copy link
Author

Dugnist commented Apr 2, 2018

@maZahaca i'm using linux ubuntu 16.04 LTS

{
  "dependencies": {
    "goose-chrome-environment": "^1.0.2",
    "goose-parser": "^0.5.0-alpha.3",
    "phantomjs-prebuilt": "^2.1.16"
  },
  "devDependencies": {}
}

I want to get all html page with executed javascript (if target site use framework like React.js) and save result to html file and required assets.

@maZahaca
Copy link
Member

maZahaca commented Apr 2, 2018

@Dugnist This parsing tool (goose-parser) allows you to save only JSON results, not HTML and assets.
However we're planning to add ability to save assets and the whole HTML in the future.

Here is an example of using goose-parser+goose-chrome-environment to fetch json results:

const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');

const env = new ChromeEnvironment({
  url: 'https://www.google.com/search?newwindow=1&ei=mzDCWoPkOI-RmwWaoLzYCg&q=goose-parser&oq=goose-parser&gs_l=psy-ab.3..0i30k1.1186908.1189012.0.1189621.12.12.0.0.0.0.154.877.9j2.11.0....0...1c.1.64.psy-ab..1.11.876...0j0i131k1j0i131i67k1j0i67k1j0i10k1j0i19k1j0i30i19k1j0i10i30i19k1j0i13i30k1j0i8i30k1.0.lU1cumFem2s&gws_rd=cr&dcr=0&fg=1',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
          type: 'wait',
          timeout: 10 * 1000,
          scope: '.srg>.g',
          parentScope: 'body'
        }
      ],
      rules: {
        scope: '.srg>.g',
        collection: [[
          {
            name: 'url',
            scope: 'h3.r>a',
            attr: 'href',
          },
          {
            name: 'text',
            scope: 'h3.r>a',
          }
        ]]
      }
    });
    console.log(results);
  } catch (e) {
    console.log(e.message);
  }
})();

And results will be:

[
  {
    url: 'https://www.npmjs.com/package/goose-parser',
    text: 'goose-parser - npm'
  },
  {
    url: 'https://github.com/advancedlogic/GoOse/blob/master/parser.go',
    text: 'GoOse/parser.go at master · advancedlogic/GoOse · GitHub'
  },
  {
    url: 'https://habrahabr.ru/post/271425/',
    text: 'Как парсить интернет по-гусиному / Хабрахабр'
  },
  {
    url: 'https://pypi.python.org/pypi/goose-extractor/',
    text: 'goose-extractor 1.0.25 : Python Package Index'
  },
  {
    url: 'https://toster.ru/q/337511',
    text: 'Как добавлять комментарии в Instagram без api? — Toster.ru'
  },
  {
    url: 'https://www.youtube.com/watch?v=BEbAhwyQeOM',
    text: 'Continued Work on Goose\'s Parser - YouTube'
  },
  {
    url: 'https://godoc.org/github.com/advancedlogic/GoOse',
    text: 'goose - GoDoc'
  },
  {
    url: 'http://blog.reddikh.com/goose-parser/',
    text: 'Goose parser |'
  },
  {
    url: 'https://www.kth.se/social/upload/538599b1f27654141f4cc333/Master',
    text: 'Development of a library to generate and parse IEC 61850-90-5 ... - KTH'
  },
  {
    url: 'http://nullege.com/codes/search/goose.parsers.Parser',
    text: 'goose.parsers.Parser - Nullege Python Samples'
  }
]

@Dugnist Dugnist closed this as completed Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants