Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use the api instead of scraping the wiki #3

Merged
merged 3 commits into from
May 14, 2018
Merged

use the api instead of scraping the wiki #3

merged 3 commits into from
May 14, 2018

Conversation

zeke
Copy link
Member

@zeke zeke commented May 12, 2018

He @jdlrobson 👋

Now that I know about the Wikipedia Summary API, I can toss out the unreliable HTML-scraping approach formerly used by this module. 🎉

Hitting a snag though: When doing a query with a special character in it like muñeca, I get a 400 error:

{ HTTPError: Response code 400 (Bad Request) at stream.catch.then.data (/Users/z/git/words/wikipedia-tldr/node_modules/got/index.js:341:13) at at process._tickCallback (internal/process/next_tick.js:188:7) name: 'HTTPError', host: 'es.wikipedia.org', hostname: 'es.wikipedia.org', method: 'GET', path: '/api/rest_v1/page/summary/Muñeca', protocol: 'https:', url: 'https://es.wikipedia.org/api/rest_v1/page/summary/Muñeca', statusCode: 400, statusMessage: 'Bad Request', headers: { date: 'Sat, 12 May 2018 06:24:22 GMT', 'content-type': 'application/problem+json', 'content-length': '172', connection: 'close', 'content-location': 'https://es.wikipedia.org/api/rest_v1/page/summary/Mu%25F1eca', 'access-control-allow-origin': '*', 'access-control-allow-methods': 'GET,HEAD', 'access-control-allow-headers': 'accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding', 'access-control-expose-headers': 'etag', 'x-content-type-options': 'nosniff', 'x-frame-options': 'SAMEORIGIN', 'referrer-policy': 'origin-when-cross-origin', 'x-xss-protection': '1; mode=block', 'content-security-policy': 'default-src \'none\'; frame-ancestors \'none\'', 'x-content-security-policy': 'default-src \'none\'; frame-ancestors \'none\'', 'x-webkit-csp': 'default-src \'none\'; frame-ancestors \'none\'', 'cache-control': 'private, max-age=0, s-maxage=0, must-revalidate', 'x-request-id': '1bee36c6-55ad-11e8-b33e-98eb3e8befd2', server: 'restbase1011', 'x-varnish': '544595303, 476608204, 903378207, 972304234', via: '1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)', age: '0', 'x-cache': 'cp1067 pass, cp2004 pass, cp4029 pass, cp4027 pass', 'x-cache-status': 'pass', 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'set-cookie': [ 'WMF-Last-Access=12-May-2018;Path=/;HttpOnly;secure;Expires=Wed, 13 Jun 2018 00:00:00 GMT', 'WMF-Last-Access-Global=12-May-2018;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 13 Jun 2018 00:00:00 GMT', 'GeoIP=US:CA:Moorpark:34.31:-118.88:v4; Path=/; secure; Domain=.wikipedia.org' ], 'x-analytics': 'https=1;nocookies=1', 'x-client-ip': '99.175.68.19' } }

Example URL: https://es.wikipedia.org/api/rest_v1/page/summary/Muñeca

However if I open this URL in the browser, I see results. Any idea what might be wrong? At first I thought it was the redirect from lowercase muñeca to muñeca but got follows redirects. So that's not it.

Any ideas?

Resolves #2

BREAKING CHANGE: The response object structure is now different, as it returns the full object from the API repsonse.
@jdlrobson
Copy link

does hitting https://en.wikipedia.org/api/rest_v1/page/summary/Mu%C3%B1eca work any better? I'm in transit right now but will take a closer look when I am not on my mobile phone :)

@zeke
Copy link
Member Author

zeke commented May 12, 2018

Curiously, that URL redirects to https://en.wikipedia.org/api/rest_v1/page/summary/Sinfon%C3%ADa_Soledad

@jdlrobson
Copy link

Got(cha!)
const response = await got('https://es.wikipedia.org/api/rest_v1/page/summary/' + encodeURIComponent('Muñeca'));
This works for me.. basically run encodeURIComponent on every title. I guess that would be query in your example?

@zeke
Copy link
Member Author

zeke commented May 14, 2018

@jdlrobson that fixed it! Thank you.

Surprised that node's built-in URL.format doesn't encode the path.

@zeke zeke merged commit 8dbf4e4 into master May 14, 2018
@zeke zeke deleted the use-the-api branch May 14, 2018 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Wikipedia's summary API
2 participants