infer() doesn't work #7

yuyuyaya · 2013-12-08T06:24:01Z

from pydepta import Depta

>>> d = Depta()
>>> seed = d.extract(url='http://www.iens.nl/restaurant/10545/enschede-rhodos')[5]
>>> d.infer(seed=seed, url='http://www.iens.nl/restaurant/34397/apeldoorn-de-boschvijver')

this throws the error

infer() takes at least 2 arguments (1 given)

what does infer do exactly and how can I get it working?

when I do

>>> d.infer(seed, url='http://www.iens.nl/restaurant/34397/apeldoorn-de-boschvijver')

it just gives me an empty list

The text was updated successfully, but these errors were encountered:

tpeng · 2013-12-08T07:33:15Z

thanks for reporting this.
I just fixed it, could you try again?

yuyuyaya · 2013-12-08T11:11:44Z

Thank you!

Now I get something but the the output seems to be different than the example

from flask import Flask, request, render_template
from pydepta import Depta

app = Flask(__name__)

@app.route('/')
def pydepta():
    url = request.args.get('url')
    print url
    if url:
        depta = Depta()
        regions = depta.extract(url='http://www.iens.nl/restaurant/10545/enschede-rhodos')
        a_region = depta.infer(regions[8], url='http://www.iens.nl/restaurant/34397/apeldoorn-de-boschvijver')
        regions = a_region
        tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
        return render_template('tables.html', tables=tables)
    else:
        return render_template('index.html')

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5444, debug=True)

It produces this....is this correct, it doesn't look like the one in the example.

Also, it seems to place a row side by side? How to make it one row on each line?

Thanks again!

yuyuyaya · 2013-12-08T11:12:44Z

also, I don't understand what infer() is supposed to do. does it take the diff? does it figure out the data fields?

yuyuyaya · 2013-12-08T11:31:23Z

while extract() works well, the infer seems to bring about even more erratic behavior. For instance, when extract() works, infer() doesn't work for some sites (no tables returned when using infer) or only very little amount of rows is produced.

from flask import Flask, request, render_template
from pydepta import Depta

app = Flask(__name__)

@app.route('/')
def pydepta():
    url = request.args.get('url')
    print url
    if url:
        depta = Depta()
        regions = depta.extract(url='http://www.amazon.ca/s/ref=lp_916520_nr_n_0?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&bbn=927726&ie=UTF8&qid=1386501729&rnid=927726')
        a_region = depta.infer(regions[16], url='http://www.amazon.ca/s/ref=lp_933484_pg_2?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&page=2&ie=UTF8&qid=1386501736')
        regions = a_region
        tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
        return render_template('tables.html', tables=tables)
    else:
        return render_template('index.html')

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5444, debug=True)

This produces an output like.

tpeng · 2013-12-08T13:38:53Z

Hi,

It seems like the depta treat every 2 items as a group (similarity >= default threshold and can find larger data record). that's why it different from example
and have 2 items in 1 row.

On Dec 8, 2013, at 7:11 PM, yuyuyaya notifications@github.com wrote:

Thank you!

Now I get something but the the output seems to be different than the example

from flask import Flask, request, render_template
from pydepta import Depta

app = Flask(name)

@app.route('/')
def pydepta():
url = request.args.get('url')
print url
if url:
depta = Depta()
regions = depta.extract(url='http://www.iens.nl/restaurant/10545/enschede-rhodos')
a_region = depta.infer(regions[8], url='http://www.iens.nl/restaurant/34397/apeldoorn-de-boschvijver')
regions = a_region
tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
return render_template('tables.html', tables=tables)
else:
return render_template('index.html')

if name == 'main':
app.run(host='0.0.0.0', port=5444, debug=True)

It produces this....is this correct, it doesn't look like the one in the example.

Also, it seems to place a row side by side? How to make it one row on each line?

Thanks again!

—
Reply to this email directly or view it on GitHub.

tpeng · 2013-12-08T13:41:18Z

the infer is supposed to find the data records on similar pages (similar to the page which seed is extracted from) even the data record has only 1 item.
(the DEPTA assume the page has at least 2 data record, otherwise similarity won't work. so the infer is to intended to fix this limit)

On Dec 8, 2013, at 7:12 PM, yuyuyaya notifications@github.com wrote:

also, I don't understand what infer() is supposed to do. does it take the diff? does it figure out the data fields?

—
Reply to this email directly or view it on GitHub.

tpeng · 2013-12-08T13:41:43Z

it seems these 2 pages are not similar. that's why infer not works

On Dec 8, 2013, at 7:31 PM, yuyuyaya notifications@github.com wrote:

while extract() works well, the infer seems to bring about even more erratic behavior. For instance, when extract() works, infer() doesn't work for some sites (no tables returned when using infer) or only very little amount of rows is produced.

from flask import Flask, request, render_template
from pydepta import Depta

app = Flask(name)

@app.route('/')
def pydepta():
url = request.args.get('url')
print url
if url:
depta = Depta()
regions = depta.extract(url='http://www.amazon.ca/s/ref=lp_916520_nr_n_0?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&bbn=927726&ie=UTF8&qid=1386501729&rnid=927726')
a_region = depta.infer(regions[16], url='http://www.amazon.ca/s/ref=lp_933484_pg_2?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&page=2&ie=UTF8&qid=1386501736')
regions = a_region
tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
return render_template('tables.html', tables=tables)
else:
return render_template('index.html')

if name == 'main':
app.run(host='0.0.0.0', port=5444, debug=True)

This produces an output like.

—
Reply to this email directly or view it on GitHub.

yuyuyaya · 2013-12-08T19:19:09Z

hi tpeng!

Thanks for the explanation.

is it possible to change the default threshold to make each row on a line?

so one should use infer() for a non-MDR (multiple data record) page and extract() for MDR page?

Thanks again!

tpeng · 2013-12-09T01:00:45Z

yes. you can create a Depta instance with threshold set to other value.
yes.

yuyuyaya · 2013-12-09T02:42:27Z

how can I do this? is there a list of arguments and methods, as there is very little documentation

tpeng · 2013-12-09T02:50:01Z

e.g.

from pydepta import Depta
d = Depta(threshold=0.9)

i agree this is very little document and i probably can add some later.

yuyuyaya · 2013-12-09T07:44:18Z

I am still having trouble with infer()

consider the following code, its taking amazon product detail page, and it returns blank. I made sure I am using the right table index (trying to get the ISBN of the book) which is the 12th table

http://pydepta-heroku.herokuapp.com/?url=http%3A%2F%2Fwww.amazon.ca%2FFlood-2013-Summer-Southern-Alberta%2Fdp%2F1771640308%2Fref%3Dsr_1_17%3Fs%3Dbooks%26ie%3DUTF8%26qid%3D1386574042%26sr%3D1-17

but the other url it's actually the 11th table (ISBN)

http://pydepta-heroku.herokuapp.com/?url=http%3A%2F%2Fwww.amazon.ca%2FEarth-Spirit-Place-Featuring-Photographs%2Fdp%2F1894673670%2Fref%3Dsr_1_18%3Fs%3Dbooks%26ie%3DUTF8%26qid%3D1386574042%26sr%3D1-18

Is there a way to resolve this issue, both are the same looking page.

depta = Depta(threshold=0.9)
regions = depta.extract(url='http://www.amazon.ca/Flood-2013-Summer-Southern-Alberta/dp/1771640308/ref=sr_1_17?s=books&ie=UTF8&qid=1386574042&sr=1-17')
a_region = depta.infer(regions[12], url='http://www.amazon.ca/Earth-Spirit-Place-Featuring-Photographs/dp/1894673670/ref=sr_1_18?s=books&ie=UTF8&qid=1386574042&sr=1-18')
regions = a_region

tpeng · 2013-12-27T09:15:43Z

Hi @yuyuyaya ,

I'm working on new infer. it will use Scrapely for extracting structured data. you can find the changes on https://github.com/tpeng/pydepta/tree/infer-with-scrapely.

it's still understand WIP and it also need some patches to Scrapely. but hopefully i can finish it soon.
stay tuned!

Thanks

tpeng closed this as completed Dec 27, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infer() doesn't work #7

infer() doesn't work #7

yuyuyaya commented Dec 8, 2013

tpeng commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

tpeng commented Dec 8, 2013

tpeng commented Dec 8, 2013

tpeng commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

tpeng commented Dec 9, 2013

yuyuyaya commented Dec 9, 2013

tpeng commented Dec 9, 2013

yuyuyaya commented Dec 9, 2013

tpeng commented Dec 27, 2013

infer() doesn't work #7

infer() doesn't work #7

Comments

yuyuyaya commented Dec 8, 2013

tpeng commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

tpeng commented Dec 8, 2013

tpeng commented Dec 8, 2013

tpeng commented Dec 8, 2013

yuyuyaya commented Dec 8, 2013

tpeng commented Dec 9, 2013

yuyuyaya commented Dec 9, 2013

tpeng commented Dec 9, 2013

yuyuyaya commented Dec 9, 2013

tpeng commented Dec 27, 2013