knn_price

This is a personal project where I am looking to efficiently check pricing of products given any webpage. It's pretty simple to create a web extractor for a particular page, or even pages within the same domain. However, there is no standard websites conform to, so it's unlikely an extractor for one domain would work on another domain (different layouts, selectors, naming conventions and sometimes even technologies).

Basic Idea

I use Selenium to navigate to web pages. The benefit of using Selenium is that it loads the webpage in browser, as it was meant to be seen, and also offers usefule tools for traversal.

Javascript is injected once the page loads which extracts visible data (i.e. the important information like Title and Price; leaving out reviews, comments or product recommendations. This extract is (for sake of ease) stored in/read from a SQLite database using peewee DBM.

The data is loaded into a pandas dataframe where it's original values can be preprocessed before being converted to a numpy dataset and fed into DBSCAN for clustering. I've implemented a labeller using partially labelled data. By labelling all of the datapoints for Title, we can determine the minimum epsilon required to cluster all Titles together, as well as the max (the maximum epsilon value before a non-labelled datapoint gets included in the Title cluster), with increment passed in as a parameter.

Next steps

By taking the mid way (or better yet, a todo; using multiple labelled datasets) of those points, you can label the other clusters, setting those labels in the database. Once all of the data has been labelled, we can train model (planning on using k-nearest neighbour and the same features used during clustering) to build a generalised model.

Following that: include further domains to build an even better model, and have the model detect "sale price"

Great sources of info:

"Web Content Extraction Through Machine Learning", Ziyan Zhou & Muntasir Mashuq

"Sociopath: Automatic Local Events Extractor", Galina Alperovich

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
project		project
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

knn_price

Basic Idea

Next steps

Great sources of info:

About

Releases

Packages

Languages

ronaldgreeff/knn_price

Folders and files

Latest commit

History

Repository files navigation

knn_price

Basic Idea

Next steps

Great sources of info:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages