Skip to content

sophia303v/dynamic-WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

dynamic-WebCrawler

When I tried to fetch a webpage using React, I found selenium can help me do this.
The mechanism is that selenium can automatically open a browser and "use" it.
Hence, we can wait for the browser until it finishes rendering the components. We can successfully get the content in those components we want.
However, at the end I turned to fetch another dynamic page which using simpler language than React. XD

udn.py

It's a web-crawler using selenium to automatically scroll the page to load enough numbers of news.
"my_url" is the news list in udn.com in some category.
Before using it, we need to download a driver file for firefox or chrome.
Selenium will open a driver (firefox or chrome) and open the page, and fetch link and scroll down and fetch link...
After fetching all links of news we need, use BeautifulSoup to get content you need. (can also use selenium)
Then get the specific block you need, and output it in JSON.

mobile01.py

It's a simple static web-crawler. Fetch the data in mobile01. Use beautifulSoup. Output as JSON.
I encounter strange problems when using BeautifulSoup. It will disconnect sometimes.
If it disconnected, just read the articlesNum printed on console and modified the "articleNum" parameter in this file. Then, it can continue to fetch the next following articles.

About

it's a crawler to fetch posts' info on udn news.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages