webcrawler

簡介

設計一個 Web Service，能依據關鍵字對兩個購物網站 (Waston, Ebay)進行爬蟲，並將結果呈現出來

基本架構

利用 HTTP Handler 建構一個基礎的 Web API
利用第三方爬蟲框架 Colly 來實現爬蟲的基本需求
利用 JSON 來儲存爬蟲結果，提供了資料的高相容性以及未來擴展開發的便利性

其他細節

爬下來的商品資訊包含：名稱、價錢、圖片連結、商品連結
運用 Colly 的 Parallelism, Async 參數來實現 worker
運用 Colly 的 Limit, UserAgent 等參數來模擬真實使用者狀態來避免被網站封鎖
利用 interface 抽換底層實作，讓 code 具有延展性，並更容易測試
利用 context 來實現 graceful shutdown
搜尋結果有多頁時，可自動依據所設定商品數量計算頁數爬蟲
程式被中斷時，worker 能先將手上任務完成才結束
使用 mutex 來避免 HTTP Writer 造成的 race condition
基於 HTTP Writer 和 Colly 的並用來實現 real time render

尚可改進目標

運用 Database 建立 cache 機制，特定期限內 user 再次搜尋相同關鍵字，就不用再爬一次，但應避免 hard-code DB 連線資訊

使用方式

打開終端機到程式所在位置輸入

go run main.go
打開本地端任一瀏覽器於網址輸入

localhost:port number/search?keyword=your keyword
或是打開終端機

curl 'localhost:port number/search?keyword=your keyword'
Unit-test

go test -v ./...

爬蟲結果

Client端
Server端

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
crawl		crawl
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl

crawl

README.md

README.md

go.mod

go.mod

go.sum

go.sum

main.go

main.go

Repository files navigation

webcrawler

簡介

基本架構

其他細節

尚可改進目標

使用方式

爬蟲結果

About

Releases

Packages

Contributors 3

Languages

ttingchen/webcrawler

Folders and files

Latest commit

History

Repository files navigation

webcrawler

簡介

基本架構

其他細節

尚可改進目標

使用方式

爬蟲結果

About

Resources

Stars

Watchers

Forks

Languages