The original project description can be found here.
Source code is located in the webscraper/src
folder. The out
folder contains the data collected from the web scraper.
The entities need to be collected includes:
- Dynasties
- Historical figures (kings, queens, commanders, philosophers,...)
- Monuments - travel destinations (temples, churches,...) and historical places
- Cultural festivals & celebrations
- Historical events
Each entity will have identification, properties and needs to be linked together.
For example, an event Den Hung Festival
should be linked to the destination Den Hung
and the historical figure Hung King
.
- To collect data, a web scraper should be implemented. This scraper should be able to: (1) Collect data; (2) Save data to file; (3) Clean up data.
- The user should be able to search and retrieve data, so a proper interface (either GUI or command-line prompt) should be built.
This document contains the list of websites to get the data from.
Wikipedia exists in multiple languages, however we only need data mostly from Vietnamese and English ones. Wiki pages have lots of links to other pages, both inside Wikipedia and outside of it. We only need to find the biggest, most general pages that lead to other sub-topics. The following links are organized by a hierarchical structure, from the most general to the more detailed ones.