linguacrawl
is a tool implemented in Python3 that allows to crawl a number of top-level domains to download any text
documents in the languages specified by the user. The objective of this tool is to get as much data of interest in the
minimum time possible. To achieve this, a scout strategy is adopted to stop crawling web hosts that are not productive
enough. The user can specify which are the languages of interest and a number of documents to be downloaded. After
downloading this number of documents, the amount of data in the targeted languages is checked and if the criteria set by
the user are not fulfilled, the web host is discarded for crawling.
Another interesting feature of linguacrawl as regards performance is that it is implemented following a provider-consumer architecture. When crawling a website, it is important to keep a waiting time between consecutive requests to avoid causing trouble in the host server of the website. This means that the crawler may be inactive for some between requests. Since linguacrawl targets top-level domains, many websites can be crawled at the same time. Using a provider-consumer architecture allows to spend the waiting time between requests to a website in downloading documents from other sites. In this way, as the crawler discovers new web hosts to be crawled it becomes more and more productive (until reaching the limits set by the user).
To install linguacrawl first clone the code from the repository:
git clone https://github.com/transducens/linguacrawl.git
then, get into the downloaded directory and install the dependencies by running:
cd linguacrawl
pip3 install -r requirements.txt
Finally, use pip3 again to install the tool:
pip3 install .
In order to run the tool, just run the command followed by the path to the configuration file:
linguacrawl /home/user/config.yaml
Note that there is a sample configuration file in the directory config; it can be adapted for any specific crawling project. The following section describes all the options that can be included in the configuration file.
To use linguacrawl, we need to prepare a configuration file. This configuration file must be in yaml
format and will
contain different options related to the crawling process, the targeted contents, etc. This section describes all the
options that can be included in the configuration file.
These are general options that are related the basic aspects of configuration of the tool and the crawling task to be carried out.
seed_urls
is a list of seed URLs from which to start crawling. The larger this list, the faster the process of
crawling new data. During crawling, linguacrawl discovers new websites to be visited by looking for new URLs in the
documents available from the seed ULRs. If only one seed URL is set, this process of discovering new sites o visit will
be slower (or even could not be possible, if the seed website do not contain links to other sites in the accepted TLDs).
Therefore, it is advisable to add as many different URLs to the list as possible. An example:
seed_urls: ['https://www.ffff.es/', 'https://www.dddd.cat/']
If this list is too large, the alternative seed_urls_from_file
option is provided. This option allows to define the
path to a text file that contains the list of seed URLs (one URL per line):
seed_urls_from_file: '/home/user/my_seed_urls.txt'
Option langs_of_interest
is a mandatory option, and allows to specify the code of the languages of interest of the
crawl. If we are interested in crawling every document in English, Spanish and Catalan, we will set this option to the
following list:
langs_of_interest: ['en','es','ca']`
Option accepted_tlds
allows to define the list of top-level domains (TLDs) accepted during crawling. This means that
any websites in a different TLD will not be visited. For example, if we want to constraint our crawling to the .cat
and .es
TLDs, we can set this option to:
accepted_tlds: ['es','cat']
Option accepted_content
allows to specify the type of content accepted. By default, this option is set to
(text/html)
:
accepted_content: '(text/html)'
Option output_dir
is a mandatory option, and allows to define the output dir where the files produced during crawling will be
stored, for example:
output_dir: '/home/user/crawl_output'
Three files may be created for every web host visited:
- one or more files with extension
.warc.gz
containing all the documents downloaded in WARC format, - a file with extension
.state
that contains the internal state of the crawler when crawling is stopped, to allow resuming the crawl at some point in the future, and - a file with extension
.bitextor.gz
that consists of a TSV list of fields: URL, language code, HTML and text extracted with the library html2text; some of these fields can be used by the tool Bitextor to try to identify parallel data. In the near future, this tool will prove a script to transform these fields into the format expected by Bitextor.
If verbose
is set to false
, only errors will be reported through stderr
; if it is set to true
, much more
information will be provided about the process (be careful, this information could be huge if many parallel jobs are run
in parallel). For example:
verbose: False
Option max_jobs
allows to determine how many crawling processes can be run in parallel. This value will be defined
according to the computational resources of the machine were the tool is used. For example, in a machine with 12
threads, this option can be set to 12:
max_jobs: 12
If this option is set to true
, the crawling tool will look for the file with extension .state
for every new web host
to be visited. If this file exists, it will load the previous state of the crawler for it and will only visit pages that
have not been visited before. As regards the WARC file produced, a new file will be created every time the crawling is
resumed. New files will be named with extension .1.warc.gz
, .2.warc.gz
, etc. The same applies for files with
extension .bitextor.gz
. To activate resuming, set the option as follows:
resume_crawling: True
Options to configure the behaviour of the crawling robot/s used.
Option user_agent
is a mandatory option, and allows to specify the user agent for the crawler launched. The user agent
is an important information provided to web servers at the time of requesting access to a page. This allows servers to
limit access to contents through robots.txt
. For example, we could set Google's bot user agent string
by seting this option to the following string:
user_agent: 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
crawl_delay
option allows to specify the delay (in seconds) between consecutive requests to a web domain. This delay
only affects to requests to the same web domain and is aimed at preventing to hinder the web servers that host the
website. The default value for this property is 3 seconds, but it can be modified by seting the option in the
configuration file, for example:
crawl_delay: 5
max_time_per_site
option allows to specify the maximum crawling time (in seconds) devoted to a single web domain. When
this time is reached, crawling is stopped for this web site. For example, if we want to stop crawling a site after 24
hours, we can set the option:
max_time_per_site: 5184000
connection_timeout
allows to set the connection time out (in seconds) for a given web server. For example, to set this
variable to 10 seconds:
connection_timeout: 10
With option prefix_filter
, one can automatically discard links that start with a specified prefix. This
option accepts a list of prefixes that can be defined as a regular expression. For example, to avoid adding
links to e-mails, we could set the following option:
prefix_filter: ['mailto:']
Option max_folder_tree_depth
allows to set te maximum folder depth for a URL to be taken into account. Defining this
option helps to avoid falling in loops that keep concatenating a string to a URL (in this case, a string that
corresponds to a folder). For example, to set this option to 20, use:
max_folder_tree_depth: 20
Option max_attempts
allows to define the maximum number of attempts to visit a web page. If it is not possible to
download a page after the maximum number of attempts, it is discarded. For example, to set this option to three
attempts, use:
max_attempts: 3
Option url_blacklist
allows to specify a list of web domains that will not be taken into account. The following could
be an example of web domains that we may want to discard in our crawling:
url_blacklist: ['wordpress','blogspot','facebook','google','wikipedia','youtube','perehodi','twitter','instagram']
Note that, by defining a web domain, for example google
, we are discarding web hosts such as www.google.com
,
www.google.cat
, mail.google.com
, etc.
One of the most relevant features of linguacrawl is that it is designed to get language-specific text data from the Internet. In order to make crawling as productive as possible, it implements a scout strategy that stops crawling a web host if, after downloading a given number of documents, no enough useful data has been downloaded. The following options allow the user to configure this scouting method.
Option scout_steps
determines the number of pages to be downloaded from a web page before the scouting criterion
is evaluated. After this is done, the web host may be discarded (crawling will be stopped on this site) or accepted to
keep crawling, but the scout criterion will not be evaluated again. Example:
scout_steps: 200
Option min_langs_in_site
is used by the scout criterion. If we are interested in identifying websites with
multilingual content and we have defined a list of languages for option langs_of_interest
, we can specify the minimum
number of those languages that need to appear in a web host to be accepted by the scout criterion. For example, we could
set that English, Spanish and Catalan are our languages of interest, and specify that at least two of them have to
appear in a web host to consider it useful:
min_langs_in_site: 2
Option mandatory_lang
is related to the previous option min_langs_in_site
, and allows to specify a language that is
that is required to appear in a web host in order to be considered useful by the scout criterion. When running a
multilingual crawling, we may be mostly interested in one of the languages. Following the previous example, if we are
crawling English, Spanish and Catalan data, we may be mostly interested in Catalan (for example, if we plan to build
Catalan-English and Catalan-Spanish parallel corpora). In that case, we would define the option as follows:
mandatory_lang: 'ca'
Option min_percent_mandatory_lang
is related to the previous option mandatory_lang
and allows to define the expected
percentage of documents in the mandatory language at the moment of evaluating the scout criterion. For example, we can
specify that, at least, 10% of the documents downloaded from a web host need to be in the mandatory language to be
considered a useful web host:
min_percent_mandatory_lang: 10
This software has been released un GPL3 license.
Developed by Universitat d'Alacant as part of its contribution to the GoURMET project, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299.