In [28]:
:sccache 1

sccache: true


In [29]:
:dep reqwest = { version = "0.10", features = ["json", "blocking"] }

In [30]:
:dep quick-xml = { version = "*", features = [ "serialize" ] }


In [31]:
:dep rayon = { version = "*" }


In [32]:
:dep select = "0.4.3"

In [69]:
:dep structopt = "0.3"

# Intro to Rust
## rolisz@

# About me

* I worked at Google between 2014 and 2018
* Currently TL of AI/ML team at Archive360
* I blog at rolisz.ro
* Started playing with Rust last year in September, in my free time



# What is Rust?

* A language empowering everyone to build reliable and efficient software. https://www.rust-lang.org/
* Announced by Mozilla in 2010
* Stable release since 2015 (1.0)


# How?


## Performance
* Fast and memory efficient
* No garbage collector


## Reliability
* Very rich type system
* Ownership tracking
* Guarantees memory and thread safety


## Productivity

* Great documentation
* Great error messages
* Great tooling 
    * Cargo - package manager, build tool
    * IDE support
    * auto-formatter
    * Clippy - linter



# Ownership and borrowing

* A new and central concept in Rust
* Makes a garbage collector unnecessary
* While still not making us do manual memory allocation and freeing

### Ownership

* Each value in Rust has a variable that’s called its owner.
* There can only be one owner at a time.
* When the owner goes out of scope, the value will be dropped.
* Ownership can be transferred by moves

In [14]:
let x = "Hello, world!".to_string();
do_something(x);

fn do_something(x: String) {
    println!("Do something: {}", x);
    // x is deallocated here
}

Do something: Hello, world!


In [13]:
let x = "Hello, world!".to_string();
do_something(x);
println!("{}", x);

fn do_something(x: String) {
    println!("Do something: {}", x);
}

Error: borrow of moved value: `x`

### Borrowing

* When we pass by reference, we can borrow a value to another function
* At any given time, you can have either one mutable reference or any number of immutable references.
* References must always be valid.


In [12]:
let x = "Hello, world!".to_string();
do_something(&x);
println!("{}", x);

fn do_something(x: &String) {
    println!("Do something: {}", x);
}

Do something: Hello, world!
Hello, world!


# Let’s build a web crawler in Rust!


* Given a URL (http://rolisz.ro), load all the pages that are under that URL
  * For example http://rolisz.ro/my-awesome-post
  * But not http://oradeatechhub.ro/events
* Save a copy of them locally


* Interactive, live coding from a prepared notebook
* Feel free to ask questions

In [59]:
use std::io::Read;
use reqwest::Url;
use reqwest::blocking::Client;

fn main() {
    let client = Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    println!("HTML: {}", &body[0..40]);
}
main()

Status for https://rolisz.ro/: 200 OK
HTML: <!DOCTYPE html>
<html lang="en">
<head


()


Instantiate an HTTP client

Send the request and `unwrap` it.

Read the contents into a string.

Variables are immutable by default.

Reading the request is also `unwrapped`

Print first 40 characters

## What is `unwrap`?

* Rust has a strong type system to handle errors. Most operations that can fail return the following:

In [21]:
enum Result<T, E> {
    Ok(T),
    Err(E),
}

let success: Result<i32, String> = Result::Ok(1);
let error: Result<i32, String> = Result::Err("Fail".to_string());

Unwrap transforms from `Result<T,E>` \=\> `T` and panics if there is an error.

Unwrap is usually bad practice, you should properly handle the error

## Parsing the HTML 

In [27]:
let found_urls = Document::from(body.as_str())
    .find(Name("a"))
    .filter_map(|n| n.attr("href"))
    .map(str::to_string)
    .collect::<HashSet<String>>();
println!("URLs: {:#?}", found_urls) 

URLs: {
    "/2020/05/08/on-text-editors/",
    "#",
    "/2020/06/08/happy-decennial/",
    "/2020/06/14/operating-system-journey/",
    "/2020/05/20/about-time/",
    "https://rolisz.ro",
    "https://feedly.com/i/subscription/feed/https://rolisz.ro/rss/",
    "/2020/05/11/quarantine-boardgames-plague-inc/",
    "https://ghost.org",
    "https://rolisz.ro/about-me/",
    "/2020/05/05/why-i-blog/",
    "#subscribe",
    "/2020/05/19/bridging-networks-with-a-synology-nas/",
    "/2020/06/07/new-desk-setup/",
    "/2020/05/16/quarantine-boardgames-itchy-feet/",
    "/2020/05/18/an-unexpected-error-in-rust/",
    "https://twitter.com/rolisz",
    "https://rolisz.ro/projects/",
    "/2020/06/08/boardgames-party-codenames/",
    "/2020/05/26/duckduckgo/",
    "/2020/05/24/bullet-journaling/",
    "/2020/05/14/travelers/",
    "/2020/06/15/how-much-does-it-cost-to-run-this-blog/",
    "/2020/05/15/context-variables-in-python/",
    "/2020/05/12/productivity-tips-pomodoros/",
    "/2020/05/0

()

Use a library to parse the HTML.

Find `a` elements, that have a `href` field and get only that value. 

`n.attr('href')` returns an `Option` with the value. 

Put everything into a `HashSet`, to avoid duplicates.

`collect` generates a container of the given type that holds the resulting elements. 

In [28]:
use std::io::Read;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = Document::from(body.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .map(str::to_string)
        .collect::<HashSet<String>>();
    println!("URLs: {:#?}", found_urls)
}

main()

Status for https://rolisz.ro/: 200 OK
URLs: {
    "/2020/05/08/on-text-editors/",
    "/2020/06/07/new-desk-setup/",
    "https://www.facebook.com/rolisz",
    "/2020/05/20/about-time/",
    "/2020/05/13/adding-search-to-static-ghost/",
    "https://rolisz.ro",
    "/2020/05/16/quarantine-boardgames-itchy-feet/",
    "/2020/03/08/boardgames-party-ha/",
    "/2020/05/18/an-unexpected-error-in-rust/",
    "/2020/05/07/splitting-up-my-blog/",
    "/2020/05/06/connecting-to-azure-pypi-repositories/",
    "/2020/05/05/why-i-blog/",
    "/2020/03/18/reflections-during-covid19-times/",
    "https://rolisz.ro/projects/",
    "/2020/03/27/quarantine-boardgames-travelin/",
    "https://rolisz.ro/uses/",
    "/2020/06/02/quarantine-boardgames-pandemic/",
    "/2020/05/26/duckduckgo/",
    "javascript:;",
    "https://ghost.org",
    "/2020/05/12/productivity-tips-pomodoros/",
    "/2020/05/11/quarantine-boardgames-plague-inc/",
    "/2020/05/15/context-variables-in-python/",
    "/2020/06/08/happ

()

## `Option` type in Rust

Sometimes `Result` is too complicated - the result simply doesn't exist and it doesn't mean an error.

```rust
enum Option<T> {
    None,
    Some(T),
}
```

For example: searching in a list or retrieving a field. Existence => Some(value), not found => None


## Filtering results to only our domain

In [38]:
fn normalize_url(url: &str) -> Option<String> {
    let new_url = Url::parse(url);
    match new_url { 
        Ok(new_url) => {
            if let Some("rolisz.ro") = new_url.host_str() {
                Some(url.to_string())
            } else {
                None
            }
        },
        Err(_e) => {
            // Relative urls are not parsed by Reqwest
            if url.starts_with('/') {
                Some(format!("https://rolisz.ro{}", url))
            } else {
                None
            }
        }
    }
}

* Match syntax for flow control
* if let syntax as a shorthand
* implicit returns

## String vs str

## String

* allocated on the heap
* is usually owned

## str

* immutable
* can be on the heap or stack 
* used as a read-only view of a string



# Applying the normalization 

In [41]:
use select::predicate::Predicate;

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html)
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

# Putting everything together

In [24]:
fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = get_links_from_html(&body);
    println!("URLs: {:#?}", found_urls)
}
main()

Status for https://rolisz.ro/: 200 OK
URLs: {
    "https://rolisz.ro/2020/03/01/web-crawler-in-rust/",
    "https://rolisz.ro/2020/02/13/lost-in-space/",
    "https://rolisz.ro/2020/06/07/new-desk-setup/",
    "https://rolisz.ro/2020/05/24/bullet-journaling/",
    "https://rolisz.ro/2020/06/01/yard-work/",
    "https://rolisz.ro/2020/05/26/duckduckgo/",
    "https://rolisz.ro/2020/04/11/moving-away-from-gmail/",
    "https://rolisz.ro/2020/03/18/reflections-during-covid19-times/",
    "https://rolisz.ro/2020/05/20/about-time/",
    "https://rolisz.ro/2020/05/07/splitting-up-my-blog/",
    "https://rolisz.ro/2020/05/13/adding-search-to-static-ghost/",
    "https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/",
    "https://rolisz.ro/2020/05/14/travelers/",
    "https://rolisz.ro/assets/built/screen.css?v=0df9ee8832",
    "https://rolisz.ro/favicon.ico",
    "https://rolisz.ro/2020/05/16/quarantine-boardgames-itchy-feet/",
    "https://rolisz.ro/2020/05/12/productivity-tips-pomod

()

# Refactor fetching into a function

In [36]:
fn fetch_url(client: &Client, url: &str) -> String {
    let mut res = client.get(url).send().unwrap();
    println!("Status for {}: {}", url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    body  
}

# Visiting found URLs too

In [33]:
let mut new_urls = found_urls;
while !new_urls.is_empty() {
    let mut found_urls = new_urls.iter()
    .map(|url| {
        let body = fetch_url(&client, url);
        let links = get_links_from_html(&body);
        println!("Visited: {} found {} links", 
            url, links.len());
        links
    }).fold(HashSet::new(), |mut acc, x| {
            acc.extend(x);
            acc
    });
    visited.extend(new_urls);

    new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();
    println!("New urls: {}", new_urls.len())
}

Error: Failed to determine type of variable `client`. rustc suggested type reqwest::blocking::client::Client, but that's private. Sometimes adding an extern crate will help rustc suggest the correct public type name, or you can give an explicit type.

### Algorithm:

Take all URLs found on home page and crawl them. 

Check what new URLs we find and repeat until we don't find any new URLs.

### Rust:

* Using `iter` to go over all elements of the array
* Using `fold` to join all the results into a single `HashSet`.
* Using a functional-ish style

In [36]:
use std::time::Instant;

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let mut found_urls = new_urls.iter().map(|url| {
            let body = fetch_url(&client, url);
            let links = get_links_from_html(&body);
            println!("Visited: {} found {} links", url, links.len());
            links
        }).fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        });
        visited.extend(new_urls);
        
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}
main()

Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/favicon.ico: 200 OK


thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidData, error: "stream did not contain valid UTF-8" }', src\lib.rs:55:5
stack backtrace:
   0: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
   1: core::fmt::write
   2: <std::io::IoSlice as core::fmt::Debug>::fmt
   3: std::panicking::take_hook
   4: std::panicking::take_hook
   5: std::panicking::rust_panic_with_hook
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::option::expect_none_failed
   9: ctx::fetch_url
  10: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
  11: run_user_code_23
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: BaseThreadInitThunk
  22: RtlUserThreadStart
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


Error: Child process terminated with status: exit code: 101

# Filtering out non-HTML links based on presence of extension

In [57]:
use std::path::Path;

fn has_extension(url: &&str) -> bool {
    Path::new(url).extension().is_none()
}

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html)
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter(has_extension)
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

main()

Status for https://rolisz.ro/: 200 OK
HTML: <!DOCTYPE html>
<html lang="en">
<head


()

* use stdlib `Path` parser to get extension
* returns `Option`
* `is_none` checks what the option contains

# Let's write the pages to disk

In [34]:
use std::fs;


fn write_file(path: &str, content: &str) {
    fs::create_dir_all(format!("static{}", path)).unwrap();
    fs::write(format!("static{}/index.html", path), content);
}

* Create the folder structure, including all parents
* Write the file to `index.html` in that folder

In [49]:
fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body= fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
    	.difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while new_urls.len() > 0 {
        let mut found_urls: HashSet<String> = new_urls
        	.iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);
                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
        })
        .fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        });
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}
main()


Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/2020/06/14/operating-system-journey/: 200 OK
Visited: https://rolisz.ro/2020/06/14/operating-system-journey/ found 18 links
Status for https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/: 200 OK
Visited: https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/ found 13 links
Status for https://rolisz.ro/projects/: 200 OK
Visited: https://rolisz.ro/projects/ found 19 links
Status for https://rolisz.ro/2020/05/11/quarantine-boardgames-plague-inc/: 200 OK
Visited: https://rolisz.ro/2020/05/11/quarantine-boardgames-plague-inc/ found 14 links
Status for https://rolisz.ro/page/2/: 200 OK
Visited: https://rolisz.ro/page/2/ found 34 links
Status for https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/: 200 OK
Visited: https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/ found 14 links
Status for https://rolisz.ro/2020/05/24/bullet-journaling/: 200 OK
Visited: https://rolisz.ro/2020/05/24/bul

Visited: https://rolisz.ro/2020/01/02/boardgames-party-rival-restaurants/ found 14 links
Status for https://rolisz.ro/2012/06/29/a-tale-of-two-interviews/: 200 OK
Visited: https://rolisz.ro/2012/06/29/a-tale-of-two-interviews/ found 14 links
Status for https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/: 200 OK
Visited: https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/ found 15 links
Status for https://rolisz.ro/2019/11/25/git-tutorial-part-3/: 200 OK
Visited: https://rolisz.ro/2019/11/25/git-tutorial-part-3/ found 14 links
Status for https://rolisz.ro/page/3/: 200 OK
Visited: https://rolisz.ro/page/3/ found 35 links
Status for https://rolisz.ro/2020/01/21/blogs-are-best-served-static/: 200 OK
Visited: https://rolisz.ro/2020/01/21/blogs-are-best-served-static/ found 14 links
Status for https://rolisz.ro/2018/05/07/scraping-for-houses/: 200 OK
Visited: https://rolisz.ro/2018/05/07/scraping-for-houses/ found 12 links
Status for https://rolisz.ro/2010/06/18/windows-vis

Status for https://rolisz.ro/2011/05/24/the-event/: 200 OK
Visited: https://rolisz.ro/2011/05/24/the-event/ found 14 links
Status for https://rolisz.ro/2010/10/10/rubicon/: 200 OK
Visited: https://rolisz.ro/2010/10/10/rubicon/ found 14 links
Status for https://rolisz.ro/2012/02/19/olimpiada-judeteana-de-fizica/: 200 OK
Visited: https://rolisz.ro/2012/02/19/olimpiada-judeteana-de-fizica/ found 14 links
Status for https://rolisz.ro/2013/04/13/single-table-inheritance-in-camelot/: 200 OK
Visited: https://rolisz.ro/2013/04/13/single-table-inheritance-in-camelot/ found 14 links
Status for https://rolisz.ro/2016/01/13/2015-in-review/: 200 OK
Visited: https://rolisz.ro/2016/01/13/2015-in-review/ found 17 links
Status for https://rolisz.ro/2012/12/08/global-day-of-code-retreat-2012/: 200 OK
Visited: https://rolisz.ro/2012/12/08/global-day-of-code-retreat-2012/ found 13 links
Status for https://rolisz.ro/2017/02/20/backing-up/: 200 OK
Visited: https://rolisz.ro/2017/02/20/backing-up/ found 13 l

Visited: https://rolisz.ro/2010/10/12/experienta-mea-cu-linux-part-2/ found 14 links
Status for https://rolisz.ro/2011/06/08/happy-birthday-dear-blog/: 200 OK
Visited: https://rolisz.ro/2011/06/08/happy-birthday-dear-blog/ found 14 links
Status for https://rolisz.ro/2016/01/01/setting-goals/: 200 OK
Visited: https://rolisz.ro/2016/01/01/setting-goals/ found 14 links
Status for https://rolisz.ro/2014/09/28/acrylamid-image-gallery/: 200 OK
Visited: https://rolisz.ro/2014/09/28/acrylamid-image-gallery/ found 13 links
Status for https://rolisz.ro/projects/ant/#{0,0,1L1}{0,1,1L1}{1,0,1R1}{1,1,0A0}: 200 OK
Visited: https://rolisz.ro/projects/ant/#{0,0,1L1}{0,1,1L1}{1,0,1R1}{1,1,0A0} found 0 links
Status for https://rolisz.ro/tag/blog/page/2/: 200 OK
Visited: https://rolisz.ro/tag/blog/page/2/ found 17 links
Status for https://rolisz.ro/2013/08/09/fun-at-arka-park/: 200 OK
Visited: https://rolisz.ro/2013/08/09/fun-at-arka-park/ found 14 links
Status for https://rolisz.ro/2013/04/18/neural-net

Status for https://rolisz.ro/2012/02/04/arrested-development/: 200 OK
Visited: https://rolisz.ro/2012/02/04/arrested-development/ found 14 links
Status for https://rolisz.ro/2015/10/09/google-tech-talk-babes-bolyai-university/: 200 OK
Visited: https://rolisz.ro/2015/10/09/google-tech-talk-babes-bolyai-university/ found 15 links
Status for https://rolisz.ro/2017/03/09/silence/: 200 OK
Visited: https://rolisz.ro/2017/03/09/silence/ found 14 links
Status for https://rolisz.ro/tag/reviews/page/2/: 200 OK
Visited: https://rolisz.ro/tag/reviews/page/2/ found 35 links
Status for https://rolisz.ro/2015/04/13/mail-subscription/: 200 OK
Visited: https://rolisz.ro/2015/04/13/mail-subscription/ found 14 links
Status for https://rolisz.ro/2019/03/24/old-man-s-cave/#fn-2: 200 OK
Visited: https://rolisz.ro/2019/03/24/old-man-s-cave/#fn-2 found 16 links
Status for https://rolisz.ro/2010/07/10/eureka-4x01/: 200 OK
Visited: https://rolisz.ro/2010/07/10/eureka-4x01/ found 14 links
Status for https://roli

Status for https://rolisz.ro/2011/01/31/olimpiada-nationala-de-fizica-2011-part-one/: 200 OK
Visited: https://rolisz.ro/2011/01/31/olimpiada-nationala-de-fizica-2011-part-one/ found 14 links
Status for https://rolisz.ro/2016/12/09/noh-hai-la-vot/: 200 OK
Visited: https://rolisz.ro/2016/12/09/noh-hai-la-vot/ found 13 links
Status for https://rolisz.ro/2012/01/05/2011-in-review/: 200 OK
Visited: https://rolisz.ro/2012/01/05/2011-in-review/ found 14 links
Status for https://rolisz.ro/2015/04/30/trip-to-geneva-part-2/: 200 OK
Visited: https://rolisz.ro/2015/04/30/trip-to-geneva-part-2/ found 14 links
Status for https://rolisz.ro/2012/12/31/la-multi-ani/: 200 OK
Visited: https://rolisz.ro/2012/12/31/la-multi-ani/ found 18 links
Status for https://rolisz.ro/2017/01/28/sous-vide-with-anova-precision-cooker/: 200 OK
Visited: https://rolisz.ro/2017/01/28/sous-vide-with-anova-precision-cooker/ found 14 links
Status for https://rolisz.ro/tag/cmd/: 200 OK
Visited: https://rolisz.ro/tag/cmd/ found 

Status for https://rolisz.ro/2017/10/06/around-europe-with-my-wifey/: 200 OK
Visited: https://rolisz.ro/2017/10/06/around-europe-with-my-wifey/ found 14 links
Status for https://rolisz.ro/2016/05/24/paris-revisited/: 200 OK
Visited: https://rolisz.ro/2016/05/24/paris-revisited/ found 15 links
Status for https://rolisz.ro/2015/08/30/after-a-year-in-zuerich/: 404 Not Found
Visited: https://rolisz.ro/2015/08/30/after-a-year-in-zuerich/ found 0 links
Status for https://rolisz.ro/2010/07/03/prince-of-persia-the-forgotten-sands/: 200 OK
Visited: https://rolisz.ro/2010/07/03/prince-of-persia-the-forgotten-sands/ found 14 links
Status for https://rolisz.ro/2010/12/07/xhr-file-upload/: 200 OK
Visited: https://rolisz.ro/2010/12/07/xhr-file-upload/ found 14 links
Status for https://rolisz.ro/tag/eight/: 200 OK
Visited: https://rolisz.ro/tag/eight/ found 16 links
Status for https://rolisz.ro/2012/11/12/tutorial-prolog/: 200 OK
Visited: https://rolisz.ro/2012/11/12/tutorial-prolog/ found 14 links
S

Status for https://rolisz.ro/2016/07/24/50-days-of-roland/: 200 OK
Visited: https://rolisz.ro/2016/07/24/50-days-of-roland/ found 12 links
Status for https://rolisz.ro/2011/04/23/php-benchmarking/: 200 OK
Visited: https://rolisz.ro/2011/04/23/php-benchmarking/ found 14 links
Status for https://rolisz.ro/2010/10/28/webcomics/: 200 OK
Visited: https://rolisz.ro/2010/10/28/webcomics/ found 14 links
Status for https://rolisz.ro/2014/06/05/festivitate-informatica-ubb/: 200 OK
Visited: https://rolisz.ro/2014/06/05/festivitate-informatica-ubb/ found 13 links
Status for https://rolisz.ro/2010/06/13/experienta-mea-cu-linux-part-1/: 200 OK
Visited: https://rolisz.ro/2010/06/13/experienta-mea-cu-linux-part-1/ found 14 links
Status for https://rolisz.ro/2013/09/15/ultima-vacanta-de-vara/: 200 OK
Visited: https://rolisz.ro/2013/09/15/ultima-vacanta-de-vara/ found 14 links
Status for https://rolisz.ro/2010/07/25/used-by-me-3/: 200 OK
Visited: https://rolisz.ro/2010/07/25/used-by-me-3/ found 14 links

Status for https://rolisz.ro/2012/10/11/anul-2/: 200 OK
Visited: https://rolisz.ro/2012/10/11/anul-2/ found 14 links
Status for https://rolisz.ro/tag/competition/: 200 OK
Visited: https://rolisz.ro/tag/competition/ found 16 links
Status for https://rolisz.ro/2011/10/07/facultate-first-impressions/: 200 OK
Visited: https://rolisz.ro/2011/10/07/facultate-first-impressions/ found 14 links
Status for https://rolisz.ro/page/6/: 200 OK
Visited: https://rolisz.ro/page/6/ found 35 links
Status for https://rolisz.ro/tag/holiday/: 200 OK
Visited: https://rolisz.ro/tag/holiday/ found 11 links
Status for https://rolisz.ro/2010/12/17/tron-evolution/: 200 OK
Visited: https://rolisz.ro/2010/12/17/tron-evolution/ found 14 links
Status for https://rolisz.ro/tag/php/: 200 OK
Visited: https://rolisz.ro/tag/php/ found 10 links
Status for https://rolisz.ro/2012/06/03/cu-bita-prin-cluj/: 200 OK
Visited: https://rolisz.ro/2012/06/03/cu-bita-prin-cluj/ found 13 links
Status for https://rolisz.ro/tag/voting/: 

Visited: https://rolisz.ro/page/7/ found 35 links
New urls: 3
Status for https://rolisz.ro/tag/fun/: 200 OK
Visited: https://rolisz.ro/tag/fun/ found 26 links
Status for https://rolisz.ro/author/rolisz/page/7/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/7/ found 35 links
Status for https://rolisz.ro/page/8/: 200 OK
Visited: https://rolisz.ro/page/8/ found 35 links
New urls: 2
Status for https://rolisz.ro/page/9/: 200 OK
Visited: https://rolisz.ro/page/9/ found 35 links
Status for https://rolisz.ro/author/rolisz/page/8/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/8/ found 35 links
New urls: 2
Status for https://rolisz.ro/page/10/: 200 OK
Visited: https://rolisz.ro/page/10/ found 35 links
Status for https://rolisz.ro/author/rolisz/page/9/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/9/ found 35 links
New urls: 2
Status for https://rolisz.ro/author/rolisz/page/10/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/10/ found 35 links
Status for https://rolis

()

# Let's parallelize it!

In [None]:
use rayon::prelude::*;

....
while !new_urls.is_empty() {
  let found_urls: HashSet<String> = new_urls
    .par_iter()
...
    
    .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            })

* Use the `rayon` library
* Use `par_iter` instead of `iter`
* Use the `reduce` function with a slightly different signature instead of `fold`

In [48]:
use rayon::prelude::*;
use std::time::{Duration, Instant};

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let found_urls: HashSet<String> = new_urls
            .par_iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);

                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
            })
            .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            });
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());
}
main()

Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/: 200 OK
Status for https://rolisz.ro/uses/: 200 OK
Visited: https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/ found 13 links
Status for https://rolisz.ro/2020/05/24/bullet-journaling/: 200 OK
Visited: https://rolisz.ro/uses/ found 9 links
Status for https://rolisz.ro/2020/05/18/an-unexpected-error-in-rust/: 200 OK
Status for https://rolisz.ro/2020/05/14/travelers/: 200 OK
Visited: https://rolisz.ro/2020/05/24/bullet-journaling/ found 13 links
Status for https://rolisz.ro/2020/05/08/on-text-editors/: 200 OK
Status for https://rolisz.ro/2020/06/02/quarantine-boardgames-pandemic/: 200 OK
Status for https://rolisz.ro/2020/06/27/pixel-3a/: 200 OK
Status for https://rolisz.ro/2020/05/06/connecting-to-azure-pypi-repositories/: 200 OK
Status for https://rolisz.ro/2020/06/14/operating-system-journey/: 200 OK
Visited: https://rolisz.ro/2020/05/18/an-unexpected-er

Status for https://rolisz.ro/2019/11/30/boardgames-party-tokaido/: 200 OK
Status for https://rolisz.ro/2019/08/07/breaking-radio-silence/: 200 OK
Visited: https://rolisz.ro/2018/05/07/scraping-for-houses/ found 12 links
Status for https://rolisz.ro/2019/11/25/git-tutorial-part-3/: 200 OK
Visited: https://rolisz.ro/2019/04/09/hacktm-oradea/ found 16 links
Status for https://rolisz.ro/2010/10/23/dicotomia-mea/: 200 OK
Status for https://rolisz.ro/2019/05/25/vot-26-mai/: 200 OK
Visited: https://rolisz.ro/2020/04/11/moving-away-from-gmail/ found 14 links
Visited: https://rolisz.ro/2019/11/30/boardgames-party-tokaido/ found 14 links
Visited: https://rolisz.ro/2013/11/15/receiptbudget/ found 15 links
Visited: https://rolisz.ro/2019/08/07/breaking-radio-silence/ found 14 links
Visited: https://rolisz.ro/tag/reviews/ found 34 links
Status for https://rolisz.ro/2019/12/25/the-rise-of-skywalker/: 200 OK
Status for https://rolisz.ro/2012/08/20/processing-im-logs/: 200 OK
Status for https://rolisz

Visited: https://rolisz.ro/2018/10/14/sweating-in-hong-kong/ found 14 links
Visited: https://rolisz.ro/tag/programming/page/2/ found 31 links
Visited: https://rolisz.ro/2012/02/16/simplu-calculator-in-python-partea-2/ found 14 links
Visited: https://rolisz.ro/2017/02/20/backing-up/ found 13 links
Status for https://rolisz.ro/tag/yahoo/: 200 OK
Status for https://rolisz.ro/tag/psychology/: 200 OK
Status for https://rolisz.ro/2016/01/02/setting-goals/: 200 OK
Status for https://rolisz.ro/2014/09/22/moving-from-wordpress-to-acrylamid/: 200 OK
Status for https://rolisz.ro/2018/04/01/man-versus-nature/: 200 OK
Visited: https://rolisz.ro/2018/08/12/setting-up-ssh-keys/ found 14 links
Status for https://rolisz.ro/2011/04/17/white-collar/: 200 OK
Visited: https://rolisz.ro/2012/09/09/frumusetea-miscai/ found 14 links
Visited: https://rolisz.ro/tag/yahoo/ found 9 links
Visited: https://rolisz.ro/tag/psychology/ found 9 links
Status for https://rolisz.ro/2010/07/10/eureka-4x01/: 200 OK
Status fo

Visited: https://rolisz.ro/2011/07/24/burn-notice/ found 16 links
Status for https://rolisz.ro/2014/12/01/interstellar/: 200 OK
Status for https://rolisz.ro/tag/desktop/: 200 OK
Status for https://rolisz.ro/2011/05/28/ultima-zi-de-liceu/: 200 OK
Status for https://rolisz.ro/2017/05/06/happy-mother-s-day/: 200 OK
Visited: https://rolisz.ro/2013/04/26/the-mysterious-vcvarsall-bat/ found 12 links
Status for https://rolisz.ro/2018/06/08/regular-expressions-for-objects/: 200 OK
Status for https://rolisz.ro/2012/08/02/pioneer-one/: 200 OK
Visited: https://rolisz.ro/2010/10/10/rubicon/ found 14 links
Status for https://rolisz.ro/2011/06/08/happy-birthday-dear-blog/: 200 OK
Visited: https://rolisz.ro/tag/desktop/ found 9 links
Status for https://rolisz.ro/2010/08/27/scoala-de-soferi/: 200 OK
Visited: https://rolisz.ro/2017/05/06/happy-mother-s-day/ found 14 links
Visited: https://rolisz.ro/2012/08/02/pioneer-one/ found 14 links
Visited: https://rolisz.ro/2018/06/08/regular-expressions-for-obje

Status for https://rolisz.ro/2015/01/03/saying-goodbye-to-lenovo/: 200 OK
Status for https://rolisz.ro/tag/celullar-automata/: 200 OK
Visited: https://rolisz.ro/2018/01/13/mass-effect-andromeda/ found 15 links
Status for https://rolisz.ro/2017/03/09/silence/: 200 OK
Status for https://rolisz.ro/2019/03/24/old-man-s-cave/#fn-2: 200 OK
Status for https://rolisz.ro/2015/03/29/time-for-a-new-look/: 200 OK
Visited: https://rolisz.ro/2016/01/13/2015-in-review/ found 17 links
Status for https://rolisz.ro/2014/06/21/misca-reloaded/: 200 OK
Visited: https://rolisz.ro/2016/06/08/gluecklicher-6-geburtstag-mein-blog/ found 23 links
Visited: https://rolisz.ro/tag/celullar-automata/ found 9 links
Visited: https://rolisz.ro/2015/01/03/saying-goodbye-to-lenovo/ found 12 links
Visited: https://rolisz.ro/2015/03/29/time-for-a-new-look/ found 14 links
Visited: https://rolisz.ro/2019/03/24/old-man-s-cave/#fn-2 found 16 links
Visited: https://rolisz.ro/2017/03/09/silence/ found 14 links
Status for https://

Status for https://rolisz.ro/2016/03/25/the-museum/: 200 OK
Status for https://rolisz.ro/2017/01/28/sous-vide-with-anova-precision-cooker/: 200 OK
Status for https://rolisz.ro/2013/11/30/stephen-wolframs-new-big-thing/: 200 OK
Status for https://rolisz.ro/2011/01/01/afisarea-treptata-a-unei-figuri/: 200 OK
Visited: https://rolisz.ro/2014/07/04/review-materii-anul-3/ found 14 links
Visited: https://rolisz.ro/2013/02/25/cura-de-slabire/ found 14 links
Visited: https://rolisz.ro/2012/03/16/tutorial-awk/ found 14 links
Visited: https://rolisz.ro/2016/03/06/mayumana/ found 14 links
Visited: https://rolisz.ro/2016/03/25/the-museum/ found 15 links
Visited: https://rolisz.ro/2017/01/28/sous-vide-with-anova-precision-cooker/ found 14 links
Status for https://rolisz.ro/2011/07/17/lenovo-t520-review/: 200 OK
Status for https://rolisz.ro/2011/01/02/2010-in-review/: 200 OK
Status for https://rolisz.ro/2014/06/05/festivitate-informatica-ubb/: 200 OK
Status for https://rolisz.ro/2015/05/29/weekend-in

Visited: https://rolisz.ro/2010/10/08/big-day/ found 14 links
Visited: https://rolisz.ro/2011/09/27/windows-8-part-2/ found 15 links
Status for https://rolisz.ro/tag/acrylamid/: 200 OK
Visited: https://rolisz.ro/2012/12/31/la-multi-ani/ found 18 links
Status for https://rolisz.ro/2016/02/27/new-york-again/: 200 OK
Status for https://rolisz.ro/tag/challenge/: 200 OK
Status for https://rolisz.ro/2014/09/29/knabenschiessen/: 200 OK
Status for https://rolisz.ro/2017/11/09/blade-runner-2049/: 200 OK
Visited: https://rolisz.ro/2013/11/10/laptop-refreshing/ found 14 links
Status for https://rolisz.ro/2017/06/29/i-believe-i-can-fly/: 200 OK
Visited: https://rolisz.ro/tag/acrylamid/ found 11 links
Visited: https://rolisz.ro/tag/challenge/ found 12 links
Status for https://rolisz.ro/2010/07/25/used-by-me-3/: 200 OK
Visited: https://rolisz.ro/2017/11/09/blade-runner-2049/ found 14 links
Visited: https://rolisz.ro/2014/09/29/knabenschiessen/ found 13 links
Status for https://rolisz.ro/tag/plants/:

Status for https://rolisz.ro/2016/05/24/paris-revisited/: 200 OK
Visited: https://rolisz.ro/2015/10/08/google-tech-talk-babes-bolyai-university/amp/ found 3 links
Visited: https://rolisz.ro/2015/01/20/winter-holiday-movies/ found 14 links
Status for https://rolisz.ro/2011/01/20/fall-of-giants/: 200 OK
Status for https://rolisz.ro/2012/06/02/osom/: 200 OK
Visited: https://rolisz.ro/2015/04/10/biking/ found 14 links
Status for https://rolisz.ro/2012/01/05/2011-in-review/: 200 OK
Status for https://rolisz.ro/tag/windows/: 200 OK
Status for https://rolisz.ro/2010/11/15/concursul-de-fizica-schwartz/: 200 OK
Status for https://rolisz.ro/2014/09/24/first-week-in-zurich/: 200 OK
Visited: https://rolisz.ro/2011/01/20/fall-of-giants/ found 15 links
Visited: https://rolisz.ro/2016/05/11/hanz-zimmer-live-on-tour/ found 14 links
Visited: https://rolisz.ro/2012/06/02/osom/ found 14 links
Visited: https://rolisz.ro/2012/01/05/2011-in-review/ found 14 links
Visited: https://rolisz.ro/tag/windows/ foun

Visited: https://rolisz.ro/2011/03/17/adblocking/ found 11 links
Status for https://rolisz.ro/2010/07/29/bancuri-bilingve/: 200 OK
Visited: https://rolisz.ro/tag/voting/ found 11 links
Visited: https://rolisz.ro/2016/09/17/good-bye-alida/amp/ found 3 links
Visited: https://rolisz.ro/2012/01/07/redirecting-python-output-from-the-command-line/ found 14 links
Status for https://rolisz.ro/2015/06/12/trip-to-dublin/amp/: 200 OK
Visited: https://rolisz.ro/tag/philosophy/ found 11 links
Status for https://rolisz.ro/2013/11/21/cmd-line-batch-installer/: 404 Not Found
Visited: https://rolisz.ro/2013/11/21/cmd-line-batch-installer/ found 0 links
Status for https://rolisz.ro/2012/03/17/mass-effect-3/: 200 OK
Status for https://rolisz.ro/2012/03/03/viata-de-student-desert/: 200 OK
Status for https://rolisz.ro/2013/12/06/wireshark-and-amazon-swf/: 200 OK
Visited: https://rolisz.ro/2014/01/17/today-software-magazine/ found 14 links
Visited: https://rolisz.ro/2015/06/12/trip-to-dublin/amp/ found 3 li

Status for https://rolisz.ro/page/8/: 200 OK
Status for https://rolisz.ro/author/rolisz/page/7/: 200 OK
Visited: https://rolisz.ro/tag/fun/ found 26 links
Visited: https://rolisz.ro/author/rolisz/page/7/ found 35 links
Visited: https://rolisz.ro/page/8/ found 35 links
New urls: 2
Status for https://rolisz.ro/author/rolisz/page/8/: 200 OK
Status for https://rolisz.ro/page/9/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/8/ found 35 links
Visited: https://rolisz.ro/page/9/ found 35 links
New urls: 2
Status for https://rolisz.ro/page/10/: 200 OK
Status for https://rolisz.ro/author/rolisz/page/9/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/9/ found 35 links
Visited: https://rolisz.ro/page/10/ found 35 links
New urls: 2
Status for https://rolisz.ro/page/11/: 200 OK
Status for https://rolisz.ro/author/rolisz/page/10/: 200 OK
Visited: https://rolisz.ro/author/rolisz/page/10/ found 35 links
Visited: https://rolisz.ro/page/11/ found 35 links
New urls: 2
Status for https://roli

()

## A 3x speedup, from 30 seconds to 8 seconds, with 3 lines changed. 

* Rust guarantees that this is still correct and no data races occur. 

# Questions?

# Extras

## Command line options

In [82]:
use structopt::StructOpt;

#[derive(StructOpt, Debug)]
#[structopt(version = "1.0", author = "rolisz")]
struct Opts {
    url: String,
    #[structopt(short, long, default_value = "4")]
    num_threads: usize,
    #[structopt(short, long, parse(from_occurrences))]
    verbose: i32,
    #[structopt(long)]
    dry_run: bool
}

In [85]:
let opt = Opts::from_iter(vec!["crawler", "--num-threads", "8", "--dry-run", "rolisz.ro"]);
opt

Opts { url: "rolisz.ro", num_threads: 8, verbose: 0, dry_run: true }

## Proper error handling

In [51]:
use std::io::Error as IoErr;
#[derive(Debug)]
enum Error {
    Write { url: String, e: IoErr },
    Fetch { url: String, e: reqwest::Error },
}
type Result<T> = std::result::Result<T, Error>;

In [52]:
impl<S: AsRef<str>> From<(S, IoErr)> for Error {
    fn from((url, e): (S, IoErr)) -> Self {
        Error::Write {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

impl<S: AsRef<str>> From<(S, reqwest::Error)> for Error {
    fn from((url, e): (S, reqwest::Error)) -> Self {
        Error::Fetch {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

In [55]:
fn fetch_url(client: &reqwest::blocking::Client, url: &str) -> Result<String> {
    let mut res = client.get(url).send().map_err(|e| (url, e))?;
    println!("Status for {}: {}", url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).map_err(|e| (url, e))?;
    Ok(body)
}

Error: mismatched types

Error: mismatched types

Error: mismatched types

Error: mismatched types

In [54]:

fn write_file(path: &str, content: &str) -> Result<()> {
    let dir = format!("static{}", path);
    fs::create_dir_all(format!("static{}", path)).map_err(|e| (&dir, e))?;
    let index = format!("static{}/index.html", path);
    fs::write(&index, content).map_err(|e| (&index, e))?;

    Ok(())
}

In [None]:
 let (found_urls, errors): (Vec<Result<HashSet<String>>>, Vec<_>) = new_urls
            .par_iter()
            .map(|url| -> Result<HashSet<String>> {
                let body = fetch_url(&client, url)?;
                write_file(&url[origin_url.len() - 1..], &body)?;

                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                Ok(links)
            })
            .partition(Result::is_ok);

In [None]:
 visited.extend(new_urls);
        new_urls = found_urls
            .into_par_iter()
            .map(Result::unwrap)
            .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            })
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len());

In [None]:
println!(
            "Errors: {:#?}",
            errors
                .into_iter()
                .map(Result::unwrap_err)
                .collect::<Vec<Error>>()
        )