In [2]:
:sccache 1

sccache: true


In [3]:
:dep reqwest = { version = "0.10", features = ["json", "blocking"] }

In [4]:
:dep quick-xml = { version = "*", features = [ "serialize" ] }


In [5]:
:dep rayon = { version = "*" }


In [6]:
:dep select = "0.4.3"

# Intro to Rust
## rolisz@

# About me

* I worked at Google between 2014 and 2018
* Currently TL of AI/ML team at Archive360
* I blog at rolisz.ro
* Started playing with Rust last year in September, in my free time



# What is Rust?

* A language empowering everyone to build reliable and efficient software. https://www.rust-lang.org/
* Announced by Mozilla in 2010
* Stable release since 2015 (1.0)


# How?


## Performance
* Fast and memory efficient
* No garbage collector


## Reliability
* Very rich type system
* Ownership tracking
* Guarantees memory and thread safety


## Productivity

* Great documentation
* Great error messages
* Great tooling - package manager, built tool, IDE support, auto-formatter, etc.



# Let’s build a web crawler in Rust!


* Given a URL (http://rolisz.ro), load all the pages that are under that URL
  * For example http://rolisz.ro/my-awesome-post
  * But not http://subdomain.rolisz.ro/another-page
* Save a copy of them locally


* Interactive, live coding from a prepared notebook
* Feel free to ask questions

In [15]:
use std::io::Read;
use reqwest::Url;
use reqwest::blocking::Client;

fn main() {
    let client = Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    println!("HTML: {}", &body[0..40]);
}

main()

Status for https://rolisz.ro/: 200 OK
HTML: <!DOCTYPE html>
<html lang="en">
<head


()


Instantiate an HTTP client

Send the request and `unwrap` it.

Read the contents into a string.

Variables are immutable by default.

Reading the request is also `unwrapped`

Print first 40 characters

## What is `unwrap`?

* Rust has a strong type system to handle errors. Most operations that can fail return the following:

In [21]:
enum Result<T, E> {
    Ok(T),
    Err(E),
}

let success: Result<i32, String> = Result::Ok(1);
let error: Result<i32, String> = Result::Err("Fail".to_string());

Unwrap transforms from `Result<T,E>` \=\> `T` and panics if there is an error.

Unwrap is usually bad practice, you should properly handle the error

## Parsing the HTML 

In [None]:
let found_urls = Document::from(body.as_str())
    .find(Name("a"))
    .filter_map(|n| n.attr("href"))
    .map(str::to_string)
    .collect::<HashSet<String>>();
println!("URLs: {:#?}", found_urls) 

Use a library to parse the HTML.

Find `a` elements, that have a `href` field and get only that value. 

`n.attr('href')` returns an `Option` with the value. 

Put everything into a `HashSet`, to avoid duplicates.

In [28]:
use std::io::Read;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = Document::from(body.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .map(str::to_string)
        .collect::<HashSet<String>>();
    println!("URLs: {:#?}", found_urls)
}

main()

Status for https://rolisz.ro/: 200 OK
URLs: {
    "/2020/05/08/on-text-editors/",
    "/2020/06/07/new-desk-setup/",
    "https://www.facebook.com/rolisz",
    "/2020/05/20/about-time/",
    "/2020/05/13/adding-search-to-static-ghost/",
    "https://rolisz.ro",
    "/2020/05/16/quarantine-boardgames-itchy-feet/",
    "/2020/03/08/boardgames-party-ha/",
    "/2020/05/18/an-unexpected-error-in-rust/",
    "/2020/05/07/splitting-up-my-blog/",
    "/2020/05/06/connecting-to-azure-pypi-repositories/",
    "/2020/05/05/why-i-blog/",
    "/2020/03/18/reflections-during-covid19-times/",
    "https://rolisz.ro/projects/",
    "/2020/03/27/quarantine-boardgames-travelin/",
    "https://rolisz.ro/uses/",
    "/2020/06/02/quarantine-boardgames-pandemic/",
    "/2020/05/26/duckduckgo/",
    "javascript:;",
    "https://ghost.org",
    "/2020/05/12/productivity-tips-pomodoros/",
    "/2020/05/11/quarantine-boardgames-plague-inc/",
    "/2020/05/15/context-variables-in-python/",
    "/2020/06/08/happ

()

## `Option` type in Rust

Sometimes `Result` is too complicated - the result simply doesn't exist and it doesn't mean an error.

```rust
enum Option<T> {
    None,
    Some(T),
}
```

For example: searching in a list or retrieving a field. Existence => Some(value), not found => None


## Filtering results to only our domain

In [29]:
fn normalize_url(url: &str) -> Option<String> {
    let new_url = Url::parse(url);
    match new_url {
        Ok(new_url) => {
            if let Some("ghost.rolisz.ro") = new_url.host_str() {
                Some(url.to_string())
            } else {
                None
            }
        },
        Err(_e) => {
            // Relative urls are not parsed by Reqwest
            if url.starts_with('/') {
                Some(format!("https://rolisz.ro{}", url))
            } else {
                None
            }
        }
    }
}

* Match syntax for flow control
* if let syntax as a shorthand
* implicit returns

# Applying the normalization 

In [35]:
use select::predicate::Predicate;

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html)
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

# Putting everything together

In [24]:
fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = get_links_from_html(&body);
    println!("URLs: {:#?}", found_urls)
}
main()

Status for https://rolisz.ro/: 200 OK
URLs: {
    "https://rolisz.ro/2020/03/01/web-crawler-in-rust/",
    "https://rolisz.ro/2020/02/13/lost-in-space/",
    "https://rolisz.ro/2020/06/07/new-desk-setup/",
    "https://rolisz.ro/2020/05/24/bullet-journaling/",
    "https://rolisz.ro/2020/06/01/yard-work/",
    "https://rolisz.ro/2020/05/26/duckduckgo/",
    "https://rolisz.ro/2020/04/11/moving-away-from-gmail/",
    "https://rolisz.ro/2020/03/18/reflections-during-covid19-times/",
    "https://rolisz.ro/2020/05/20/about-time/",
    "https://rolisz.ro/2020/05/07/splitting-up-my-blog/",
    "https://rolisz.ro/2020/05/13/adding-search-to-static-ghost/",
    "https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/",
    "https://rolisz.ro/2020/05/14/travelers/",
    "https://rolisz.ro/assets/built/screen.css?v=0df9ee8832",
    "https://rolisz.ro/favicon.ico",
    "https://rolisz.ro/2020/05/16/quarantine-boardgames-itchy-feet/",
    "https://rolisz.ro/2020/05/12/productivity-tips-pomod

()

# Refactor fetching into a function

In [None]:
fn fetch_url(client: &Client, url: &str) -> String {
    let mut res = client.get(url).send().unwrap();
    println!("Status for {}: {}", url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    body
}

# Visiting found URLs too

In [33]:
let mut new_urls = found_urls;
while !new_urls.is_empty() {
    let mut found_urls = new_urls.iter()
    .map(|url| {
        let body = fetch_url(&client, url);
        let links = get_links_from_html(&body);
        println!("Visited: {} found {} links", 
            url, links.len());
        links
    }).fold(HashSet::new(), |mut acc, x| {
            acc.extend(x);
            acc
    });
    visited.extend(new_urls);

    new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();
    println!("New urls: {}", new_urls.len())
}

Error: Failed to determine type of variable `client`. rustc suggested type reqwest::blocking::client::Client, but that's private. Sometimes adding an extern crate will help rustc suggest the correct public type name, or you can give an explicit type.

### Algorithm:

Take all URLs found on home page and crawl them. 

Check what new URLs we find and repeat until we don't find any new URLs.

### Rust:

* Using `iter` to go over all elements of the array
* Using `fold` to join all the results into a single `HashSet`.
* Using a functional-ish style

In [36]:
use std::time::Instant;

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let mut found_urls = new_urls.iter().map(|url| {
            let body = fetch_url(&client, url);
            let links = get_links_from_html(&body);
            println!("Visited: {} found {} links", url, links.len());
            links
        }).fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        });
        visited.extend(new_urls);
        
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}
main()

Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/favicon.ico: 200 OK


thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidData, error: "stream did not contain valid UTF-8" }', src\lib.rs:55:5
stack backtrace:
   0: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
   1: core::fmt::write
   2: <std::io::IoSlice as core::fmt::Debug>::fmt
   3: std::panicking::take_hook
   4: std::panicking::take_hook
   5: std::panicking::rust_panic_with_hook
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::option::expect_none_failed
   9: ctx::fetch_url
  10: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
  11: run_user_code_23
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: BaseThreadInitThunk
  22: RtlUserThreadStart
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


Error: Child process terminated with status: exit code: 101

# Filtering out non-HTML links based on presence of extension

In [40]:
use std::path::Path;

fn has_extension(url: &&str) -> bool {
    Path::new(url).extension().is_none()
}

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html)
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter(has_extension)
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

main()

Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/2020/06/08/happy-decennial/: 200 OK
Status for https://rolisz.ro/2020/06/07/new-desk-setup/: 200 OK
Visited: https://rolisz.ro/2020/06/08/happy-decennial/ found 6 links
Status for https://rolisz.ro/2020/05/08/on-text-editors/: 200 OK
Visited: https://rolisz.ro/2020/06/07/new-desk-setup/ found 7 links
Status for https://rolisz.ro/2020/05/11/quarantine-boardgames-plague-inc/: 200 OK
Status for https://rolisz.ro/2020/03/01/web-crawler-in-rust/: 200 OK
Status for https://rolisz.ro/2020/05/15/context-variables-in-python/: 200 OK
Status for https://rolisz.ro/2020/05/09/the-beauty-of-suncuius-part-2/: 200 OK
Status for https://rolisz.ro/2020/03/18/reflections-during-covid19-times/: 200 OK
Status for https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/: 200 OK
Visited: https://rolisz.ro/2020/05/08/on-text-editors/ found 7 links
Status for https://rolisz.ro/author/rolisz/: 200 OK
Status for https://rolisz.ro/2020/

Visited: https://rolisz.ro/2013/02/22/grafice-cu-d3-js-part-2/ found 7 links
Visited: https://rolisz.ro/2019/09/29/ad-astra/ found 7 links
Visited: https://rolisz.ro/2020/01/18/moving-from-acrylamid-to-ghost/ found 7 links
Status for https://rolisz.ro/2013/01/14/moar-lisp/: 200 OK
Status for https://rolisz.ro/2018/12/31/2018-in-review/: 200 OK
Status for https://rolisz.ro/2013/03/22/tutorial-camelot/: 200 OK
Status for https://rolisz.ro/2018/02/23/goals-for-2018/: 200 OK
Status for https://rolisz.ro/2020/02/07/interview-about-wfh/: 200 OK
Visited: https://rolisz.ro/2017/07/07/evaluating-goals-2017-edition/ found 7 links
Visited: https://rolisz.ro/2011/04/17/white-collar/ found 7 links
Visited: https://rolisz.ro/2019/01/14/books-of-2018/ found 7 links
Status for https://rolisz.ro/2017/05/06/happy-mother-s-day/: 200 OK
Status for https://rolisz.ro/2013/02/09/elementary/: 200 OK
Status for https://rolisz.ro/2013/12/14/code-retreat-2013/: 200 OK
Visited: https://rolisz.ro/2018/12/31/2018-i

Status for https://rolisz.ro/2012/02/04/arrested-development/: 200 OK
Visited: https://rolisz.ro/2015/01/30/comments-are-back/ found 7 links
Status for https://rolisz.ro/2010/06/14/eureka/: 200 OK
Status for https://rolisz.ro/2017/08/05/how-to-organize-a-wedding-in-20-easy-steps/: 200 OK
Status for https://rolisz.ro/2019/05/25/vot-26-mai/: 200 OK
Status for https://rolisz.ro/2012/08/02/pioneer-one/: 200 OK
Status for https://rolisz.ro/2019/04/09/hacktm-oradea/: 200 OK
Visited: https://rolisz.ro/2015/03/29/time-for-a-new-look/ found 7 links
Visited: https://rolisz.ro/2012/02/04/arrested-development/ found 7 links
Visited: https://rolisz.ro/2010/06/14/eureka/ found 7 links
Visited: https://rolisz.ro/2019/05/25/vot-26-mai/ found 7 links
Visited: https://rolisz.ro/2017/08/05/how-to-organize-a-wedding-in-20-easy-steps/ found 7 links
Status for https://rolisz.ro/2015/11/07/line-charts-in-javascript/: 200 OK
Visited: https://rolisz.ro/2020/01/21/blogs-are-best-served-static/ found 7 links
Vis

Visited: https://rolisz.ro/2014/07/29/roland-origins/ found 7 links
Visited: https://rolisz.ro/2018/12/16/synology-moments/ found 7 links
Visited: https://rolisz.ro/2017/03/09/silence/ found 7 links
Status for https://rolisz.ro/2014/11/17/global-day-of-code-retreat-2014/: 200 OK
Status for https://rolisz.ro/2012/01/15/50-apps/: 200 OK
Visited: https://rolisz.ro/2016/05/04/rise-of-the-tomb-raider/ found 7 links
Status for https://rolisz.ro/2015/09/05/san-francisco-and-the-so-called-mountain-view/: 200 OK
Visited: https://rolisz.ro/tag/review/ found 19 links
Status for https://rolisz.ro/2018/04/26/synology-and-docker/: 200 OK
Status for https://rolisz.ro/2010/07/10/sfml/: 200 OK
Visited: https://rolisz.ro/2015/02/21/what-i-think-of-the-ai-hype/ found 6 links
Status for https://rolisz.ro/tag/events/: 200 OK
Status for https://rolisz.ro/2012/06/20/chestie-utila-pt-sda/: 200 OK
Status for https://rolisz.ro/tag/birthday/: 200 OK
Visited: https://rolisz.ro/2014/11/17/global-day-of-code-retrea

Status for https://rolisz.ro/2016/12/22/shalom-from-israel/: 200 OK
Visited: https://rolisz.ro/2015/12/18/the-force-awakens/ found 7 links
Status for https://rolisz.ro/2017/06/14/istanbul-land-of-turbans-spice-and-carpets/: 200 OK
Status for https://rolisz.ro/2014/09/24/first-week-in-zurich/: 200 OK
Visited: https://rolisz.ro/2016/01/19/bose-bluetooth-headphones/ found 7 links
Visited: https://rolisz.ro/2014/08/12/delta-dunarii/ found 7 links
Visited: https://rolisz.ro/2016/08/29/my-desktop-setup/ found 7 links
Status for https://rolisz.ro/tag/presentation/: 200 OK
Visited: https://rolisz.ro/2011/08/02/introducere-in-node-js/ found 7 links
Status for https://rolisz.ro/2013/06/27/anul-2-2/: 200 OK
Status for https://rolisz.ro/tag/ssl/: 200 OK
Status for https://rolisz.ro/2015/06/12/trip-to-dublin/: 200 OK
Status for https://rolisz.ro/2012/07/20/experienta-mea-cu-orange/: 200 OK
Visited: https://rolisz.ro/2014/09/24/first-week-in-zurich/ found 7 links
Visited: https://rolisz.ro/tag/prese

Status for https://rolisz.ro/2016/02/27/new-york-again/: 200 OK
Visited: https://rolisz.ro/2010/07/25/sfml-in-visual-studio-2010/ found 7 links
Visited: https://rolisz.ro/2011/11/17/update-facultate/ found 7 links
Status for https://rolisz.ro/2014/12/01/interstellar/: 200 OK
Status for https://rolisz.ro/2014/03/02/walter-the-waiter/: 200 OK
Status for https://rolisz.ro/2011/02/04/olimpiada-nationala-de-fizica-2011-part-two/: 200 OK
Status for https://rolisz.ro/2012/01/09/dont-click-it/: 200 OK
Visited: https://rolisz.ro/tag/scraping/ found 2 links
Visited: https://rolisz.ro/2011/08/08/chat-cu-node-js/ found 7 links
Visited: https://rolisz.ro/tag/philosophy/ found 4 links
Status for https://rolisz.ro/2011/06/29/planuri-de-vacanta/: 200 OK
Status for https://rolisz.ro/tag/voting/: 200 OK
Visited: https://rolisz.ro/2014/03/02/walter-the-waiter/ found 7 links
Status for https://rolisz.ro/tag/nablopomo/: 200 OK
Visited: https://rolisz.ro/2012/01/09/dont-click-it/ found 7 links
Visited: http

Visited: https://rolisz.ro/tag/orange/ found 4 links
Visited: https://rolisz.ro/2012/03/02/instalare-linux-pe-masina-virtuala/ found 7 links
Visited: https://rolisz.ro/2014/11/05/going-to-vote/ found 6 links
Status for https://rolisz.ro/2012/01/05/2011-in-review/: 200 OK
Visited: https://rolisz.ro/2011/09/17/horvath-janos-iskolacsoport-megnyitasa/ found 6 links
Visited: https://rolisz.ro/2014/06/28/interviu-cu-echipa-secureye/ found 6 links
Visited: https://rolisz.ro/2017/11/09/blade-runner-2049/ found 7 links
Visited: https://rolisz.ro/2011/03/17/adblocking/ found 4 links
Visited: https://rolisz.ro/2012/01/05/2011-in-review/ found 7 links
New urls: 69
Status for https://rolisz.ro/tag/interview/: 200 OK
Status for https://rolisz.ro/2011/12/19/project-euler-si-ac/: 200 OK
Status for https://rolisz.ro/tag/ads/: 200 OK
Status for https://rolisz.ro/2016/03/06/mayumana/: 200 OK
Status for https://rolisz.ro/tag/tutorial/: 200 OK
Status for https://rolisz.ro/2011/10/08/filmele-verii/: 200 OK


Status for https://rolisz.ro/tag/fun/: 200 OK
Visited: https://rolisz.ro/2011/02/08/windows-7-take-ownership/ found 7 links
Visited: https://rolisz.ro/2010/11/29/50-de-ani-de-casatorie/ found 7 links
Status for https://rolisz.ro/2012/02/13/statistici-si-grafice-in-r/: 200 OK
Status for https://rolisz.ro/2012/03/18/mass-effect-3-spoilerish-rant/: 200 OK
Status for https://rolisz.ro/2011/09/04/pregatire-pentru-facultate-part-two/: 200 OK
Visited: https://rolisz.ro/2010/10/28/webcomics/ found 7 links
Visited: https://rolisz.ro/tag/goals/ found 8 links
Visited: https://rolisz.ro/tag/fun/ found 18 links
Visited: https://rolisz.ro/2013/12/06/wireshark-and-amazon-swf/ found 4 links
Visited: https://rolisz.ro/2012/01/07/redirecting-python-output-from-the-command-line/ found 7 links
Status for https://rolisz.ro/tag/rant/: 200 OK
Visited: https://rolisz.ro/2012/02/13/statistici-si-grafice-in-r/ found 7 links
Status for https://rolisz.ro/2014/11/09/rabechilbi/: 200 OK
Status for https://rolisz.ro

()

* use stdlib `Path` parser to get extension
* returns `Option`
* `is_none` checks what the option contains

# Let's write the pages to disk

In [38]:
use std::fs;


fn write_file(path: &str, content: &str) {
    fs::create_dir_all(format!("static{}", path)).unwrap();
    fs::write(format!("static{}/index.html", path), content);
}

* Create the folder structure, including all parents
* Write the file to `index.html` in that folder

In [44]:
fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body= fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
    	.difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while new_urls.len() > 0 {
        let mut found_urls: HashSet<String> = new_urls
        	.iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);
                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
        })
        .fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        });
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}
main()


Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/2020/03/08/boardgames-party-ha/: 200 OK
Visited: https://rolisz.ro/2020/03/08/boardgames-party-ha/ found 7 links
Status for https://rolisz.ro/2020/05/09/the-beauty-of-suncuius-part-2/: 200 OK
Visited: https://rolisz.ro/2020/05/09/the-beauty-of-suncuius-part-2/ found 7 links
Status for https://rolisz.ro/2020/02/13/lost-in-space/: 200 OK
Visited: https://rolisz.ro/2020/02/13/lost-in-space/ found 7 links
Status for https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/: 200 OK
Visited: https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/ found 7 links
Status for https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/: 200 OK
Visited: https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/ found 7 links
Status for https://rolisz.ro/2020/05/07/splitting-up-my-blog/: 200 OK
Visited: https://rolisz.ro/2020/05/07/splitting-up-my-blog/ found 7 links
Status for https://rolisz.ro/2020/06/07/new-desk-setup/: 200 O

Status for https://rolisz.ro/2017/01/01/2016-in-review/: 200 OK
Visited: https://rolisz.ro/2017/01/01/2016-in-review/ found 7 links
Status for https://rolisz.ro/2020/01/01/2019-in-review/: 200 OK
Visited: https://rolisz.ro/2020/01/01/2019-in-review/ found 7 links
Status for https://rolisz.ro/2017/08/05/how-to-organize-a-wedding-in-20-easy-steps/: 200 OK
Visited: https://rolisz.ro/2017/08/05/how-to-organize-a-wedding-in-20-easy-steps/ found 7 links
Status for https://rolisz.ro/2013/08/09/fun-at-arka-park/: 200 OK
Visited: https://rolisz.ro/2013/08/09/fun-at-arka-park/ found 7 links
Status for https://rolisz.ro/2019/11/30/boardgames-party-tokaido/: 200 OK
Visited: https://rolisz.ro/2019/11/30/boardgames-party-tokaido/ found 7 links
Status for https://rolisz.ro/2018/08/12/setting-up-ssh-keys/: 200 OK
Visited: https://rolisz.ro/2018/08/12/setting-up-ssh-keys/ found 7 links
Status for https://rolisz.ro/2019/02/26/boardgames-party-splendor/: 200 OK
Visited: https://rolisz.ro/2019/02/26/board

Visited: https://rolisz.ro/2017/04/23/look-what-i-found-on-uetliberg/ found 7 links
Status for https://rolisz.ro/2016/09/17/good-bye-alida/: 200 OK
Visited: https://rolisz.ro/2016/09/17/good-bye-alida/ found 7 links
Status for https://rolisz.ro/2016/06/12/nice-time-in-nice/: 200 OK
Visited: https://rolisz.ro/2016/06/12/nice-time-in-nice/ found 7 links
Status for https://rolisz.ro/2018/12/31/2018-in-review/: 200 OK
Visited: https://rolisz.ro/2018/12/31/2018-in-review/ found 7 links
Status for https://rolisz.ro/2012/02/04/arrested-development/: 200 OK
Visited: https://rolisz.ro/2012/02/04/arrested-development/ found 7 links
Status for https://rolisz.ro/2011/01/26/sherlock/: 200 OK
Visited: https://rolisz.ro/2011/01/26/sherlock/ found 7 links
Status for https://rolisz.ro/2010/06/14/eureka/: 200 OK
Visited: https://rolisz.ro/2010/06/14/eureka/ found 7 links
Status for https://rolisz.ro/2013/12/14/code-retreat-2013/: 200 OK
Visited: https://rolisz.ro/2013/12/14/code-retreat-2013/ found 6 li

Status for https://rolisz.ro/2015/01/11/my-experience-with-linux-part-6/: 200 OK
Visited: https://rolisz.ro/2015/01/11/my-experience-with-linux-part-6/ found 7 links
Status for https://rolisz.ro/2017/12/30/let-s-go-to-africa/: 200 OK
Visited: https://rolisz.ro/2017/12/30/let-s-go-to-africa/ found 7 links
Status for https://rolisz.ro/tag/conference/: 200 OK
Visited: https://rolisz.ro/tag/conference/ found 4 links
Status for https://rolisz.ro/2015/04/10/biking/: 200 OK
Visited: https://rolisz.ro/2015/04/10/biking/ found 7 links
Status for https://rolisz.ro/2015/10/08/google-tech-talk-babes-bolyai-university/: 200 OK
Visited: https://rolisz.ro/2015/10/08/google-tech-talk-babes-bolyai-university/ found 7 links
Status for https://rolisz.ro/2014/07/05/end-of-june-end-of-college/: 200 OK
Visited: https://rolisz.ro/2014/07/05/end-of-june-end-of-college/ found 7 links
Status for https://rolisz.ro/2018/04/26/synology-and-docker/: 200 OK
Visited: https://rolisz.ro/2018/04/26/synology-and-docker/ 

Status for https://rolisz.ro/2011/07/17/lenovo-t520-review/: 200 OK
Visited: https://rolisz.ro/2011/07/17/lenovo-t520-review/ found 7 links
Status for https://rolisz.ro/2012/09/06/summing-up-contacts/: 200 OK
Visited: https://rolisz.ro/2012/09/06/summing-up-contacts/ found 7 links
Status for https://rolisz.ro/2015/06/06/gardening/: 200 OK
Visited: https://rolisz.ro/2015/06/06/gardening/ found 4 links
Status for https://rolisz.ro/2011/01/31/olimpiada-nationala-de-fizica-2011-part-one/: 200 OK
Visited: https://rolisz.ro/2011/01/31/olimpiada-nationala-de-fizica-2011-part-one/ found 7 links
Status for https://rolisz.ro/tag/acrylamid/: 200 OK
Visited: https://rolisz.ro/tag/acrylamid/ found 4 links
Status for https://rolisz.ro/2011/05/26/used-by-me-6/: 200 OK
Visited: https://rolisz.ro/2011/05/26/used-by-me-6/ found 7 links
Status for https://rolisz.ro/tag/cmd/: 200 OK
Visited: https://rolisz.ro/tag/cmd/ found 3 links
Status for https://rolisz.ro/2015/01/20/winter-holiday-movies/: 200 OK
Vis

Status for https://rolisz.ro/2013/09/15/ultima-vacanta-de-vara/: 200 OK
Visited: https://rolisz.ro/2013/09/15/ultima-vacanta-de-vara/ found 7 links
Status for https://rolisz.ro/2011/07/10/google/: 200 OK
Visited: https://rolisz.ro/2011/07/10/google/ found 7 links
Status for https://rolisz.ro/2013/12/21/gentoo-vs-rolisz-round-1/: 200 OK
Visited: https://rolisz.ro/2013/12/21/gentoo-vs-rolisz-round-1/ found 5 links
Status for https://rolisz.ro/2015/03/01/survived-first-oncall/: 200 OK
Visited: https://rolisz.ro/2015/03/01/survived-first-oncall/ found 7 links
Status for https://rolisz.ro/tag/philosophy/: 200 OK
Visited: https://rolisz.ro/tag/philosophy/ found 4 links
Status for https://rolisz.ro/tag/fail/: 200 OK
Visited: https://rolisz.ro/tag/fail/ found 7 links
Status for https://rolisz.ro/2013/11/03/nablopomo/: 200 OK
Visited: https://rolisz.ro/2013/11/03/nablopomo/ found 5 links
Status for https://rolisz.ro/2012/08/20/processing-im-logs/: 200 OK
Visited: https://rolisz.ro/2012/08/20/pr

Status for https://rolisz.ro/2014/07/05/the-dungeon/: 200 OK
Visited: https://rolisz.ro/2014/07/05/the-dungeon/ found 7 links
Status for https://rolisz.ro/tag/yahoo/: 200 OK
Visited: https://rolisz.ro/tag/yahoo/ found 2 links
Status for https://rolisz.ro/2011/03/29/polonia/: 200 OK
Visited: https://rolisz.ro/2011/03/29/polonia/ found 7 links
Status for https://rolisz.ro/2010/12/17/tron-evolution/: 200 OK
Visited: https://rolisz.ro/2010/12/17/tron-evolution/ found 7 links
Status for https://rolisz.ro/2013/11/22/studcard-omnipass-bt/: 200 OK
Visited: https://rolisz.ro/2013/11/22/studcard-omnipass-bt/ found 7 links
Status for https://rolisz.ro/2014/05/13/ghid-optionale-ubb-info/: 200 OK
Visited: https://rolisz.ro/2014/05/13/ghid-optionale-ubb-info/ found 6 links
Status for https://rolisz.ro/2010/10/23/dicotomia-mea/: 200 OK
Visited: https://rolisz.ro/2010/10/23/dicotomia-mea/ found 4 links
Status for https://rolisz.ro/2011/06/08/happy-birthday-dear-blog/: 200 OK
Visited: https://rolisz.ro

Status for https://rolisz.ro/tag/sale/: 200 OK
Visited: https://rolisz.ro/tag/sale/ found 2 links
Status for https://rolisz.ro/tag/tutorial/: 200 OK
Visited: https://rolisz.ro/tag/tutorial/ found 26 links
Status for https://rolisz.ro/2010/11/06/dimensionalitatea-spatiului-timp/: 200 OK
Visited: https://rolisz.ro/2010/11/06/dimensionalitatea-spatiului-timp/ found 7 links
Status for https://rolisz.ro/2011/02/08/windows-7-take-ownership/: 200 OK
Visited: https://rolisz.ro/2011/02/08/windows-7-take-ownership/ found 7 links
Status for https://rolisz.ro/2014/12/17/strike-the-iron-while-its-hot/: 200 OK
Visited: https://rolisz.ro/2014/12/17/strike-the-iron-while-its-hot/ found 7 links
Status for https://rolisz.ro/2017/02/20/backing-up/: 200 OK
Visited: https://rolisz.ro/2017/02/20/backing-up/ found 6 links
Status for https://rolisz.ro/2010/12/20/rapirea-lui-mos-craciun-de-catre-teroristi/: 200 OK
Visited: https://rolisz.ro/2010/12/20/rapirea-lui-mos-craciun-de-catre-teroristi/ found 4 links
S

()

# Let's parallelize it!

In [None]:
use rayon::prelude::*;

....
while !new_urls.is_empty() {
  let found_urls: HashSet<String> = new_urls
    .par_iter()
...
    
    .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            })

* Use the `rayon` library
* Use `par_iter` instead of `iter`
* Use the `reduce` function with a slightly different signature instead of `fold`

In [39]:
use rayon::prelude::*;

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let found_urls: HashSet<String> = new_urls
            .par_iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);

                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
            })
            .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            });
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());
}
main()

Status for https://rolisz.ro/: 200 OK
Status for https://rolisz.ro/2020/05/07/splitting-up-my-blog/: 200 OK
Status for https://rolisz.ro/2020/03/01/web-crawler-in-rust/: 200 OK
Visited: https://rolisz.ro/2020/05/07/splitting-up-my-blog/ found 9 links
Status for https://rolisz.ro/2020/05/11/quarantine-boardgames-plague-inc/: 200 OK
Visited: https://rolisz.ro/2020/03/01/web-crawler-in-rust/ found 8 links
Status for https://rolisz.ro/2020/06/08/happy-decennial/: 200 OK
Status for https://rolisz.ro/2020/05/26/duckduckgo/: 200 OK
Status for https://rolisz.ro/2020/05/24/bullet-journaling/: 200 OK
Status for https://rolisz.ro/author/rolisz/: 200 OK
Status for https://rolisz.ro/2020/06/07/new-desk-setup/: 200 OK
Status for https://rolisz.ro/2020/03/08/boardgames-party-ha/: 200 OK
Status for https://rolisz.ro/2020/05/15/context-variables-in-python/: 200 OK
Visited: https://rolisz.ro/2020/06/08/happy-decennial/ found 8 links
Visited: https://rolisz.ro/2020/05/11/quarantine-boardgames-plague-inc/

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidData, error: "stream did not contain valid UTF-8" }', src\lib.rs:61:5
stack backtrace:
   0: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
   1: core::fmt::write
   2: <std::io::IoSlice as core::fmt::Debug>::fmt
   3: std::panicking::take_hook
   4: std::panicking::take_hook
   5: std::panicking::rust_panic_with_hook
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::option::expect_none_failed


Status for https://rolisz.ro/2020/06/02/quarantine-boardgames-pandemic/: 200 OK
Visited: https://rolisz.ro/2020/06/07/new-desk-setup/ found 9 links
Visited: https://rolisz.ro/2020/03/08/boardgames-party-ha/ found 9 links


   9: ctx::fetch_url
  10: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
  11: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: std::panicking::try::do_call
  14: _rust_maybe_catch_panic
  15: rayon_core::registry::in_worker
  16: rayon::iter::plumbing::bridge_producer_consumer::helper
  17: rayon_core::job::StackJob<L,F,R>::run_inline
  18: rayon_core::registry::in_worker
  19: rayon::iter::plumbing::bridge_producer_consumer::helper
  20: std::panicking::try::do_call
  21: _rust_maybe_catch_panic
  22: rayon_core::registry::in_worker
  23: rayon::iter::plumbing::bridge_producer_consumer::helper
  24: std::panicking::try::do_call
  25: _rust_maybe_catch_panic
  26: rayon_core::registry::in_worker
  27: rayon::iter::plumbing::bridge_producer_consumer::helper
  28: std::panicking::try::do_call
  29: _rust_maybe_catch_panic
  30: <std::panic::AssertUnwind

Visited: https://rolisz.ro/2020/05/15/context-variables-in-python/ found 10 links
Status for https://rolisz.ro/2020/03/18/reflections-during-covid19-times/: 200 OK


  42: BaseThreadInitThunk
  43: RtlUserThreadStart
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


Visited: https://rolisz.ro/author/rolisz/ found 28 links
Status for https://rolisz.ro/2020/05/06/connecting-to-azure-pypi-repositories/: 200 OK
Status for https://rolisz.ro/2020/05/05/why-i-blog/: 200 OK
Status for https://rolisz.ro/2020/05/20/about-time/: 200 OK
Status for https://rolisz.ro/2020/05/13/adding-search-to-static-ghost/: 200 OK
Status for https://rolisz.ro/2020/05/18/an-unexpected-error-in-rust/: 200 OK
Visited: https://rolisz.ro/2020/06/02/quarantine-boardgames-pandemic/ found 9 links
Visited: https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/ found 9 links
Visited: https://rolisz.ro/2020/03/18/reflections-during-covid19-times/ found 9 links
Status for https://rolisz.ro/2020/03/27/quarantine-boardgames-travelin/: 200 OK
Status for https://rolisz.ro/2020/06/01/yard-work/: 200 OK
Visited: https://rolisz.ro/2020/05/06/connecting-to-azure-pypi-repositories/ found 7 links
Visited: https://rolisz.ro/2020/05/05/why-i-blog/ found 9 links
Status for https://rolisz.ro/2020/0

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 123, kind: Other, message: "The filename, directory name, or volume label syntax is incorrect." }', src\lib.rs:99:5
stack backtrace:
   0: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
   1: core::fmt::write
   2: <std::io::IoSlice as core::fmt::Debug>::fmt
   3: std::panicking::take_hook
   4: std::panicking::take_hook
   5: std::panicking::rust_panic_with_hook
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::option::expect_none_failed
   9: ctx::write_file
  10: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
  11: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: std::panicking::try::do_call
  14: _rust_maybe_catch_panic
  15: rayon_core::registry::in_worker
  16: rayon::iter::plumbing::bridge_producer_consumer

Visited: https://rolisz.ro/2020/04/11/moving-away-from-gmail/ found 9 links


  40: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter
  41: std::panicking::try::do_call
  42: _rust_maybe_catch_panic
  43: std::thread::Builder::spawn
  44: ZN244_$LT$std..error..$LT$impl$u20$core..convert..From$LT$alloc..string..String$GT$$u20$for$u20$alloc..boxed..Box$LT$dyn$u20$std..error..Error$u2b$core..marker..Sync$u2b$core..marker..Send$GT$$GT$..from..StringError$u20$as$u20$core..fmt..Display$GT$3fmt17
  45: std::sys::windows::thread::Thread::new
  46: BaseThreadInitThunk
  47: RtlUserThreadStart
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


Visited: https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/ found 7 links
Visited: https://rolisz.ro/2020/05/16/quarantine-boardgames-itchy-feet/ found 9 links
Visited: https://rolisz.ro/2020/05/09/the-beauty-of-suncuius-part-2/ found 9 links
Visited: https://rolisz.ro/2020/05/14/travelers/ found 9 links
Visited: https://rolisz.ro/2020/05/08/on-text-editors/ found 9 links


Error: Child process terminated with status: exit code: 0xc0000005

## A 3x speedup, from 30 seconds to 8 seconds, with 3 lines changed. 

* Rust guarantees that this is still correct and no data races occur. 

# Questions?

# Extras

## Proper error handling

In [43]:
use std::io::Error as IoErr;
#[derive(Debug)]
enum Error {
    Write { url: String, e: IoErr },
    Fetch { url: String, e: reqwest::Error },
}
type Result<T> = std::result::Result<T, Error>;

In [44]:
impl<S: AsRef<str>> From<(S, IoErr)> for Error {
    fn from((url, e): (S, IoErr)) -> Self {
        Error::Write {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

impl<S: AsRef<str>> From<(S, reqwest::Error)> for Error {
    fn from((url, e): (S, reqwest::Error)) -> Self {
        Error::Fetch {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

In [48]:
fn fetch_url(client: &reqwest::blocking::Client, url: &str) -> Result<String> {
    let mut res = client.get(url).send().map_err(|e| (url, e))?;
    println!("Status for {}: {}", url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).map_err(|e| (url, e))?;
    Ok(body)
}

Error: mismatched types

Error: mismatched types

Error: mismatched types

Error: mismatched types

In [46]:

fn write_file(path: &str, content: &str) -> Result<()> {
    let dir = format!("static{}", path);
    fs::create_dir_all(format!("static{}", path)).map_err(|e| (&dir, e))?;
    let index = format!("static{}/index.html", path);
    fs::write(&index, content).map_err(|e| (&index, e))?;

    Ok(())
}