# A08 - [Project: Scraping Nuclear Reactors](https://dtkaplan.github.io/DataComputingEbook/project-scraping-nuclear-reactors.html#project-scraping-nuclear-reactors)
Kaplan, Daniel & Matthew Beckman. (2021). _Data Computing_. 2nd Ed. [Home](https://dtkaplan.github.io/DataComputingEbook/).

https://davefriedman01.github.io/Mathematics/computer/program/rlang/STAT184/intro.html

---

```{admonition} Revised
19 Jun 2023
```
```{contents}
```

---

## Programming Environment

In [2]:
library(rvest)
library(tidyverse)

str_c('EXECUTED : ', now())
sessionInfo()

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m         masks [34mstats[39m::filter()
[31m✖[39m [34mreadr[39m::[32mguess_encoding()[39m masks [34mrvest[39m::guess_encoding()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m            masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2    
 [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.4.2   tidyverse_2.0.0 rvest_1.0.3    

loaded via a namespace (and not attached):
 [1] gtable_0.3.3     jsonlite_1.8.5   compiler_4.3.0   crayon_1.5.2    
 [5] tidyselect_1.2.0 IRdisplay_1.1    xml2_1.3.4       scales_1.2.1    
 [9] uuid_1.1-0       fastma

---

In [28]:
page      <- 'https://en.wikipedia.org/wiki/List_of_commerical_nuclear_reactors'
tableList <-
  page %>%
    read_html() %>%
    html_nodes(css = 'table') %>%
    html_table(fill = TRUE)
length(tableList)

Japan <-
  tableList[[21]] %>%
    select(1:9)
#names(Japan)[c(3, 6)] <- c('type', 'grossMW')
head(Japan)

Plantname,UnitNo.,Type,Model,Status,Capacity(MW),Beginbuilding,Commercialoperation,Closed
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Plantname,UnitNo.,Type,Model,Status,Capacity(MW),Beginbuilding,Commercialoperation,Closed
Fukushima Daiichi,1,BWR,BWR-3,Inoperable,439,25 Jul 1967,26 Mar 1971,19 May 2011
Fukushima Daiichi,2,BWR,BWR-4,Inoperable,760,9 Jun 1969,18 Jul 1974,19 May 2011
Fukushima Daiichi,3,BWR,BWR-4,Inoperable,760,28 Dec 1970,27 Mar 1976,19 May 2011
Fukushima Daiichi,4,BWR,BWR-4,Inoperable,760,12 Feb 1973,12 Oct 1978,19 May 2011
Fukushima Daiichi,5,BWR,BWR-4,Shut down,760,22 May 1972,18 Apr 1978,17 Dec 2013


Among other things, some of the variable names appear redundant and others have multiple words separated by spaces. You can rename variables using the data verb `rename()`, finding appropriate names from the Wikipedia table. Another problem is that the first row is not data but a continuation of the variable names. So row number 1 should be dropped.

In [29]:
Japan <-
  Japan %>%
    filter(row_number() > 1) %>%
    rename(
      name         = Plantname,
      reactor      = `UnitNo.`,
      model        = Model,
      status       = Status,
      newMW        = `Capacity(MW)`,
      construction = Beginbuilding,
      operation    = Commercialoperation,
      closure      = Closed
    )
head(Japan)

name,reactor,Type,model,status,newMW,construction,operation,closure
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Fukushima Daiichi,1,BWR,BWR-3,Inoperable,439,25 Jul 1967,26 Mar 1971,19 May 2011
Fukushima Daiichi,2,BWR,BWR-4,Inoperable,760,9 Jun 1969,18 Jul 1974,19 May 2011
Fukushima Daiichi,3,BWR,BWR-4,Inoperable,760,28 Dec 1970,27 Mar 1976,19 May 2011
Fukushima Daiichi,4,BWR,BWR-4,Inoperable,760,12 Feb 1973,12 Oct 1978,19 May 2011
Fukushima Daiichi,5,BWR,BWR-4,Shut down,760,22 May 1972,18 Apr 1978,17 Dec 2013
Fukushima Daiichi,6,BWR,BWR-5,Shut down,1067,26 Oct 1973,24 Oct 1979,17 Dec 2013


In [30]:
str(Japan)

tibble [68 × 9] (S3: tbl_df/tbl/data.frame)
 $ name        : chr [1:68] "Fukushima Daiichi" "Fukushima Daiichi" "Fukushima Daiichi" "Fukushima Daiichi" ...
 $ reactor     : chr [1:68] "1" "2" "3" "4" ...
 $ Type        : chr [1:68] "BWR" "BWR" "BWR" "BWR" ...
 $ model       : chr [1:68] "BWR-3" "BWR-4" "BWR-4" "BWR-4" ...
 $ status      : chr [1:68] "Inoperable" "Inoperable" "Inoperable" "Inoperable" ...
 $ newMW       : chr [1:68] "439" "760" "760" "760" ...
 $ construction: chr [1:68] "25 Jul 1967" "9 Jun 1969" "28 Dec 1970" "12 Feb 1973" ...
 $ operation   : chr [1:68] "26 Mar 1971" "18 Jul 1974" "27 Mar 1976" "12 Oct 1978" ...
 $ closure     : chr [1:68] "19 May 2011" "19 May 2011" "19 May 2011" "19 May 2011" ...


In [None]:
Japan %>%
  mutate(
    
  )

---