Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add usdoj blog post #9

Merged
merged 2 commits into from
Apr 3, 2023
Merged

Add usdoj blog post #9

merged 2 commits into from
Apr 3, 2023

Conversation

stephbuon
Copy link
Member

No description provided.

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and the length is also suitable!

I suggest that @pitkant will also approve before release.

@antagomir antagomir requested a review from pitkant April 2, 2023 10:00
Copy link
Member

@pitkant pitkant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks for submitting this!

@pitkant pitkant merged commit d8b1bb6 into rOpenGov:master Apr 3, 2023
@pitkant
Copy link
Member

pitkant commented Apr 3, 2023

It takes a while more until the blog post appears on the website, we're looking into it

@stephbuon
Copy link
Member Author

Thank you for your help!!

@stephbuon
Copy link
Member Author

Hello, @pitkant -- I saw the blog isn't up yet. Is there something I can do to help? Thanks.

@pitkant
Copy link
Member

pitkant commented May 3, 2023

Hi @stephbuon , we have now figured out the problems with the web server rendering with University IT people. Apologies that this took so long!

I tried re-rendering your blogpost from the .Rmd file and encountered the following problems:

  • Row 32: downloading 100000 press releases caused on several occasions a situation where the download would seemingly stall. When I manually cancelled the process I got the following message:
Warning: Error in curl::curl_fetch_memory: Operation was aborted by an application callback

I don't know what the warning message in question is about, but then I read documentation from usdoj API site:

There is a maximum limit of 50 results per request. If you request a pagesize that is larger than 50, you will receive a response with no more than 50 results. Developers leveraging this API should keep the stability of the API and their own applications in mind. Individual users issuing more than 10 requests per second will experience degraded performance and may be blocked entirely.

For a smaller number of downloaded speeches (like 10000, 20000 etc) things worked fine so I assumed my download was getting throttled. But funnily enough when I set my location to US East Coast with a VPN I was able to download the 100K records, albeit very slowly. Maybe Usdoj discriminates against API calls from abroad? Anyway, all this may be more related to usdoj package than than the context of this blog post, let's move on...

  • Row 50: I wasn't able to replicate the colourful plot_usmap graph even when I was able to download the 100K records. For some reason my earliest_date was Jan 5, 2009 and latest_date was Jan 19, 2009 so maybe the observed differences were too small to differentiate. Here is the map:

000022

  • Row 99-102: I wasn't able to replicate the described phenomenon of multiple values in a single field: "A single field may contain multiple values. For example, the field "name" contains the (sometimes multiple) US DOJ divisions related to a press release, as shown by lines 7 and 9. A single press release may relate to USAOs across multiple states or may implicate multiple offices." I got the following output:
head(press_releases$name, 10)

 [1] "Office of the Attorney General"            
 [2] "Civil Rights Division"                     
 [3] "Civil Division"                            
 [4] "Criminal Division"                         
 [5] "Environment and Natural Resources Division"
 [6] "Office of the Deputy Attorney General"     
 [7] "Environment and Natural Resources Division"
 [8] "Tax Division"                              
 [9] "Criminal Division"                         
[10] "Tax Division"
  • Rows 113-118: I'm not sure if the code here is correct. For example
state_names <- paste(statepop$full, collapse = "|USAO - ")

returns the following:

[1] "Alabama|USAO - Alaska|USAO - Arizona|USAO - Arkansas|USAO - California|USAO - Colorado|USAO - Connecticut|USAO - Delaware|USAO - District of Columbia|USAO - Florida|USAO - Georgia|USAO - Hawaii|USAO - Idaho|USAO - Illinois|USAO - Indiana|USAO - Iowa|USAO - Kansas|USAO - Kentucky|USAO - Louisiana|USAO - Maine|USAO - Maryland|USAO - Massachusetts|USAO - Michigan|USAO - Minnesota|USAO - Mississippi|USAO - Missouri|USAO - Montana|USAO - Nebraska|USAO - Nevada|USAO - New Hampshire|USAO - New Jersey|USAO - New Mexico|USAO - New York|USAO - North Carolina|USAO - North Dakota|USAO - Ohio|USAO - Oklahoma|USAO - Oregon|USAO - Pennsylvania|USAO - Rhode Island|USAO - South Carolina|USAO - South Dakota|USAO - Tennessee|USAO - Texas|USAO - Utah|USAO - Vermont|USAO - Virginia|USAO - Washington|USAO - West Virginia|USAO - Wisconsin|USAO - Wyoming"

So a single item. I think the intention was to return all the states with USAO attached at the end? Because of this the str_extract function fails on row 115. And probably because of this the code on rows 124-126 returns an empty tibble which in turn causes the final visualization on rows 171-179 to fail and return the following error:

Error in `combine_vars()`:
! Faceting variables must have at least one value
Backtrace:
 1. base (local) `<fn>`(x)
 2. ggplot2:::print.ggplot(x)
 4. ggplot2:::ggplot_build.ggplot(x)
 5. layout$setup(data, plot$data, plot$plot_env)
 6. ggplot2 (local) setup(..., self = self)
 7. self$facet$compute_layout(data, self$facet_params)
 8. ggplot2 (local) compute_layout(..., self = self)
 9. ggplot2::combine_vars(data, params$plot_env, vars, drop = params$drop)

If you could re-render the html file and re-upload it to this repository, we could see if things work correctly now. Thanks and sorry for the inconvenience!

@stephbuon
Copy link
Member Author

stephbuon commented May 8, 2023

Thanks for this @pitkant ! It was meant to read from a csv file, not pull from the API in real time. Let me fix that and send it back to you.

@stephbuon
Copy link
Member Author

stephbuon commented May 8, 2023

Hi, @pitkant -- The code that pulls 100,000 press releases should have been only for my use (not to be viewed in the blog post). I uploaded a markdown file with echo=F in the map building section so that I would just display the visual (which reads from a CSV file). Does this process not work on your server? Should I do something else?


{r, echo=F, message=FALSE}
library(usmap)
library(lubridate)
library(tidyverse)
library(usdoj)

# press_releases <- doj_press_releases(n_results = 100000, search_direction = "DESC")
# write_csv(press_releases, "press_releases_doj_intro.csv")
press_releases <- read_csv("press_releases_doj_intro.csv")

@pitkant
Copy link
Member

pitkant commented May 9, 2023

Hi @stephbuon -- In the case of .Rmd-files, the server does not do any rendering but it just displays already-rendered html files. In the case of plain .md files it does some basic parsing to display the page but that does not allow for some more complicated blog posts, such as ones that have embedded interactive visualisations and so on.

I didn't mean that you should do anything different. I was just trying to replicate the html file on my own computer to see if there is something different in the output compared to the one there is now. I don't know if this is significant but it seems that some older .html files have the yaml front matter left in them whereas yours doesn't. There are also some other minute differences.

For example, compare these two:
usdoj-cran-release/index.en.html
minatutkin-twiitit/index.fi.html

It's clearly knitting related issue but it's hard to say without being able to re-render the html file. Maybe you could try using this formatting in the front matter?

output:
  blogdown::html_page:
    highlight: tango

instead of this

output: blogdown::html_page

@pitkant
Copy link
Member

pitkant commented May 11, 2023

I made an attempt at converting the blog post from .Rmd to an .md file and it worked, the blog post is live here: https://ropengov.org/2023/04/usdoj-cran-release/

Maybe we should still attempt to fix the .Rmd file somehow, I can't come up with any other reason than the problem being in the yaml front matter.

@pitkant
Copy link
Member

pitkant commented May 11, 2023

I made some further changes to the website, converting some older blog posts to use standard code fences:

```r
some r code example
```

instead of this

```{% highlight r %}
some r code example
```

and updated config.toml to include our preferred syntax highlighting style, tango, instead of the Hugo default monokai. (There are some good looking alternatives as well, maybe we can consider those at some point as well https://xyproto.github.io/splash/docs/all.html)

The blog post should look very similar to what it would look if it was rendered from an .Rmd file now. @stephbuon can decide if it's good enough for now.

@stephbuon
Copy link
Member Author

stephbuon commented May 18, 2023

Thank you, @pitkant -- I really appreciate your help with this.

Is it possible to remove the code before the map of the United States and just show the map of the United States?

In the future I will use the method that you did to create this (instead of the .Rmd file)

@pitkant
Copy link
Member

pitkant commented May 19, 2023

It's just my personal preference, but I think it's nice that there are some code examples along with the visualisations - even if the code examples wouldn't be fully reproducible. Maybe instead of removing it all the code chunk could be slightly modified to make it clear that it's for illustrative purposes and reproducing the example as it is presented would not work due to API limitations etc...?

Something to this effect:

Original:

# press_releases <- doj_press_releases(n_results = 100000, search_direction = "DESC")
# write_csv(press_releases, "press_releases_doj_intro.csv")
press_releases <- read_csv("press_releases_doj_intro.csv")

Modified:

# A NON-REPRODUCIBLE example of downloading a large number of press releases and saving them
# press_releases <- doj_press_releases(n_results = 100000, search_direction = "DESC")
# write_csv(press_releases, "press_releases_doj_intro.csv")
# press_releases <- read_csv("press_releases_doj_intro.csv")

But if you wish that all the lines 21-57(

```r
library(usmap)
library(lubridate)
library(tidyverse)
library(usdoj)
# press_releases <- doj_press_releases(n_results = 100000, search_direction = "DESC")
# write_csv(press_releases, "press_releases_doj_intro.csv")
press_releases <- read_csv("press_releases_doj_intro.csv")
state <- statepop$full
count <- list()
for(state_name in state) {
count <- append(count, sum(str_count(press_releases$name, state_name))) }
df <- data.frame(state = unlist(state), count = unlist(count))
earliest_date <- ymd(min(press_releases$date))
earliest_date <- paste0(month(earliest_date, label = TRUE), " ", day(earliest_date), ", ", year(earliest_date))
latest_date <- ymd(max(press_releases$date))
latest_date <- paste0(month(latest_date, label = TRUE), " ", day(latest_date), ", ", year(latest_date))
plot_usmap(data = df,
values = "count",
color = "#4682b4") +
scale_fill_continuous(low = "white",
high = "#4682b4",
name = "n",
label = scales::comma) +
theme(legend.position = "right") +
labs(title = "US DOJ Press Releases Involving the FBI Corresponding to State",
subtitle = paste0("Raw Count From ", earliest_date, " to ", latest_date),
caption = "This plot was generated using data from usdoj. It visualizes the raw count of press releases that are tagged
as involving both the FBI and a state's office of the United States Attorney.")
```
) be removed, I can of course do that for you. Maybe then some added clarification would be needed, even though the sentence "Data is cleaned and structured before it is returned as a data frame with fields for the body text, date, title, url, the name of the corresponding division, to name just a few" explains the process pretty well?

@stephbuon
Copy link
Member Author

Hi, @pitkant ! If you think having non-reproducible code is useful, I take your word for it and am happy to keep it! Good idea adding the disclaimer on top.

I'm happy to publish the blog post (including the disclaimer and code before the first visualization) if you also think it looks good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants