Skip to content

R data package of German NDR Coronavirusupdate podcast

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

quickcoffee/coronavirusupdate

Repository files navigation

coronavirusupdate

When the NDR Coronavirus-Update announced that they would make transcripts of all episodes publicly available I firstly tried to scrape the PDF files, which kind of worked but was not very reliable. So I went back and now used the transcripts directly from the podcast’s homepage.
Consequently I decided to create a R package to make it easy for anyone to scrape and access the data.
In this repo every Wednesday a GitHub action is triggered to get the newest transcripts from NDR.de (as long as the website doesn’t change).
For all non-R users you can also find the data as parquet file in data/coronavirusupdate_transcripts.parquet.

Installation

You can install this package from GitHub with:

# install.packages("devtools")
devtools::install_github("quickcoffee/coronavirusupdate")

Example

This is a basic example how to load the transcript data:

library(coronavirusupdate)
## load data
data("coronavirusupdate_transcripts")

This will load a tidy dataframe called coronavirus_update with one row per paragraph. Currently this includes 114 episodes with the following structure:

library(dplyr)
glimpse(coronavirusupdate_transcripts)
#> Rows: 10,295
#> Columns: 8
#> $ title            <chr> "Mit BA.4 und BA.5 in den Sommer", "Mit BA.4 und BA.5…
#> $ link             <chr> "https://www.ndr.de/nachrichten/info/114-Coronavirus-…
#> $ episode_no       <int> 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114…
#> $ speaker          <chr> "Beke Schulmann", "Sandra Ciesek", "Sandra Ciesek", "…
#> $ text             <chr> "Die Zahl der Neuinfektionen steigt wieder nach einer…
#> $ paragraph_no     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
#> $ last_change      <dttm> 2022-06-17 17:00:00, 2022-06-17 17:00:00, 2022-06-17…
#> $ duration_episode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

With this dataset one could for example inspect the share per speaker and episode:

library(ggplot2)

coronavirusupdate_transcripts %>%
  mutate(paragraph_length = nchar(text)) %>% 
  group_by(episode_no, speaker) %>% 
  summarise(speaker_length = sum(paragraph_length)) %>% 
  group_by(episode_no) %>% 
  mutate(speaker_share = speaker_length/sum(speaker_length)) %>% 
  ggplot(aes(x=episode_no, y=speaker_share, fill = speaker))+
  geom_bar(position="stack", stat="identity")+
  theme_minimal()+
  theme(legend.position="bottom")+
  labs(title = "Share of speaker per episode",
       y = "Share",
       x = "Episode No",
       fill = element_blank())

Or one can create a tf-idf analysis to find out which words are “unique” to each speaker:

library(tidytext) #for text processing and analysis
library(ggplot2) # for plotting

# create tibble with stopwords from stopwords package and some very frequent words in the corpus
stop_words_de <- tibble(word = c(stopwords::stopwords("de"),
                                 "dass",
                                 "schon",
                                 "mal",
                                 "ja",
                                 "ganz",
                                 "gibt",
                                 "sagen"))

# get top 6 speaker by paragraph count
top_speaker <- coronavirusupdate_transcripts %>% 
  count(speaker, sort = TRUE) %>% 
  head(6) %>% 
  pull(speaker)

tidy_corona <- coronavirusupdate_transcripts %>%
  filter(speaker %in% top_speaker) %>%
  select(episode_no, speaker, text) %>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words_de)

#get tf-idf per speaker
tidy_corona_tf_idf <- tidy_corona %>%
  count(speaker, word) %>% 
  ungroup() %>% 
  bind_tf_idf(word, speaker, n)


tidy_corona_tf_idf %>% 
  group_by(speaker) %>% 
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  ggplot(aes(tf_idf, word, fill = speaker)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~speaker, ncol = 2, scales = "free") +
  labs(title = "Highest tf-idf words of top 6 speakers from the NDR Coronavirusupdate",
       caption = "Transcripts from NDR.de",
       x = "tf-idf", y = NULL)+
  theme_minimal()

About

R data package of German NDR Coronavirusupdate podcast

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages