- Introduction
- Project 2: TV's Golden Age is real.
- Dataset: link
- Questions
- How many unique genres?
- How many unique titles per genre?
- how many shows released every year, for each genre.
- Find the average rating of each title across seasons
- Average rating variation across genres. Which genres seem more popular?
- Average number of seasons and the connection with the average rating
- Top 10, based on number of seasons and average rating, and bringing out the connection with genre.
- Code notebook
- Todo Alternate and cleaner method of generating flags
- Project 1: Federal R&D spending by agency.
This repo will contain my code explorations with datasets available via the #TidyTuesday project. The idea is to evolve towards a document based on literate programming principles, with the code side by side with the exploration / analysis.
The latest project is on top.
- I use ESS (and Emacs) for all my workflows. This is a single Org document containing each exploration under a heading, and code blocks enclosed within Org-babel headers.
- Completed code snippets are tangled into scripts in the 00_scripts folder.
Dataset: link
Loading Libraries and reading in data
# Loading libraries
library("easypackages")
libraries("tidyverse", "tidyquant")
# Reading in the data directly from github
tv_ratings_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-08/IMDb_Economist_tv_ratings.csv")
Let's get a quick overview of the data available.
tv_ratings_raw %>% skimr::skim()
tv_ratings_raw %>% glimpse()
Observations:
- We have 7 variables, and all of them appear to be useful.
- There are no missing values.
- Each serial has a unique titleId. This can be useful to separate out serials from seasons.
- what does the 'share' column imply?
To get deeper insight on these serials: the genres have to be separated out. This will allow analysis of the serials based on the Genre.
Viewing the data will show that individual genres are separated by a comma. Though I initially started with 6 splits - I found the maximum to be 3 and there are several with less than 3 genres.
So how many unique genres do we have?
# Note: the unique genres can also obtained by starting with map(unique), and with some further processing.
## test <- tv_ratings_raw %>%
## separate(genres,
## into = c("genre1_", "genre2_", "genre3_"),
## sep = ',' ,
## remove = FALSE
## ) %>%
## select(ends_with("_")) %>%
## map(unique)
all_genres <- tv_ratings_raw %>%
separate(genres,
into = c("genre1_", "genre2_", "genre3_"),
sep = ',' ,
remove = FALSE
) %>%
select(ends_with("_")) %>%
gather(
key = genres_col,
value = genre_all,
.,
na.rm = TRUE
) %>%
select(genre_all) %>%
unique() %>%
spread(
key = genre_all,
value = genre_all
)
There are 22 unique genre categories. To enable further analysis and make the data more suitable to machine learning, the genres could be combined with the dataset as separate columns with a logical value denoting whether the serial falls into the category or Not. In addition, it would nice to standardise the genre names formatting, especially those like Sci-If and Reality-TV.
With 22 genre columns to consider, a tidyeval function might help.
names(all_genres) <- str_c(rep("g_", length(all_genres)), names(all_genres))
tv_ratings_conditioned <- bind_rows(tv_ratings_raw, all_genres)
all_genres %>% glimpse()
genre_flag_fn <- function(data, search_col = genres, search_pattern, flag_col )
{
flag_col <- enquo(flag_col)
flag_col_name <- quo_name(flag_col)
search_col <- enquo(search_col)
data %>%
mutate(!!flag_col_name := case_when(
str_detect(!!search_col, search_pattern) ~ 1,
TRUE ~ 0
)
)
}
tv_ratings_genre_sep_tbl <- tv_ratings_conditioned %>%
genre_flag_fn(search_pattern = "Action" , flag_col = g_Action ) %>%
genre_flag_fn(search_pattern = "Adventure" , flag_col = g_Adventure ) %>%
genre_flag_fn(search_pattern = "Animation" , flag_col = g_Animation ) %>%
genre_flag_fn(search_pattern = "Biography" , flag_col = g_Biography ) %>%
genre_flag_fn(search_pattern = "Comedy" , flag_col = g_Comedy ) %>%
genre_flag_fn(search_pattern = "Crime" , flag_col = g_Crime ) %>%
genre_flag_fn(search_pattern = "Documentary" , flag_col = g_Documentary ) %>%
genre_flag_fn(search_pattern = "Drama" , flag_col = g_Drama ) %>%
genre_flag_fn(search_pattern = "Family" , flag_col = g_Family ) %>%
genre_flag_fn(search_pattern = "Fantasy" , flag_col = g_Fantasy ) %>%
genre_flag_fn(search_pattern = "History" , flag_col = g_History ) %>%
genre_flag_fn(search_pattern = "Horror" , flag_col = g_Horror ) %>%
genre_flag_fn(search_pattern = "Music" , flag_col = g_Music ) %>%
genre_flag_fn(search_pattern = "Musical" , flag_col = g_Musical ) %>%
genre_flag_fn(search_pattern = "Mystery" , flag_col = g_Mystery ) %>%
genre_flag_fn(search_pattern = "Reality-TV" , flag_col = `g_Reality-TV`) %>%
genre_flag_fn(search_pattern = "Romance" , flag_col = g_Romance ) %>%
genre_flag_fn(search_pattern = "Sci-Fi" , flag_col = `g_Sci-Fi` ) %>%
genre_flag_fn(search_pattern = "Sport" , flag_col = g_Sport ) %>%
genre_flag_fn(search_pattern = "Thriller" , flag_col = g_Thriller ) %>%
genre_flag_fn(search_pattern = "War" , flag_col = g_War ) %>%
genre_flag_fn(search_pattern = "Western" , flag_col = g_Western )
Considering releases of serials per year, what was the distribution of genres of the released serials per year? This could include a new season of a serial. A serial will generally run atleast a season or two, unles sit is terribly unpopular. Howevever, what was the overall mood and distribution per genre?
tv_genre_y <- tv_ratings_genre_sep_tbl %>%
mutate(dt_year = year(date)) %>%
select(-c(seasonNumber,
titleId,
title,
av_rating,
share,
genres,
date)) %>%
group_by(dt_year) %>%
filter(!duplicated(titleId)) %>%
summarise_if(.predicate = is.numeric,funs(sum(.))) %>%
ungroup() %>%
arrange(desc(dt_year))
all_genres <- tv_ratings_raw %>%
separate(genres,
into = c("genre1_", "genre2_", "genre3_"),
sep = ',' ,
remove = FALSE
) %>%
select(ends_with("_")) %>%
gather(
key = genres_col,
value = genre_all,
.,
na.rm = TRUE
) %>%
select(genre_all) %>%
unique() %>%
spread(
key = genre_all,
value = genre_all
)
key_terms <- all_genres %>% select(genre_all) %>% count(genre_all)
tv_ratings_raw %>% select(genres)
tv_ratings_conditioned <- bind_rows(tv_ratings_raw, all_genres)
tv_ratings_conditioned %>%
genre_flag_multi_fn <- function(data, search_col = genres, ... , flag_col )
{
flag_col <- enquo(flag_col)
flag_col_name <- quo_name(flag_col)
search_col <- enquo(search_col)
search_pattern <- enquos(...)
## search_pattern
## ## print((!!!search_pattern))
## vars(!!!search_pattern[2])
data %>%
mutate(expr(str_c(!!!flag_col_name, "_g")) := map(~ case_when(
expr(str_detect(!!search_col, !!!search_pattern)) ~ 1,
TRUE ~ 0
)
)
)
}
tv_ratings_raw %>%
select(genres) %>%
genre_flag_multi_fn(search_pattern, flag_col = search_pattern)
search_pattern <-
key_terms %>%
select(genre_all) %>%
spread(
key = genre_all,
value = genre_all
) %>%
names()
key_terms %>% select(genre_all)
Dataset: link
The current code only explores only the portion of Climate Spending.
# Loading libraries
library("easypackages")
libraries("tidyverse", "tidyquant")
# Reading in data directly from github
climate_spend_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/climate_spending.csv", col_types = "cin")
# This initial conditioning need not have involved the date manipulation, as the year extracted from a date object is still a double.
climate_spend_conditioned <- climate_spend_raw %>%
mutate(year_dt = str_glue("{year}-01-01")) %>%
mutate(year_dt = as.Date(year_dt)) %>%
mutate(gcc_spending_txt = scales::dollar(gcc_spending,
scale = 1e-09,
suffix = "B"
)
)
climate_spend_dept_y <- climate_spend_conditioned %>%
group_by(department, year_dt = year(year_dt)) %>%
summarise(
tot_spend_dept_y = sum(gcc_spending)) %>%
mutate(tot_spend_dept_y_txt = tot_spend_dept_y %>%
scales::dollar(scale = 1e-09,
suffix = "B")
) %>%
ungroup()
glimpse(climate_spend_dept_y)
climate_spend_plt_fn <- function(
data,
y_range_low = 2000,
y_range_hi = 2010,
ncol = 3,
caption = ""
)
{
data %>%
filter(year_dt >= y_range_low & year_dt <= y_range_hi) %>%
ggplot(aes(y = tot_spend_dept_y_txt, x = department, fill = department ))+
geom_col() +
facet_wrap(~ year_dt,
ncol = 3,
scales = "free_y") +
theme_tq() +
scale_fill_tq(theme = "dark") +
theme(
axis.text.x = element_text(angle = 45,
hjust = 1.2),
legend.position = "none",
plot.background=element_rect(fill="#f7f7f7"),
)+
labs(
title = str_glue("Federal R&D budget towards Climate Change: {y_range_low}-{y_range_hi}"),
x = "Department",
y = "Total Budget $ Billion",
subtitle = "NASA literally dwarfs all the other departments, getting to spend upwards of 1.1 Billion dollars every year since 2000.",
caption = caption
)
}
climate_spend_plt_fn(climate_spend_dept_y,
y_range_low = 2000,
y_range_hi = 2017,
caption = "#TidyTuesday:\nDataset 2019-02-12\nShreyas Ragavan"
)
## The remaining code is partially complete and is in place for further exploration planned in the future.
## Code to download all the data.
## fed_rd <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/fed_r_d_spending.csv")
## energy_spend <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/energy_spending.csv")
## climate_spend <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/climate_spending.csv")
## climate_spend_pct_all <- climate_spend_conditioned %>%
## group_by(year_dt = year(year_dt)) %>%
## summarise(
## tot_spend_all_y = sum(gcc_spending)
## ) %>%
## mutate(tot_spend_all_y_txt = tot_spend_all_y %>%
## scales::dollar(scale = 1e-09,
## suffix = "B"
## )
## )%>%
## ungroup() %>%
## mutate(tot_spend_all_lag = lag(tot_spend_all_y, 1)) %>%
## tidyr::fill(tot_spend_all_lag ,.direction = "up") %>%
## mutate(tot_spend_all_pct = (tot_spend_all_y - tot_spend_all_lag)/ tot_spend_all_y,
## tot_spend_all_pct_txt = scales::percent(tot_spend_all_pct, accuracy = 1e-02)
## )