-
Notifications
You must be signed in to change notification settings - Fork 54
/
emoji_vignette.Rmd
155 lines (116 loc) · 8.28 KB
/
emoji_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Analyzing Emojis in YouTube Comments"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Analyzing Emojis in YouTube Comments}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r Setup 0, echo = FALSE, message = FALSE, warning = FALSE, results = 'hide'}
library(tuber)
library(ggplot2)
library(DataCombine)
library(tidyverse)
library(tidytext)
library(anytime)
library(devtools)
install_github("dill/emoGG")
install_github("hadley/emo")
library(emoGG)
library(emo)
```
Depending on the video(s) you are exploring, it might be useful to account for and analyze the use of emojis in comments. As emojis become more and more popular and more complex in the meanings they are able to signify, the more important it is to at least account for emojis and include them in textual analyses. Here's how you can do this with YouTube data!
## Getting YouTube Data
I wanted to choose something that had both 1) a lot of comments and 2) a strong likelihood of comments containing emojis, so let's look at the comments from the ill-advised (and ill-fated) 'Emoji Movie' trailer. This also has a lot of varying sentiment (one of the comments is "The movie is a such disgrace to the animation film industry."`r emo::ji("joy_cat")``r emo::ji("joy_cat")``r emo::ji("joy_cat")`).
If you don't have the YouTube API set up, please see instructions on how to do so [here](https://developers.google.com/youtube/v3/).
```{r Set Up Youtube + Get Comments, echo = FALSE, results = 'hide', message = FALSE, warning = FALSE, error = FALSE, eval = FALSE}
yt_oauth()
```
```{r Get YouTube Comments, echo = TRUE, results = 'hide', message = FALSE, warning = FALSE, error = FALSE, eval = FALSE}
# Connect to YouTube API
# Leave token blank
# yt_oauth("app_id", "app_password", token='')
# Get comments. 'max_results = 101' ensures I get all of the comments on the video.
emojimovie <- get_comment_threads(c(video_id="o_nfdzMhmrA"), max_results = 101)
# Save data (if you want)
# write_csv(emojimovie, path = "sampletubedata.csv")
```
Now we have some (~10,300) comments to play with -- let's identify the emojis in our data. To do so, we'll use the FindReplace function from the [DataCombine package](https://CRAN.R-project.org/package=DataCombine) and an [emoji dictionary](https://lyons7.github.io/portfolio/2017-10-04-emoji-dictionary/) I put together that has each emoji's prose name, UTF-8 encoding and R encoding (the R encoding is specifically for emojis in Twitter data). There are a couple of steps to change the dictionary to be able to identify emojis in our YouTube data, but depending on your computer you might be able to just search by UTF-8 encoding.
Help figuring out the emoji encoding issue from [Patrick Perry](https://stackoverflow.com/questions/47243155/get-r-to-keep-utf-8-codepoint-representation/47243425#47243425) -- thanks Patrick! `r emo::ji("smiling_face_with_smiling_eyes")`
```{r Emojis in YouTube Comments, echo=TRUE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
emojis <- read_csv("https://raw.githubusercontent.com/lyons7/emojidictionary/master/emoji_dictionary.csv")
# Specific to YouTube data
emojis <- emojis[!emojis$Name == " SHRUGFACE ",]
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
emojis$escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
emojis$codes <- sapply(parse(text = paste0("'", emojis$escapes, "'"),
keep.source = FALSE), eval)
```
Now we'll use the FindReplace function to go through and identify emojis.
```{r Emojis in YouTube Comments 2, echo = TRUE, results = 'hide', message = FALSE, warning = FALSE, error = FALSE}
emojimovie <- read_csv("https://github.com/soodoku/tuber/blob/master/data-raw/sampletubedata.csv?raw=TRUE")
emojimovie <- as.data.frame(emojimovie)
# First have to change class of comment text to character from factor
emojimovie$text <- as.character(emojimovie$textOriginal)
# Go through and identify emojis
emoemo <- FindReplace(data = emojimovie, Var = "text",
replaceData = emojis,
from = "codes", to = "Name",
exact = FALSE)
# This might take some time, we have a big data set.
```
Now you have your comments with emojis identified. Let's look at the top emojis in our data set.
```{r YouTube Emoji Comments, echo = TRUE, message = FALSE, warning = FALSE, error = FALSE}
# Have to do keep the "to_lower" parameter FALSE so our emojis in our dictionary are kept separate from words that happen to be the same as emoji names
# First want an id for each comment -- this will come in handy later on
emoemo$comment_id <- 1:nrow(emoemo)
emotidy_tube <- emoemo %>%
unnest_tokens(word, text, to_lower = FALSE)
# Put emojis in tidy format too so as to use a join function
emojis$Name <- as.character(emojis$Name)
tube_tidy_emojis <- emojis %>%
unnest_tokens(word, Name, to_lower = FALSE)
# Now we will use inner_join to keep matches and get rid of non-matches
tube_emojis_total <- tube_tidy_emojis %>%
inner_join(emotidy_tube)
# What is the most frequent emoji?
tube_freqe <- tube_emojis_total %>%
count(word, sort = TRUE)
tube_freqe[1:10,]
```
So, our ten most frequent emojis in the comments of the Emoji Movie trailer are `r emo::ji("face_with_tears_of_joy")`, `r emo::ji("boy")`, `r emo::ji("mobile_phone")`, `r emo::ji("kissing_heart")`, `r emo::ji("man")`, `r emo::ji("skull_and_crossbones")`, `r emo::ji("atom_symbol")`, `r emo::ji("dancing_women")`, `r emo::ji("grimacing")` and `r emo::ji("kissing_smiling_eyes")`. Read into that what you will! `r emo::ji("face_with_tears_of_joy")`
What if we want to look at how the use of these emojis has changed over time? We can also look at WHEN the posts were generated. We can make a graph of comment frequency over time. Graphs constructed with help from [here](http://www.cyclismo.org/tutorial/R/time.html), [here](https://gist.github.com/stephenturner/3132596),
[here](http://stackoverflow.com/questions/27626915/r-graph-frequency-of-observations-over-time-with-small-value-range), [here](http://michaelbommarito.com/2011/03/12/a-quick-look-at-march11-saudi-tweets/), [here](http://stackoverflow.com/questions/31796744/plot-count-frequency-of-tweets-for-word-by-month), [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html), [here](http://sape.inf.usi.ch/quick-reference/ggplot2/geom) and [here](http://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r).
We will also use the [anytime](https://CRAN.R-project.org/package=anytime) package to format the time in a usable way.
```{r, echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 4}
# Subset to just have posts that have our top ten emojis
top_ten <- subset(tube_emojis_total, word == "FACEWITHTEARSOFJOY" | word == "BOY"| word == "MOBILEPHONE" | word == "FACETHROWINGAKISS" | word == "MAN" | word == "SKULLANDCROSSBONES" | word == "ATOMSYMBOL" | word == "COLONEWOMANWITHBUNNYEARS"| word == "GRIMACINGFACE" | word == "KISSINGFACEWITHSMILINGEYES")
# Now use the 'anytime' package to convert to time format we can use
top_ten$created <- anytime(as.factor(top_ten$publishedAt))
Emoji <- top_ten$word
minutes <- 60
ggplot(top_ten, aes(created, color = Emoji)) +
geom_freqpoly(binwidth=10080*minutes)
```
We can look at these one by one too and use the emoGG package to use actual emojis to show which ones we are talking about.
```{r, echo=TRUE, message=FALSE, warning=FALSE, fig.width=7, fig.height=4}
# The code you use in emoGG is the same as UTF-8 but without "U+" etc, and all letters lowercase
tearsofjoy <- top_ten[top_ten$word == "FACEWITHTEARSOFJOY",]
ggplot(tearsofjoy, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f602")
boy <- top_ten[top_ten$word == "BOY",]
ggplot(boy, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f466")
# Sometimes emoGG doesn't have your emoji -- here we have to use skull, not skull and crossbones
skull <- top_ten[top_ten$word == "SKULLANDCROSSBONES",]
ggplot(skull, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f480")
grimace <- top_ten[top_ten$word == "GRIMACINGFACE",]
ggplot(grimace, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f62c")
# ad infinitum!
```