# Read these instructions completely in order to receive full credit

- Before you submit the problem set, make sure everything runs as expected. Go to the menu bar at the top of Jupyter Notebook and click `Kernel > Restart & Run All`. Your code should run from top to bottom with no errors. Failure to do this will result in loss of points.

- You should not use `install.packages()` anywhere. You may assume that we have already installed all the packages needed to run your code.

- Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE" and delete the `stop()` functions. The `stop()` functions produce an error and are there to remind you of cells that need an answer.

- If you are working in a group, make sure you and your collaborators have been added to a group on Canvas as described at the beginning of lecture 2.
- As a backup, *also* fill in your uniqid as well as those of your collaborators below:

Your uniqid: `<replace with your uniqid>`

Uniqids of your collaborator(s): `<replace with their uniqids>`

- **Carefully proofread the PDF that you upload to Canvas. PDFs that have missing or truncated code cannot be graded and will not receive credit.**

---

In [None]:
library(tidyverse)
library(stringr)
options(jupyter.rich_display=T)
library(harrypotter)
library(tidytext)
options(repr.plot.width=4, repr.plot.height=3)

# STATS 306
## Problem Set 5: Text Analysis
This problem set is shorter than usual because of the midterm. It contains five problems, each worth one point each.

### Sentiment Analysis
In problems 1-2 we will perform *sentiment analysis* of the Harry Potter books. The file `afinn.RData` contains a sentiment score for a large number of words in the English language:

In [None]:
load("afinn.RData")
print(afinn)

Negatively connoted words receive low scores, while positively connoted words receive high scores:

In [None]:
filter(afinn, word %in% c("death", "hurrah"))

By joining this table to other tables containing text data and summarizing, we can generate scores of how positive or negative the text is.

#### Problem 1
The `tidytext::unnest_tokens()` function can be used to break a chunk of text into "tokens" (words, sentences, etc.) It works as follows. Consider the following tibble, which contains all 17 chapters of the first book in the Harry Potter series:

In [None]:
phil_tbl <- tibble(chapter=seq_along(philosophers_stone), 
                   text=philosophers_stone) %>% print

To perform sentiment analysis, we need to break each chapter into words so that we can join it to the `afinn` table. This is what `unnest_tokens()` does:

In [None]:
phil_tok <- unnest_tokens(phil_tbl, input=text, output=word) %>% print

Using this table and `afinn`, we can assign sentiment scores to various portions of text. For example:

![image.png](attachment:image.png)

#### Problem 1
Some people say that the Harry Potter books became darker (more negative) over time. Use sentiment analysis to investigate this, and report your conclusion here. (Hint: A list of all the Harry Potter books can be obtained by looking at the help page for the `harrypotter` package.)

In [None]:
# YOUR CODE HERE
stop()

YOUR ANSWER HERE

#### Problem 2

According to the sentiment scores, what is the most negative/positive chapter in any of the Harry Potter books? 

Of the sentences that contain three or more scored words, what is the happiest (most positive)?

In [None]:
# YOUR CODE HERE
stop()

### Reddit dataset
The file `reddit_xmas_2017.RData` contains 100,000 comments posted to Reddit on Christmas Day, 2017.

In [None]:
load('reddit_xmas_2017.RData')
reddit %>% print

Problems 3-5 ask you questions about this data set. Unless specified otherwise, all matches are case insensitive.

(*Disclaimer*: I filtered out objectionable comments as best I could, but you may find more if you dig around this data for long enough.)

#### Problem 3
Comment 174 wishes everyone a happy holidays:

In [None]:
reddit %>% slice(174)

What are other people wishing? Count the first occurrence of the string "Happy `<word>`" or "Merry `<word>`" in the comment body, if any, count the matches. 

To keep things interesting, do *not* include phrases matching `(happy|merry) (to|with|for|about|and|that|if|i|you|when)`. 

Print a table containing the top 20 matches; a few of the rows are:

<table>
<thead>
    <tr><th scope="col">greeting</th><th scope="col">n</th></tr>
    <tr><th scope="col">&lt;chr&gt;</th><th scope="col">&lt;int&gt;</th></tr>
</thead>
<tbody>
    <tr><td>merry christmas</td><td>2040</td></tr>
    <tr><td>happy holidays </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td> &mdash;</td></tr>
    <tr><td>&mdash;        </td><td>   7</td></tr>
    <tr><td>happy cakeday  </td><td>   6</td></tr>
</tbody>
</table>

Your numbers may vary slightly depending on how you performed the match.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 4
The number of hourly mentions of the word `christmas`or `xmas` is:
![image.png](attachment:image.png)

Make a similar plot for hourly mentions of any word which contains "snow" or "flakes".

In [None]:
# YOUR CODE HERE
stop()

#### Problem 5
The most common word in the comments is "the", which occurs 81,104 times.

In [None]:
reddit %>% mutate(c=str_count(str_to_lower(body), '\\bthe\\b')) %>% summarize(sum(c))

The word `christmas` occurs 4265 times:

In [None]:
xmas_re <- regex('\\bchristmas\\b', ignore_case = T)
reddit %>% mutate(c=str_count(body, xmas_re)) %>% summarize(sum(c))

What is the next most common word after Christmas, and how many times does it appear?

In [None]:
# YOUR CODE HERE
stop()