# COMP40020: Human Language Technologies
## Assignment 1



### Instructions

For this Assignment you will be investigating the language styles used online by processing and analysing data from Reddit. You should submit a report online via Brightspace by **11:59pm Wednesday 10th March 2021**. There are no extensions to this deadline without formal extenuating circumstances. Please contact us as soon as possible should you require an extension. All assignments will undergo a plagiarism check as standard.


### Task

For this assignment you will build a corpus (or multiple corpora) of internet language data from one or more subreddits, decide which preproccessing steps should be carried out, then perform some analysis on it to investigate a particular aspect of language usage. You can use any of the tools presented in previous workshops or come up with your own methods of analysis. 

Some examples of things you might want to investigate:

* can you infer the subject of a subreddit from its most common words?
* does the style of language differ if you compare different subjects?
* are there subreddits that are very similar in their language?
* do common n-grams capture popular jokes or memes within the subreddit?
* are there texts which are more lexically diverse than others?
* can you spot trends or popular topics over time?
* are particular texts more likely to use certain emojis over others?

This is not an exhaustive list nor do we expect you to answer all of these questions. Pick a research focus that interests you - a small specific question is more effective than a vague general investigation. If you have any questions or issues feel free to email me (emma.l.oneill@ucdconnect.ie). 

Please also be aware that we are working with real-world data from the internet. Use your own judgement in the case of potentially inappropriate or sensitive content. If your investigation deals with such content it should be handled professionally and respectfully. 

To get you started I have supplied steps and code to scrape data from Reddit. To do this you will need to sign up for a Reddit account. If you are unable to create an account or do not wish to do so you can collect data elsewhere. For example you could work with different books and authors by downloading texts from Project Gutenberg (https://www.gutenberg.org/).


### Format

The assignment should be submitted as a **PDF file**. It should take the form of a short scientific report with the following sections;

* Research Question - what is the focus of your investigation
* Methods - a description of the data you collected, the preprocessing steps you carried out (and why), and the analysis you performed
* Results - the results of your analysis
* Discussion - an interpretation of your results; what they mean, conclusions you can draw, any potential issues or areas of future work

The report should be around 2000 words (absolute maximum 3000). Please include a word count at the end of your assignment.

**You should also upload the .ipynb notebook file and any additional files needed to run your code (like your corpus) as a separate .zip file** These will **not** be graded but may be examined to verify your results or check for plagiarism.

Any code that was supplied here or in the Workshops does not need to be included in your report. If you have written any additional code of your own please include relevant snippets where appropriate. 

Any referenced material should be properly cited and it should be clear where work is not your own.

Again, submission should be a **PDF** of your report and a **separate .zip** of your notebook and corpus. 

### Grading

This Assignment accounts for 20% of the overall grade for this module. The assignment will be letter graded according to the UCD grading scheme for Stage 4 modules.

(see: https://www.ucd.ie/registry/t4media/UCD%20Module%20Grade%20Descriptors.pdf)


## Accessing Reddit

To access reddit data you will need to follow these steps:

1. Sign in to reddit (feel free to create a new account for this Assignment and delete it afterwards)


2. Go to https://www.reddit.com/prefs/apps and click "are you a developer? create an app..." 


3. Enter a name (combine a random word and some numbers - if it's too short you might have problems later), select "script" and for the redirect uri enter http://localhost:8080


4. Click "create app"


5. You should now see the following:

<img src="reddit.png">

For scraping our reddit data we'll be using PRAW. In a terminal (Mac: Terminal, Windows: cmd) type:

<code>pip install praw</code>

Now we're ready to start.

## Building a corpus

We will now create a .txt file containing a number of posts for a particular subreddit. Once this code has been run you should have a file (SUBREDDIT_NAME.txt) that you can work with using the tools presented in previous workshops.

Using the image above for reference, fill in your own values in between the " " marks for the following:

(For examples of subreddits you could look at see https://blog.oneupapp.io/biggest-subreddits/ )

In [3]:
ci = "mathsmats21354" # your client id
cs = "dS2I4meWoXMgQ-AtZObOO8C5WM_smw" #your client secret
ua = "HLTUCDScript" #your user agent name
sub = "apexlegends" #the name of the subreddit (not including the 'r/')

Then run the following cell to create a corpus containing post titles, content and top level comments:

Note that you can change how posts are sorted, e.g. "hot", "new", "rising", "top" and how many posts you collect - your choice will be dependent on your research question and the subreddit you choose. Depending on how many posts you extract and how many comments there are this could take a few minutes. You might also choose not to extract titles or comments - feel free to remove the corresponding lines of code in that case.

In [4]:
import emoji

import praw
from praw.models import MoreComments

reddit = praw.Reddit(
    client_id=ci,
    client_secret=cs,
    user_agent=ua
)

def remove_emojis(text):
    return text.encode('ascii', 'ignore').decode('ascii')

def decode_emojis(text):
    return emoji.demojize(text)
    
with open(sub + ".txt", "w") as f:
    # on the following line you can change top to any of the previously mentioned ways of sorting
    # and the limit to however many posts you would like to extract (here we extract just 10).
    for post in reddit.subreddit(sub).top(limit=10):

        # this line collects the post titles
        f.write(post.title + "\n")

        # this line collects the post content
        f.write(post.selftext + "\n")

        # this section collects the comments
        for comment in post.comments:
            if isinstance(comment, MoreComments):
                continue

            f.write(remove_emojis(comment.body) + "\n")

ResponseException: received 401 HTTP response

As a final note, you are not restricted to this methodology, file format, or the analysis tools from Workshops. You are very welcome to try out your own ideas in terms of coding, data structures, and/or linguistic theories. If you have an idea you want to implement but aren't sure how to go about it let me know - you are not being graded on your code but rather your ability to apply techniques, generate results, and discuss your interpretations.