# Assignment 1

## Guidelines

> Remember that this is a code notebook - add an explanation of what you do using text boxes and markdown, and comment your code. Answers without explanations may get less points.
>
> If you re-use a substantial portion of code you find online, e.g on Stackoverflow, you need to add a link to it and make the borrowing explicit. The same applies of you take it and modify it, even substantially. There is nothing bad in doing that, providing you are acknowledging it and make it clear you know what you're doing.
>
> The **Generative AI policy** from the syllabus for the programming assignments applies. Generative AI can be used as a source of information in these assignments if properly referenced. You can use generative AI assistance for writing code, but you must reference the chat used as a source, just as if you would take from StackOverflow. In ChatGPT, you can make an URL to the information you obtained by clicking the "Share link to Chat" button and then "Copy Link". This allows you to cite the source of the information you use in your answer or code solution. Of course, as you know, GenAI tools are not always a reliable source and its answers are intransparantly drawn from other sources - it is recommended to cross-check its output with other sources or your own understanding of the topic.
> 
> For the explanations of what you do that you provide with each question, as well as for (sub)questions that ask about things like motivation of choices or your opinion, the answer to this must be conceptualized and written by yourself and not copied from a generative AI source.
>
> Make sure your notebooks have been run when you submit, as I won't run them myself. Submit both the `.ipynb` file along with an `.html` export of the same. Submit all necessary auxilliary files as well. Please compress your submission into a `.zip` archive. Only `.zip` files can be submitted.
> If you are using Google Colab, here is a tutorial for obtaining an HTML export: https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab .
>
> With Jupyter, you can simply export it as HTML through the File menu.

## Grading policy
> As follows:
>
> * 70 points for correctly completing the assignment.
>
> * 20 points for appropriately writing and organizing your code in terms of structure, readibility (also by humans), comments and minimal documentation. It is important to be concise but also to explain what you did and why, when not obvious. Feel free to re-use functions and variables from previous questions if that helps for structure and readability - you do not need to repeat previous steps for each question.
> 
> * 10 points for doing something extra, e.g., if you go beyond expectations (overall or on something specific). Some ideas for extras might be mentioned in the exercises, or you can come up with your own. You don't need to do them all to get the bonus. The sum of points is 90, doing (some of) the extras can bring you to 100, so the extras are not necessary to get an A.
> 

**The AUC code of conduct applies to this assignment: please only submit your own work and follow the instructions on referencing external sources above.**

---

# Warm up (20 points)

## Question 1 (2 points)

Explain why `list1` and `list2` behave differently when they are passed to the `append_to_nested_list()` function.

In [1]:
def append_to_nested_list(a_list):
    a_list[0].append("Python")
    return a_list
    
list1 = [[], [], []]
list2 = [[]] * 3

print(append_to_nested_list(list1))
print(append_to_nested_list(list2))

[['Python'], [], []]
[['Python'], ['Python'], ['Python']]


<b> EXPLANATION </b>

In `list1`, three distinct nested lists are created, so appending to one does not affect the others.
In `list2`, however, only one nested list is created in memory, and is pointed to three times. Basically, `list2[0]` is the same object as `list2[1]` and `list2[2]`, so appending to `list2[0]` also appends to the other nested lists.

## Question 2 (2 points)

Write a function that counts the **total** frequency of words that start and end with the same character (e.g. comic) in a text file and test it on `data/melville-md.txt`. Total frequency means that you end with one number, although you are encouraged to show intermediate steps.

Ensure that the words are treated case-insensitive.

In [7]:
# your code here

def freq_start_end_same(file_name: str) -> int:
    '''
    DOCSTRING HERE
    '''
    count = 0
    with open(file_name) as file:
        for line in file.readlines():
            words = line.split()
            for word in words:
                if word.lower()[0] == word.lower()[-1] and word.isalpha():
                    # print(word)
                    count += 1

    return count

print(freq_start_end_same("data/melville-md.txt"))

12688


<b> EXPLANATION </b>

A counter is initialised to 0. We open the file, and for every line, we split it into a list of individual words. Then, for every word, if it only contains alphabetic letters and its first and last characters are the same, we increment the counter by 1. Finally, we return the counter.

## Question 3 (2 points)

Rewrite the following code such that:

- the sequence of fruit names are randomly presented (without replacement). Use a function in the [random](https://docs.python.org/3.7/library/random.html) module for this.


- the article "an" is used when a fruit name begins with a vowel.

In [2]:
available_fruit = ['apple', 'apricot', 'avocado', 'banana', 'bilberry', 'blackberry', 'blackcurrant', 'blueberry', 'boysenberry', 'currant', 'cherry', 'cherimoya', 'cloudberry', 'coconut', 'cranberry', 'cucumber', 'damson', 'date', 'dragonfruit', 'durian', 'elderberry', 'feijoa', 'fig', 'gooseberry', 'grape', 'raisin', 'grapefruit', 'guava', 'honeyberry', 'huckleberry', 'jabuticaba', 'jackfruit', 'jambul', 'jujube', 'kiwano', 'kiwifruit', 'kumquat', 'lemon', 'lime', 'loquat', 'longan', 'lychee', 'mango', 'mangosteen', 'marionberry', 'melon', 'cantaloupe', 'honeydew', 'watermelon', 'mulberry', 'nectarine', 'nance', 'orange', 'clementine', 'mandarine', 'tangerine', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'plantain', 'plum', 'prune', 'pineapple', 'plumcot', 'pomegranate', 'pomelo', 'quince', 'raspberry', 'salmonberry', 'rambutan', 'redcurrant', 'salak', 'satsuma', 'soursop', 'strawberry', 'tamarillo', 'tamarind', 'yuzu']

for fruit in available_fruit:
    print("We have a " + fruit)

We have a apple
We have a apricot
We have a avocado
We have a banana
We have a bilberry
We have a blackberry
We have a blackcurrant
We have a blueberry
We have a boysenberry
We have a currant
We have a cherry
We have a cherimoya
We have a cloudberry
We have a coconut
We have a cranberry
We have a cucumber
We have a damson
We have a date
We have a dragonfruit
We have a durian
We have a elderberry
We have a feijoa
We have a fig
We have a gooseberry
We have a grape
We have a raisin
We have a grapefruit
We have a guava
We have a honeyberry
We have a huckleberry
We have a jabuticaba
We have a jackfruit
We have a jambul
We have a jujube
We have a kiwano
We have a kiwifruit
We have a kumquat
We have a lemon
We have a lime
We have a loquat
We have a longan
We have a lychee
We have a mango
We have a mangosteen
We have a marionberry
We have a melon
We have a cantaloupe
We have a honeydew
We have a watermelon
We have a mulberry
We have a nectarine
We have a nance
We have a orange
We have a clemen

In [9]:
# your code here
from random import randrange

def rand_fruit(fruits: list) -> None:
    '''
    DOCSTRING
    '''
    while fruits:
        fruit = fruits.pop(randrange(len(fruits)))
        article = "an" if fruit[0] in "aeioui" else "a"
        print(f"We have {article} {fruit}")

fruits_pt = [
    "abacate", "ananás", "cereja", "framboesa", "lima", "maçã", "melancia", "mirtilo", "morango", "uva"
]
rand_fruit(fruits_pt)

We have a melancia
We have a morango
We have an abacate
We have a mirtilo
We have a cereja
We have an ananás
We have a framboesa
We have a maçã
We have an uva
We have a lima


<b> EXPLANATION </b>

I chose to use the `randrange()` function from the `random` library to get a random index from 0 to `len(fruits)` and pop the element at said index. An if condition calculates if the article should be "a" or "an" based on the first letter of the element, which is then formatted into the desired output. This repeats until all elements have been popped.

## Question 4 (5 points)

The following code has been written to extract all word-initial consonant clusters in a text (e.g. "br" in "bread). Each sequence is obtained by matching any sequence of letters that does not include 'aeiou' and that occurs after a whitespace or the start of the line and that consists of 2 or more such characters.

It works by reading an input file line by line, and finding all matches of a regular expression in this line (case insensitive).

Unfortunately, the method only counts, and we do not find out which word-initial consonants are present in the text. Can you find a way to save all matching consonant clusters to the dictionary named "consonantclusters" with their frequency as the value, and then print this dictionary? Note that there can be multiple results per line. Try to avoid capturing the space(s) before the consonant cluster also. As for every question, explain what you did and how your solution works.

Solutions where you adapt the provided regular expression will get more points than non-regex solutions, but you can try a non-regex solution if you are stuck.

**Possible extra:** Print the consonant clusters sorted by frequency and in a nice looking way.

In [4]:
import codecs
import re

In [25]:
consonantclusters = {}
consonantclustercount = 0
with codecs.open("data/melville-md.txt", "r", encoding="utf8") as infile:
    consonantclusterregex = re.compile(r'(^|\s)(?:(?![aeiouy])[a-z]){2,}')
    for line in infile:
        result = consonantclusterregex.findall(line.lower())
        if result:
            consonantclustercount += len(result)
print(consonantclustercount)

51523


In [None]:
# your code here

## Question 5 (9 points)

Please use the frequencies in `late_arrival_causes` to create a duplicate of the plot below, as close as possible. This is called a Pareto chart.

Note: the line plot above the bars shows the cumulative frequency.

**Possible extra:** suggest, motivate and implement an alternative visualization for the same data.

![pareto chart](images/pareto-chart.png)

In [6]:
late_arrival_causes = {"Child Care" : 44, "Emergency" : 7, "Overslept" : 11, "Traffic" : 56, "Transp." : 27, "Weather" : 20}

In [7]:
# your code here

---

# Preprocessing pipelines (30 points)

## Question 6 (20 points)

- Download a 19th-century edition (or earlier, but not later!) of a book you like from the [Internet Archive](https://archive.org) in `.txt` format. For example, [Frankenstein](https://archive.org/details/ghostseer01schiuoft/page/n6). Add the link to the edition you used to your answer, as well as the `.txt` file to your submission.

- Write code that:

    1. Reads the text in memory.
    
    1. Pre-processes the text in a way that suits this type of data. One step is typically tokenization, for which you can use a tokenizer from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) and optionally other preprocessing steps if you feel this helps. To get full points, you should motivate your choice of tokenizer and choice of other preprocessing steps (or lack thereof).
    
    1. Filter out words that consist of strictly less than 4 alphabetic characters.

    1. Counts the frequencies of all the words in the corpus (words should be counted case-insensitive).

    1. Writes each word-frequency pair to a csv file (from most frequent to rarest).

Comment on your results, especially looking at very frequent and very infrequent words. What is problematic about processing these old editions? Can you find some limitations of the tokenizer in use, and think about how you would improve on it? Naturally, this part is required for full points.

**Possible extra:** Write your preprocessing code as reusable functions, as you often have to preprocess multiple textual sources in a consistent way. You can then avoid code duplication in Question 8.

**Possible extra:** Plot the relative frequency of the top N words (e.g., use the Pareto chart you did above, or another suitable plot) and discuss whether the distribution might follow the [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law).

**Possible extra:** Add lemmatization or stemming and part-of-speech tagging.

In [None]:
# your code here

## Question 7 (10 points)

Do some self-learning: implement the same pipeline of question 6 using [spaCy pipelines](https://spacy.io/usage/processing-pipelines).

Hint: Make sure to know what Spacy does by default when loading specific models, the default options are not always what you need.

In [1]:
# your code here

---

# Descriptive text analysis (20 points)

## Question 8 (20 points)

In the `data/numan` directory, there are lyrics of some songs from two albums by electronic music pioneer Gary Numan. There are 5 songs from his 1979 album and 5 songs from his 2017 album.  This data was acquired from [Genius](https://genius.com) (Genius Lyrics) using their API, something that you could do too using the lyricsgenius package for Python!

- Load the data from these files into an appropriate data structure and perform appropriate preprocessing for this type of data (which might be different from preprocessing old books). Motivate your choices

**Possible extra:** This might be a good opportunity to show how good you were at writing re-usable code in Question 6. Avoid duplication of code by calling back to functions you made in Question 6. Only include the steps that also make sense for song lyrics.

- Write a function that can return some statistics about an albums' song lyrics, and run this function for both albums:

    * Most frequent words
    * Type to token ratio (unique words/words)
    * Average word length
    * Longest and shortests songs (by lyrics)
    * What are the songs with the largest vocabulary and smallest vocabulary?
    
- Print these results to your notebook for both albums in a nice looking way. Then, also show these same statistics for just the song 'Cars' from the 1979 album, which was Gary Numan's most famous song.
   
   * Write down your interpretation of these results in this notebook.  
   * In which of the two time periods was Gary Numan more verbose? Back it with some evidence.
   * Electronic music is sometimes said to make more use of repetition than other forms of music. In which of the two time periods did Gary Numan make more use of lyrical repetition? You can either argue your case based on the numbers you were asked to calculate, or you can come up with your own definition of 'repetitiveness' and calculate it with Python code.

In [None]:
# your code here