# Assignment 3a

## Due: Friday, September 30, 2022 at 5pm (submission via Canvas)


* Please submit your assignment (notebooks of parts 3a and 3b + Python modules) as **a single .zip file** using Canvas (Assignments --> Assignment 3). Please put the notebooks for Assignment 3a and 3b as well as the Python modules (files ending with .py) in one folder, which you call ASSIGNMENT_3_FIRSTNAME_LASTNAME. Please zip this folder and upload it as your submission.

* Please name your zip file with the following naming convention: ASSIGNMENT_3_FIRSTNAME_LASTNAME.zip

**IMPORTANT NOTE**:
* The students who follow the Bachelor version of this course, i.e., the course Introduction to Python for Humanities and Social Sciences (L_AABAALG075) as part of the minor Digital Humanities, do **NOT have to do Exercises 3 and 4 of Assignment 3b**
* The other students, who follow the Master version of Programming in Python for Text Analysis (L_AAMPLIN021), are required to **DO Exercises 3 and 4 of Assignment 3b**

If you have **questions** about this topic, please contact us **(cltl.python.course@gmail.com)**. Questions and answers will be collected on Piazza, so please check if your question has already been answered first.


In this block, we covered a lot of ground:

* Chapter 12 - Importing external modules 
* Chapter 13 - Working with Python scripts
* Chapter 14 - Reading and writing text files
* Chapter 15 - Off to analyzing text 


In this assignment, you will first complete a number of small exercises about each chapter to make sure you are familiar with the most important concepts. In the second part of the assignment, you will apply your newly acquired skills to write your very own text processing program (ASSIGNMENT-3b) :-). But don't worry, there will be instructions and hints along the way. 


**Can I use external modules other than the ones treated so far?**

For now, please try to avoid it. All the exercises can be solved with what we have covered in block I, II, and III. 


## Functions & scope

### Excercise 1:

Define a function called `split_sort_text` which takes one positional parameter called **text** (a string).

The function:
* splits the string on a space character, i.e., ' '
* returns all the unique words in alphabetical order as a list.

* Hint 1: There is a specific python container which does not allow for duplicates and simply removes them. Use this one. 
* Hint 2: There is a function which sorts items in an iterable called 'sorted'. Look at the documentation to see how it is used. 
* Hint 3: Don't forget to write a docstring. Please make sure that the docstring generally explains with the input is, what the function does, and what the function returns. If you want, but this is not needed to receive full points, you can use [reStructuredText](http://docutils.sourceforge.net/rst.html).

In [4]:
# your code here
def split_sort_text(text:str):
    '''
    This function takes a string as input and returns a list of unique words in alphabetical order.
    '''
    words = text.lower().split()
    unique_words = set(words)
    return sorted(unique_words)
    
print(split_sort_text('This is a test string'))

['a', 'is', 'string', 'test', 'this']


## Working with external modules

### Exercise 2
NLTK offers a way of using WordNet in Python. Do some research (using google, because quite frankly, that's what we do very often) and see if you can find out how to import it. WordNet is a computational lexicon which organizes words according to their senses (collected in synsets). See if you can print all the **synset definitions** of the lemma **dog**.

Make sure you have run the following cell to make sure you have installed WordNet:

In [5]:
import nltk
import ssl

# Bypass SSL certificate verification
ssl._create_default_https_context = ssl._create_unverified_context

nltk.download('wordnet')

# uncomment the following line to download material including WordNet
# nltk.download('book')
# nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rohanzonneveld/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# print synsets of the word 'dog'
from nltk.corpus import wordnet as wn
print(wn.synsets('dog'))

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]


## Working with python scripts

### Exercise  3

#### a.) Define a function called `my_word_count`, which determines how often each word occurs in a string. Do not use NLTK just yet. Find a way to test it. 

* Write a helper-function called `preprocess`, which removes the punctuation specified by the user, and returns the same string without the unwanted characters. You call the function `preprocess` inside the `my_word_count` function.

* Remember that there are string methods that you can use to get rid of unwanted characters. Test the `preprocess` function using the following string `'this is a (tricky) test'`.

* Remember how we used dictionaries to count words? If not, have a look at Chapter 10 - Dictionaries. 

* Make sure you split the string on a space character ' '. You then can loop over the list to count the words.

* Test your function using an example string, which will tell you whether it fulfills the requirements (remove punctuation, split, count). You will get a point for good testing.

#### b.) Create a python script 

Use your editor to create a Python script called **count_words.py**. Place the function definition of the **my_word_count** function in **count_words.py**. Also put a function call of the **my_word_count** function in this file to test it. Place your helper function definition, i.e., **preprocess**, in a separate script called **utils_3a.py**. Import your helper function **preprocess** into count_words.py. Test whether everything works as expected by calling the script count_words.py from the terminal.

The function **preprocess** preprocesses the text by removing characters that are unwanted by the user. **preprocess** is called within the **my_word_count** function and hence builds upon the output from the preprocess function and creates a dictionary in which the key is a word and the value is the frequency of the word.

**Please submit these scripts together with the other notebooks**.

Don't forget to add docstrings to your functions. 

In [7]:
from count_words import my_word_count
text = 'this is a (very very very very tricky) test'
print(my_word_count(text))

{'this': 1, 'is': 1, 'a': 1, '(very': 1, 'very': 3, 'tricky)': 1, 'test': 1}


## Dealing with text files

### Exercise 4

**Playing with lyrics**

a.) Write a function called `load_text`, which opens and reads a file and returns the text in the file. It should have the file path as a parameter. Test it by loading this file: ../Data/lyrics/walrus.txt

* Hint: remember it is best practice to use a context manager
* Hint: **FileNotFoundError**: This means that the path you provide does not lead to an existing file on your computer. Please carefully study Chapter 14. Please determine where the notebook or Python module that you are working with is located on your computer. Try to determine where Python is looking if you provide a path such as “../Data/lyrics/walrus.txt”. Try to go from your notebook to the location on your computer where Python is trying to find the file. One tip: if you did not store the Assignments notebooks 3a and 3b in the folder “Assignments”, you would get this error.

b.) Write a function called `replace_walrus`, which takes lyrics as input and replaces every instance of 'walrus' by 'hippo' (make sure to account for upper and lower case - it is fine to transform everything to lower case). The function should write the new version of the song to a file called 'walrus_hippo.txt and stored in ../Data/lyrics. 

Don't forget to add docstrings to your functions. 

In [2]:
# your code here
def load_text(file_path):
    with open(file_path, 'r') as f:
        return f.read()
text = load_text('/Users/rohanzonneveld/Documents/Artificial Intelligence/Jaar 2/Programming in Python for Text Analysis/python-for-text-analysis/Data/lyrics/walrus.txt')
print(text)

"I Am The Walrus"
("Magical Mystery Tour" Version)

I am he
As you are he
As you are me
And we are all together

See how they run
Like pigs from a gun
See how they fly
I'm crying

Sitting on a cornflake
Waiting for the van to come
Corporation tee shirt
Stupid bloody Tuesday
Man, you been a naughty boy
You let your face grow long

I am the eggman (Ooh)
They are the eggmen, (Ooh)
I am the walrus
Goo goo g' joob

Mister city p'liceman sitting pretty
Little p'licemen in a row
See how they fly
Like Lucy in the sky
See how they run
I'm crying
I'm crying, I'm crying, I'm crying

Yellow matter custard
Dripping from a dead dog's eye
Crabalocker fishwife pornographic priestess
Boy you been a naughty girl
You let your knickers down

I am the eggman (Ooh)
They are the eggmen (Ooh)
I am the walrus
Goo goo g' joob

Sitting in an English
Garden waiting for the sun
If the sun don't come
You get a tan from standing in the English rain

I am the eggman
They are the eggmen
I am the walrus
Goo goo g' joob

In [4]:
def replace_walrus(lyrics):
    with open('/Users/rohanzonneveld/Documents/Artificial Intelligence/Jaar 2/Programming in Python for Text Analysis/python-for-text-analysis/Data/lyrics/walrus_hippo.txt', 'w') as f:
        f.write(lyrics.lower().replace('walrus', 'hippo'))
replace_walrus(text)

## Analyzing text with nltk

### Exercise 5

**Building a simple NLP pipeline**

For this exercise, you will need NLTK. Don't forget to import it. 

Write a function called `tag_text`, which takes raw text as input and returns the tagged text. To do this, make sure you follow the steps below:

* Tokenize the text. 

* Perform part-of-speech tagging on the list of tokens. 

* Return the tagged text


Then test your function using the text snipped below (`test_text`) as input.

Please note that the tags may not be correct and that this is not a mistake on your end, but simply NLP tools not being perfect.

In [9]:
test_text = """Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:"""

In [14]:
# your code here
def tag_text(text:str):
    '''
    This function takes a string as input and returns a list of tuples (word, tag).
    '''
    words = nltk.word_tokenize(text)
    return nltk.pos_tag(words)

print(tag_text(test_text))

[('Shall', 'NN'), ('I', 'PRP'), ('compare', 'VBP'), ('thee', 'JJ'), ('to', 'TO'), ('a', 'DT'), ('summer', 'NN'), ("'s", 'POS'), ('day', 'NN'), ('?', '.'), ('Thou', 'NNP'), ('art', 'RB'), ('more', 'RBR'), ('lovely', 'RB'), ('and', 'CC'), ('more', 'JJR'), ('temperate', 'NN'), (':', ':'), ('Rough', 'NNP'), ('winds', 'NNS'), ('do', 'VBP'), ('shake', 'VB'), ('the', 'DT'), ('darling', 'VBG'), ('buds', 'NNS'), ('of', 'IN'), ('May', 'NNP'), (',', ','), ('And', 'CC'), ('summer', 'NN'), ("'s", 'POS'), ('lease', 'NN'), ('hath', 'NN'), ('all', 'DT'), ('too', 'RB'), ('short', 'JJ'), ('a', 'DT'), ('date', 'NN'), (':', ':')]


## Python knowledge

### Exercise 6

6.a) Explain in your own words the difference between the global and the local scope.

Global variables are often declared at the top of the program and are accesible from every part of the program, even within functions. Local variables are only accesible within the block they are created, so local variables in a function are destroyed after the function is finished.

6.b) What is the difference between the modes 'w' and 'a' when opening a file?

The 'w' mode when opening a file refers to write, this overwrites an existing file or creates a new one when it doesn't already exist. The 'a' mode refers to append, this appends to an existing file.