# Programming Assignment 1: Tokenize and Count Words in Alice in Wonderland

## CS 584-WN: Natural Language Processing

#### Assignment Overview
The goal of this assignment is to thoroughly tokenize the text (.txt file) Alice in
Wonderland, which can be found in the ’Programming Assignment 1’ module,
and create two frequency dictionaries:
1. A Token Frequency Dictionary: Case-sensitive tokens based on specific tokenization rules.
2. A Full Word Frequency Dictionary: Complete, correctly spelled words.

These will be python dictionaries. You must thoroughly comment on your
code, line-by-line, explaining new functions and what they do to process text.
You do NOT need to comment on well-known, trivial functions like the print
statement. However, you MUST explain how each regex or tokenization function
is changing the text.
Words in the ’Full Word Frequency Dictionary’ must be spelled correctly.
You will be graded at least partially on how correct and comprehensive this
dictionary is.

#### Output Requirements
1. Token Frequency Dictionary
  * Data Structure: Python dictionary.
  * Keys: Case-sensitive tokens (e.g., "Alice", "ALICE", "said").
  * Values: Frequency of each token (e.g., the number of occurrences).
  * Punctuation: Include punctuation as standalone tokens.
  * Example: "Alice": 5, "said": 10, "ca": 3, "n’t": 3, ",": 15
2. Full Word Frequency Dictionary
  * Data Structure: Python dictionary.
  * Keys: Complete words.
  * Values: Frequency of each complete word.
  * Example: "Alice": 5, "can’t": 3, "believe": 2

#### Instructions
1. Tokenization:
  * Tokenize the text of Alice in Wonderland using industry-standard rules:  
    – Split punctuation if it’s not part of the word (e.g., commas, pe- riods).  
    – Keep contractions as separate tokens (e.g., "ca", "n’t").  
    – Maintain case sensitivity (e.g., "Alice" is distinct from "ALICE").  
    – Split hyphenated words unless they are common expressions or proper nouns.
2. Dictionary Creation:
  * Create a Token Dictionary that tracks the frequency of each token.
  * Create a Full Word Dictionary that tracks the frequency of each full word in the text (e.g., "can’t" remains one word in this dictionary).

In [1]:
import re
from collections import defaultdict

### Part 1

In [31]:
def tokenize(text):
    token_freq = defaultdict(int)

    # Regex to split contractions and keep punctuation separate
    tokens = re.findall(r"\b\w+(?=n't)|n't|\b\w+|[.,!?;]", text)
    for token in tokens:
        token_freq[token] += 1
        

    return dict(token_freq)

### Testing the tokenization function

In [32]:
text = "Alice can't can't climb up the stairs"
token_frequencies = tokenize(text)

for token, freq in token_frequencies.items():
    print(f"'{token}': {freq}")

'Alice': 1
'ca': 2
'n't': 2
'climb': 1
'up': 1
'the': 1
'stairs': 1


### Processing the file with the tokenization function

In [36]:
def process_file(file_path, freqFunc):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    
    if (freqFunc == 'token'):
        token_frequencies = tokenize(text)
        for token, freq in token_frequencies.items():
            print(f"'{token}': {freq}")
    else:
        token_frequencies = frequency(text)
        for token, freq in token_frequencies.items():
            print(f"'{token}': {freq}")

process_file('alice_in_wonderland.txt', 'token')

'Alice': 396
's': 195
'Adventures': 3
'in': 357
'Wonderland': 3
'ALICE': 3
'S': 7
'ADVENTURES': 1
'IN': 2
'WONDERLAND': 1
'Lewis': 1
'Carroll': 1
'THE': 9
'MILLENNIUM': 1
'FULCRUM': 1
'EDITION': 1
'3': 1
'.': 978
'0': 1
'CHAPTER': 12
'I': 543
'Down': 3
'the': 1527
'Rabbit': 45
'Hole': 1
'was': 363
'beginning': 14
'to': 725
'get': 44
'very': 126
'tired': 7
'of': 500
'sitting': 10
'by': 54
'her': 243
'sister': 9
'on': 189
'bank': 3
',': 2418
'and': 802
'having': 10
'nothing': 30
'do': 119
'once': 31
'or': 76
'twice': 5
'she': 509
'had': 184
'peeped': 3
'into': 67
'book': 11
'reading': 3
'but': 133
'it': 527
'no': 69
'pictures': 4
'conversations': 1
'what': 93
'is': 104
'use': 18
'a': 615
'thought': 74
'without': 26
'conversation': 10
'?': 202
'So': 27
'considering': 3
'own': 10
'mind': 9
'as': 246
'well': 40
'could': 82
'for': 140
'hot': 7
'day': 29
'made': 30
'feel': 8
'sleepy': 5
'stupid': 5
'whether': 11
'pleasure': 2
'making': 8
'daisy': 1
'chain': 1
'would': 82
'be': 145
'worth': 4


### Part 2 

In [48]:
def frequency(text): 
    
    token_freq = defaultdict(int)
    full_word_freq = defaultdict(int)

    # Full Word Dictionary (keep contractions but break certain punctuations)
    full_words = re.findall(r"\w+|[.,!?;]", text)
    for token in full_words:
        token_freq[token] += 1
    return dict(token_freq)

In [49]:
process_file('alice_in_wonderland.txt', 'frequency')

'Alice': 396
's': 195
'Adventures': 3
'in': 357
'Wonderland': 3
'ALICE': 3
'S': 7
'ADVENTURES': 1
'IN': 2
'WONDERLAND': 1
'Lewis': 1
'Carroll': 1
'THE': 9
'MILLENNIUM': 1
'FULCRUM': 1
'EDITION': 1
'3': 1
'.': 978
'0': 1
'CHAPTER': 12
'I': 543
'Down': 3
'the': 1527
'Rabbit': 45
'Hole': 1
'was': 352
'beginning': 14
'to': 725
'get': 44
'very': 126
'tired': 7
'of': 500
'sitting': 10
'by': 54
'her': 243
'sister': 9
'on': 189
'bank': 3
',': 2418
'and': 802
'having': 10
'nothing': 30
'do': 68
'once': 31
'or': 76
'twice': 5
'she': 509
'had': 177
'peeped': 3
'into': 67
'book': 11
'reading': 3
'but': 133
'it': 527
'no': 69
'pictures': 4
'conversations': 1
'what': 93
'is': 97
'use': 18
'a': 615
'thought': 74
'without': 26
'conversation': 10
'?': 202
'So': 27
'considering': 3
'own': 10
'mind': 9
'as': 246
'well': 40
'could': 73
'for': 140
'hot': 7
'day': 29
'made': 30
'feel': 8
'sleepy': 5
'stupid': 5
'whether': 11
'pleasure': 2
'making': 8
'daisy': 1
'chain': 1
'would': 70
'be': 145
'worth': 4
't