# 210315_study

## Problem

You want to build a word cloud, an infographic where the size of a word corresponds to how often it appears in the body of text.

To do this, you'll need data. Write code that takes a long string and builds its word cloud data in a dictionary ↴ , where the keys are words and the values are the number of times the words occurred.

Think about capitalized words. For example, look at these sentences:

  'After beating the eggs, Dana read the next step:'
'Add milk and eggs, then add flour and sugar.'
What do we want to do with "After", "Dana", and "add"? In this example, your final dictionary should include one "Add" or "add" with a value of 2. Make reasonable (not necessarily perfect) decisions about cases like "After" and "Dana".

Assume the input will only contain words and standard punctuation.

You could make a reasonable argument to use regex in your solution. We won't, mainly because performance is difficult to measure and can get pretty bad.

## Approach

1. strip punctuations
2. split on whitespace for words
3. if word exists in dict, add 1 to the count.
4. if word doesn't exist and is not empty, add the word to dict
5. check to see if upper/lower case of the word is in the dictionary. if the word appears more than once, lower case it. If not, keep it at upper case.?

## Code

In [1]:
def words_to_counts(sentence: str) -> dict:
    res = {}
    allowed = ["'"]
    # o(m), m being the number of spaces = n-1
    # heuristic for handling ... in a quick and dirty fashion
    wordList = sentence.replace("...", ", ").split()
    
    for word in wordList:
        # if word has punc
        if word[-1] not in allowed and not word[-1].isalpha():
            word = word[0:-1]
        if word in res.keys():
            res[word] += 1
        # check for both cases
        elif word.title() in res.keys() or word.lower() in res.keys():
            # if upper case exists, remove the upper case and assign the current count to lower case
            if res[word.title()]:
                currentCnt = res[word.title()]
                del res[word.title()]
                res[word.lower()] = currentCnt
            # either way, we just need to add one more to the lower of the occurence.
            res[word.lower()] += 1
        # only add if the word is not '', in case there were single punctuation in between white spaces
        elif word:
            res[word] = 1
    return res

## Test

In [2]:
def assertTest(actual, expected):
    if actual == expected:
        print("PASS")
    else:
        print("FAIL")

def test_simple_sentence():
    input = 'I like cake'

    actual = words_to_counts(input)

    expected = {'I': 1, 'like': 1, 'cake': 1}
    assertTest(actual, expected)

def test_longer_sentence():
    input = 'Chocolate cake for dinner and pound cake for dessert'

    actual = words_to_counts(input)

    expected = {
        'and': 1,
        'pound': 1,
        'for': 2,
        'dessert': 1,
        'Chocolate': 1,
        'dinner': 1,
        'cake': 2,
    }
    assertTest(actual, expected)

def test_punctuation():
    input = 'Strawberry short cake? Yum!'

    actual = words_to_counts(input)

    expected = {'cake': 1, 'Strawberry': 1, 'short': 1, 'Yum': 1}
    assertTest(actual, expected)

def test_hyphenated_words():
    input = 'Dessert - mille-feuille cake'

    actual = words_to_counts(input)

    expected = {'cake': 1, 'Dessert': 1, 'mille-feuille': 1}
    assertTest(actual, expected)

def test_ellipses_between_words():
    input = 'Mmm...mmm...decisions...decisions'

    actual = words_to_counts(input)

    expected = {'mmm': 2, 'decisions': 2}
    assertTest(actual, expected)

def test_apostrophes():
    input = "Allie's Bakery: Sasha's Cakes"

    actual = words_to_counts(input)

    expected = {"Bakery": 1, "Cakes": 1, "Allie's": 1, "Sasha's": 1}
    assertTest(actual, expected)
    
test_simple_sentence()
test_longer_sentence()
test_punctuation()
test_hyphenated_words()
test_ellipses_between_words()
test_apostrophes()

PASS
PASS
PASS
PASS
PASS
PASS


## Complexity

### Time

- `replace()` -> `O(n)`, as we go through every char in the string of length n
- `split()` -> `O(n)`, as we go through every char to find the split char
- `forloop` -> `O(m)`, the length of the word list
- Strip punctuation -> `O(1) + O(1)` -> `O(1)`
- Check word in dict -> `O(d)`, where `d` is the size of current dictionary is represented as `m - r`, where `i` is the current iteration and `r` is number of repeating word.
- Duplicate handling -> `O((n/m) + d)`, `n/m` represents the length of the average word in the list (total char / len(wordList)), and `d` is the size of the dictionary represented as `m - r`
- rest of the operations are O(1)

Basic mathematics implies following relationship between these variables

$n \geq m$ because we are going to remove at least 0 whitespace.

$m \geq d$ because we are going to have at least 0 duplicates.

$\therefore n \geq m \geq d$

Therefore, the dominant term in the algorithm is `n` and the time complexity grows linearly with the input size `n` asymptotically -> `O(n)`

### Space

Constant space to hold the allowed characters, O(n) space to hold the words and O(n) space to build the dictionary. Here the space complexity becomes `O(n)`