## Objective
***
This notebook will show some of the functionality of the tokenizer library.

In [1]:
from tokenizer import clean_text, tokenizer, count_words


In [2]:
!head ../data/raw/The_Raven_17192.txt

﻿The Project Gutenberg eBook of The Raven
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.



In [3]:
# helper function - makes the books into strings
def converts_book_to_string(book_path: str) -> str:
    from pathlib import Path
    """takes a project Gutenberg book and converts the book to a string.

    takes out new lines and (Byte order Mark BOM)
    """
    txt = Path(book_path).read_text()

    txt = txt.replace('\n', '')
    txt = txt.replace('\ufeff', '')

    return txt


In [4]:
# remove newline and (Byte order Mark BOM) -- not valid words
sample_text = converts_book_to_string("../data/raw/The_Raven_17192.txt")
sample_text[:500]

'The Project Gutenberg eBook of The Raven    This ebook is for the use of anyone anywhere in the United States andmost other parts of the world at no cost and with almost no restrictionswhatsoever. You may copy it, give it away or re-use it under the termsof the Project Gutenberg License included with this ebook or onlineat www.gutenberg.org. If you are not located in the United States,you will have to check the laws of the country where you are locatedbefore using this eBook.Title: The RavenAuth'

### Clean Text
* lowers text
* takes out punctuation

In [5]:
clean_text(sample_text)[:300]

'the project gutenberg ebook of the raven    this ebook is for the use of anyone anywhere in the united states andmost other parts of the world at no cost and with almost no restrictionswhatsoever you may copy it give it away or reuse it under the termsof the project gutenberg license included with t'

### Tokenize Text
* turns a string into a list

In [6]:
tokenizer(sample_text)[:10]

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Raven', '', '', '']

In [7]:
tokenizer(clean_text(sample_text))[:10]

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'raven', '', '', '']

### Count Text
* will count sample_text and to do that it requires the text be lowercase and tokenized.
* will return whitespace as a valid key

In [8]:
# print out is too much so list dict to show head
dict(list(count_words(sample_text).items())[:20])

{'the': 640,
 'project': 66,
 'gutenberg': 21,
 'ebook': 12,
 'of': 379,
 'raven': 36,
 '': 2844,
 'this': 79,
 'is': 99,
 'for': 68,
 'use': 13,
 'anyone': 4,
 'anywhere': 2,
 'in': 153,
 'united': 11,
 'states': 13,
 'andmost': 1,
 'other': 22,
 'parts': 3,
 'world': 5}