# Practice with Dictionaries & Counters

NLP Challenges to introduce Counters and practice more with dictionaries and lists.

## Counters

Counters are a subclass aka subtype of `Dictionaries`. They're useful for quickly counting / tallying things up.


Key difference from dictionaries
* return 0 count for missing items instead of returning a key error

In [None]:
from collections import Counter
# import the counter class from collections module

In [None]:
c = Counter(['eggs', 'ham'])
c

In [None]:
# The key "green_eggs" does not exist in our counter
c['green_eggs']

Similar to dictionaries, you can retrieve elements with the key

In [None]:
c['eggs']

We can iterate values

In [None]:
c['eggs'] += 1

In [None]:
print(c['eggs'])

Usefulness of having a default key of zero

In [None]:
color_counts = Counter()

for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    color_counts[word] += 1

In [None]:
color_counts

# NLP Example

Practice with Lists, Dictionaries

In [None]:
import requests

response = requests.get("https://raw.githubusercontent.com/khushmeeet/potter-nlp/master/final_data/book1.txt")
data = response.text

### 1.Inspect the data

* What type of object is the `data` variable?
* What's the length of the `data` variable?
* How many words and characters?
* What does the text look like? (can you look at part / subset of the data?)

To count word occurances, we might want to split text from a single huge string to individual words or tokens. We could split using string split

In [None]:
# split the first 50 characters into tokens by their space character
data[260:300].split(" ")

However, this will make it more challenging to count characters like "mrs. dursley" whose names are multiple words. Let's replace this name with something else before we count the occurances of characters.

### 3. Correct some mistakes

We realize that our naive tokenization (splitting our full text string into individual words made some errors.

Using the dictionary mapping the original text to our desired replacement, replace these substrings in the original data.

In [None]:
# Dictionary mapping a substring to its replacement
corrections = {'mr. and mrs. dursley': 'the_dursleys',
               'mr. dursley': 'mr_dursley',
               'mrs. dursley': 'mrs_dursley'
              }

### 4. Split data into tokens

Make a list of strings, where each element of the list contains a word in the `data` variable. To simplify things we can split words on space character only.

### 4. How many times does each characters name appear?

In [None]:
hp_chars = list(corrections.values()) + ['harry', 'hagrid', 'dumbledore', 'hermione','ron']

### 5. Plot your results

Read the documentation for Seaborn barplot to learn how to make a chart: https://seaborn.pydata.org/generated/seaborn.barplot.html

There's some helper code to convert our dictionary into a new object type a Pandas `DataFrame` to match the Seaborn documentation and examples.

In [None]:
import seaborn as sns
import pandas as pd

In [None]:
# we're going to make a DataFrame object from our dictionary
data = pd.DataFrame(char_counts.values(), char_counts.keys())
# I'm renaming the columns and resetting the index to match the Seaborn tutorial
data = data.reset_index().rename(columns={'index': 'character', 0:'num_occurances'})

Now we have a DataFrame with one row for each character, and two columns `character` and `num_occurances`

In [None]:
data

In [None]:
# We can remove harry from the dataset if we want
# df = df[df['character'] != 'harry']