# Counters and plotting

by Koenraad De Smedt at UiB

---
We have already seen that a frequency distribution (FreqDist) in NLTK is a kind of counter. We can also make our own counters to find the frequency of each item in a sequence. 

This notebook shows:
1.   How to make a counter with the help of `Counter` (a subclass of dict) from the `collections` module. 
2.   How to plot the counts in a bar or line plot.

---



The genetic code in RNA is also a language! As an example, consider the sequence of nucleotides in the five-prime untranslated region (5' UTR) of [the mRNA in the BNT162b2 vaccine](https://berthub.eu/articles/posts/reverse-engineering-source-code-of-the-biontech-pfizer-vaccine/). We want to count how many times the different nucleotides occur in the sequence.

In [None]:
from collections import Counter
fiveprimeutr = 'GAAΨAAACΨAGΨAΨΨCΨΨCΨGGΨCCCCACAGACΨCAGAGAGAACCCGCCACC'
cntr = Counter(fiveprimeutr)
cntr


Let's make a bar plot of the counter. A counter is a kind of dict, so we can retrieve its keys and values.

In [None]:
import matplotlib.pyplot as plt
plt.bar(cntr.keys(), cntr.values(), align='center')
plt.title('Nucleotide frequencies')
plt.show()

---

Let's count character frequencies in a text.

In [None]:
story ='''Once upon a time, there was a princess called Buttercup. She had a
farm-hand called Westley; whenever she tells him to do something, he always
answers: "As you wish." At first she didn't realize he loves her, but
eventually she realizes it and she loves him too! Westley leaves to seek his
fortune overseas so they can marry. When his ship is attacked by the Dread
Pirate Roberts, who is infamous for never leaving survivors, Westley is
presumed dead...'''

charcount = Counter(story.casefold())
charcount

A counter is unordered. If we want to get the counted items by decreasing frequency, we can use the `most_common()` method. This produces a list of pairs with keys and values.

In [None]:
charcount.most_common()

Let's get the 10 most common items in a specified format and print each character with its frequency. Of course, the space is invisible.

In [None]:
for ch, freq in charcount.most_common(10):
  print(ch, freq)

Put only the frequencies of all items, in decreasing order, into a list. Print the first ten of these numbers.

In [None]:
frequencies = [freq for ch, freq in charcount.most_common()]
print(frequencies[:10])

Make a line plot of all frequencies with the `pyplot` module. Uncomment the `plt.figure` instruction to specify the plot resolution.

In [None]:
# plt.figure(dpi=120)
plt.plot(frequencies)
plt.title('Character frequencies')
plt.ylabel('Frequency')
plt.xlabel('Index')
plt.show()

### Exercises

1.   Can you make a counter of a set? Does it make sense?
2.   For which purposes can knowledge about character frequencies be useful?
3.   If you look at a frequency plot of characters in normal text, what do you observe? Why is it not a straight line?