# Word Counts

When you are working with a pile of twitter data it can be useful to take a look at the types of words that are used in the dataset. To do this we need to create a function that will read through the tweets, and split the text of the tweet on whitespace, normalize (lowercase) the word, and then keep track of the number of times the word was seen using a Python dictionary.

In [4]:
import json
import gzip

def word_counts(filename):
    counts = {}
    with gzip.open(filename, 'rt') as f:
        for line in f:
            tweet = json.loads(line)
            if 'text' not in tweet:
                break
            for word in tweet['text'].split(' '):
                word = word.lower()
                counts[word] = counts.get(word, 0) + 1
    return counts

Go ahead and change this filename if you want to look at another dataset:

In [5]:
counts = word_counts('data/assorted/ferguson-blacklivesmatter.json.gz')

We can try printing out the results, but the dictionary is unordered, so it's kind of messy.

In [6]:
print(counts)



We can create a list of the words that are sorted in descending order by the number of times they appear.

In [7]:
words = sorted(counts, key=counts.get, reverse=True)

Now we can see how many unique words there are:

In [8]:
print(len(words))

18332


And we can print out the top 25 words with their frequency:

In [9]:
for word in words[0:25]:
    print(word, counts[word])

#blacklivesmatter 12322
#ferguson 10851
rt 10582
the 5089
in 4092
to 3368
ferguson 3052
of 3051
police 2631
is 2591
a 2096
 1883
for 1720
i 1584
was 1399
and 1366
have 1327
on 1290
#fergusonreport 1255
not 1242
&amp; 1185
black 1088
from 1032
he 1019
after 998
