# Building and Visualizing word frequencies

In this lab, we will focus on the build_freqs() helper function and visualizing a dataset fed into it. In our goal of tweet sentiment analysis, this function will build a dictionary where we can lookup how many times a word appears in the lists of positive or negative tweets. This will be very helpful when extracting the features of the dataset in the week's programming assignment. Let's see how this function is implemented under the hood in this notebook.


In [2]:
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import numpy as np 

In [3]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

tweets = all_positive_tweets + all_negative_tweets

print("number of tweets : ",len(tweets))


number of tweets :  10000


now we will build a label array that matches our tweets.
This array will be composed of 10000 elements
The first 5000 will be filled with 1 labels denoting positiv sentments, while the next 5000 will be 0 labels denoting negative sentiments

In [4]:
labels = np.append(np.ones(len(all_positive_tweets)),np.zeros(len(all_negative_tweets)) )

In [5]:
labels[0]

1.0

In [6]:
labels[9999]

0.0

In [7]:
dictionary = {'key1' : 1 , 'key2': 2 }

In [8]:
#adding new entry
dictionary['key3'] = -4

#overwritng thevalue of key1
dictionary['key1'] = 0

print(dictionary)


{'key1': 0, 'key2': 2, 'key3': -4}


Acessing values and lookup keys
- we can use the square breacket notation 
- get() method

In [10]:
print(dictionary['key2'])

2


In [11]:
print(dictionary['key9'])

KeyError: 'key9'

In [13]:
#this prints a value
if 'key1' in dictionary:
    print("item found : ",dictionary['key1'])

else:
    print('key1 is not defined')

#same but with the get function
print("item found : " , dictionary.get('key1' , -1))

item found :  0
item found :  0


In [14]:
# This prints a message because the key is not found
if 'key7' in dictionary:
    print(dictionary['key7'])
else:
    print('key does not exist!')

# This prints -1 because the key is not found and we set the default to -1
print(dictionary.get('key7', -1))

key does not exist!
-1


LETS TAKE A LOOK AT BUILD_FREQS() FUNCTION THAT IS INBUILT IN COURSERA NOTEBOOK.
THIS FUNCTION CREATES THE DICTIONARY CONTAINING THE WORD COUNTS FROM EACH CORPUS

In [18]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer


def process_tweets(tweet):
    #remove the old style retweet text "RT"
    tweet2 = re.sub( r'^RT[\s]+' , '' ,tweet)

    #remove hyperlinks
    tweet2 = re.sub(r'https?:\/\/.*[\r\n]*' , '' ,      tweet2)

    #remove hashtag (only the # sign)
    tweet2 = re.sub(r'#' , '' ,tweet2)

    #instantiate the tokenizer class
    tokenizer = TweetTokenizer  (preserve_case=False,
                          strip_handles=True,
                          reduce_len = True)

    #tokenize tweets
    tweet_tokens = tokenizer.tokenize(tweet2)

    #importing the english stop words from nltk
    stopwords_english = stopwords.words('english')

    tweets_clean = []

    for word in tweet_tokens:
        if(word not in stopwords_english 
        and 
        word not in string.punctuation):
            tweets_clean.append(word)
    
    stemmer = PorterStemmer()

    #create an empty list to store the stems
    tweets_stem = []

    for word in tweets_clean:
        stem_word = stemmer.stem(word)  
        #stemming word
        tweets_stem.append(stem_word)
    
    return tweets_stem 

print(process_tweets(tweets[2277]))



['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


In [15]:
def build_freqs(tweets , ys):
    """Build frequencies
    input:
        tweets : a list of tweets
        ys: an mx1 array with the sentiment label of each tweet(either 0 or 1)
    output:
        freqs: a dictionary mapping each (word,sentiment) pair to its frequency
        """
    
    #convert the np array to list since zip needs an iterble
    # The squeeze is necessary or the list ends up with  one element
    # also note that this is just a NOP if ys is already a list
     
    yslist = np.squeeze(ys).tolist()

    #start with an empty dictionary and populate it by looping  over all tweets
    # and over all processed words in each tweet
    freqs={}

    for y , tweet in zip(yslist , tweets):
        for word in process_tweet(tweet):
            pair =  (word,y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    return freqs
    

LETS UNDERSTAND THE LOGIC BEHIND THIS 

### "folowfriday" appears 25 times in the positive tweets
('followfriday', 1.0): 25

### "shame" appears 19 times in the negative tweets
('shame', 0.0): 19 