# Assignment 5

__Table of contents__

1. [Module 7 walkthrough](#Module-7-walkthrough)
1. [Module 8 walkthrough](#Module-8-walkthrough)
1. [Assignment 5](#assignment)
    1. [Acquire tweets](#Acquire-tweets)
    1. [Remove username, URL](#Remove-username-URL)
    1. [Remove punctuation](#Remove-punctuation)
    1. [Remove apostrophes](#Remove-apostrophes)
    1. [Word pattern formatting](#Word-pattern-formatting)
    1. [Remove hashtags](#Remove-hashtags)
    1. [Polarity analysis](#Polarity-analysis)

In [11]:
import os
import sys
import jsonpickle
import json
import tweepy
import html.parser as HTMLParser
import re

import nltk
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
#from nltk.tag import ps_tag
from nltk.corpus import stopwords


modulePath = os.path.abspath(os.path.join('../../..'))
if modulePath not in sys.path:
    sys.path.append(modulePath)
import config

# Standard tweepy API setup

auth = tweepy.OAuthHandler(config.apiKey, config.apiSec)
auth.set_access_token(config.accessToken, config.accessSec)

api = tweepy.API(auth)

# Application authentication tweepy setup
# Use application-only authentication for higher Twitter API rate limit
# Twitter API returns a max of 100 tweets per query
# Allows for 450 queries every 15 minutes
# So we can gather 45,000 tweets every 15 minutes

#Switching to application authentication
auth = tweepy.AppAuthHandler(config.apiKey, config.apiSec)

#Setting up new api wrapper, using authentication only
api = tweepy.API(auth, wait_on_rate_limit = True
                 ,wait_on_rate_limit_notify = True)
 
# View rate limit status

api.rate_limit_status()['resources']['search']


[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1541983640}}

<a id = 'Module-7-walkthrough'></a>

# Module 7 walkthrough

In [12]:
# 

htmlParser = HTMLParser.HTMLParser()

tweet = "@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com ."
parsedTweet = htmlParser.unescape(tweet)
print(parsedTweet)


@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com .


  


In [13]:
#

urlPattern = re.compile('http\S+')
tweet_v1 = re.sub(urlPattern, '', parsedTweet)
print(tweet_v1)


@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like  .


In [14]:
# 

usernamePattern = re.compile('@\S+')
tweet_v2 = re.sub(usernamePattern, '', tweet_v1)
print(tweet_v2)


 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like  .


In [16]:
#

wordPattern = re.compile('s[o]+')
tweet_v3 = re.sub(wordPattern, 'so', tweet_v2)
print(tweet_v3)


 Life is great & I like it so much. It's whatis life. #life #great#like  .


<a id = 'Module-8-walkthrough'></a>

# Module 8 walkthrough

In [None]:
#

print('positive score: {0}'.format(list(swn.senti_synsets('happy','a'))[0].pos_score()))
print('negative score: {0}'.format(list(swn.senti_synsets('happy','a'))[0].neg_score()))
print('neutral score: {0}'.format(list(swn.senti_synsets('happy','a'))[0].obj_score()))


In [None]:
#

sentence = 'i am happy'
tokens = nltk.tokensize.word_tokenize(sentence)
for token in tokens:
    print(swn.senti_synsets(token, '')[0])
    

In [None]:
#

stop= stopwords.words('english')
sentence = 'i am happy'
newSentence = []
for word in tokens:
    if word not in stop:
        newSentence.append(word)

print('The sentence has been reduced from \'{0}\' \n to \'{1}\''.format(sentence, newSentence))


<a id = 'assignment'></a>

# Assignment 5

* Try cleaning the tweets that you have extracted in the the previous chapter. Apply the above rules and in addition to that apply the below mentioned rules as well:
    * Remove Punctuations. Puntuations sometimes don't carry any weight. You can remove them. Try writing a regular expression to remove , from sentences. Dont remove question marks "?" or exclamatory marks as they have effect upon any sentence.
    * Remove apostrophes and expand the words. For example in the sentence "It's a great time to code!" the first word It's can be expanded to 'it is'. You can do this either with regular expressions.
    * Create a list of word patterns for word formatting. For example 'gud' should be substitued with 'good'

* Calculate the polarity of a sentence and write a progam to calculate the polarity of all the tweets that you have extracted and preprocessed in the previous questions. You progam should also include the below features:

    * Tweets have hashtags. Remove the hashtags and then find the polarity of each tweet.

    * There might be words that are not present in the sentiwordnet lexicon.
    * The program should handle these cases, by giving a zero score for such words.
    *Depending on the questions,file uploads or screenshots are necessary to show your work.

<a id = 'Acquire-tweets'></a>

## Acquire tweets

In [None]:
# Find up to 500,000 tweets from the last week containing the word election.
# Store in JSON file

maxTweets = 500000
tweetCount = 0
with open('electionTweets.json','w') as f:
    for tweet in tweepy.Cursor(api.search, q = 'election', tweet_mode = 'extended', lang = 'en').items(maxTweets):
        f.write(jsonpickle.encode(tweet._json, unpicklable = False) + '\n')
        tweetCount += 1
    print('Downloaded {0} tweets'.format(tweetCount))


In [22]:
# Load election tweets into memory

data = []
with open('./electionTweets.json', 'r') as jsonFile:
    for line in jsonFile:
        data.append(json.loads(line))
print('Total number of tweets loaded: {0}'.format(len(data)))



Total number of tweets loaded: 151122


In [23]:
# Unpack all tweets in data

tweets = []
for item in data:
    if 'full_text' in item.keys():
        tweet = item['full_text']
        tweets.append(tweet)
print('Total number of tweets extracted from json: {0}'.format(len(tweets)))


Total number of tweets extracted from json: 151122


In [27]:
data[1]

{'contributors': None,
 'coordinates': None,
 'created_at': 'Sun Nov 11 18:41:56 +0000 2018',
 'display_text_range': [0, 140],
 'entities': {'hashtags': [],
  'symbols': [],
  'urls': [],
  'user_mentions': [{'id': 16199594,
    'id_str': '16199594',
    'indices': [3, 17],
    'name': 'The_Evil_Dr_R',
    'screen_name': 'The_Evil_Dr_R'}]},
 'favorite_count': 0,
 'favorited': False,
 'full_text': "RT @The_Evil_Dr_R: THIS is what voter suppression looks like. How is it even legal for someone to oversee an election they're running in an…",
 'geo': None,
 'id': 1061690506994950144,
 'id_str': '1061690506994950144',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': True,
 'lang': 'en',
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'place': None,
 'quoted_status_id': 1061322349524455425,
 'quoted_status_id_str': '1061322349524455425',


<a id = 'Remove-username-URL'></a>

## Remove username, URL

<a id = 'Remove-punctuation'></a>

## Remove punctuation

- Remove ','
- Keep '?','!'
- Others?

<a id = 'Remove-apostrophes'></a>

## Remove apostrophes

- Remove apostrophes and expand words
    - "It's" becomes "It is", however "Trump's" stays "Trump's"

- It's, what's, whats, that's, thats
- do a search through corpus for the rest

<a id = 'Word-pattern-formatting'></a>

## Word pattern formatting

- Condense extended strings of vowels and consonants down to form correctly spelled word
    - "Gooooooood" becomes "Good"
    - "Realllllly" becomes "Really"

<a id = 'Remove-hashtags'></a>

## Remove hashtags

<a id = 'Polarity-analysis'></a>

## Polarity analysis