# Intro

In order to build a sentiment analyzer, first we need to equip ourselves with the right tools and methods. **Machine learning** is one such tool where people have developed various methods to classify. Classifiers may or may not need training data. 

In particular, we will deal with the following machine learning classifiers, namely, *Naive Bayes Classifier, Maximum Entropy Classifier* and *Support Vector Machines*. All of these classifiers require training data and hence these methods fall under the category of supervised classification.

<img src="https://www.ravikiranj.net/images/supervised-classification.png", width = 640,height = 480>

# Training the Classifiers
The classifiers need to be trained and to do that, we need to list manually classified tweets. Let's start with 3 positive, 3 neutral and 3 negative tweets.

### Positive tweets
> *@PrincessSuperC Hey Cici sweetheart! Just wanted to let u know I luv u! OH! and will the mixtape drop soon? FANTASY RIDE MAY 5TH!!!!*

> *@Msdebramaye I heard about that contest! Congrats girl!!*

> *UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER* http://tinyurl.com/49955t3

### Neutral tweets
> *Do you Share More #jokes #quotes #music #photos or #news #articles on #Facebook or #Twitter?*

> *Good night #Twitter and #TheLegionoftheFallen. 5:45am cimes awfully early!*

> *I just finished a 2.66 mi run with a pace of 11'14"/mi with Nike+ GPS. #nikeplus #makeitcount*

### Negative tweets
> Disappointing day. Attended a car boot sale to raise some funds for the sanctuary, made a total of 88p after the entry fee - sigh

> no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.

> Just had some bloodwork done. My arm hurts

## Preprocess tweets

1. Lower Case - Convert the tweets to lower case.
2. URLs - I don't intend to follow the short urls and determine the content of the site, so we can eliminate all of these URLs via regular expression matching or replace with generic word URL.
3. @username - we can eliminate "@username" via regex matching or replace it with generic word AT_USER.
4. #hashtag - hash tags can give us some useful information, so it is useful to replace them with the exact same word without the hash. E.g. #nike replaced with 'nike'.
5. Punctuations and additional white spaces - remove punctuation at the start and ending of the tweets. E.g: ' the day is beautiful! ' replaced with 'the day is beautiful'. It is also helpful to replace multiple whitespaces with a single whitespace

# Feature Vector

In pattern recognition and machine learning, a _**feature vector** is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis._

Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data.

In tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any contribution in indicating the sentiment of a tweet and hence we can filter them out. _Adding individual (single) words to the feature vector is referred to as **'unigrams'** approach._

Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. 

For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. 

Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode

## Filtering tweet words (for feature vector)
1. Stop words - a, is, the, with etc. The full list of stop words is present in a text file attached with the project. These words don't indicate any sentiment and can be removed.
2. Repeating letters - if you look at the tweets, sometimes people repeat letters to stress the emotion. E.g. hunggrryyy, huuuuuuungry for 'hungry'. We can look for 2 or more repetitive letters in words and replace them by 2 of the same.
3. Punctuation - we can remove punctuation such as comma, single/double quote, question marks at the start and end of each word. E.g. beautiful!!!!!! replaced with beautiful
4. Words must start with an alphabet - For simplicity sake, we can remove all those words which don't start with an alphabet. E.g. 15th, 5.34am

# Pulling tweets from twitter

Crawl Tweets Against Hash Tags

To have access to the Twitter API, you'll need to login the Twitter Developer website and create an application. Enter your desired Application Name, Description and your website address making sure to enter the full address including the http://. You can leave the callback URL empty.

<img src = http://ipullrank.com/wp-content/uploads/2017/04/create-an-application.png>

---

After registering, create an access token and grab your application’s Consumer Key, Consumer Secret, Access token and Access token secret from Keys and Access Tokens tab.

<img src = http://ipullrank.com/wp-content/uploads/2017/04/application-settings2.png>

In [2]:
#import twython
import tweepy
import csv
import pandas as pd
####input your credentials here
consumer_key = 'orMX860DHi6kNaWRUdLPpUZLd'
consumer_secret = 'sUODNhBoUHbLlwIlnty32RguQ2ElcpcFGymnyzxkvI7Dogsi4v'
access_token = '58457005-BFCQKaAWsaYl7JFNYVsNSmYNhNw0fpUlvcRrySsYN'
access_token_secret = 'r6QFQKuYOFxQPYwmM5lWcTjNMzynx6CAsHGUe6Hmp1xMd'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
#####Kotak Bank
# Open/Create a file to append data
csvFile = open('C:/Users/admin/Desktop/Training Material/8. Data Science With Python/20. Text Mining - NLP/Sentiment Analysis/kotak.csv', 'w')
#Use csv Writer
csvWriter = csv.writer(csvFile)

### Brief intro to function that extracts tweets based on hashtag, userID or page : `tweepy.Cursor()` function

** User Time Line (the @someone thing) **

`tweepy.Cursor(api.user_timeline, id="twitter")`


** Generic Serach for @username or a #hashtag **

`API.search(q[, lang ][, locale ][, rpp ][, page ][, since_id ][, geocode ][, show_user])`
Returns tweets that match a specified query.

Parameters

• q – the search query string

• lang – Restricts tweets to the given language, given by an ISO 639-1 code.

• locale – Specify the language of the query you are sending. This is intended for languagespecific
clients and the default should work in the majority of cases.

• rpp – The number of tweets to return per page, up to a max of 100.

• page – The page number (starting at 1) to return, up to a max of roughly 1500 results (based
on rpp * page.

• since_id – Returns only statuses with an ID greater than (that is, more recent than) the
specified ID.

• geocode – Returns tweets by users located within a given radius of the given latitude/longitude.
The location is preferentially taking from the Geotagging API, but will fall
back to their Twitter profile. The parameter value is specified by “latitide,longitude,radius”,
where radius units must be specified as either “mi” (miles) or “km” (kilometers). Note that
you cannot use the near operator via the API to geocode arbitrary locations; however you
can use this geocode parameter to search near geocodes directly.

• show_user – When true, prepends “<user>:” to the beginning of the tweet. This is useful
for readers that do not display Atom’s author field. The default is false.

For more info, please refer to the official documentation of tweepy in the link :
https://media.readthedocs.org/pdf/tweepy/latest/tweepy.pdf

In [3]:
# Search tweets by a hash tag.

date_c = list()
tweet_s = list()
for tweet in tweepy.Cursor(api.search,q="#kotak",count=100,
                           lang="en",
                           since="2017-05-01").items():
    date_c.append(tweet.created_at)
    tweet_s.append(tweet.text)
    # print (tweet.created_at, tweet.text)
    csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])

In [4]:
###Preprocess tweets
def processTweet2(tweet):
    # process the tweets

    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet    

###get stopword list
def getStopWordList(stopWordListFileName):
    #read the stopwords file and build a list
    stopWords = []
    stopWords.append('AT_USER')
    stopWords.append('URL')

    fp = open(stopWordListFileName, 'r')
    line = fp.readline()
    while line:
        word = line.strip()
        stopWords.append(word)
        line = fp.readline()
    fp.close()
    return stopWords

stopWords = []

st = open('C:/Users/admin/Desktop/Training Material/8. Data Science With Python/20. Text Mining - NLP/Sentiment Analysis/stopwords.txt', 'r')
stopWords = getStopWordList('C:/Users/admin/Desktop/Training Material/8. Data Science With Python/20. Text Mining - NLP/Sentiment Analysis/stopwords.txt')


def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character and replace with the character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    return pattern.sub(r"\1\1", s)
#end

In [5]:
import re

In [6]:
def getFeatureVector(tweet):
    featureVector = []
    #split tweet into words
    words = tweet.split()
    for w in words:
        #replace two or more with two occurrences
        w = replaceTwoOrMore(w)
        #strip punctuation
        w = w.strip('\'"?,.')
        #check if the word stats with an alphabet
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
        #ignore if it is a stop word
        if(w in stopWords or val is None):
            continue
        else:
            featureVector.append(w.lower())
    return featureVector
 
###load airline sentiment training data 
    
airlinetrain = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding ="ISO-8859-1")
tweets = []
featureList = []
for i in range(len(airlinetrain)):
    sentiment = airlinetrain['airline_sentiment'][i]
    tweet = airlinetrain['text'][i]
    processedTweet = processTweet2(tweet)
    featureVector = getFeatureVector(processedTweet)
    featureList.extend(featureVector)
    tweets.append((featureVector, sentiment))
        
def extract_features(tweet):
    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)
    return features
#end

### Remove featureList duplicates
featureList = list(set(featureList))



In [10]:
ua = pd.read_csv("C:/Users/admin/Desktop/Training Material/8. Data Science With Python/20. Text Mining - NLP/Sentiment Analysis/kotak.csv")
ua.columns = ["Date","tweets"]

In [11]:
import nltk
training_set = nltk.classify.util.apply_features(extract_features, tweets)
# Train the classifier Naive Bayes Classifier
NBClassifier = nltk.NaiveBayesClassifier.train(training_set)
#ua is a dataframe containing all the tweets

ua['sentiment'] = ua['tweets'].apply(lambda tweet: NBClassifier.classify(extract_features(getFeatureVector(processTweet2(tweet)))))


In [None]:
ua.head(3)

In [159]:
csvFile.close()