# Foundations of Artificial Intelligence and Machine Learning
## A Program by IIIT-H and TalentSprint

#### To be done in the Lab

The objective of this experiment is to perform sentimental analysis.

In this experiment we will be using twitter dataset as training data and crawled realtime tweets for testing. 

The Ground truth is 1 for positive tweet and 0 for negative tweet.

Few examples of positive and negative tweets are:

**Few Positive Tweets: **
1.  @Msdebramaye I heard about that contest! Congrats girl!!
2. UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3

**Few Negative Tweets:**
1. no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.
2. Just had some bloodwork done. My arm hurts

### Data Source

https://www.kaggle.com/c/twitter-sentiment-analysis2/data


## Exercise 1: (2 marks)

The first exercise is cleaning the tweets.
Perform preprocessing as required.

Complete the functon : preprocess_tweets 

Input or Arguement to the function : tweet as a string 

Return value: processed tweet as string 

Hint: Use regular expressions
* convert the all the cases into lower case
  + look at lower()
* Replace any urls with the word "URL"
  + Hint : 
      - re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',"Tweet") (re is python regular expression package)
* convert the username to "AT_USER", consider any word that starts with @ as user name
  + Hint : 
      - re.sub('@[^\s]+','AT_USER',"Tweet")
* Remove multiple whitespaces with a single white space
  + Hint :
      - re.sub('[\s]+', ' ', tweet)
* Replace hashtag words (#word) with just the words (word)
  + Hint : 
      - re.sub(r'#([^\s]+)', r'\1', "tweet")
      
* TEST CASE :
    + given the tweet "@V_DEL_ROSSI: Me         #dragging myself to the gym https://t.co/cOjM0mBVeY"
    + output should be "AT_USER me dragging myself to the gym URL"

In [59]:
import re
import pandas as pd
import numpy as np
from sklearn import feature_extraction, svm
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
def preprocess_tweets(tweet):
    #Code here
    #print("Original: " ,tweet)
    #Convert to lower
    tweet = tweet.lower()
    #print("To lower: ", tweet)
    
    #Replace URL
    tweet = re.sub('((www.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #print("Removing URL : ", tweet)
    
    #Replace @user
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #print("Removing @user : ", tweet)
    
    #Replace multiple whitespace
    tweet = re.sub('[\s]+', ' ', tweet)
    #print("Removing multiple space : ", tweet)
    
    #Replace hashtag
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #print("Removing # : ", tweet)
    
    return tweet

# Test it out
tweet = preprocess_tweets("@V_DEL_ROSSI: Me #dragging myself to     the gym   https://t.co/cOjM0mBVeY")

## Exercise 2: (3 marks)

Tokenize the processed tweets to make a tweet into a list of words and make sure that no punctuations are returned. so that it can be used in the next steps to represent the tweet as a feature vector. Remove the Stops words, if necessary

Complete the functon : word_tokenizer 

Input or Arguement to the function : processed tweet

Return value: list of words without any punctuations

TEST CASE :

Given an input :
    "Neither Man, nor machine can replace its creator. really?."
    
Result : 
    ['neither', 'man', 'nor', 'machine', 'replace', 'creator', 'hahaha']

In [81]:
stopWords = pd.read_csv('stopwords.txt')["a"].values.tolist()
stopWords = set(stopWords)

posWords = pd.read_csv('positive-words.txt')["a+"].values.tolist()
posWords = set(posWords)

negWords = pd.read_csv('negative-words.txt')["S2-faced"].values.tolist()
negWords = set(negWords)

In [22]:
def word_tokenizer(tweet):
    
    tweet_words = re.split(r"[\W\s_]+", tweet)
    tweet_words = list(filter(None, tweet_words))
    
    tokenized_tweet = [word for word in tweet_words if word not in stopWords]
    return list(set(tokenized_tweet))

tweet = "Neither Man, nor machine can replace its creator. really?. hahaha hahaha"
tweet = preprocess_tweets(tweet)
word_tokenizer(tweet)

['neither', 'machine', 'creator', 'replace', 'hahaha', 'nor']

## Exercise 3: (5 marks)

Using the list of words from the above the step, 
* represent the tweet as a feature vector using bag of words

Hint : counts of postive/negative/neutral words as three features can also be used

In [5]:
def getfeaturevector(tokenized_tweet):
#Code here
    pos, neg, neu = 0,0,0
    for x in tokenized_tweet:
        if x in posWords:
            pos += 1
        elif x in negWords:
            neg += 1
        else:
            neu += 1 
    return [pos, neg, neu]

## Exercise 4: (Marks : 5 ) 


Load the given training data and use the above functions you created to process, to tokenise and to get feature vector.

Considering the feature vector as input to the classifier, Train a classifier to classify the sentiment of the tweet correctly.

Divide the training data into two sets, to validate your classifier

In [71]:
# Read only 10k rows
train = pd.read_csv("twitter_train.csv", encoding = "ISO-8859-1", index_col=["ItemID"], nrows=10000)
train.head()

Unnamed: 0_level_0,Sentiment,SentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,is so sad for my APL frie...
2,0,I missed the New Moon trail...
3,1,omg its already 7:30 :O
4,0,.. Omgaga. Im sooo im gunna CRy. I'...
5,0,i think mi bf is cheating on me!!! ...


In [72]:
# Build a vocabulary

# Get a hashmap or word and its respective count across all tweets
word_counts = {}
for index, row in train.iterrows():
    tweet = row[1]
    tweet_words = word_tokenizer(preprocess_tweets(tweet))
    for word in tweet_words:
        word_counts[word] = word_counts.get(word, 0) + 1

# Word should be present in atleast 50 tweets and not in more than 5000 tweets
vocabulary_list = []
for word, count in word_counts.items():
    if 100 < count < 5000:
        vocabulary_list.append(word)

# assign number to each word
vocabulary = {word:i for i, word in enumerate(vocabulary_list)}
print(vocabulary, len(vocabulary))

{'sad': 0, 'a': 1, '2': 2, 've': 3, 'im': 4, 'tonight': 5, 'tomorrow': 6, 'miss': 7, 'USER': 8, 'AT': 9, 'thanks': 10, 'day': 11, 'feel': 12, 'lt': 13, 'URL': 14, 'twitter': 15, 'gonna': 16, 'happy': 17, 'awesome': 18, 're': 19, '3': 20, 'please': 21, 'wish': 22, 'life': 23, 'sleep': 24, 'didn': 25, 'hate': 26, 'guys': 27, 'morning': 28, 'days': 29, 'week': 30, 'cant': 31, 'dont': 32, 'amp': 33, 'time': 34, 'fun': 35, 'love': 36, 'night': 37, 'lol': 38, 'home': 39, 'wanna': 40, '1': 41, '4': 42, 'bed': 43, 'gt': 44, 'hope': 45, 'll': 46, 'am': 47, 'quot': 48, 'oh': 49, 'getting': 50, 'people': 51, 'bad': 52, 'don': 53, 'haha': 54, 'school': 55, 'song': 56, 'follow': 57, 'followfriday': 58, 'myweakness': 59, 'musicmonday': 60, 'iremember': 61} 62


In [73]:
def onehot(words, vocabulary):
    vector = [0 for _ in range(len(vocabulary))]
    for w in words:
        if w in vocabulary:
            vector[vocabulary[w]] = 1
    return vector

In [74]:
# Get cout vectors for each tweet
train["Count_vector"] = train["SentimentText"].apply(lambda x: onehot(word_tokenizer(preprocess_tweets(x)), vocabulary))
train.head()

Unnamed: 0_level_0,Sentiment,SentimentText,Count_vector
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,is so sad for my APL frie...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,0,I missed the New Moon trail...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,1,omg its already 7:30 :O,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5,0,i think mi bf is cheating on me!!! ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [75]:
# Get pos, neg, neutral counts
train["Feature"] = train["SentimentText"].apply(lambda x: getfeaturevector(word_tokenizer(preprocess_tweets(x))))
train.head()

Unnamed: 0_level_0,Sentiment,SentimentText,Count_vector,Feature
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,is so sad for my APL frie...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 2]"
2,0,I missed the New Moon trail...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 2]"
3,1,omg its already 7:30 :O,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 3]"
4,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 12]"
5,0,i think mi bf is cheating on me!!! ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 2]"


In [76]:
# Split up the count vectors and features into separate columns
train[vocabulary_list] = pd.DataFrame(train["Count_vector"].values.tolist(), index= train.index)
train[["Neg", "Pos", "Neu"]] = pd.DataFrame(train["Feature"].values.tolist(), index= train.index)
train.head()

Unnamed: 0_level_0,Sentiment,SentimentText,Count_vector,Feature,sad,a,2,ve,im,tonight,...,school,song,follow,followfriday,myweakness,musicmonday,iremember,Neg,Pos,Neu
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,is so sad for my APL frie...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 2]",1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
2,0,I missed the New Moon trail...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 2]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
3,1,omg its already 7:30 :O,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 3]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 12]",0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,1,12
5,0,i think mi bf is cheating on me!!! ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 2]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2


In [77]:
# Drop the list columns
train.drop(["SentimentText", "Count_vector", "Feature"], inplace=True, axis=1)
train.columns.values

array(['Sentiment', 'sad', 'a', '2', 've', 'im', 'tonight', 'tomorrow',
       'miss', 'USER', 'AT', 'thanks', 'day', 'feel', 'lt', 'URL',
       'twitter', 'gonna', 'happy', 'awesome', 're', '3', 'please',
       'wish', 'life', 'sleep', 'didn', 'hate', 'guys', 'morning', 'days',
       'week', 'cant', 'dont', 'amp', 'time', 'fun', 'love', 'night',
       'lol', 'home', 'wanna', '1', '4', 'bed', 'gt', 'hope', 'll', 'am',
       'quot', 'oh', 'getting', 'people', 'bad', 'don', 'haha', 'school',
       'song', 'follow', 'followfriday', 'myweakness', 'musicmonday',
       'iremember', 'Neg', 'Pos', 'Neu'], dtype=object)

In [78]:
# 80:20 Split
X, Y = train.iloc[:, 1:], train.iloc[:,0]
train_x, valid_x, train_y, valid_y = train_test_split(X, Y, test_size=0.2)

In [80]:
# Try out different classifiers
classifier = svm.SVC(tol=0.3)
#classifier = SGDClassifier(loss="log", tol=10, max_iter= 10000, penalty=None)
#classifier = RandomForestClassifier(max_features=6)

classifier.fit(train_x, train_y)
print(classifier.score(train_x, train_y))
print(classifier.score(valid_x, valid_y))

0.72175
0.743


## Exercise 5: (Marks : 5)

#### Twitter crawling using tweepy

Use tweepy to get the tweets on real time, which is used as test data for the classifier.

## Requirements: 

Twitter account

Create a twitter account if you don't have one by going to the link given below:

https://twitter.com/i/flow/signup

## Installation

Tweepy: tweepy is the python client for the official Twitter API.
Install it using following pip command:

In [None]:
!pip install tweepy

The tweets need to be gathered so as to perform Sentiment analysis on those tweets. They can be fetched from Twitter using the Twitter API. 

In order to fetch tweets through Twitter API, one needs to register an App through their twitter account. Follow these steps for the same:
<ul>
<li>Open the link given below to create a App through the twitter account.
    https://apps.twitter.com
<li>click the button: ‘Create New App’
<li>Fill the application details. You can leave the callback url field empty.
<li>Once the app is created, you will be redirected to the app page.
<li>Open the ‘Keys and Access Tokens’ tab.
<li>Copy ‘Consumer Key’, ‘Consumer Secret’, ‘Access token’ and ‘Access Token Secret’.
</ul>

In [None]:
#Replace with your ‘Consumer Key’, ‘Consumer Secret’, ‘Access token’ and ‘Access Token Secret’ below. 

consumer_key = 'XXXXXXXXX'
consumer_secret = 'XXXXXXXXXX'
access_token = 'XXXXXXXXXXXXXX'
access_secret = 'XXXXXXXXXXXXX'

Run the code below to authenticate your application.

In [None]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)    

## Tweepy Cursor

The below code gives the search results from twitter for the search string passed to the keyword arguement "q" in the tweepy.Cursor. The number passed to the items method of tweepy.Cursor indicates that it gives 100 such tweets, if available

In [None]:
for i in tweepy.Cursor(api.search, q='searchme', lang = 'en', full_text=True).items(100):
    print(i._json['text'])
    #print(processTweet(i._json['text']), end='\n\n\n')

Also apply the preprocessing steps and obtain the feature vectors for the crawled twitter data.
Classify the crawled tweets by passing its feature vector to the trained classifier.

In [None]:
##Your code here