# Foundations of Artificial Intelligence and Machine Learning
## A Program by IIIT-H and TalentSprint

In [23]:
%config IPCompleter.greedy=True

#### To be done in the Lab

The objective of this experiment is to perform sentimental analysis.

In this experiment we will be using twitter dataset as training data and crawled realtime tweets for testing. 

The Ground truth is 1 for positive tweet and 0 for negative tweet.

Few examples of positive and negative tweets are:

**Few Positive Tweets: **
1.  @Msdebramaye I heard about that contest! Congrats girl!!
2. UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3

**Few Negative Tweets:**
1. no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.
2. Just had some bloodwork done. My arm hurts

### Data Source

https://www.kaggle.com/c/twitter-sentiment-analysis2/data


## Exercise 1: (2 marks)

The first exercise is cleaning the tweets.
Perform preprocessing as required.

Complete the functon : preprocess_tweets 

Input or Arguement to the function : tweet as a string 

Return value: processed tweet as string 

Hint: Use regular expressions
* convert the all the cases into lower case
  + look at lower()
* Replace any urls with the word "URL"
  + Hint : 
      - re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',"Tweet") (re is python regular expression package)
* convert the username to "AT_USER", consider any word that starts with @ as user name
  + Hint : 
      - re.sub('@[^\s]+','AT_USER',"Tweet")
* Remove multiple whitespaces with a single white space
  + Hint :
      - re.sub('[\s]+', ' ', tweet)
* Replace hashtag words (#word) with just the words (word)
  + Hint : 
      - re.sub(r'#([^\s]+)', r'\1', "tweet")
      
* TEST CASE :
    + given the tweet "@V_DEL_ROSSI: Me         #dragging myself to the gym https://t.co/cOjM0mBVeY"
    + output should be "AT_USER me dragging myself to the gym URL"

In [24]:
import re
def preprocess_tweets(tweet):
    tweet = tweet.lower()
    tweet = re.sub('((www.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    tweet = re.sub('[\s]+', ' ', tweet)
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Code here
    return tweet


In [25]:
preprocess_tweets("@V_DEL_ROSSI: Me #dragging myself to the gym https://t.co/cOjM0mBVeY")

'AT_USER me dragging myself to the gym URL'

## Exercise 2: (3 marks)

Tokenize the processed tweets to make a tweet into a list of words and make sure that no punctuations are returned. so that it can be used in the next steps to represent the tweet as a feature vector. Remove the Stops words, if necessary

Complete the functon : word_tokenizer 

Input or Arguement to the function : processed tweet

Return value: list of words without any punctuations

TEST CASE :

Given an input :
    "Neither Man, nor machine can replace its creator. really?."
    
Result : 
    ['neither', 'man', 'nor', 'machine', 'replace', 'creator', 'hahaha']

In [26]:
def remove_punctuations(tweet):
    return re.sub('[.]*[?]*[,]*[!]*[;]*[&]*', "", tweet)

In [65]:
stop_words = pd.read_csv('stopwords.txt').values

In [66]:
def word_tokenizer(tweet):
    tokenized_tweets = tweet.split(" ")
    return tokenized_tweets

In [67]:
def filter_stop_words(tokenized_tweets) :
    return [tweet for tweet in tokenized_tweets if tweet not in stop_words]

## Exercise 3: (5 marks)

Using the list of words from the above the step, 
* represent the tweet as a feature vector using bag of words

Hint : counts of postive/negative/neutral words as three features can also be used

In [68]:
import pandas as pd
posWords = pd.read_csv('positive-words.txt').values
negWords = pd.read_csv('negative-words.txt').values
def getfeaturevector(tokenized_tweet):
#Code here
    p,n,ne = 0,0,0
    for token in tokenized_tweet : 
        if token in posWords : p = p + 1
        elif token in negWords : n = n + 1 
        else : ne = ne + 1
    return p,n,ne

In [69]:
def process_text(tweet) :
    tweet = preprocess_tweets(tweet)
    tweet = remove_punctuations(tweet)
    tweet = word_tokenizer(tweet)
    tweet = filter_stop_words(tweet)
    tweet = getfeaturevector(tweet)
    return tweet

## Exercise 4: (Marks : 5 ) 


Load the given training data and use the above functions you created to process, to tokenise and to get feature vector.

Considering the feature vector as input to the classifier, Train a classifier to classify the sentiment of the tweet correctly.

Divide the training data into two sets, to validate your classifier

In [70]:
#Code here
import pandas as pd

In [71]:
df = pd.read_csv('twitter_train.csv', encoding='iso-8859-1',engine='python')

In [72]:
df.drop('ItemID', axis = 1, inplace=True)

In [73]:
X = []
for text in df.SentimentText :
    X.append(process_text(text))

In [75]:
y = df.Sentiment

In [76]:
from sklearn.model_selection import train_test_split


In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20)

In [103]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [90]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [91]:
predict = knn.predict(X_test)

In [104]:
accuracy_score(predict, y_test.values)

0.6292129212921292

In [101]:
# dont know why it is not working
knn.score(predict, y_test.values)

ValueError: Expected 2D array, got 1D array instead:
array=[1 1 1 ... 1 1 1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## Exercise 5: (Marks : 5)

#### Twitter crawling using tweepy

Use tweepy to get the tweets on real time, which is used as test data for the classifier.

## Requirements: 

Twitter account

Create a twitter account if you don't have one by going to the link given below:

https://twitter.com/i/flow/signup

## Installation

Tweepy: tweepy is the python client for the official Twitter API.
Install it using following pip command:

In [None]:
!pip install tweepy

The tweets need to be gathered so as to perform Sentiment analysis on those tweets. They can be fetched from Twitter using the Twitter API. 

In order to fetch tweets through Twitter API, one needs to register an App through their twitter account. Follow these steps for the same:
<ul>
<li>Open the link given below to create a App through the twitter account.
    https://apps.twitter.com
<li>click the button: ‘Create New App’
<li>Fill the application details. You can leave the callback url field empty.
<li>Once the app is created, you will be redirected to the app page.
<li>Open the ‘Keys and Access Tokens’ tab.
<li>Copy ‘Consumer Key’, ‘Consumer Secret’, ‘Access token’ and ‘Access Token Secret’.
</ul>

In [None]:
#Replace with your ‘Consumer Key’, ‘Consumer Secret’, ‘Access token’ and ‘Access Token Secret’ below. 

consumer_key = 'XXXXXXXXX'
consumer_secret = 'XXXXXXXXXX'
access_token = 'XXXXXXXXXXXXXX'
access_secret = 'XXXXXXXXXXXXX'

Run the code below to authenticate your application.

In [None]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)    

## Tweepy Cursor

The below code gives the search results from twitter for the search string passed to the keyword arguement "q" in the tweepy.Cursor. The number passed to the items method of tweepy.Cursor indicates that it gives 100 such tweets, if available

In [None]:
for i in tweepy.Cursor(api.search, q='searchme', lang = 'en', full_text=True).items(100):
    print(i._json['text'])
    #print(processTweet(i._json['text']), end='\n\n\n')

Also apply the preprocessing steps and obtain the feature vectors for the crawled twitter data.
Classify the crawled tweets by passing its feature vector to the trained classifier.

In [None]:
##Your code here