Please fill up the following
* **Student Name:** Abhijit Khuperkar
* **Student ID:** 10150121
* **Student Email ID:** abhijit.bdapm10150121@spjain.org
* **Student GitHub Repo:** https://github.com/akhuperkar

## The Assignment
1. Write a blog post on how to use **OR** operator for find queries in mongodb.
2. Feed negative and positive tweets to the classification function for training. (using the Sentiment140 dataset)
3. Crawl all followers of ***naveen_odisha***, Odisha CM (note: you'll have to pay attention to rate limiting)
4. Crawl all followers of SRK. How can you calculate if this is feasible or not? (show the math)
5. Predict the sentiment of tweets by followers of ***naveen_odisha*** 

### Q1. Write a blog post on how to use **OR** operator for find queries in mongodb.

#### Is it TRUE or FALSE?
This type of conditional assessment is often used in programming. The logical operators evaluate boolean conditions of a query expression. The logical operators like OR, AND, NOT join two or more clauses of a query expressions and return a single logical value of either TRUE or FALSE. 

It is no different in MongoDB query. A MongoDB query targets a specific collection of documents. Queries specify criteria, or conditions, that identify the documents MongoDB returns to the clients. MongoDB query specifies the criteria and conditions using the following logical operators within the [db.collection.find()](https://docs.mongodb.org/manual/reference/method/db.collection.find/#db.collection.find) method to receive the matching documents.

1. `$OR`: Returns all documents that match either conditions of the conditional clauses
2. `$AND`: Returns all documents that match all conditions of clauses
3. `$NOT`: Returns documents that do not match the query condition
4. `$NOR`: Returns all documents that fail to match either conditional clauses

#### The OR operator 
Using the `$OR` operator, you can specify a query that joins each clause with a logical OR so that the query selects the documents in the collection that match at least one condition. In syntaxical language, the `$OR` operator performs a logical OR operation on two or more `<expressions>` and selects the documents that satisfy at least one of the `<expressions>`. This is expressed as the following query syntax:

`{ $or: [ { <expression1> }, { <expression2> },..., { <expressionN> } ] }`

In the below example, the query will select all documents in the inventory collection where either the quantity field value is less than (`$lt`) 20 or the price field value equals 10.

`db.inventory.find( { $or: [ { quantity: { $lt: 20 } }, { price: 10 } ] } )`

With additional clauses, you can specify precise conditions. In the below example, the compound query selects all documents in the collection where the value of the *type* field is *'food'* and *either* the *qty* has a value greater than (`$gt`) 100 or the value of the *price* field is less than (`$lt`) 9.95:

`db.inventory.find(
   {
     type: 'food',
     $or: [ { qty: { $gt: 100 } }, { price: { $lt: 9.95 } } ]
   }
)`

When evaluating the clauses in the `$or` expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. The index scan is required when `$or` is used with `$text` and sort() operations. This change was introduced in the 2.6 version of MongoDB.

To conclude, the `$OR` operator helps user to receive the output satisfying at least one of many conditional clauses in the MongoDB query expression.

### Q2. Feed negative and positive tweets to the classification function for training. (using the Sentiment140 dataset)

In [3]:
import csv
import re, math, collections, itertools, os
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.metrics import accuracy, precision, recall

#this function takes a feature selection mechanism and returns its performance in a variety of metrics
def evaluate_features(feature_select):
	posFeaturesTrain = []
	negFeaturesTrain = []

	#breaks up the sentences into lists of individual words and appends 'pos' or 'neg' after each list
	with open("/home/abhijit/Documents/TwitterAnalysis/trainingandtestdata/training.1600000.processed.noemoticon.csv", 'r') as TrainFile:
		TrainTweets = csv.reader(TrainFile)
		for i in TrainTweets:
			if i[0]=="4":
				posWordsTrain = re.findall(r"[\w']+|[.,!?;]", i[4].rstrip())
				posWordsTrain = [feature_select(posWordsTrain), 'pos']
				posFeaturesTrain.append(posWordsTrain)
			elif i[0]=="0":
				negWordsTrain = re.findall(r"[\w']+|[.,!?;]", i[4].rstrip())
				negWordsTrain = [feature_select(negWordsTrain), 'neg']
				negFeaturesTrain.append(negWordsTrain)

	trainFeatures = posFeaturesTrain + negFeaturesTrain
	
	#selects the features to be used for testing
	posFeaturesTest = []
	negFeaturesTest = []
	with open("/home/abhijit/Documents/TwitterAnalysis/trainingandtestdata/testdata.manual.2009.06.14.csv", 'r') as TestFile:
		TestTweets = csv.reader(TestFile)
		for j in TestTweets:
			if j[0]=="4":
				posWordsTest = re.findall(r"[\w']+|[.,!?;]", j[4].rstrip())
				posWordsTest = [feature_select(posWordsTest), 'pos']
				posFeaturesTest.append(posWordsTest)
			elif j[0]=="0":
				negWordsTest = re.findall(r"[\w']+|[.,!?;]", j[4].rstrip())
				negWordsTest = [feature_select(negWordsTest), 'neg']
				negFeaturesTest.append(negWordsTest)
	
	testFeatures = posFeaturesTest + negFeaturesTest
	
	#trains a Naive Bayes Classifier
	classifier = NaiveBayesClassifier.train(trainFeatures)

	#initiates referenceSets and testSets
	referenceSets = collections.defaultdict(set)
	testSets = collections.defaultdict(set)	

	#puts correctly labeled sentences in referenceSets and the predictively labeled version in testsets
	for k, (features, label) in enumerate(testFeatures):
		referenceSets[label].add(k)
		predicted = classifier.classify(features)
		testSets[predicted].add(k)	

	#prints metrics to show how well the feature selection did
	print 'train on %d instances, test on %d instances' % (len(trainFeatures), len(testFeatures))
	print 'accuracy:', nltk.classify.util.accuracy(classifier, testFeatures)
	#print 'pos precision:', nltk.metrics.precision(referenceSets['pos'], testSets['pos'])
	#print 'pos recall:', nltk.metrics.recall(referenceSets['pos'], testSets['pos'])
	#print 'neg precision:', nltk.metrics.precision(referenceSets['neg'], testSets['neg'])
	#print 'neg recall:', nltk.metrics.recall(referenceSets['neg'], testSets['neg'])
	classifier.show_most_informative_features(10)

#creates a feature selection mechanism that uses all words
def make_full_dict(words):
	return dict([(word, True) for word in words])

#tries using all words as the feature selection mechanism
print 'using all words as features'
evaluate_features(make_full_dict)

#scores words based on chi-squared test to show information gain
def create_word_scores():
	#creates lists of all positive and negative words
	posWords = []
	negWords = []
	with open("/home/abhijit/Documents/TwitterAnalysis/trainingandtestdata/training.1600000.processed.noemoticon.csv", 'r') as TrainFile:
		TrainTweets = csv.reader(TrainFile)
		for i in TrainTweets:
			if i[0]=="4":
				posWord = re.findall(r"[\w']+|[.,!?;]", i[4].rstrip())
				posWords.append(posWord)
			elif i[0]=="0":
				negWord = re.findall(r"[\w']+|[.,!?;]", i[4].rstrip())
				negWords.append(negWord)
	posWords = list(itertools.chain(*posWords))
	negWords = list(itertools.chain(*negWords))

	#build frequency distibution of all words and then frequency distributions of words within positive and negative labels
	word_fd = FreqDist()
	cond_word_fd = ConditionalFreqDist()
	for word in posWords:
		word_fd[word.lower()] += 1
		cond_word_fd['pos'][word.lower()] += 1
	for word in negWords:
		word_fd[word.lower()] += 1
		cond_word_fd['neg'][word.lower()] += 1

	#finds the number of positive and negative words, as well as the total number of words
	pos_word_count = cond_word_fd['pos'].N()
	neg_word_count = cond_word_fd['neg'].N()
	total_word_count = pos_word_count + neg_word_count

	#builds dictionary of word scores based on chi-squared test
	word_scores = {}
	for word, freq in word_fd.iteritems():
		pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
		neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
		word_scores[word] = pos_score + neg_score

	return word_scores

#finds word scores
word_scores = create_word_scores()

#finds the best 'number' words based on word scores
def find_best_words(word_scores, number):
	best_vals = sorted(word_scores.iteritems(), key=lambda (w, s): s, reverse=True)[:number]
	best_words = set([w for w, s in best_vals])
	return best_words

#creates feature selection mechanism that only uses best words
def best_word_features(words):
	return dict([(word, True) for word in words if word in best_words])

#numbers of features to select
numbers_to_test = [100, 1000, 10000]
#tries the best_word_features mechanism with each of the numbers_to_test of features
for num in numbers_to_test:
	print 'evaluating best %d word features' % (num)
	best_words = find_best_words(word_scores, num)
	evaluate_features(best_word_features)	

using all words as features
train on 1600000 instances, test on 359 instances
accuracy: 0.557103064067
Most Informative Features
                  wowlew = True              neg : pos    =     84.2 : 1.0
                 dogzero = True              pos : neg    =     63.0 : 1.0
                  nsane8 = True              pos : neg    =     54.3 : 1.0
                 snedwan = True              pos : neg    =     52.3 : 1.0
              Angel42579 = True              pos : neg    =     49.7 : 1.0
                   RGM77 = True              pos : neg    =     43.0 : 1.0
            angelaxjonas = True              pos : neg    =     42.3 : 1.0
            Cuttersftbll = True              pos : neg    =     42.3 : 1.0
               DarkPiano = True              pos : neg    =     42.1 : 1.0
             GlitzyGloss = True              pos : neg    =     41.0 : 1.0
evaluating best 100 word features
train on 1600000 instances, test on 359 instances
accuracy: 0.506963788301
Most Informa

### Q3. Crawl all followers of ***naveen_odisha***, Odisha CM (note: you'll have to pay attention to rate limiting)

In [1]:
import os
import time
import tweepy
import csv
import numpy as np
import pandas as pd

TWITTER_CONSUMER_KEY = os.environ["TWITTER_CONSUMER_KEY"]
TWITTER_CONSUMER_SECRET = os.environ["TWITTER_CONSUMER_SECRET"]
TWITTER_ACCESS_TOKEN = os.environ["TWITTER_ACCESS_TOKEN"]
TWITTER_ACCESS_TOKEN_SECRET = os.environ["TWITTER_ACCESS_TOKEN_SECRET"]

auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET)
auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

if(api.verify_credentials):
    print 'Sucessfully logged in \n'

def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            time.sleep(15 * 60)

for follower in limit_handled(tweepy.Cursor(api.followers, screen_name="naveen_odisha").items()):
    print("User: {0} \t\t Name: {1} \t\t Number of tweets: {2}".format(follower.screen_name.encode("utf-8"), 
                                                                          follower.name.encode("utf-8"), 
                                                                      follower.statuses_count))
    
#wrote below lines to save results of follower info in a csv file. This did not work.
#f = pd.DataFrame()
#sn = []
#un = []
#sc = []

#for follower in limit_handled(tweepy.Cursor(api.followers, screen_name="naveen_odisha").items()):
#    sn.append(follower.screen_name.encode("utf-8"))
#    print(sn)

#    un.append(follower.name.encode("utf-8"))
#    sc.append(follower.statuses_count)
#    x = np.array((sn,un,sc))
#    z = x.transpose()

#f = pd.DataFrame(z)
#f.head()
#f.to_csv('/home/abhijit/Documents/GitProjects/bdap2015/Hands-on: Twitter Sentiment Mining/nofollwers_abhijitk.csv')

Sucessfully logged in 

User: kapil_kambe 		 Name: Kapil Kambe 		 Number of tweets: 0
User: ManojaMahani 		 Name: manoja kumar mahani 		 Number of tweets: 4
User: priyadarshidash 		 Name: Priyadarshi 		 Number of tweets: 10
User: RajuKha75715155 		 Name: Raju Khan 		 Number of tweets: 1
User: lokanathdas1 		 Name: lokanath das 		 Number of tweets: 8
User: BalGyana 		 Name: Gyana Ranjan Bal 		 Number of tweets: 3
User: AKASHPATTANAIK1 		 Name: AKASH PATTANAIK 		 Number of tweets: 33
User: skrashidali3 		 Name: skrashidali 		 Number of tweets: 0
User: PrasmitJagadala 		 Name: PRASMIT JAGADALA 		 Number of tweets: 0
User: sambhunsahoo1 		 Name: sambhunath sahoo 		 Number of tweets: 0
User: bbsrdinesh 		 Name: Dinesh Kumar Patel 		 Number of tweets: 0
User: bapumishra506 		 Name: chandrashekhar mishr 		 Number of tweets: 0
User: SSwayam_Sarthak 		 Name: Swayam Sarthak 		 Number of tweets: 1
User: ambika709 		 Name: Ambika Pr. Kanungo 		 Number of tweets: 10
User: sumitbunty4 		 Name: sumit

KeyboardInterrupt: 

### Q4. Crawl all followers of SRK. How can you calculate if this is feasible or not? (show the math)

#### The Math
Twitter GET followers/ids method in API 1.1 version has the Rate Limit of 15 requests per 15-minutes window, i.e. about 60 requests per hour. Further, the method returns maximum 5000 ids per request. Given the rate limit, you have to issue more requests to get all the followers of popular users. 

SRK has a total 17.3M followers. With no interruptions, it will take about 2.4 days (=17300000/(5000x60x24)) to load SRK's all followers. 

In [5]:
import os
import time
import tweepy

TWITTER_CONSUMER_KEY = os.environ["TWITTER_CONSUMER_KEY"]
TWITTER_CONSUMER_SECRET = os.environ["TWITTER_CONSUMER_SECRET"]
TWITTER_ACCESS_TOKEN = os.environ["TWITTER_ACCESS_TOKEN"]
TWITTER_ACCESS_TOKEN_SECRET = os.environ["TWITTER_ACCESS_TOKEN_SECRET"]

auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET)
auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

if(api.verify_credentials):
    print 'Sucessfully logged in \n'

def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            time.sleep(15 * 60)

for follower in limit_handled(tweepy.Cursor(api.followers, screen_name="iamsrk").items()):
    print("User: {0} \t\t Name: {1} \t\t Number of tweets: {2}".format(follower.screen_name.encode("utf-8"), 
                                                                          follower.name.encode("utf-8"), 
                                                                      follower.statuses_count))

Sucessfully logged in 

User: lalajr_ 		 Name: Lala Jr. 		 Number of tweets: 0
User: khawerali18 		 Name: khawer ali 		 Number of tweets: 3
User: Umesha959106357 		 Name: Umesha 		 Number of tweets: 0
User: karankrsp75 		 Name: Rajesh kumar Das 		 Number of tweets: 1
User: rish_thakkar 		 Name: Rish Thakkar 		 Number of tweets: 0
User: djtstar20 		 Name: Sanchit Krishan 		 Number of tweets: 1
User: asishdebnath11 		 Name: asish debnath 		 Number of tweets: 0
User: Tisul_AS 		 Name: Tisul tgh bertenang 		 Number of tweets: 2617
User: goldenryu2000 		 Name: Nishant Sharma 		 Number of tweets: 0
User: CornelaTombuku 		 Name: Cornela Tombuku 		 Number of tweets: 0
User: 31bfd6b98b0b465 		 Name: JUNED 		 Number of tweets: 0
User: uouououo1811 		 Name: +9647708134488 		 Number of tweets: 0
User: sk19882 		 Name: suhail khan 		 Number of tweets: 0
User: rohit7071 		 Name: rohit sharma 		 Number of tweets: 0
User: TanviChipkar 		 Name: TANVI CHIPKAR 		 Number of tweets: 0
User: brzoskvinoid 		

KeyboardInterrupt: 

### Q5. Predict the sentiment of tweets by followers of ***naveen_odisha*** 

In [4]:
import csv
import re, math, collections, itertools, os
import time
import tweepy
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.metrics import accuracy, precision, recall

#this function takes a feature selection mechanism and returns its performance in a variety of metrics
def evaluate_features(feature_select):
	posFeaturesTrain = []
	negFeaturesTrain = []

	#breaks up the sentences into lists of individual words and appends 'pos' or 'neg' after each list
	with open("/home/abhijit/Documents/TwitterAnalysis/trainingandtestdata/training.1600000.processed.noemoticon.csv", 'r') as TrainFile:
		TrainTweets = csv.reader(TrainFile)
		for i in TrainTweets:
			if i[0]=="4":
				posWordsTrain = re.findall(r"[\w']+|[.,!?;]", i[4].rstrip())
				posWordsTrain = [feature_select(posWordsTrain), 'pos']
				posFeaturesTrain.append(posWordsTrain)
			elif i[0]=="0":
				negWordsTrain = re.findall(r"[\w']+|[.,!?;]", i[4].rstrip())
				negWordsTrain = [feature_select(negWordsTrain), 'neg']
				negFeaturesTrain.append(negWordsTrain)

	trainFeatures = posFeaturesTrain + negFeaturesTrain

#pulls follower tweets for sentiment prediction
TWITTER_CONSUMER_KEY = os.environ["TWITTER_CONSUMER_KEY"]
TWITTER_CONSUMER_SECRET = os.environ["TWITTER_CONSUMER_SECRET"]
TWITTER_ACCESS_TOKEN = os.environ["TWITTER_ACCESS_TOKEN"]
TWITTER_ACCESS_TOKEN_SECRET = os.environ["TWITTER_ACCESS_TOKEN_SECRET"]

auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET)
auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

if(api.verify_credentials):
    print 'Sucessfully logged in'

def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            time.sleep(15 * 60)
    
	#selects the features to be used for testing
	TestTweets = []
    for follower in limit_handled(tweepy.Cursor(api.followers, screen_name="naveen_odisha").items()):
        TestTweets.append(follower.text.encode("utf-8"))
        print(TestTweets)
    
    for j in TestTweets:
        WordsTest = re.findall(r"[\w']+|[.,!?;]".j.rstrip())
		WordsTest = [feature_select(WordsTest)]
		testFeatures.append(WordsTest)
	
	#trains a Naive Bayes Classifier
	classifier = NaiveBayesClassifier.train(trainFeatures)

	#initiates referenceSets and testSets
	referenceSets = collections.defaultdict(set)
	testSets = collections.defaultdict(set)	

	#puts correctly labeled sentences in referenceSets and the predictively labeled version in testsets
	for k, (features, label) in enumerate(testFeatures):
		referenceSets[label].add(k)
		predicted = classifier.classify(features)
		testSets[predicted].add(k)	

	#prints metrics to show how well the feature selection did
	print 'train on %d instances, test on %d instances' % (len(trainFeatures), len(testFeatures))
	classifier.show_most_informative_features(10)

#creates a feature selection mechanism that uses all words
def make_full_dict(words):
	return dict([(word, True) for word in words])

#tries using all words as the feature selection mechanism
print 'using all words as features'
evaluate_features(make_full_dict)

#for status in result["statuses"]:
#    print("Tweet: {0} \n Sentiment: {1} \n".format(status["text"].encode("utf-8"), classifier.classify(getFeatures(status["text"].encode("utf-8").split()))))

IndentationError: unexpected indent (<ipython-input-4-485dbe1e7a06>, line 58)