# Getting information about concert appearances of bands and musicians from tweets

This notebook intends to present the work I've carried out related to the test task for data science provided by John Snow Labs. The goal of the test is to extract information from tweets about concert appearances of musicians, performers or bands.

## Importing relevant external packages

In [4]:
import re
import requests
import zipfile
import io
import nltk
from nltk import word_tokenize

## Opening Bands & Artists File

The names of the bands & artists used for the purpose of this test were extracted from the webpage www.musicmp3.ru. All bands included in the categories Metal, Pop, Rock, Electronic, R&B, Classical, Alternative, Dance, Hip Hop and Soundtracks were extracted.
The file was further sorted by descending order in terms of the size of the name of the bands, so that, as it will be presented later, if there are several bands' names in a given tweet, the longer name will be chosen. This way, I avoid more common English words.

In [2]:
MyBandsList = []
with open('Bands_Artists.txt', 'r', encoding="utf8") as mybands:
    for band in mybands:
        lowercase_base = band.lower()
        MyBandsList.append(lowercase_base.strip())

The number of distinct bands'names used amount to:

In [3]:
print(len(MyBandsList))

22663


In [4]:
print("\n".join(MyBandsList))

john alldis choir, london symphony orchestra, nobuko imai & sir colin davis
kurt sanderling, peter schreier, birgit finnila & berlin symphony orchestra
laurent garnier & bugge wesseltoft & philippe nadaud & benjamin rippert
nicky hopkins, ry cooder, mick jagger, bill wyman & charlie watts
steven lubin & the academy of ancient music & christopher hogwood
ketil bjornstad, david darling, terje rypdal & jon christensen
christine pedi carolee carmello philip hoffman frank wildhorn
acid mothers temple & the melting paraiso u.f.o. / escapade
nick glennie-smith, hans zimmer and harry gregson-williams
adrian johnston / dickon hinchliffe / barrington pheloung
tigran hamasyan, arve henriksen, eivind aarset & jan bang
shye ben tzur, jonny greenwood and the rajasthan express
mahalia barnes & the soul mates featuring joe bonamassa
marcus miller & orchestre philharmonique de monte-carlo
james luther dickinson and north mississippi all stars
michael landau, robben ford, jimmy haslip & gary novak
shirl

## Opening Sentiment Analysis Dictionary

I used AFINN-111 (acessible from http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010) in order to compute the sentiment embedded in the tweets. AFINN is a list of 2477 English words and phrases rated for sentiment analysis with an integer between minus five (negative) and plus five (positive).

In [5]:
sentiment_dictionary = {}
for line in open('AFINN-111.txt', 'r', encoding="utf8"):
    word,score = line.split('\t')
    sentiment_dictionary[word] = int(score)

In [6]:
print(len(sentiment_dictionary))

2477


In [7]:
for word in sentiment_dictionary:
    print(word,sentiment_dictionary[word])

abandon -2
abandoned -2
abandons -2
abducted -2
abduction -2
abductions -2
abhor -3
abhorred -3
abhorrent -3
abhors -3
abilities 2
ability 2
aboard 1
absentee -1
absentees -1
absolve 2
absolved 2
absolves 2
absolving 2
absorbed 1
abuse -3
abused -3
abuses -3
abusive -3
accept 1
accepted 1
accepting 1
accepts 1
accident -2
accidental -2
accidentally -2
accidents -2
accomplish 2
accomplished 2
accomplishes 2
accusation -2
accusations -2
accuse -2
accused -2
accuses -2
accusing -2
ache -2
achievable 1
aching -2
acquit 2
acquits 2
acquitted 2
acquitting 2
acrimonious -3
active 1
adequate 1
admire 3
admired 3
admires 3
admiring 3
admit -1
admits -1
admitted -1
admonish -2
admonished -2
adopt 1
adopts 1
adorable 3
adore 3
adored 3
adores 3
advanced 1
advantage 2
advantages 2
adventure 2
adventures 2
adventurous 2
affected -1
affection 3
affectionate 3
afflicted -1
affronted -1
afraid -2
aggravate -2
aggravated -2
aggravates -2
aggravating -2
aggression -2
aggressions -2
aggressive -2
aghast 

## Opening User Location Dictionary

User location was infered from the information available in the file training_set_users.txt (available in https://archive.org/details/twitter_cikm_2010). It includes the identification ID of the twitter user and his or her location.

In [8]:
where_dictionary = {}
for line in open('training_set_users.txt', 'r', encoding="utf8"):
    sline=line.rstrip('\n')
    user,local = sline.split('\t')
    where_dictionary[user] = local

In [9]:
for user in where_dictionary:
    print(user,where_dictionary[user])

14 San Francisco
15 San Francisco
18 San Francisco, CA
19922973 Chicago, IL
63963170 New York
17825828 Houston
71303208 Buffalo, NY
24117294 Brooklyn
67 Bloomington, IN
15728730 Las Vegas, NV
18874383 San Francisco
94 San Francisco, CA
20097724 Miami, Florida
107 San Francisco, CA
17126765 Atlanta
30408856 New Orleans
21670598 Kent
20126941 Miami
84585166 Ontario
78643425 Houma, Louisiana
10486002 New York
246 San Francisco, CA
257 Portland, OR
259 San Francisco, CA
35651857 Birmingham, AL
291 San Francisco, CA
24117550 Phoenix, Arizona
68157749 Houston
87032139 Richmond
45089102 Murfreesboro, TN
75148003 Ballwin, MO
59 St. Louis, MO
364 Boston, MA
80740718 Tuscaloosa, AL
61 San Francisco, CA
15729017 Los Angeles, CA
66 San Francisco, California
15379182 Marina del Rey, CA
71303574 Pontiac, Michigan
414 San Francisco, California
6291872 San Francisco
45089208 Hialeah, FL
84935116 San Diego, California
32506326 Las Vegas
10486242 Chicago, IL
38797799 Los Angeles
46137833 Mount Dora, Flo

## Identifying relevant Tweets (with the word 'concert') from 3 844 612 Tweets

The site https://archive.org/details/twitter_cikm_2010 includes several tweets including the user ID, the tweet ID, the tweet, and the respetive date and time.

Its file training_set_tweets.txt is too huge (> 400 MB) to be uploaded to GitHub (with a file size limit of 100 MB). Therefore, the zip file must be first downloaded and then the referred file extracted to a local directory.

In [5]:
r = requests.get('https://archive.org/download/twitter_cikm_2010/twitter_cikm_2010.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extract('training_set_tweets.txt')

'C:\\Projects\\concert_tweets\\training_set_tweets.txt'

Only the tweets from this file which included the word "concert" were taking into account for the purposes of this test.

In [10]:
ConcertTweetList = []
with open('training_set_tweets.txt', 'r', encoding="utf8") as mydata:
    for line in mydata:
        line.lower()
        if line.find('concert') != -1:
            ConcertTweetList.append(line.strip())

In [11]:
SentenceList = []
for sentence in ConcertTweetList:
    SentenceList.append(sentence.split('\t'))

The number of tweets including the word 'concert' amounted to:

In [12]:
print(len(SentenceList))

5352


In [13]:
for sentence in SentenceList:
    print(sentence[0])
    print(sentence[1])
    print(sentence[2])
    print(sentence[3])
    print('===========================================')

22937622
3184069233
taking back sunday is playing the asu fall welcome concert. i used to like them, like 4 years ago.
2009-08-07 16:52:52
18612228
5568283054
Off to digital design class & then off to a jazz concert later on tonight! :)
2009-11-09 14:39:56
67
5081876056
You can hear the concert from across campus, hopefully the rain won't dampen the event too much.
2009-10-22 17:55:30
67
5080442016
Wow the music from the concert over on Dunn meadow is rattling the windows in my office at Lindley Hall, and I can almost make out the words
2009-10-22 16:53:55
17301577
5250078985
RT @IDS07: Nususa Live 4 Hip Hop Edition Nov 2 @ 7PM CST free streaming concert http://nususalive.com feat. Analyrical & Tonecrusher Smith
2009-10-28 22:36:07
16908436
5428808334
Just bought my tix for the 91X concert in San Diego with Rise Against and Anberlin. O frabjous day! Callooh! Callay! /chortle
2009-11-04 13:59:57
73138375
5979606016
VOTE George Press as BEST OF NJ: Jewelry! ....and be entered to win 4 VI

IndexError: list index out of range

## Extracting the requested information (who, when, where, audience, sentiment) from relevant concert tweets having the name of the band or artist

The requested information was extracted from the relevant tweets (with the word 'concert') only.
The date of the concert was considered to be the same as the date of the tweet.
The sentiment of the tweet was computed by summing up all its specific sentiment rates (as determined by AFINN-111). If this total was between -1 and 1, it was considered neutral. If it was below -1, it was considered negative. And if it was above 1, it was considered positive.

Running this code over 5352 tweets, using the names of 22663 bands and 2477 sentiment words, implies around 1 hour of time. Therefore, I am limiting the next code to the first 500 (relevant) tweets. This limit can be simply changed in the code.

In [14]:
WhoList = []
WhenList = []
WhereList = []
AudienceList = []
SentimentList = []

#Slice from the first 500 relevant tweets (since it takes a long time running over the 5 thousand relevant tweets)
for sentence in SentenceList[:500]:
    for band in MyBandsList:
        if sentence[0].isdigit() and re.search(r'\b' + re.escape(band)+ r'\b', sentence[2].lower()):
            WhoList.append(band)
            if len(sentence)==4:                
                WhenList.append(sentence[3][:10])
            else:
                WhenList.append(None)
            WhereList.append(where_dictionary.get(sentence[0],None))
            AudienceList.append(sentence[0])
            token_words = word_tokenize(str(sentence[2]).lower())
            sentiment_score = sum(sentiment_dictionary.get(word,0) for word in token_words)
            if sentiment_score>1:
                sentiment = 'positive'
            elif sentiment_score <-1:
                sentiment = 'negative'
            else:
                sentiment = 'neutral'
            SentimentList.append(sentiment)
            break

The number of relevant tweets (with the word 'concert') including the name of the band or artist is depicted below (out of 500):

In [15]:
print(len(WhoList))

271


The list of this set of 271 tweets is presented below, along with the required information:

In [16]:
for index, item in enumerate(WhoList):
    print(index,'| who:',item,'| when:',WhenList[index],'| where:',WhereList[index],'| audience:',AudienceList[index], '| sentiment:', SentimentList[index])

0 | who: taking back sunday | when: 2009-08-07 | where: Phoenix | audience: 22937622 | sentiment: positive
1 | who: the music | when: 2009-10-22 | where: Bloomington, IN | audience: 67 | sentiment: positive
2 | who: smith | when: 2009-10-28 | where: Minneapolis, MN | audience: 17301577 | sentiment: neutral
3 | who: rise against | when: 2009-11-04 | where: Los Angeles, CA | audience: 16908436 | sentiment: neutral
4 | who: taylor swift | when: 2009-11-23 | where: Livingston, NJ | audience: 73138375 | sentiment: positive
5 | who: motley crue | when: 2009-08-10 | where: Compton | audience: 33161623 | sentiment: neutral
6 | who: blue oyster cult | when: 2009-08-03 | where: Columbus, Ohio | audience: 15598051 | sentiment: positive
7 | who: jamie foxx | when: 2009-10-01 | where: Mesa, AZ | audience: 23331369 | sentiment: neutral
8 | who: michael jackson | when: 2009-11-06 | where: Los Angeles, CA | audience: 32200125 | sentiment: neutral
9 | who: ida | when: 2009-12-02 | where: Orlando, FL | 

## Conclusion & Further Work

There are several ways this project could be improved, including:
1) computationally, by using more efficient data structures and/or nesting distinct blocks (nevertheless, at the expense of the source code understandibility);
2) technically, by:
a) further researching on existing databases disclosing the names of bands and musicians
b) not taking into account the bands' names which are equal to common used English words
c) when there are several potential bands' names in a given tweet, take into account the location of the band's name in relation to the word concert (e.g.: "U2 concert" or "Rihanna's concert")
d) Search online databases of past concert dates and correlate it to the date and location of the tweet
e) Look into the tweets for the name of specific locations (cities and countries) rather than just relying on the user's location.

Nevertheless, I believe the purpose of this job is not delivering a complete job, which would take a bit longer, but just to show you some expertise on using the specific analytical tools you mentioned as well as the rationale on how I approach a given problem.

I hope you enjoyed reading it as much as I enjoyed doing and learning from it.