### Explore annotated airline tweet data provided by Crowdflower
A Super Handy CrowdFlower Glossary of Terms can be found [here](https://success.crowdflower.com/hc/en-us/articles/202703305-Glossary-of-Terms)!

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max_rows", 500)

#### Read-In Jobs-Level Data (from CrowdFlower's *Data for Everyone* [library](https://www.crowdflower.com/data-for-everyone/))

In [2]:
cf = pd.read_csv("http://cdn2.hubspot.net/hub/346378/file-2612489700-csv/DFE_CSVs/Airline-Full-Non-Ag-DFE-Sentiment.csv")
print cf.columns
cf.head(2)

Index([u'_unit_id', u'_created_at', u'_golden', u'_id', u'_missed',
       u'_started_at', u'_tainted', u'_channel', u'_trust', u'_worker_id',
       u'_country', u'_region', u'_city', u'_ip', u'airline_sentiment',
       u'negativereason', u'airline', u'airline_sentiment_gold', u'name',
       u'negativereason_gold', u'retweet_count', u'text', u'tweet_coord',
       u'tweet_created', u'tweet_id', u'tweet_location', u'user_timezone'],
      dtype='object')


Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,2/25/2015 04:52:40,False,1575073003,,2/25/2015 04:49:12,False,elite,0.8108,31110645,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)
1,681448150,2/25/2015 05:22:10,False,1575093916,,2/25/2015 05:19:59,False,prodege,0.8919,1908948,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)


#### Check for Duplicative Records

In [3]:
##Check for Duplicate judgement IDs (_id)
print cf.duplicated("_id").value_counts() #no duplicate judgement IDs

##Check for Dupplicate Tweet IDs (_unit_id, tweet_id)
print cf.drop_duplicates(["_unit_id","tweet_id"]).duplicated("_unit_id").value_counts() #No Duplicates
print cf.drop_duplicates(["_unit_id","tweet_id"]).duplicated("tweet_id").value_counts() #N=195 Duplicates.
print cf.drop_duplicates(["tweet_id","text"]).duplicated("tweet_id").value_counts() #No Duplicates

False    55783
dtype: int64
False    14680
dtype: int64
False    14485
True       195
dtype: int64
False    14485
dtype: int64


Not sure why _unit_id is duplicative/ what represents, but since tweet_id lines up with the actual tweet (from which our model feature vectors will eventually be created, use tweet_id as unit of 1 tweet, not _unit_id


In [4]:
##Check for Duplicates by Tweet ID, Worker ID (this should represent 1 judgement)
cf.drop_duplicates(["tweet_id","_worker_id"]).duplicated(["tweet_id","_worker_id"]).value_counts() #no duplicates

False    54821
dtype: int64

In [None]:
##Check for duplicates among user name and worker id

#### Investigate How Worker Trust Scores are Calculated

In [5]:
print cf._tainted.value_counts(dropna=False) #no tweets marked at tainted
print cf._trust.describe() ##All Trust Scores in Range 70% - 100% - "tainted" judgements dropped

False    55783
Name: _tainted, dtype: int64
count    55783.000000
mean         0.850374
std          0.066688
min          0.700000
25%          0.809500
50%          0.857100
75%          0.892900
max          1.000000
Name: _trust, dtype: float64


In [6]:
###Look at Progression
cf.sort_values(by=["_worker_id","_started_at"])[["_worker_id","_started_at","_created_at","tweet_id","text",\
                                                  "_golden","airline_sentiment","airline_sentiment_gold","_trust"]].head(15)

Unnamed: 0,_worker_id,_started_at,_created_at,tweet_id,text,_golden,airline_sentiment,airline_sentiment_gold,_trust
3954,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,569851578276048896,"@united I'm aware of the flight details, thank...",True,negative,negative,0.8919
6398,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,569473998519578624,@united flighted delayed for hours. 10pm arriv...,True,negative,negative,0.8919
10921,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,568637541513089024,@united rebooked 24 hours after original fligh...,True,negative,negative,0.8919
19831,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,568752276040495104,@SouthwestAir If a travel advisory is posted f...,True,neutral,neutral,0.8919
23682,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,567717985092395008,@southwestair - kind of early but any idea whe...,True,neutral,neutral,0.8919
29994,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,568182544124014592,@JetBlue I am heading to JFK now just on princ...,True,negative,negative,0.8919
44130,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,568824537338417154,@AmericanAir - how long does it take to get cr...,True,negative,negative,0.8919
44347,1908948,2/25/2015 03:16:13,2/25/2015 03:32:53,568551906634797056,@AmericanAir Hopefully you ll see bad ones as ...,True,neutral,positive,0.8919
14307,1908948,2/25/2015 03:32:54,2/25/2015 03:33:59,567778009013178368,@united So what do you offer now that my fligh...,True,negative,negative,0.8919
43466,1908948,2/25/2015 03:32:54,2/25/2015 03:33:59,569601363799359488,@AmericanAir should reconsider #usairways acqu...,True,negative,negative,0.8919


* Starts with a series of Test (golden) tweets to determine trust score, then occational spot checks w/ a test tweet
* Doesnt look like trust score fluctuates with performance in judgement-level data. 
Confirm that there is 1 trust score per worker only.

In [7]:
##Confirm 1 Trust Score Per Worker
cf.drop_duplicates(["_worker_id","_trust"]).duplicated("_worker_id").value_counts() ##N=69 Workers w/ multiple trust scores

False    503
True      69
dtype: int64

In [8]:
dups = cf.drop_duplicates(["_worker_id","_trust"])[["_worker_id","_trust"]].copy()
dups["dups"] = cf.duplicated("_worker_id")

dups[dups.dups==True][:10]

Unnamed: 0,_worker_id,_trust,dups
45196,10078394,0.931,True
45197,25620782,0.9189,True
45198,27392644,0.8077,True
45199,31777977,0.8182,True
45201,25411252,0.9545,True
45202,18776579,0.8667,True
45204,31560993,0.8,True
45206,28976121,0.9412,True
45209,28834224,0.9333,True
45210,29746131,0.8889,True


#### Look at Tasking for a couple users with multiple trust scores

In [9]:
view = cf.sort_values(by=["_worker_id","_started_at"])[["_worker_id","_started_at","_created_at","tweet_id","text",\
                                                  "_golden","airline_sentiment","airline_sentiment_gold","_trust"]].copy()

In [10]:
view[view._worker_id==25620782]

Unnamed: 0,_worker_id,_started_at,_created_at,tweet_id,text,_golden,airline_sentiment,airline_sentiment_gold,_trust
6508,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,569473998519578624,@united flighted delayed for hours. 10pm arriv...,True,negative,negative,0.8108
19939,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,568752276040495104,@SouthwestAir If a travel advisory is posted f...,True,neutral,neutral,0.8108
28886,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,568606685230555136,@JetBlue time to reevaluate my nyc carrier.,True,neutral,negative,0.8108
42140,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,569944281512685570,@AmericanAir FYI...call stilling getting dropp...,True,negative,negative,0.8108
42378,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,569842758967386112,@AmericanAir how can I get you guys to respond...,True,negative,negative,0.8108
42860,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,569680231012773888,@AmericanAir 800 number will not even let you...,True,negative,negative,0.8108
43083,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,569622568459636736,@AmericanAir I want to speak to a human being!...,True,negative,negative,0.8108
44230,25620782,2/25/2015 08:44:49,2/25/2015 08:48:44,568824537338417154,@AmericanAir - how long does it take to get cr...,True,negative,negative,0.8108
7082,25620782,2/25/2015 08:48:45,2/25/2015 08:49:52,569343661063823360,@united I have a question,True,neutral,neutral,0.8108
38574,25620782,2/25/2015 08:48:45,2/25/2015 08:49:52,568561924985782272,@USAirways thank you finally got our bag. Cust...,True,positive,positive,0.8108


It looks like the two seperate trust scores per user are associated with two different "tasking" sessions. 
Use (weighted?) average? Or treat as if two different taskers?

#### Checks to be done
* Duplicate worker IDs by name
* Golden Tweet Flag X Airline Sentiment X Gold Airline Sentiment
* Golden Tweet Flag X Airline Topic X Gold Airline Topic
* Look at examples where Airline Sentiment != Gold Airline Sentiment. Confirm Gold Airline Sentiment is the "correct"