### (0) Explore annotated airline tweet data provided by Crowdflower
A Super Handy CrowdFlower Glossary of Terms can be found [here](https://success.crowdflower.com/hc/en-us/articles/202703305-Glossary-of-Terms)!

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max_rows", 500)

#### Read-In Jobs-Level Data (from CrowdFlower's *Data for Everyone* [library](https://www.crowdflower.com/data-for-everyone/))

In [2]:
cf = pd.read_csv("http://cdn2.hubspot.net/hub/346378/file-2612489700-csv/DFE_CSVs/Airline-Full-Non-Ag-DFE-Sentiment.csv")
print cf.columns
cf.head(2)

Index([u'_unit_id', u'_created_at', u'_golden', u'_id', u'_missed',
       u'_started_at', u'_tainted', u'_channel', u'_trust', u'_worker_id',
       u'_country', u'_region', u'_city', u'_ip', u'airline_sentiment',
       u'negativereason', u'airline', u'airline_sentiment_gold', u'name',
       u'negativereason_gold', u'retweet_count', u'text', u'tweet_coord',
       u'tweet_created', u'tweet_id', u'tweet_location', u'user_timezone'],
      dtype='object')


Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,2/25/2015 04:52:40,False,1575073003,,2/25/2015 04:49:12,False,elite,0.8108,31110645,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)
1,681448150,2/25/2015 05:22:10,False,1575093916,,2/25/2015 05:19:59,False,prodege,0.8919,1908948,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)


#### Check for Duplicative Records

In [4]:
##Check for Duplicate judgement IDs (_id)
print cf.duplicated("_id").value_counts() #no duplicate judgement IDs

##Check for Dupplicate Tweet IDs (_unit_id, tweet_id)
print cf.drop_duplicates(["_unit_id","tweet_id"]).duplicated("_unit_id").value_counts() #No Duplicates
print cf.drop_duplicates(["_unit_id","tweet_id"]).duplicated("tweet_id").value_counts() #N=195 Duplicates.
print cf.drop_duplicates(["tweet_id","text"]).duplicated("tweet_id").value_counts() #No Duplicates

False    55783
dtype: int64
False    14680
dtype: int64
False    14485
True       195
dtype: int64
False    14485
dtype: int64


Not sure why _unit_id is duplicative/ what represents, but since tweet_id lines up with the actual tweet (from which our model feature vectors will eventually be created, use tweet_id as unit of 1 tweet, not _unit_id


In [5]:
##Check for Duplicates by Tweet ID, Worker ID (this should represent 1 judgement)
cf.drop_duplicates(["tweet_id","_worker_id"]).duplicated(["tweet_id","_worker_id"]).value_counts() #no duplicates

False    54821
dtype: int64

#### Investigate How Worker Trust Scores are Calculated

In [None]:
print cf._tainted.value_counts(dropna=False) #no tweets marked at tainted
print cf._trust.describe() ##All Trust Scores in Range 70% - 100% - "tainted" judgements dropped

In [None]:
###Look at Progression
cf.sort_values(by=["_worker_id","_started_at"])[["_worker_id","_started_at","_created_at","tweet_id","text",\
                                                  "_golden","airline_sentiment","airline_sentiment_gold","_trust"]].head(15)

* Starts with a series of Test (golden) tweets to determine trust score, then occational spot checks w/ a test tweet
* Doesnt look like trust score fluctuates with performance in judgement-level data. 
Confirm that there is 1 trust score per worker only.

In [None]:
##Confirm 1 Trust Score Per Worker
cf.drop_duplicates(["_worker_id","_trust"]).duplicated("_worker_id").value_counts() ##N=69 Workers w/ multiple trust scores

In [None]:
dups = cf.drop_duplicates(["_worker_id","_trust"])[["_worker_id","_trust"]].copy()
dups["dups"] = cf.duplicated("_worker_id")

dups[dups.dups==True][:10]

#### Look at Tasking for a couple users with multiple trust scores

In [None]:
view = cf.sort_values(by=["_worker_id","_started_at"])[["_worker_id","_started_at","_created_at","tweet_id","text",\
                                                  "_golden","airline_sentiment","airline_sentiment_gold","_trust"]].copy()

In [None]:
view[view._worker_id==25620782]

It looks like the two seperate trust scores per user are associated with two different "tasking" sessions. 
Use (weighted?) average? Or treat as if two different taskers?

#### Checks to be done
* Duplicate worker IDs by name
* Golden Tweet Flag X Airline Sentiment X Gold Airline Sentiment
* Golden Tweet Flag X Airline Topic X Gold Airline Topic
* Look at examples where Airline Sentiment != Gold Airline Sentiment. Confirm Gold Airline Sentiment is the "correct"

#### Look at Test Tweets X Sentiment/Topic Flags and Golden Sentiment/ Topic Flags

In [8]:
print cf.columns

Index([u'_unit_id', u'_created_at', u'_golden', u'_id', u'_missed',
       u'_started_at', u'_tainted', u'_channel', u'_trust', u'_worker_id',
       u'_country', u'_region', u'_city', u'_ip', u'airline_sentiment',
       u'negativereason', u'airline', u'airline_sentiment_gold', u'name',
       u'negativereason_gold', u'retweet_count', u'text', u'tweet_coord',
       u'tweet_created', u'tweet_id', u'tweet_location', u'user_timezone'],
      dtype='object')


In [11]:
print cf.airline_sentiment.value_counts()
print cf.negativereason.value_counts()

negative    36280
neutral     11027
positive     8476
Name: airline_sentiment, dtype: int64
CSProblem         12419
late               5914
canttell           4771
cancel             3685
lostluggae         2466
badflight          2135
booking            1818
airplanestaff      1804
longlines           981
damagedluggage      292
Name: negativereason, dtype: int64


In [19]:
pd.crosstab(cf._golden, cf.airline_sentiment_gold, dropna=False)

Unnamed: 0_level_0,negative,neutral,positive
_golden,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
True,9588,920,1489


In [20]:
pd.crosstab(cf.airline_sentiment_gold, cf.negativereason_gold, dropna=False)

Unnamed: 0_level_0,CSProblem,CSProblem canttell,CSProblem lostluggae,airplanestaff,badflight,cancel,cancel CSProblem,canttell,late,late airplanestaff,late cancel,late lostluggae,lostluggae damagedluggage
airline_sentiment_gold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
negative,3649,314,286,301,277,892,604,876,1147,317,293,310,322


In [23]:
###Confirm that Golden Sentiment/ Topics are "correct"
view1 = cf[(cf._golden==True) & (cf.airline_sentiment != cf.airline_sentiment_gold)].copy()
view2 = cf[(cf._golden==True) & (cf.negativereason != cf.negativereason_gold)].copy()

In [24]:
for a, b, c in zip(view1["text"], view1["airline_sentiment"], view1["airline_sentiment_gold"])[:20]:
    print a, b, c

@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit positive negative
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit positive negative
@united flighted delayed for hours. 10pm arrival to Vegas is now 4am. Did you seriously lose my luggage??? neutral negative
@united I have a question positive neutral
@united I have a question positive neutral
@united I have a question negative neutral
@united I have a question negative neutral
@united I have a question positive neutral
@united I have a question negative neutral
@united I have a question negative neutral
@united I have a question negative neutral
@united You shouldn't page o'head that it's best to call 1-800# - on hold 26+ mins neutral negative
@united You shouldn't page o'head that it's best to call 1-800# - on hold 26+ mins neutral negative
@united You shouldn't page o'head that it's best to call 1-800# - on hold 26+ mi

In [25]:
for a, b, c in zip(view2["text"], view2["negativereason"], view2["negativereason_gold"])[:20]:
    print a, b, c

@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit late late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit late late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit late late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit late late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit CSProblem late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit late late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a crew that could not give less of a shit late late
airplanestaff
@united I'm aware of the flight details, thanks. Three hours late a cre

"Gold" reasons are the "correct" ones.