### Clean & Prepprocess Crowdflower Data Prior to Model Training
A Super Handy CrowdFlower Glossary of Terms can be found [here](https://success.crowdflower.com/hc/en-us/articles/202703305-Glossary-of-Terms)!

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

#### Read-In Jobs-Level Data (from CrowdFlower's *Data for Everyone* [library](https://www.crowdflower.com/data-for-everyone/))

In [2]:
cf = pd.read_csv("http://cdn2.hubspot.net/hub/346378/file-2612489700-csv/DFE_CSVs/Airline-Full-Non-Ag-DFE-Sentiment.csv")
print cf.columns
cf.head(2)

Index([u'_unit_id', u'_created_at', u'_golden', u'_id', u'_missed',
       u'_started_at', u'_tainted', u'_channel', u'_trust', u'_worker_id',
       u'_country', u'_region', u'_city', u'_ip', u'airline_sentiment',
       u'negativereason', u'airline', u'airline_sentiment_gold', u'name',
       u'negativereason_gold', u'retweet_count', u'text', u'tweet_coord',
       u'tweet_created', u'tweet_id', u'tweet_location', u'user_timezone'],
      dtype='object')


Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,2/25/2015 04:52:40,False,1575073003,,2/25/2015 04:49:12,False,elite,0.8108,31110645,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)
1,681448150,2/25/2015 05:22:10,False,1575093916,,2/25/2015 05:19:59,False,prodege,0.8919,1908948,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)


#### Split Test (i.e. Golden Tweets) out from the non-test tweets.
* (We already know the "correct" answers for the Test tweets, so we can process those seperately.)
* Also, per CF's recommendation, we only keep judegements from contibutors with a trust score of at least 70%

In [3]:
gold =   cf[cf._golden==True].copy()
nogold = cf[cf._golden==False].copy()

print gold.shape, nogold.shape

(11997, 27) (43786, 27)


In [None]:
### Look at D

#### Create 0/1 sentiment indicator for non-test tweets

In [None]:
nogold.airline_sentiment.value_counts(dropna=False)

In [None]:
nogold["complaint0"] = 0
nogold["complaint1"] = 0

nogold["complaint0trust"] = np.nan
nogold["complaint1trust"] = np.nan

nogold.ix[nogold["airline_sentiment"].isin(["positive","neutral"]), "complaint0"] = 1
nogold.ix[nogold["airline_sentiment"]=="negative", "complaint1"] = 1

nogold.ix[nogold["airline_sentiment"].isin(["positive","neutral"]), "complaint0trust"] = nogold._trust
nogold.ix[nogold["airline_sentiment"]=="negative", "complaint1trust"] = nogold._trust

In [None]:
nogold[["complaint0","complaint1","complaint0trust","complaint1trust","_trust"]].head(5)

In [None]:
cp = nogold.groupby(by=["tweet_id", "text"], as_index=False)\
[["complaint0","complaint1","complaint0trust","complaint1trust"]].sum()

cp["negProb"] = cp.complaint1trust/(cp.complaint0trust+cp.complaint1trust)

cp.head(5)

In [None]:
ambiguous = cp[(cp.complaint0 > 0) & (cp.complaint1 > 0)]

In [None]:
plt.figure(figsize=(20,10))
plt.hist(ambiguous["negProb"].tolist(), bins=50)

##### Classify Labels

In [None]:
def define_complaints(sentiment, complaint):
    if sentiment in ["positive", "neutral"]:
        return "No Complaint"
    elif complaint in ["Cancelled Flight", "Late Flight"]:
        return "Delay or Cancellation"
    elif complaint in ["Lost Luggage", "Damaged Luggage"]:
        return "Lost or Damaged Luggage"
    elif complaint in ["Customer Service Issue", "Flight Attendant Complaints", "Flight Booking Problems", "longlines"]:
        return "Customer Service"
    elif compaint in ["Bad Flight", "Can't Tell"]:
        return "Unknown"
    
cf["complaint"] = cf[["airline_sentiment", "negativereason"]].apply(define_complaints)