### Clean & Prepprocess Crowdflower Data Prior to Model Training
A Super Handy CrowdFlower Glossary of Terms can be found [here](https://success.crowdflower.com/hc/en-us/articles/202703305-Glossary-of-Terms)!

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

#### Read-In Jobs-Level Data (from CrowdFlower's *Data for Everyone* [library](https://www.crowdflower.com/data-for-everyone/))

In [2]:
cf = pd.read_csv("http://cdn2.hubspot.net/hub/346378/file-2612489700-csv/DFE_CSVs/Airline-Full-Non-Ag-DFE-Sentiment.csv")
print cf.columns
cf.head(2)

Index([u'_unit_id', u'_created_at', u'_golden', u'_id', u'_missed',
       u'_started_at', u'_tainted', u'_channel', u'_trust', u'_worker_id',
       u'_country', u'_region', u'_city', u'_ip', u'airline_sentiment',
       u'negativereason', u'airline', u'airline_sentiment_gold', u'name',
       u'negativereason_gold', u'retweet_count', u'text', u'tweet_coord',
       u'tweet_created', u'tweet_id', u'tweet_location', u'user_timezone'],
      dtype='object')


Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,2/25/2015 04:52:40,False,1575073003,,2/25/2015 04:49:12,False,elite,0.8108,31110645,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)
1,681448150,2/25/2015 05:22:10,False,1575093916,,2/25/2015 05:19:59,False,prodege,0.8919,1908948,...,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,570306133677760513,,Eastern Time (US & Canada)


#### Split Test (i.e. Golden Tweets) out from the non-test tweets.
(We already know the "correct" answers for the Test tweets, so we can process those seperately.)

In [16]:
gold =   cf[cf._golden==True].copy()
nogold = cf[cf._golden==False].copy()

print gold.shape, nogold.shape

(11997, 27) (43786, 27)


#### Process "Test" Tweets - Use "Correct" Sentiment & Topics

#### Process Non-Test Tweets

##### Clean Airline Sentiment (Positive/Neutral or Negative) Label

In [17]:
nogold.airline_sentiment.value_counts(dropna=False)

negative    26919
neutral      9742
positive     7125
Name: airline_sentiment, dtype: int64

In [18]:
##Convert Text Labels into numeric and pool postitive and neutral
nogold.airline_sentiment.replace(["positive", "neutral", "negative"], [1, 1, -1], inplace=True)
nogold.airline_sentiment.value_counts()

-1    26919
 1    16867
Name: airline_sentiment, dtype: int64

In [19]:
##Aggregate to the tweet-level. Allow for duplicates where multiple taskers flagged differently.
t_sentiment = nogold.groupby(by=["tweet_id","text","airline_sentiment"], as_index=False)\
[["tweet_id","text","airline_sentiment","_trust"]].sum()

t_sentiment.head(10)

Unnamed: 0,tweet_id,text,airline_sentiment,_trust
0,567588278875213824,@JetBlue's new CEO seeks the right balance to ...,1,2.7027
1,567590027375702016,@JetBlue is REALLY getting on my nerves !! 😡�...,-1,2.7494
2,567591480085463040,@united yes. We waited in line for almost an h...,-1,2.5777
3,567592368451248130,@united the we got into the gate at IAH on tim...,-1,2.5215
4,567594449874587648,@SouthwestAir its cool that my bags take a bit...,-1,2.5058
5,567594579310825473,@united and don't hope for me having a nicer f...,-1,2.6997
6,567595670463205376,@united I like delays less than you because I'...,-1,2.5135
7,567614049425555457,"@united, link to current status of flights/air...",-1,2.5157
8,567617081336950784,@SouthwestAir you guys there? Are we on hour 2...,-1,2.6075
9,567617486703853568,@united I tried 2 DM it would not go thru... n...,-1,2.2953


In [7]:
print t_sentiment.duplicated("tweet_id").value_counts() ##N = 3,115 (~22%) Tweets marked as both Pos and Neg

False    14455
True      3115
dtype: int64


#### Output Sample "Ambiguous" Tweets for Handcoding

In [29]:
t_sample = t_sentiment[t_sentiment.duplicated("tweet_id")==True].sample(frac=0.15, replace=False, random_state=4444)
t_sample[["tweet_id","text"]].to_csv("ambigous_sentiment_hand_coded.csv", index=False)

In [9]:
##Calculate Tweet "score"
t_sentiment = t_sentiment.pivot_table(index=["tweet_id","text"], columns="airline_sentiment", values="_trust")
t_sentiment.head(5)

Unnamed: 0_level_0,airline_sentiment,-1,1
tweet_id,text,Unnamed: 2_level_1,Unnamed: 3_level_1
567588278875213824,@JetBlue's new CEO seeks the right balance to please passengers and Wall ... - Greenfield Daily Reporter http://t.co/LM3opxkxch,,2.7027
567590027375702016,@JetBlue is REALLY getting on my nerves !! 😡😡 #nothappy,2.7494,
567591480085463040,@united yes. We waited in line for almost an hour to do so. Some passengers just left not wanting to wait past 1am.,2.5777,
567592368451248130,"@united the we got into the gate at IAH on time and have given our seats and closed the flight. If you know people is arriving, have to wait",2.5215,
567594449874587648,"@SouthwestAir its cool that my bags take a bit longer, dont give me baggage blue balls-turn the carousel on, tell me it's coming, then not.",2.5058,


In [13]:
t_sentiment["flag"] = 0
t_sentiment.ix[(t_sentiment[-1].notnull()) & (t_sentiment[1].notnull()), "flag"] = 1

In [14]:
t_sentiment["flag"].value_counts()

0    11340
1     3115
Name: flag, dtype: int64