## Exploratory Analysis

Exploratory analysis of SemEval 2017 dataset. Don't download files as I've already got them 

In [28]:
import os
from glob import glob
import pandas as pd

files = glob("../data/SemEval2017/GOLD/Subtask_A/twitter*.txt", recursive=True)
files

['../data/SemEval2017/GOLD/Subtask_A/twitter-2016test-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2013dev-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2015test-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2014test-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2013train-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2016dev-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2013test-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2016devtest-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2016train-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2014sarcasm-A.txt',
 '../data/SemEval2017/GOLD/Subtask_A/twitter-2015train-A.txt']

In [3]:
import pandas as pd

def read_table(path):
    """
    Read SemEval table and return dataframe
    """
    df = pd.read_table(path, header=None)
    # Get rid of last column

    if len(df.columns) > 3:
        del df[3]
    df.columns = ["id", "label", "text"]
    #df.set_index("id", inplace=True)
    return df

pd.options.display.max_colwidth = 200
pd.options.display.max_rows = 100

read_table(files[0])



Unnamed: 0,id,label,text
0,619950566786113536,neutral,"Picturehouse's, Pink Floyd's, 'Roger Waters: The Walll - opening 29 Sept is now making waves. Watch the trailer on Rolling Stone - look..."
1,619969366986235905,neutral,Order Go Set a Watchman in store or through our website before Tuesday and get it half price! #GSAW @GSAWatchmanBook https://t.co/KET6EGD1an
2,619971047195045888,negative,"If these runway renovations at the airport prevent me from seeing Taylor Swift on Monday, Bad Blood will have a new meaning."
3,619974445185302528,neutral,"If you could ask an onstage interview question at Miss USA tomorrow, what would it be?"
4,619987808317407232,positive,A portion of book sales from our Harper Lee/Go Set a Watchman release party on Mon. 7/13 will support @CAP_Tulsa and the great work they do.
...,...,...,...
20627,681877834982232064,neutral,"@ShaquilleHoNeal from what I think you're asking, in no order. Future, Drake, Thug, Cole, Kendrick and Tiller a close 6th"
20628,681879579129200640,positive,"Iran ranks 1st in liver surgeries, Allah bless the country."
20629,681883903259357184,neutral,"Hours before he arrived in Saudi Arabia on Tuesday, Turkish President Recep Tayyip Erdogan accused Syria's president of ""mercilessly""..."
20630,681904976860327936,negative,@VanityFair Alex Kim Kardashian worth how to love Kim Kardashian she's so bad Sun Conure to


In [30]:
test_files = [
    '../data/SemEval2017/GOLD/Subtask_A/twitter-2016test-A.txt',
]
dev_files = [f for f in files if any(t in f for t in {"devtest", "dev"})]

train_files = [f for f in files if f not in test_files and f not in dev_files]

print("Train files : ", train_files)

print("Dev files   : ", dev_files)

print("Test files  : ", test_files)


Train files :  ['../data/SemEval2017/GOLD/Subtask_A/twitter-2015test-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2014test-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2013train-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2013test-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2016train-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2014sarcasm-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2015train-A.txt']
Dev files   :  ['../data/SemEval2017/GOLD/Subtask_A/twitter-2013dev-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2016dev-A.txt', '../data/SemEval2017/GOLD/Subtask_A/twitter-2016devtest-A.txt']
Test files  :  ['../data/SemEval2017/GOLD/Subtask_A/twitter-2016test-A.txt']


In [36]:

train_df = pd.concat([read_table(f) for f in train_files])
train_df["slice"] = "train"
dev_df = pd.concat([read_table(f) for f in dev_files])
dev_df["slice"] = "dev"
test_df = pd.concat([read_table(f) for f in test_files])
test_df["slice"] = "test"

df = pd.concat([train_df, dev_df, test_df])
print("Len train : ", train_df.shape)
print("Len dev   : ", dev_df.shape)
print("Len test  : ", test_df.shape)


Len train :  (23880, 4)
Len dev   :  (5620, 4)
Len test  :  (20632, 4)


There are repeated rows => remove duplicates

In [37]:
df.drop_duplicates("id", inplace=True)

train_df = df[df["slice"] == "train"]
dev_df = df[df["slice"] == "dev"]
test_df = df[df["slice"] == "test"]
print("Len train : ", train_df.shape)
print("Len dev   : ", dev_df.shape)
print("Len test  : ", test_df.shape)

Len train :  (23310, 4)
Len dev   :  (5577, 4)
Len test  :  (20481, 4)


In [33]:
train_df[train_df["label"] == "positive"].sample(10)

Unnamed: 0,id,label,text,slice
7030,115931367044427776,positive,I was thrilled when Melissa McCarthy won an Emmy last night. She called me today to talk about her big win. http://t.co/OKM3X6DB,train
2964,258819156424683520,positive,@Sarah_zayed91 I'll come for the faculty exhibition on Tuesday.. and I'm like most of the time in uni after 3:30 PM :) in CIT building,train
4714,257544455576485888,positive,@KWAMEDIDIT: FRIDAY OCT 19th!!!! THE ALUMNI will be rocking Philly!!!!! FREE SHOW @ the Liacouras Center 7pm CLASSIC HIP HOP!,train
5262,641650142827950080,positive,@PittsfordDad There is a one day delay due to Labor day for the 14534 area. If your normal service day is Wednesday service will (1/2),train
4,255713054224949249,positive,Hello from the Foundation Trekkers! We're up in chilly Haltwhistle getting ready to trek Hadrian's wall tomorrow :-) http://t.co/MmRnhLAL,train
5233,641584836222808064,positive,"Hello everyone! I had a fabulous time in Hermitage PA, on Labor Day! As you may or may not know, I was in... http://t.co/UD0vLq4mEM",train
2345,158551378271277058,positive,"Watched a movie yesterday #The70's on #OVTV and was pleasantly surprised to see Michael Easton (John, #OLTL) in it.",train
2183,262664798473445377,positive,With the election right around the corner\u002c RRFP is pleased to announce Super Tuesday! Get an additional 5%... http://t.co/aKR98s83,train
4907,263292830171156480,positive,good eats on deck 2day. Volare 4 lunch. Maya del Sol 4 dinner. i was smart 2 wear black. i may look like i\u2019m w/ child by the end of the day,train
640,641631147370352640,positive,The Apple event starts 10 minutes after I get out of Chemistry. I may be sprinting back to my dorm to catch it in time.,train


In [34]:
train_df[train_df["label"] == "neutral"].sample(10)

Unnamed: 0,id,label,text,slice
3738,639529925188452352,neutral,NEW: IBM executive J. Bruce Harreld named 21st president of @uiowa over faculty concerns about his qualifications. http://t.co/ztHFnLgdln,train
4812,641411751846674432,neutral,Lake James was dope on Sunday even with Justin,train
8893,259271052906074112,neutral,LOCAL SPORTS: Special to the News-Sun: PLANT CITY - The Highlands Youth Football and Cheer Organization (HYFC) t... http://t.co/Kgy7m7vH,train
5763,100032433461805057,neutral,RT @VomitsHerMinddx: Britney: Lady Gaga's here tonight. Me: WHAT? *Watches Gaga the rest of the night * Me: Who's concert was that ? I ...,train
3130,263703333628411906,neutral,Chelsea need to beat Ajax to keep up there battle with Ajax for 2nd place #NextGenSeries,train
7650,264251309492940801,neutral,My roommate going to Eastern tomorrow.,train
3737,203361721329516544,neutral,Postal plants to shrink\u002c 28\u002c000 jobs at stake: The U.S. Postal Service announced on Thursday it\u2019s moving forward... http://t.co/1CWNuysA,train
3428,641371757161644032,neutral,"Mulcair, Harper and Trudeau still in close race after campaign's 1st phase. Leaders considering time-sharing the country.",train
1166,641569184388935680,neutral,House Of Cards author says my treatment raises 'disturbing questions about the inner workings' of the BBC. http://t.co/fAV54NKGFs,train
6212,100502269635731456,neutral,RT @SkinnyCuh: They should make a new Friday and have Kevin Hart Katt Williams And Mike Epps in it.,train


In [35]:
train_df[train_df["label"] == "neutral"].sample(10)

Unnamed: 0,id,label,text,slice
4592,262707915985661952,neutral,Alex\u002c the female sideline commentator for Sunday night football...sounds like Kermit the Frog. How did they let you on the air?,train
525,622457199399317504,neutral,Contrasts 20th-21st cent. Angela Merkel understands Quantum Physics. Does she have political depth for long term German survival?,train
4562,262977992077221889,neutral,"\""""Well that\u2019s it\u002c the very last one. That may stop you.\"""" - The Lorax",train
3563,261788570719752193,neutral,@PatCunningham16 u just hold out to Ash Wednesday #letthemknowthecraic,train
5421,261630924570103808,neutral,@demi_nicole12\u2019s 16th birthday at Garner. I gave her a face full of chocolate for her birthday. She gave it http://t.co/sdHu1y1E,train
937,638473308988571649,neutral,"@_suprene @sinTripas_ if he can play then I hope he plays against barca, on the 20th",train
3178,640814559167541248,neutral,I just Googled something and the 10th result was my Google+ post from almost 3 years ago~,train
951,225300174812102656,neutral,Lechlade's view on the upcoming clash with Marlborough CC on Saturday http://t.co/5jKiTOjN,train
4068,264155923780612096,neutral,@ArinRodriguez I texted you on Tuesday I was Trina see if you wanted to match,train
579,638971504277983232,neutral,"There's no love lost btw them but Angela Merkel will present the biography of her predecessor Gerhard Schroeder on Sept 22, @Bild reports",train
