## 원본 데이터셋을 이용해서 수도라벨링을 시도해보자
- https://www.kaggle.com/maxjon/complete-tweet-sentiment-extraction-data

1. 분류 모델을 학습해서(81%) -> 감정이 태깅 되지 않은 약 8000건의 unseen 데이터에 태깅
    - pseudo_senti.csv (senti_cls/pseudo_labeling.py)
2. test.csv에 대해서 selected_text를 태깅
    - LB 0.715 나오는 모델로 태깅
    - pseudo_selected.csv (senti_ext/pseudo_labeling.py)
3. 혹은 test.csv + pseudo_senti.csv에 대해서 selected_text를 태깅
    - LB 0.715 나오는 모델로 태깅
    - pseudo_selected.csv (senti_ext/pseudo_labeling.py)

In [193]:
INPUT_BASE = '/home/jeinsong/senti_ext/dataset'

In [172]:
import os
import re
import time
import random
import datetime

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from textblob import TextBlob

sns.set()

In [190]:
train_data = pd.read_csv(os.path.join('/DATA/image-search/kgg/input', 'train.csv'))
train_data['text'] = train_data.apply(lambda row: str(row.text).strip(), axis=1)
train_data['selected_text'] = train_data.apply(lambda row: str(row.selected_text).strip(), axis=1)

In [231]:
train_data

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on th...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband lo...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has m...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - yo...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


In [191]:
test_data = pd.read_csv(os.path.join('/DATA/image-search/kgg/input', 'test.csv'))
test_data['text'] = test_data.apply(lambda row: str(row.text).strip(), axis=1)

In [232]:
test_data

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely --...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive
...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative
3530,416863ce47,All alone in this old house again. Thanks for...,positive
3531,6332da480c,I know what you mean. My little dog is sinking...,negative
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive


In [363]:
test_data.sentiment.value_counts()

neutral     1430
positive    1103
negative    1001
Name: sentiment, dtype: int64

## 원본 데이터

In [250]:
orig_data = pd.read_csv(os.path.join(INPUT_BASE, 'tweet_dataset.csv'))
orig_data['text'] = orig_data.apply(lambda row: str(row.text).strip(), axis=1)
orig_data['selected_text'] = orig_data.apply(lambda row: str(row.selected_text).strip(), axis=1)

In [333]:
len(orig_data.text.unique())

39502

In [251]:
orig_data

Unnamed: 0,textID,sentiment,author,text,old_text,aux_id,new_sentiment,selected_text
0,1956967341,empty,xoshayzers,i know i was listenin to bad habit earlier an...,@tiffanylue i know i was listenin to bad habi...,p1000000000,,
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh...waitin o...,c811396dc2,negative,headache
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony...gloomy friday...,9063631ab1,negative,gloomy
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!,2a815f151d,positive,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,We want to trade with someone who has Houston ...,@dannycastillo We want to trade with someone w...,82565a56d3,neutral,We want to trade with someone who has Houston ...
...,...,...,...,...,...,...,...,...
39995,1753918954,neutral,showMe_Heaven,,@JohnLloydTaylor,p1000008985,neutral,
39996,1753919001,love,drapeaux,Happy Mothers Day All my love,Happy Mothers Day All my love,0b62ea4f2d,positive,Happy
39997,1753919005,love,JenniRox,Happy Mother`s Day to all the mommies out ther...,Happy Mother's Day to all the mommies out ther...,1adaa3519d,positive,Happy Mother`s Day
39998,1753919043,happiness,ipdaman1,WASSUP BEAUTIFUL!!! FOLLOW ME!! PEEP OUT MY N...,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE...,d63253be9a,neutral,WASSUP BEAUTIFUL!!! FOLLOW ME!! PEEP OUT MY N...


## 원본 데이터중 학습/테스트 제외하고 한번도 안본 (감정 라벨링 안된) 애들을 찾자


In [252]:
orig_data_ = orig_data[orig_data.selected_text == 'nan']
orig_data_ = orig_data_[orig_data_.text != 'nan']

In [253]:
orig_data_

Unnamed: 0,textID,sentiment,author,text,old_text,aux_id,new_sentiment,selected_text
0,1956967341,empty,xoshayzers,i know i was listenin to bad habit earlier an...,@tiffanylue i know i was listenin to bad habi...,p1000000000,,
6,1956968487,sadness,ShansBee,"I should be sleep, but im not! thinking about ...","I should be sleep, but im not! thinking about ...",p1000000001,,
7,1956968636,worry,mcsleazy,Hmmm. http://www.djhero.com/ is down,Hmmm. http://www.djhero.com/ is down,2dfbe0b7fb,negative,
9,1956969172,sadness,Ingenue_Em,I`m sorry at least it`s Friday?,@kelcouch I'm sorry at least it's Friday?,6d846d7d50,negative,
15,1956971077,sadness,Sim_34,The storm is here and the electricity is gone,The storm is here and the electricity is gone,p1000000002,,
...,...,...,...,...,...,...,...,...
39982,1753905020,neutral,2str20lt,"hey guys, if you have something to ask, just a...","hey guys, if you have something to ask, just a...",p1000008981,,
39983,1753905073,neutral,ABZQuine,"not really just leaving flat now, on the looko...","@Astronick not really just leaving flat now, o...",p1000008982,,
39990,1753918829,neutral,kdpaine,I think the lesson of the day is not to have l...,@shonali I think the lesson of the day is not ...,p1000008983,,
39993,1753918892,neutral,bushidosan,"haha, yeah. Twitter has many uses. For me it`s...","@sendsome2me haha, yeah. Twitter has many uses...",p1000008984,,


## 한번도 안본 데이터는 이렇게 얻을 수 있고, unseen.csv로 저장
- 여기에 감정 수도 라벨링!
- 이거 진짜로 한번도 안본데이터인지 확인 필요할거같은게 test랑 겹치는거같음

In [384]:
unseen_data = orig_data_[orig_data_.new_sentiment.isnull()].reset_index(drop=True)

In [385]:
unseen_data

Unnamed: 0,textID,sentiment,author,text,old_text,aux_id,new_sentiment,selected_text
0,1956967341,empty,xoshayzers,i know i was listenin to bad habit earlier an...,@tiffanylue i know i was listenin to bad habi...,p1000000000,,
1,1956968487,sadness,ShansBee,"I should be sleep, but im not! thinking about ...","I should be sleep, but im not! thinking about ...",p1000000001,,
2,1956971077,sadness,Sim_34,The storm is here and the electricity is gone,The storm is here and the electricity is gone,p1000000002,,
3,1956972116,neutral,jansc,No Topic Maps talks at the Balisage Markup Con...,No Topic Maps talks at the Balisage Markup Con...,p1000000003,,
4,1956972359,sadness,xamountoftruth,so tired and i think i`m definitely going to g...,so tired and i think i'm definitely going to g...,p1000000004,,
...,...,...,...,...,...,...,...,...
8600,1753904526,enthusiasm,Freakonomy,Hey Sarah! Hws u? Hope u remember me,@SarahSaner Hey Sarah! Hws u? Hope u remember me,p1000008980,,
8601,1753905020,neutral,2str20lt,"hey guys, if you have something to ask, just a...","hey guys, if you have something to ask, just a...",p1000008981,,
8602,1753905073,neutral,ABZQuine,"not really just leaving flat now, on the looko...","@Astronick not really just leaving flat now, o...",p1000008982,,
8603,1753918829,neutral,kdpaine,I think the lesson of the day is not to have l...,@shonali I think the lesson of the day is not ...,p1000008983,,


In [386]:
del_idxs = []
for idx, text1 in enumerate(unseen_data.text.values):
    for text2 in test_data.text.values:
        if set(text1.split()) == set(text2.split()):
            print(text1)
            print(text2)
            del_idxs.append(idx)

welcome
welcome
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
hey
hey hey
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!
Thank you!


In [387]:
del_idxs

[4662, 4729, 4990, 5272, 5706, 6386, 7241, 7614, 8037, 8161, 8253]

In [388]:
unseen_data = unseen_data.drop(del_idxs).reset_index(drop=True)

In [390]:
unseen_data

Unnamed: 0,textID,sentiment,author,text,old_text,aux_id,new_sentiment,selected_text
0,1956967341,empty,xoshayzers,i know i was listenin to bad habit earlier an...,@tiffanylue i know i was listenin to bad habi...,p1000000000,,
1,1956968487,sadness,ShansBee,"I should be sleep, but im not! thinking about ...","I should be sleep, but im not! thinking about ...",p1000000001,,
2,1956971077,sadness,Sim_34,The storm is here and the electricity is gone,The storm is here and the electricity is gone,p1000000002,,
3,1956972116,neutral,jansc,No Topic Maps talks at the Balisage Markup Con...,No Topic Maps talks at the Balisage Markup Con...,p1000000003,,
4,1956972359,sadness,xamountoftruth,so tired and i think i`m definitely going to g...,so tired and i think i'm definitely going to g...,p1000000004,,
...,...,...,...,...,...,...,...,...
8589,1753904526,enthusiasm,Freakonomy,Hey Sarah! Hws u? Hope u remember me,@SarahSaner Hey Sarah! Hws u? Hope u remember me,p1000008980,,
8590,1753905020,neutral,2str20lt,"hey guys, if you have something to ask, just a...","hey guys, if you have something to ask, just a...",p1000008981,,
8591,1753905073,neutral,ABZQuine,"not really just leaving flat now, on the looko...","@Astronick not really just leaving flat now, o...",p1000008982,,
8592,1753918829,neutral,kdpaine,I think the lesson of the day is not to have l...,@shonali I think the lesson of the day is not ...,p1000008983,,


In [389]:
unseen_data.to_csv('unseen_no_dup.csv', index=False)

## Pseudo Labeling 결과
- unseen.csv에 대해 senti_cls에서 감정 라벨링 -> pseudo_senti.csv 생성
- pseudo_senti.csv에 대해서 senti_ext에서 selected_text 라벨링

In [456]:
pseudo = pd.read_csv(os.path.join('dataset', 'pseudo_selected.csv'))
pseudo['text'] = pseudo.apply(lambda row: str(row.text).strip(), axis=1)

In [457]:
pseudo

Unnamed: 0,textID,text,sentiment,selected_text
0,1695882869,But Chevy & Chrysler may soon be owned by the ...,neutral,But Chevy & Chrysler may soon be owned by the...
1,1751851500,aww that`s sweet! i made a home made card and ...,positive,aww that`s sweet!
2,56f5217a16,happy birthday ness!!,positive,happy birthday ness!!
3,1752917807,Yes ma`am lol. I`ll switch to my inspirational...,neutral,Yes ma`am lol. I`ll switch to my inspirationa...
4,1694792115,music-habits - I`ll join your study,neutral,music-habits - I`ll join your study
...,...,...,...,...
12018,1753852785,woah in the uk it isn`t mothers day werid! I`v...,positive,awesome
12019,1961890740,is there a program that tells you when someone...,negative,Lost
12020,1962264095,*eekkk* back to work,neutral,*eekkk* back to work
12021,136b1926c9,Cool. That`d be fantastic!,positive,Cool. That`d be fantastic!


In [458]:
pseudo['text'] = pseudo.apply(lambda row: str(row.text).strip(), axis=1)

## pseudo labled data가 학습/테스트셋에 있는지 체크
해야할일이 뭐냐면 
- 테스트 데이터에서 아래처럼 체크한 뒤 트레인 데이터에 포함되는 내용 지워야하고
- 안본 데이터에 대해서 감정을 태깅 했으니까
- 그렇게 많지 않긴한데...
    - 테스트 데이터랑 학습 데이터는 거의 안겹치고
    - 학습 데이터랑 (테스트+언씬) 데이터는 겹치는게 쫌 있음 (이게 사실 바로 전걸 포함함)

## 학습셋에 존재하는 데이터 필터링

In [459]:
del_idxs = []
for idx, text1 in enumerate(pseudo.text.values):
    if idx%500 == 0:
        print(idx)
    for text2 in train_data.text.values:
        if set(text1.split()) == set(text2.split()):
            print(text1)
            print(text2)
            del_idxs.append(idx)

0
thanx
thanx
thanx
thanx
Good Morning!
Good Morning!
500
1000
1500
happy mothers day
happy mothers day
2000
Happy Star Wars Day!!!
Happy Star Wars Day!!!
#Java is not working - hmph!  Can`t upload photos to #Facebook.
#Java is not working - hmph!  Can`t upload photos to #Facebook.
Where`d the songs go on the site, I want 'Do You' on this computer too
Where`d the songs go on the site, I want 'Do You' on this computer too
2500
check out review for the movie Fighting - http://bit.ly/Fle9j Hilarious!! leave this guy a comment!
check out review for the movie Fighting - http://bit.ly/Fle9j  Hilarious!! leave this guy a comment!
3000
my cousin is in jail for shoplifting and drugs she is 16! im upset please help me feel better
my cousin is in jail for shoplifting and drugs she is 16! im upset please help me feel better
3500
good morning!
good morning!
i think my niece got me sickee  lame.
i think my niece got me sickee  lame.
4000
lol exams i didn`t go to mcast or other school i finished form

In [466]:
len(del_idxs)

28

In [467]:
pseudo = pseudo.drop(del_idxs).reset_index(drop=True)

In [468]:
pseudo

Unnamed: 0,textID,text,sentiment,selected_text
0,1695882869,But Chevy & Chrysler may soon be owned by the ...,neutral,But Chevy & Chrysler may soon be owned by the...
1,1751851500,aww that`s sweet! i made a home made card and ...,positive,aww that`s sweet!
2,56f5217a16,happy birthday ness!!,positive,happy birthday ness!!
3,1752917807,Yes ma`am lol. I`ll switch to my inspirational...,neutral,Yes ma`am lol. I`ll switch to my inspirationa...
4,1694792115,music-habits - I`ll join your study,neutral,music-habits - I`ll join your study
...,...,...,...,...
11968,1753852785,woah in the uk it isn`t mothers day werid! I`v...,positive,awesome
11969,1961890740,is there a program that tells you when someone...,negative,Lost
11970,1962264095,*eekkk* back to work,neutral,*eekkk* back to work
11971,136b1926c9,Cool. That`d be fantastic!,positive,Cool. That`d be fantastic!


In [472]:
pseudo = pseudo.drop_duplicates(['text'], keep='first').reset_index(drop=True)

In [473]:
pseudo.to_csv('./dataset/pseudo_selected_no_dup.csv')

## 문제가 되는 문장 (영어가 아닌 문장, 삭제해버리기 -> 코드에 반영 완료)
- 혹시 모르니 내일 빈칸 개수가 안맞는지 여부 체크해보기
    - 아마 데이터셋.py 에서 제대로 처리할듯?

In [464]:
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

In [465]:
a = 'I did one last night  it will be available in dvd, blue ray, and digital download via the iTunes store by the end of the week'
' '.join(a.split())

'I did one last night it will be available in dvd, blue ray, and digital download via the iTunes store by the end of the week'

In [317]:
a = ' Photo: jdperry: Seriously, theseï¿½pictures make my day. Hahaha. Iï¿½always just go aroundï¿½saying ï¿½OMG did you...'
isEnglish(a)

False

In [318]:
ids = []
for idx, text in enumerate(pseudo.text):
    row = pseudo.iloc[idx]
    if not isEnglish(text):
        print(idx, text, row.selected_text)
        ids.append(idx)
    #else:
    #    if row.selected_text.strip() not in text.strip():
    #        print("@@", text.strip(), '\n', row.selected_text.strip())

411 Heathrow Connect is ï¿½7.40,i thought it was a bargain (express is 15),but then it terminated at Hayes/Harlington and i had to take a bus  Heathrow Connect is ï¿½7.40,i thought it was a bargain (express is 15),but then it terminated at Hayes/Harlington and i had to take a bus
451 God, I`ll miss my bf so ****` much!  It`s only 2 months now *snï¿½ff*  miss
483 at economics i wanna go homeeeeeee  im tired and i hate the teacher.  and iï¿½m sick of half of my classmates and i wanna go home and sleep  hate
638 vll iwas mit kï¿½se  vll iwas mit kï¿½se
747 Photovia novusnovendo) ï¿½o_0ï¿½*giggles*ï¿½who am i kidding? heï¿½s probably ****  but def my type, hey now! http://tumblr.com/x2k1wgbpm  Photovia novusnovendo) ï¿½o_0ï¿½*giggles*ï¿½who am i kidding? heï¿½s probably **** but def my type, hey now! http://tumblr.com/x2k1wgbpm
828 ist im der Haus... or something like that.  Guten morgen, Frï¿½ulein!  ist im der Haus... or something like that. Guten morgen, Frï¿½ulein!
950 haha xD LMFAOO ;

In [319]:
len(ids)

105

In [322]:
pseudo = pseudo.drop(ids).reset_index(drop=True)

In [324]:
pseudo

Unnamed: 0,textID,text,sentiment,selected_text
0,1694291488,you are so sexy mama!,positive,sexy
1,1963588805,shout out to all the people goin to prom & iis...,negative,miss
2,64b297c15e,says my new layout is so cute x) see the cutie...,positive,cute
3,1753094743,"your right, it is a great plant i love it",positive,great
4,1965986606,7pm on a Fri night & I`m sitting at home alone...,neutral,7pm on a Fri night & I`m sitting at home alon...
...,...,...,...,...
12029,1753788294,sorry i havent tweeted in a while- i was on ho...,neutral,sorry i havent tweeted in a while- i was on h...
12030,1961871203,aw my dear i`m sorry,negative,aw my dear i`m sorry
12031,1962234541,noo i only started 4th season on wednesday bu...,neutral,noo i only started 4th season on wednesday bu...
12032,136b1926c9,Cool. That`d be fantastic!,positive,Cool. That`d be fantastic!


In [323]:
pseudo.to_csv('pseudo_selected.csv', index=False)

------------

---------------------------

# 아래는 사전 기반 (실패작)

In [207]:
test_texts = dict()

for i, txt in enumerate(test_data.text):
    test_data[txt] = test_data.iloc[i].sentiment

In [200]:
# train_data = train_data[train_data.sentiment != 'empty']

## 주어진 학습 데이터가 원래 무슨 감정이었는지?
- 단순 매핑이 아니었네

In [227]:
def get_sent(text):
    testimonial = TextBlob(str(text))
    return testimonial.sentiment.polarity

In [229]:
sent_tup = set()
for i, _ in enumerate(orig_data.textID):
    row = orig_data.iloc[i]
    
    if row.text in texts_senti:
        if  texts_senti[row.text] == row.new_sentiment:
            print("?")
            print(row.sentiment, row.new_sentiment, '=>', texts_senti[row.text], '==>', get_sent(row.text), row.text)
        # sent_tup.add((row.sentiment, texts_senti[row.text], row.new_sentiment))


?
sadness negative => negative ==> 0.0 Layin n bed with a headache  ughhhh...waitin on your call...
?
sadness negative => negative ==> 0.0 Funeral ceremony...gloomy friday...
?
enthusiasm positive => positive ==> 0.25 wants to hang out with friends SOON!
?
neutral neutral => neutral ==> 0.0 We want to trade with someone who has Houston tickets, but no one will.
?
worry negative => negative ==> 0.0 Re-pinging : why didn`t you go to prom? BC my bf didn`t like my friends
?
sadness negative => negative ==> 0.5 Charlene my love. I miss you
?
neutral neutral => neutral ==> 0.0 cant fall asleep
?
worry negative => negative ==> 0.0 Choked on her retainers
?
sadness negative => negative ==> -0.3916666666666666 Ugh! I have to beat this stupid song to get to the next  rude!
?
sadness negative => negative ==> -0.3 if u watch the hills in london u will realise what tourture it is because were weeks and weeks late  i just watch itonlinelol
?
surprise neutral => neutral ==> 0.0 Got the news
?
love ne

In [211]:
sent_tup

{('anger', 'negative', 'negative'),
 ('anger', 'neutral', 'neutral'),
 ('anger', 'positive', 'positive'),
 ('boredom', 'negative', 'negative'),
 ('boredom', 'neutral', 'neutral'),
 ('boredom', 'positive', 'positive'),
 ('empty', 'negative', 'negative'),
 ('empty', 'neutral', 'neutral'),
 ('empty', 'positive', 'positive'),
 ('enthusiasm', 'negative', 'negative'),
 ('enthusiasm', 'neutral', 'neutral'),
 ('enthusiasm', 'positive', 'positive'),
 ('enthusiasm', 'positive', nan),
 ('fun', 'negative', 'negative'),
 ('fun', 'neutral', 'neutral'),
 ('fun', 'positive', 'positive'),
 ('happiness', 'negative', 'negative'),
 ('happiness', 'neutral', 'neutral'),
 ('happiness', 'positive', 'positive'),
 ('happiness', 'positive', nan),
 ('hate', 'negative', 'negative'),
 ('hate', 'neutral', 'neutral'),
 ('hate', 'positive', 'positive'),
 ('love', 'negative', 'negative'),
 ('love', 'neutral', 'neutral'),
 ('love', 'positive', 'positive'),
 ('love', 'positive', nan),
 ('neutral', 'negative', 'negative')

In [111]:
train_data

Unnamed: 0,textID,sentiment,author,text,old_text,aux_id,new_sentiment,selected_text
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,Layin n bed with a headache ughhhh...waitin o...,c811396dc2,negative,headache
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,Funeral ceremony...gloomy friday...,9063631ab1,negative,gloomy
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,wants to hang out with friends SOON!,2a815f151d,positive,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,We want to trade with someone who has Houston ...,@dannycastillo We want to trade with someone w...,82565a56d3,neutral,We want to trade with someone who has Houston ...
5,1956968477,worry,xxxPEACHESxxx,Re-pinging : why didn`t you go to prom? BC my ...,Re-pinging @ghostridah14: why didn't you go to...,a610d6b25b,negative,didn`t like my
...,...,...,...,...,...,...,...,...
39995,1753918954,neutral,showMe_Heaven,,@JohnLloydTaylor,p1000008985,neutral,
39996,1753919001,love,drapeaux,Happy Mothers Day All my love,Happy Mothers Day All my love,0b62ea4f2d,positive,Happy
39997,1753919005,love,JenniRox,Happy Mother`s Day to all the mommies out ther...,Happy Mother's Day to all the mommies out ther...,1adaa3519d,positive,Happy Mother`s Day
39998,1753919043,happiness,ipdaman1,WASSUP BEAUTIFUL!!! FOLLOW ME!! PEEP OUT MY N...,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE...,d63253be9a,neutral,WASSUP BEAUTIFUL!!! FOLLOW ME!! PEEP OUT MY N...


In [54]:
train_data.sentiment.unique()

array(['sadness', 'enthusiasm', 'neutral', 'worry', 'surprise', 'love',
       'fun', 'hate', 'happiness', 'boredom', 'relief', 'anger'],
      dtype=object)

In [55]:
neg = 'negative'
pos = 'positive'
neu = 'neutral'

sent_map = {
    'sadness': neg,
    'enthusiasm': pos,
    'neutral': neu,
    'worry': neg,
    'surprise': pos,
    'love': pos,
    'fun': pos,
    'hate': neg,
    'happiness':pos,
    'boredom': neg,
    'relief': pos,
    'anger': pos,
}

## nltk.sentiment.vader를 이용한 수도 라벨링
- https://www.nltk.org/howto/sentiment.html

In [161]:
import re
import nltk
import heapq
import string

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

def ext_sent_word(text, sentiment, topk=2, verbose=False):
    word_list = []
    sen_list = []
    score_list = []
    
    splited_text = text.split()
    
    idx_list = []

    if sentiment == 'positive':
        for w in splited_text:
            score = sid.polarity_scores(w)['compound']
            score_list.append(score)
        if verbose: print('scores:', score_list)
        for scr in [x for x in set(heapq.nlargest(topk, score_list))]:
            if scr <= 0.0:
                continue
            idx_list.append(score_list.index(scr))

    elif sentiment == 'negative':
        for w in splited_text:
            score = sid.polarity_scores(w)['compound']
            score_list.append(score)
        if verbose: print('scores:', score_list)
        for scr in [x for x in set(heapq.nsmallest(topk, score_list))]:
            if scr >= 0.0:
                continue
            idx_list.append(score_list.index(scr))
    
    if not idx_list:
        return ''
    
    start_idx = min(idx_list)
    end_idx = max(idx_list)
    selected_text = splited_text[start_idx:end_idx+1]
    
    if len(selected_text) < 2:
        return ''
    
    return ' '.join(selected_text)

## nan 인 것 만 새로 라벨링 (11975 rows)

In [162]:
orig_data = train_data[train_data.selected_text == 'nan']

In [163]:
orig_data

Unnamed: 0,textID,sentiment,author,text,old_text,aux_id,new_sentiment,selected_text
6,1956968487,sadness,ShansBee,"I should be sleep, but im not! thinking about ...","I should be sleep, but im not! thinking about ...",p1000000001,,
7,1956968636,worry,mcsleazy,Hmmm. http://www.djhero.com/ is down,Hmmm. http://www.djhero.com/ is down,2dfbe0b7fb,negative,
9,1956969172,sadness,Ingenue_Em,I`m sorry at least it`s Friday?,@kelcouch I'm sorry at least it's Friday?,6d846d7d50,negative,
15,1956971077,sadness,Sim_34,The storm is here and the electricity is gone,The storm is here and the electricity is gone,p1000000002,,
22,1956972116,neutral,jansc,No Topic Maps talks at the Balisage Markup Con...,No Topic Maps talks at the Balisage Markup Con...,p1000000003,,
...,...,...,...,...,...,...,...,...
39983,1753905073,neutral,ABZQuine,"not really just leaving flat now, on the looko...","@Astronick not really just leaving flat now, o...",p1000008982,,
39990,1753918829,neutral,kdpaine,I think the lesson of the day is not to have l...,@shonali I think the lesson of the day is not ...,p1000008983,,
39993,1753918892,neutral,bushidosan,"haha, yeah. Twitter has many uses. For me it`s...","@sendsome2me haha, yeah. Twitter has many uses...",p1000008984,,
39994,1753918900,happiness,courtside101,Succesfully following Tayla!!,Succesfully following Tayla!!,bd499c0bf7,positive,


## 일단 뉴트럴 제외하고!, 길이 2 이상인 긍/부정 답안만 포함 (1이 많이 포함되면 짧은 애들을 더 찾을까봐)

In [164]:
aug_data = []

empty_cnt = 0
for idx, text in enumerate(orig_data.text):
    row = orig_data.iloc[idx]
    
    text = re.sub('http[s]?://\S+', '', row.text)

    senti = row.sentiment
    senti = sent_map[senti]
    if senti == neu:
        #aug_data.append({
        #    'textID': row.textID,
        #    'text': row.text,
        #    'sentiment': senti,
        #    'selected_text': row.text,
        #})
        continue
    
    pseudo_label = ext_sent_word(text, senti)
    
    if pseudo_label != '':
        aug_data.append({
            'textID': row.textID,
            'text': text,
            'sentiment': senti,
            'selected_text': pseudo_label,
        })
    else:
        empty_cnt += 1
    
empty_cnt

7006

In [165]:
len(aug_data)

2223

In [166]:
textIDs = []
texts = []
sentiments = []
selected_texts = []

for data in aug_data:
    textIDs.append(data['textID'])
    texts.append(data['text'])
    sentiments.append(data['sentiment'])
    selected_texts.append(data['selected_text'])

In [167]:
df = pd.DataFrame.from_dict({
    'textID': textIDs,
    'text': texts,
    'sentiment': sentiments,
    'selected_text': selected_texts
})

In [168]:
df

Unnamed: 0,textID,text,sentiment,selected_text
0,1956977187,<3 your gonna be the first twitter ;) cause y...,positive,<3 your gonna be the first twitter ;) cause yo...
1,1956982383,"Aww Onward and upwards now, yay! Still sad to...",negative,sad to leave
2,1956987950,.. I`m suppposed to be sleep. But i got some m...,negative,stuck in my head 'your a jerk
3,1956991673,i wanted to come to BZ this summer :/ not so s...,negative,:/ not so sure anymore... a teacher`s life in ...
4,1956996776,Is miserable i feel like im gona cry sux!,negative,miserable i feel like im gona cry
...,...,...,...,...
2218,1753886405,thanks for promoing the show for me in my abse...,positive,thanks for promoing the show for me in my abse...
2219,1753902805,U think web design wld go down well at the Int...,positive,well at the International Brotherhood of Magic...
2220,1753903062,Haha! I`ll try that next time he`s up north! ...,positive,Haha! I`ll try that next time he`s up north! T...
2221,1753904142,"LOL or maybe it`s the tooth fairy, takes `em t...",positive,"LOL or maybe it`s the tooth fairy, takes `em t..."


In [169]:
df.to_csv('./dataset/pseudo_senti_dict.csv', index=False)

In [170]:
df.sentiment.value_counts()

positive    1491
negative     732
Name: sentiment, dtype: int64