# Storytelling Twitter with POSLDA

**Author: Yasir Abdurrahman**

# Background Information
**Twitter** merupakan salah satu sosial media yang populer digunakan masyarakat Indonesia. Terbatasnya jumlah karakter *tweet* sebanyak 144 karakter, tidak menjadikan nilai informasi sebuah *tweet* berkurang. Hal ini dibuktikan dengan masuknya *trending topic* tagar **#PilkadaDKI** saat pemilihan gubernur bulan April 2017 kemarin seperti pada berita [CNN](https://www.cnnindonesia.com/teknologi/20170419143713-192-208647/tagar-quick-count-dan-pilkada-dki-kuasai-twitter/). Menilik ke belakang tahun 2014, saat Pemilihan Umum Legislatif (Pemilu) taggar **#IndonesiaElectionDay** juga masuk sebagai *trending topic* pada berita [Liputan6](http://tekno.liputan6.com/read/2034443/hashtag-bertema-pemilu-dominasi-trending-topic-twitter). 

Sebuah informasi sangatlah bernilai bagi jurnalis dalam membuat sebuah berita. Pengumpulan informasi dari Twitter masih terbatas berdasarkan *trending topic*, padahal masih banyak informasi yang dapat dimanfaatkan. Pengumpulan *tweet* berdasarkan radius lokasi, kemudian dikelompokkan berdasarkan topik tertentu, dan selanjutnya disusun menjadi sebuah *storytelling* dari kumpulan *tweet* yang memiliki topik yang sama akan menjadi bahan baru bagi para jurnalis.

# Questions for Investigation
1. Bagaimana kesesuaian hasil dalam bentuk *storytelling* pada topik yang sama?
2. Topik apakah yang paling sering dibicarakan di Twitter?

# Dataset
Dataset yang digunakan berupa 1000 *tweets* hasil *crawling* sendiri. Berikut adalah struktur dataset yang digunakan:
```
id_user      : id user
username     : username pengguna twitter
created_at   : tanggal dan waktu dibuatnya tweet
latitude     : latitude
longitude    : longitude
text         : tweet
```
Dataset ditempatkan pada file dengan format *.csv*

# Crawling
#### Proses crawling
1. Menggunakan library dari python yaitu twitter, dilakukan crawling berdasarkan radius dari koordinat latitude dan longtitude
2. Data crawling meliputi atribut <b>id_user, username, created_at, latitude, longitude, text</b>
3. Hasil crawling disimpan pada file bertipe .csv

In [None]:
import tweepy
import time
import sys
import csv

latitude = -7.059035     # geographical centre of search
longitude = 110.443972   # geographical centre of search
max_range = 10            # search range in kilometres
outfile = "tweets14.csv"

consumer_key = 'w6LdCwZHkdrIaNgxvuhZziSYe'
consumer_secret = 'Y4guQcTeCwpXJ8UhYyYzCbazxTgHZqYLpN0maNJThXXCEfrMwz'
access_token = '477563933-CIawAP0XgC8tjjXcRYmKaf4p0w2OqFAG4duYCqsL'
access_secret = 'iZIjf2p4QaNyRE7wcibb1vNpIK2JntSRfrruoYRBMs6AS'

In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [None]:
csvfile = open(outfile, "w", newline='', encoding='utf-8')
csvwriter = csv.writer(csvfile)

row = [ "id_user", "username", "created_at", "latitude", "longitude", "text" ]
csvwriter.writerow(row)

In [None]:
result_count = 0
start_time = time.clock()

while result_count < 10000:
    c = tweepy.Cursor(api.search,
                      q="*",
                      count=10000,
                      geocode = "%f,%f,%dkm" % (latitude, longitude, max_range)).items(10000)
    while True:
        try:
            tweet = c.next()
            if tweet.geo:
                csvwriter.writerow([tweet.user.id, tweet.user.screen_name, tweet.created_at, 
                                    tweet.geo['coordinates'][0], tweet.geo['coordinates'][1], tweet.text])
                result_count += 1
                print ("got %d results" % result_count)
            else:
                csvwriter.writerow([tweet.user.id, tweet.user.screen_name, tweet.created_at, 
                                    None, None, tweet.text])
                result_count += 1
                print ("got %d results" % result_count)

        except tweepy.TweepError:
            print("sleeping")
            time.sleep(15 * 60)
            continue
        except StopIteration:
            break
csvfile.close()
print(time.clock() - start_time, "seconds")

# Preprocessing
Hal yang dilakukan:
1. Common Preprocessing
    1. Remove ASCII.
    2. Tokenization
    3. Case folding, convert into lowercase
    4. Repeated dot (sedih... -> sedih.)
    5. Repeated character ('hehe :)))' -> 'hehe :)')
    6. Remove elipsis (lanjut baca... -> lanjut baca)
    7. Repeated word that has meaning ('malam malam' -> 'malam-malam')
    8. Remove newline
2. Specific Preprocessing
    1. Special symbols on Twitter, removing hashtag, mention, RT, and FAV
    2. Remove all emoticons
    3. Remove URL
    4. Spell checker using noisy channel approach

In [1]:
import pandas as pd

In [2]:
df_tweets = pd.read_csv('tweets12.csv')
df_tweets.shape

(500, 6)

In [3]:
df_tweets.head()

Unnamed: 0,id_user,username,created_at,latitude,longitude,text
0,2313451298,dikyock,2018-01-24 04:49:23,,,@rossonerifreak Masih kurang \nButuh winger ta...
1,245873655,phee_nophee,2018-01-24 04:49:21,,,Apa yang Akan Berubah Untukmu Tahun Ini? https...
2,955268887406264321,FahrulAditya16,2018-01-24 04:49:21,,,RT @Pesona_STW: Support by (˛`̯´̯)-☞\n@Copasaj...
3,791650560961241088,isarndr88,2018-01-24 04:49:10,,,Apakah IPK penting?
4,2395660873,Lukman_Prayogoo,2018-01-24 04:49:09,,,https://t.co/YLyXSx3EVh


In [4]:
from modulenorm.Normalize import Normalize
from modulenorm.Tokenize import Tokenize
from modulenorm.LanguageNgramModel import LanguageNgramModel
from modulenorm.MissingLetterModel import MissingLetterModel

In [5]:
import re
# read the text
with open('resource/opensubtitle.txt', encoding = 'utf-8') as f:
    text_id = f.read()

In [6]:
# leave only letters and spaces in the text
text_id2 = re.sub(r'[^a-z ]+', '', text_id.lower().replace('\n', ' '))
all_letters = ''.join(list(sorted(list(set(text_id2)))))
print(repr(all_letters))

' abcdefghijklmnopqrstuvwxyz'


In [7]:
# Prepare training sample for the abbreviation model 
missing_set =  (
    [(all_letters, '-' * len(all_letters))] * 3 # all chars missing
    + [(all_letters, all_letters)] * 10 # all chars are NOT missing
    + [('aeiouy', '------')] * 30 # only vowels are missing
)

In [8]:
# Train the both models
big_lang_m = LanguageNgramModel(order=4, smoothing=0.001, recursive=0.01)
big_lang_m.fit(text_id2)
big_err_m = MissingLetterModel(order=0, smoothing_missed=0.1)
big_err_m.fit(missing_set)

In [9]:
import time

start = time.clock()
idx = 0
df_tweets['normalize'] = None
# result = []
for row in df_tweets['text']:
    start_tweet = time.clock()
    # normalize
    norm = Normalize()
    text_norm = norm.remove_ascii(row)
    text_norm = norm.remove_rt_fav(text_norm)
    text_norm = norm.lower_text(text_norm)
    text_norm = norm.repeat_char_modify(text_norm)
    text_norm = norm.remove_elipsis(text_norm)
    text_norm = norm.remove_newline(text_norm)
    text_norm = norm.remove_url(text_norm)
    text_norm = norm.remove_emoticons(text_norm)
    text_norm = norm.remove_hashtags_mentions(text_norm)
    
    tok = Tokenize()
    text_norm = tok.WordTokenize(text_norm)
    temp_sentence = []
    for token in text_norm:
        if (any(char.isdigit() for char in token) or len(token)>=20) == True:
            choosen_word = token
#             max_values = 0
        else:
            nc = norm.noisy_channel(token, big_lang_m, big_err_m)
            max_values = max(nc.values())
            choosen_word = list(nc.keys())[list(nc.values()).index(max_values)]
#         print(choosen_word, max_values)
        temp_sentence.append(choosen_word)
    text_norm = ' '.join(temp_sentence)
#     result.append(text_norm)
#     text_norm = ' '.join(text_norm)
    df_tweets['normalize'][idx] = text_norm
    print('tweets', idx, 'selesai', time.clock()-start_tweet, 'seconds')
    idx += 1
print("Selesai dalam", time.clock() - start, "seconds")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


tweets 0 selesai 250.56433192346333 seconds
tweets 1 selesai 4.4807238574408075 seconds
tweets 2 selesai 59.26022706927594 seconds
tweets 3 selesai 39.20278408462411 seconds
tweets 4 selesai 0.6488083766576551 seconds
tweets 5 selesai 138.11705380123226 seconds
tweets 6 selesai 0.10057383115537277 seconds
tweets 7 selesai 2139.688049441621 seconds
tweets 8 selesai 83.23973358816647 seconds
tweets 9 selesai 917.1112978140363 seconds
tweets 10 selesai 0.5300828018284847 seconds
tweets 11 selesai 8.04580442985798 seconds
tweets 12 selesai 4.884312924999449 seconds
tweets 13 selesai 0.09914495508701293 seconds
tweets 14 selesai 33.15529077571 seconds
tweets 15 selesai 304.94678890608293 seconds
tweets 16 selesai 703.2756881241348 seconds
tweets 17 selesai 624.3106340081622 seconds
tweets 18 selesai 73.22696778357749 seconds
tweets 19 selesai 0.16495593063336855 seconds
tweets 20 selesai 1.403140904513748 seconds
tweets 21 selesai 35.62710382900059 seconds
tweets 22 selesai 82.5953412710268

tweets 183 selesai 194.56205119550577 seconds
tweets 184 selesai 25.489933436954743 seconds
tweets 185 selesai 295.6332720577775 seconds
tweets 186 selesai 271.10912256973825 seconds
tweets 187 selesai 1.0926163712938433 seconds
tweets 188 selesai 84.79406394933176 seconds
tweets 189 selesai 285.3860941735329 seconds
tweets 190 selesai 365.5345860991947 seconds
tweets 191 selesai 65.26676990761916 seconds
tweets 192 selesai 69.88617057792726 seconds
tweets 193 selesai 139.90289686661708 seconds
tweets 194 selesai 57.49721865225001 seconds
tweets 195 selesai 27.292238226596965 seconds
tweets 196 selesai 18.41083265956695 seconds
tweets 197 selesai 138.570609057002 seconds
tweets 198 selesai 134.20287189265218 seconds
tweets 199 selesai 649.1564995615263 seconds
tweets 200 selesai 200.1018995042832 seconds
tweets 201 selesai 630.5310618043077 seconds
tweets 202 selesai 1.2983220684400294 seconds
tweets 203 selesai 53.854833945893915 seconds
tweets 204 selesai 136.3839115168812 seconds
tw

tweets 364 selesai 59.3451180447737 seconds
tweets 365 selesai 129.77129487189814 seconds
tweets 366 selesai 104.3218985235726 seconds
tweets 367 selesai 39.4660901366442 seconds
tweets 368 selesai 38.87002473449684 seconds
tweets 369 selesai 959.0582104496134 seconds
tweets 370 selesai 122.64450806408422 seconds
tweets 371 selesai 1.3817528078652686 seconds
tweets 372 selesai 239.79726495103387 seconds
tweets 373 selesai 550.9390412732027 seconds
tweets 374 selesai 112.84170208446449 seconds
tweets 375 selesai 102.20682895566279 seconds
tweets 376 selesai 61.92729324345419 seconds
tweets 377 selesai 1478.0469943996723 seconds
tweets 378 selesai 553.6516438916879 seconds
tweets 379 selesai 204.91028699540766 seconds
tweets 380 selesai 130.29443741853174 seconds
tweets 381 selesai 95.80047165454016 seconds
tweets 382 selesai 124.62675886889338 seconds
tweets 383 selesai 112.18101111512806 seconds
tweets 384 selesai 121.04349519280368 seconds
tweets 385 selesai 153.19073670933722 seconds

In [10]:
df_tweets['normalize'].to_csv('normalize_noisy_channel.csv', header=None, index=False)

In [None]:
result2norm = []
i = 1
norm = Normalize()
for tweet in result2:
    temp_kalimat = []
    print('Tweet', i)
    for token in tweet:
        if (any(char.isdigit() for char in token) or len(token)>=20) == True:
            choosen_word = token
            max_values = 0
        else:
            nc = norm.noisy_channel(token, big_lang_m, big_err_m)
            max_values = max(nc.values())
            choosen_word = list(nc.keys())[list(nc.values()).index(max_values)]
        print(choosen_word, max_values)
        temp_kalimat.append(choosen_word)
    result2norm.append(temp_kalimat)
    i += 1

In [None]:
text = "hello kamu yang ada disana hehe."
text_norm = tok.WordTokenize(text)
print(text_norm)

In [7]:
token = "gas"
if (not(any(char.isdigit() for char in token))):
    print('aa')

aa
