# Data_loader basics

### First you will need to REDOWNLOAD the "data.pkl" file from the google drive folder "Datasets/pickle_04212018" to the data directory. It contains all the data (labeled and unlabeled) in a transformed format

In [1]:
# import the data loader
from data_loader import Data_loader

In [2]:
# initialization
# word level tokenization
option = 'word'
max_len = 20
vocab_size = 30000
dl = Data_loader(vocab_size=vocab_size, max_len=max_len, option=option)

Loading vocabulary ...
30000 vocab is considered.
Loading user information finished
Loading tweets ...
Processing tweets ...
Data loader initialization finishes


### Access data for cross validation

In [3]:
fold_idx = 0 # suppose that we want to load the cross validation data for fold 0
tr, val, test = dl.cv_data(fold_idx) # get the cross validation data

### Format of a data point

In [4]:
data_point0 = tr[0]
print("Each data point is now a dictionary")
print("------ a single data point ------")
print(data_point0)

Each data point is now a dictionary
------ a single data point ------
{'tweet_id': 833942638038548480, 'user_mentions': [1527], 'int_arr': [2, 8, 1, 5408, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 'Loss', 'created_at': datetime.datetime(2017, 11, 23, 12, 24, 29, 403482), 'user_post': 203}


attributes: 
<br>*int_arr*: int representation of the tweet
<br>*user_post*: the id of the user who posted the tweet
<br>*label*: label of the tweet for classification
<br>*user_mentions*: the set of user ids being mentioned
<br>*tweet_id*: tweet_id of the tweet
<br>*created_at*: datetime object when the tweet was posted
<br>*retweet*: if this attribute exists, then the tweet was retweeted from the user id

### Get info about a data point

In [5]:
# I deliberatley deleted the unicode representation from the dictionaries
# to avoid confusions
# to print the information about the tweet
dl.print_recovered_tweet(data_point0)

tweet_id: 833942638038548480
user_mentions: [1527]
int_arr: [2, 8, 1, 5408, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label: Loss
created_at: 2017-11-23 12:24:29.403482
user_post: 203
User greedybandz posted the tweet.
Users being mentioned: _shayauna
original tweet content: @user my _UNKNOWN_ apologies _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_


### Access data indexed by user and time

In [6]:
user_id = 2
user_tweets = dl.tweets_by_user(user_id)
for idx in range(2):
    print('-------------')
    dl.print_recovered_tweet(user_tweets[idx])

-------------
tweet_id: 422958425316667392
user_mentions: []
int_arr: [92, 45, 72, 549, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label: Loss
created_at: 2014-01-14 05:08:00
user_post: 2
User tyquanassassin posted the tweet.
Users being mentioned: 
original tweet content: rip lil bro tyquan _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_
-------------
tweet_id: 422963500479045632
user_mentions: []
int_arr: [154, 459, 38, 188, 686, 62, 198, 794, 43, 46, 1074, 925, 5, 0, 0, 0, 0, 0, 0, 0]
label: Other
created_at: 2014-01-14 05:28:00
user_post: 2
User tyquanassassin posted the tweet.
Users being mentioned: 
original tweet content: why dese niggas aint riding if dey part of da set ‚ùì üíØ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_


### Access all the data

In [7]:
all_data = dl.all_data()
print("There are %d data for unsupervised learning" % len(all_data))
print("As before, each datapoint is a dictionary. However, it might have the \"label\" attribute.")
print("------ a single data point ------")
print(all_data[-1])
dl.print_recovered_tweet(all_data[-1])

There are 1033655 data for unsupervised learning
As before, each datapoint is a dictionary. However, it might have the "label" attribute.
------ a single data point ------
{'tweet_id': 905716028390481921, 'user_mentions': [], 'int_arr': [62, 16, 911, 163, 4, 66, 151, 1474, 6, 319, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'user_retweet': 1231, 'created_at': datetime.datetime(2017, 9, 7, 8, 55, 3), 'user_post': 44}
tweet_id: 905716028390481921
user_mentions: []
int_arr: [62, 16, 911, 163, 4, 66, 151, 1474, 6, 319, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
user_retweet: 1231
created_at: 2017-09-07 08:55:03
user_post: 44
User tgottifrmlowe_ posted the tweet.
Users being mentioned: 
Retweet from shellywelly53.
original tweet content: if u mine ... i go crazy behind you ü§ó _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_


# Extending vocab properties

In [8]:
import pickle as pkl
# The current vocabulary lookup table is in model/word.pkl
word2property = pkl.load(open('../model/word.pkl', 'rb'))
print(word2property[b'wow'])
print(word2property['üòì'.encode('utf-8')])

{'isemoji': False, 'occurence_in_labeled': 4, 'id': 1875, 'occurence_in_unlabeled': 806}
{'isemoji': True, 'occurence_in_labeled': 60, 'id': 282, 'occurence_in_unlabeled': 3608}


Essentially, word.pkl is a dictionary that maps from a **binary encoded** token to its property dictionary. Currently it contains information about: i) its id (index, in int_arr_rep) ii) occurence count in labeled/unlabled corpus iii) isemoji or not. 

In the future it might include word-embeddings/Splex scores, etc. To extend, simply load the dictionary, update, dump and commit. **Be sure not to change the other attributes, especially the lookup index; and update attribute for every single word.**

# Extending tweet properties

In [9]:
data = pkl.load(open('../data/data.pkl', 'rb'))

In [10]:
tweet_data = data['data']
tweet = tweet_data[905716028390481921]
print(tweet)

{'tweet_id': 905716028390481921, 'user_mentions': [], 'char_int_arr': [7, 21, 2, 11, 2, 15, 7, 9, 3, 2, 28, 28, 28, 2, 7, 2, 16, 5, 2, 19, 10, 6, 32, 14, 2, 18, 3, 13, 7, 9, 17, 2, 14, 5, 11, 2, 118], 'word_int_arr': [62, 16, 911, 163, 4, 66, 151, 1474, 6, 319], 'user_retweet': 1231, 'created_at': datetime.datetime(2017, 9, 7, 8, 55, 3), 'user_post': 44}


Essentially, data['data'] is a dictionary that maps a tweet_id to its correspdongin tweet attributes. To update, simply add an attribute to the tweet dictionary. **Be sure that all labeled tweets (tweets that have a "label field") have the same attributes.** (e.g. make sure that all a labeled tweets a have a context representation.)

# Extending user properties

In [11]:
user2property = pkl.load(open('../model/user.pkl', 'rb'))
user_info = user2property['tyquanassassin']
print(user_info)

{'labeled_user_post': 426, 'unlabeled_user_post': 0, 'unlabeled_user_mentioned': 179, 'occurence_in_labeled': 787, 'labeled_user_mentioned': 361, 'unlabeled_user_rt': 224, 'id': 2, 'labeled_user_rt': 0, 'occurence_in_unlabeled': 403}
