# Data_loader basics

### First you will need to download the "data.pkl" file from the google drive folder "Datasets/pickle_04212018" to the data directory. It contains all the data (labeled and unlabeled) in a transformed format

In [1]:
# import the data loader
from data_loader import Data_loader

In [2]:
# initialization
# word level tokenization
option = 'word'
max_len = 20
vocab_size = 30000
dl = Data_loader(vocab_size=vocab_size, max_len=max_len, option=option)

Loading vocabulary ...
30000 vocab is considered.
Loading tweets ...
Processing tweets ...
Data loader initialization finishes


In [3]:
fold_idx = 0 # suppose that we want to load the cross validation data for fold 0
tr, val, test = dl.cv_data(fold_idx) # get the cross validation data

In [4]:
data_point0 = tr[0]
print("Each data point is now a dictionary")
print("------ a single data point ------")
for key in data_point0:
    print('%s: %s' % (str(key), str(data_point0[key])))

Each data point is now a dictionary
------ a single data point ------
tweet_id: 549982154667859968
user_name: HeadHunchoMillz
created_at: 2017-11-20 14:54:33.315932
int_arr: [2, 16, 662, 709, 282, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label: Other


In [5]:
# I deliberatley deleted the unicode representation from the dictionaries
# to avoid confusions
# to access the unicode representation of a tweet
# use the "convert2unicode" function
s = dl.convert2unicode(data_point0['int_arr'])
print(s)

@user u acting slow ðŸ˜“ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_ _PAD_


In [None]:
all_data = dl.all_data()
print("There are %d data for unsupervised learning" % len(all_data))
print("As before, each datapoint is a dictionary. However, it might have the \"label\" attribute.")
print("------ a single data point ------")
print(all_data[-1])

There are 1033655 data for unsupervised learning
As before, each datapoint is a dictionary. However, it might have the "label" attribute.
------ a single data point ------
{'tweet_id': 905716028390481921, 'user_name': 'TGottiFrmLowe_', 'created_at': datetime.datetime(2017, 9, 7, 8, 55, 3), 'int_arr': [62, 16, 911, 163, 4, 66, 151, 1474, 6, 319, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


# Extending vocab properties

In [None]:
import pickle as pkl
# The current vocabulary lookup table is in model/word.pkl
word2property = pkl.load(open('../model/word.pkl', 'rb'))
print(word2property[b'wow'])
print(word2property['ðŸ˜“'.encode('utf-8')])

Essentially, word.pkl is a dictionary that maps from a **binary encoded** token to its property dictionary. Currently it contains information about: i) its id (index, in int_arr_rep) ii) occurence count in labeled/unlabled corpus iii) isemoji or not. 

In the future it might include word-embeddings/Splex scores, etc. To extend, simply load the dictionary, update, dump and commit. **Be sure not to change the other attributes, especially the lookup index; and update attribute for every single word.**

# Extending tweet properties

In [None]:
data = pkl.load(open('../data/data.pkl', 'rb'))

In [None]:
tweet_data = data['data']
tweet = tweet_data[905716028390481921]
print(tweet)

Essentially, data['data'] is a dictionary that maps a tweet_id to its correspdongin tweet attributes. To update, simply add an attribute to the tweet dictionary. **Be sure that all labeled tweets (tweets that have a "label field") have the same attributes.** (e.g. make sure that all a labeled tweets a have a context representation.)