# Data_loader basics

### First you will need to REDOWNLOAD the "data.pkl" file from the google drive folder "Datasets/pickle_04212018" to the data directory. It contains all the data (labeled and unlabeled) in a transformed format

In [1]:
# import the data loader
from data_loader import Data_loader
# initialization
# char level tokenization
# YOU NEED TO CHANGE THIS ARGUMENT TO LOAD DIFFERENT TOKENIZATION LEVEL
option = 'char'
# you will have to change the vocab_size for 'char' option
dl = Data_loader(option=option, vocab_size=1000)

Data loader ...
Loading vocabulary ...
1000 vocab is considered.
Loading tweets ...
Processing tweets ...
Data loader initialization finishes


### Access data for cross validation

In [2]:
fold_idx = 0 # suppose that we want to load the cross validation data for fold 0
# get the cross validation data
#each of tr, val, test is a list
tr, val, test = dl.cv_data(fold_idx)

### Format of a data point

In [3]:
data_point0 = tr[0]
print("Each data point is now a dictionary")
print("------ a single data point ------")
print(data_point0)

Each data point is now a dictionary
------ a single data point ------
{'created_at': datetime.datetime(2017, 11, 20, 12, 18, 27, 645188), 'user_post': 215, 'user_mentions': [], 'padded_int_arr': [22, 19, 2, 18, 2, 28, 2, 46, 4, 5, 2, 45, 8, 9, 2, 33, 9, 9, 2, 29, 15, 4, 27, 3, 16, 2, 37, 6, 2, 45, 3, 2, 33, 8, 7, 32, 5, 2, 29, 15, 4, 27, 8, 7, 2, 35, 7, 2, 44, 4], 'int_arr': [22, 19, 2, 18, 2, 28, 2, 46, 4, 5, 2, 45, 8, 9, 2, 33, 9, 9, 2, 29, 15, 4, 27, 3, 16, 2, 37, 6, 2, 45, 3, 2, 33, 8, 7, 32, 5, 2, 29, 15, 4, 27, 8, 7, 2, 35, 7, 2, 44, 4, 11, 27, 9, 2, 85, 2, 115], 'tweet_id': 832508907998244865, 'label': 'Aggression', 'user_retweet': 23}


attributes: 
<br>*int_arr*: int representation of the tweet
<br>*user_post*: the id of the user who posted the tweet
<br>*label*: label of the tweet for classification
<br>*user_mentions*: the set of user ids being mentioned
<br>*tweet_id*: tweet_id of the tweet
<br>*created_at*: datetime object when the tweet was posted
<br>*retweet*: if this attribute exists, then the tweet was retweeted from the user id

### Get info about a data point

In [4]:
# I deliberatley deleted the unicode representation from the dictionaries
# to avoid confusions
# to print the information about the tweet
dl.print_recovered_tweet(data_point0)

created_at: 2017-11-20 12:18:27.645188
user_post: 215
user_mentions: []
padded_int_arr: [22, 19, 2, 18, 2, 28, 2, 46, 4, 5, 2, 45, 8, 9, 2, 33, 9, 9, 2, 29, 15, 4, 27, 3, 16, 2, 37, 6, 2, 45, 3, 2, 33, 8, 7, 32, 5, 2, 29, 15, 4, 27, 8, 7, 2, 35, 7, 2, 44, 4]
int_arr: [22, 19, 2, 18, 2, 28, 2, 46, 4, 5, 2, 45, 8, 9, 2, 33, 9, 9, 2, 29, 15, 4, 27, 3, 16, 2, 37, 6, 2, 45, 3, 2, 33, 8, 7, 32, 5, 2, 29, 15, 4, 27, 8, 7, 2, 35, 7, 2, 44, 4, 11, 27, 9, 2, 85, 2, 115]
tweet_id: 832508907998244865
label: Aggression
user_retweet: 23
original tweet content: RT b'@user' : Got His Ass Smoked Na He Ain't Smokin On Folks b'\xf0\x9f\xa4\xa6' b'\xf0\x9f\xa4\xb7'


### Access data indexed by user and time

In [5]:
user_id = 2
user_tweets = dl.tweets_by_user(user_id)
for idx in range(2):
    print('-------------')
    dl.print_recovered_tweet(user_tweets[idx])

-------------
created_at: 2014-01-14 05:08:00
user_post: 2
user_mentions: []
int_arr: [22, 8, 24, 2, 40, 8, 11, 2, 25, 12, 4, 2, 5, 13, 93, 14, 6, 7]
tweet_id: 422958425316667392
label: Loss
padded_int_arr: [22, 8, 24, 2, 40, 8, 11, 2, 25, 12, 4, 2, 5, 13, 93, 14, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
original tweet content: Rip Lil bro tyquan
-------------
created_at: 2014-01-14 05:28:00
user_post: 2
user_mentions: []
int_arr: [42, 10, 13, 2, 39, 3, 9, 3, 2, 37, 8, 17, 17, 6, 9, 2, 33, 8, 7, 5, 2, 22, 8, 16, 8, 7, 17, 2, 21, 26, 2, 39, 3, 13, 2, 51, 6, 12, 5, 2, 35, 26, 2, 39, 6, 2, 29, 3, 5, 2, 220, 2, 43]
tweet_id: 422963500479045632
label: Other
padded_int_arr: [42, 10, 13, 2, 39, 3, 9, 3, 2, 37, 8, 17, 17, 6, 9, 2, 33, 8, 7, 5, 2, 22, 8, 16, 8, 7, 17, 2, 21, 26, 2, 39, 3, 13, 2, 51, 6, 12, 5, 2, 35, 26, 2, 39, 6, 2, 29, 3, 5, 2]
original tweet content: Why Dese Niggas Aint Riding If Dey Part Of Da Set b'\xe2\x9d\x93' 

### Access all the data

In [6]:
all_data = dl.all_data()
print("There are %d data for unsupervised learning" % len(all_data))
print("As before, each datapoint is a dictionary. However, it might have the \"label\" attribute.")
print("------ a single data point ------")
print(all_data[-1])
dl.print_recovered_tweet(all_data[-1])

There are 1033655 data for unsupervised learning
As before, each datapoint is a dictionary. However, it might have the "label" attribute.
------ a single data point ------
{'created_at': datetime.datetime(2017, 9, 7, 8, 55, 3), 'user_mentions': [], 'user_post': 46, 'int_arr': [22, 19, 2, 18, 2, 28, 2, 21, 26, 2, 14, 2, 15, 8, 7, 3, 2, 38, 38, 38, 2, 8, 2, 17, 4, 2, 20, 12, 6, 54, 13, 2, 25, 3, 10, 8, 7, 16, 2, 13, 4, 14, 2, 157], 'tweet_id': 905716028390481921, 'padded_int_arr': [22, 19, 2, 18, 2, 28, 2, 21, 26, 2, 14, 2, 15, 8, 7, 3, 2, 38, 38, 38, 2, 8, 2, 17, 4, 2, 20, 12, 6, 54, 13, 2, 25, 3, 10, 8, 7, 16, 2, 13, 4, 14, 2, 157, 0, 0, 0, 0, 0, 0], 'user_retweet': 2704}
created_at: 2017-09-07 08:55:03
user_mentions: []
user_post: 46
int_arr: [22, 19, 2, 18, 2, 28, 2, 21, 26, 2, 14, 2, 15, 8, 7, 3, 2, 38, 38, 38, 2, 8, 2, 17, 4, 2, 20, 12, 6, 54, 13, 2, 25, 3, 10, 8, 7, 16, 2, 13, 4, 14, 2, 157]
tweet_id: 905716028390481921
padded_int_arr: [22, 19, 2, 18, 2, 28, 2, 21, 26, 2, 14, 2, 1

## New features demanded on 5/2

In [7]:
tr, val = dl.unlabeled_tr_val()
# each of tr, val is a list of dictionaries
print('size of training set %d: ' % len(tr))
print('a tweet dictionary looks like')
print(val[0])

size of training set 818305: 
a tweet dictionary looks like
{'created_at': datetime.datetime(2014, 10, 12, 17, 12, 52), 'user_mentions': [], 'user_post': 239, 'int_arr': [44, 8, 7, 7, 6, 2, 17, 12, 6, 25, 2, 9, 4, 15, 3, 2, 6, 20, 17, 9], 'tweet_id': 521347782691815424, 'padded_int_arr': [44, 8, 7, 7, 6, 2, 17, 12, 6, 25, 2, 9, 4, 15, 3, 2, 6, 20, 17, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


### To get more "formated" info about a tweet dictionary, use the following function

In [8]:
dl.print_recovered_tweet(val[0])

created_at: 2014-10-12 17:12:52
user_mentions: []
user_post: 239
int_arr: [44, 8, 7, 7, 6, 2, 17, 12, 6, 25, 2, 9, 4, 15, 3, 2, 6, 20, 17, 9]
tweet_id: 521347782691815424
padded_int_arr: [44, 8, 7, 7, 6, 2, 17, 12, 6, 25, 2, 9, 4, 15, 3, 2, 6, 20, 17, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
original tweet content: Finna grab some acgs


### Notice that the option argument (cell 1) determines whether int_arr is in char level or word level

In [9]:
# token2property is a map from token a property dictionary
# notice that all words are in BINARY
token2property = dl.token2property
if option == 'word':
    word = b'why'
    print(token2property[word])
    print('id = %d' % token2property[word]['id'])
if option == 'char':
    char = 'ðŸ˜“'.encode('utf-8')
    print(token2property[char])
    print('id = %d' % token2property[char]['id'])
id2token = dl.id2token
print(id2token[10])

{'occurence_in_labeled': 51, 'id': 135, 'isemoji': True, 'occurence_in_unlabeled': 3561}
id = 135
h
