- This script converts multi-label data in numpy ndarray format to the format thats used by the MIMLSVM code (can be downloaded here: http://lamda.nju.edu.cn/code_MIMLBoost%20and%20MIMLSVM.ashx)

- The first step is optional (you can just use the second step if your data is already in vectorized format)

```matlab
%       MIML_TO_MLL takes,
%           train_bags       - An M1x1 cell, the jth instance of the ith training bag is stored in train_bags{i,1}(j,:)
%           train_target     - A QxM1 array, if the ith training bag belongs to the jth class, then train_target(j,i) equals +1, otherwise train_target(j,i) equals -1
%           test_bags        - An M2x1 cell, the jth instance of the ith test bag is stored in test_bags{i,1}(j,:)
%           test_target      - A QxM2 array, if the ith test bag belongs to the jth class, test_target(j,i) equals +1, otherwise test_target(j,i) equals -1
```

NOtes:

- M1 = number of training bags
- M2 = number of test bags
- Q = number of tags

In [145]:
import os
import sys

import nltk
import numpy as np
import pandas as pd

from nltk import TextTilingTokenizer
from scipy.io import savemat
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler,MinMaxScaler

pd.set_option('display.max_colwidth',1000)

%matplotlib inline
%load_ext autoreload
%autoreload 1

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
src_dir = os.path.join(os.getcwd(), os.pardir, '../')
sys.path.append(src_dir)

In [9]:
%aimport src.data.movielens_20m_imdb
%aimport src.helpers.labels,src.helpers.neighbours, src.helpers.segments
%aimport src.utils.dataframes, src.utils.clusters, src.utils.metrics

In [10]:
from src.data.movielens_20m_imdb import load_df_or_get_from_cache
from src.helpers.labels import truncate_labels

from src.utils.dataframes import sample_rows

In [126]:
MODELS_ROOT = os.path.abspath("../../models/ranking/movielens-ovr-linear-svc-calibrated/")
INTERIM_DATA_ROOT = os.path.abspath("../../data/interim/movielens-ml20m-imdb/")
PATH_TO_PROCESSED_FILE = os.path.abspath('../../data/processed/movielens-20m-imdb-tags-and-synopses-2017-12-20.csv')

OUT_PATH = '/home/felipe/Downloads/MIMLBoost&MIMLSVM/sample/testing/'

# CONFIGS
SEED= 42
MAX_NB_WORDS = 200
SAMPLING_FACTOR = 0.3
MIN_TAG_DF = 10

W=20 # Pseudosentence size (in words) - not specified in the paper, taken from TextTiling default values
K=10 # Size (in sentences) of the block used in the block comparison method - not specified in the paper, taken from TextTiling default values

In [18]:
np.random.seed(SEED)

In [20]:
docs_df = load_df_or_get_from_cache(PATH_TO_PROCESSED_FILE,INTERIM_DATA_ROOT)

docs_df = sample_rows(docs_df,SAMPLING_FACTOR)

In [42]:
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
docs_df['sentences'] = docs_df['synopsis'].map(lambda row: sentence_tokenizer.tokenize(row))

In [44]:
tok = TextTilingTokenizer(w=W, k=K)

In [45]:
def extract_segments(candidates):
    
    try:
        # we must manually insert "\n\n" because this is how 
        # texttilingtokenizer requires candidate boundaries to be 
        # represented.
        segments = tok.tokenize("\n\n".join(candidates))
    except ValueError:
        # this happens when the candidate list is too small for the 
        # text tiling tokenizer to be able to find segments. so just return
        # the original sentences.
        segments= candidates
        
    # now remove the artificially added chars
    segments = [segment.replace("\n\n"," ").strip() for segment in segments]
    
    return segments

In [46]:
docs_df['segments'] = docs_df['sentences'].map(lambda candidates: extract_segments(candidates))
docs_df['num_segments'] = docs_df['segments'].map( lambda sents: len(sents))

In [81]:
docs_df[['num_segments']].describe()

Unnamed: 0,num_segments
count,2013.0
mean,9.982116
std,10.242181
min,1.0
25%,3.0
50%,7.0
75%,12.0
max,92.0


In [87]:
documents = docs_df['synopsis'].values
segments = docs_df['segments'].values

labels = docs_df["tags"].map(lambda tagstring: tagstring.split(","))
labels = truncate_labels(labels,MIN_TAG_DF)

In [101]:
documents_train, documents_test, segments_train, segments_test, target_train, target_test = train_test_split(
    documents, 
    segments,
    labels,
    test_size=0.15,
    random_state=SEED)

print('total number of train documents: {}'.format(documents_train.shape[0]))
print('total number of validation documents: {}'.format(documents_test.shape[0]))
print("total number of unique tags: {} ".format(len(mlb.classes_)))

total number of train documents: 1711
total number of validation documents: 302
total number of unique tags: 608 


In [157]:
vectorizer = TfidfVectorizer(max_features=MAX_NB_WORDS).fit(documents_train)
label_binarizer = MultiLabelBinarizer().fit(labels)

## vectorize each instance in each bag

In [153]:
documents_train[0]

'September 1960: on a purely punitive basis, eight scouts must climb the solid mass of Brévent to 2500 meters of altitude. The so beautiful and so majestic mountain which draws up face them very quickly reveals dangerous. All the techniques of orientation learned at the scouts will do nothing there. The teenagers find themselves delivered to themselves. Lost in the abrupt throats, the eight boys are confronted cold, with the hunger and the fear. A tension starts to reign between these young people of different origins and of which concerns divergent: some worry for Algeria, for a brother left to the combat, or a family threatened of expulsion in metropolis, still the girls worry to like others. The group divides then to find an exit, but one of them disappears in water frozen from a torrent. While young people try to protect itself in a closed down refuge, the others will seek the helps...'

In [181]:
# right now, segments_train and segments_test are list of lists of strings.

vectorized_segments = list()

for i in range(100):
    current_segments = segments_train[i]
    
    current_vectorized_segments = list()
    
    for segment in current_segments:
        
        segment_data = np.array([segment])
        
        vectorized_segment = vectorizer.transform(segment_data).toarray()
        
        current_vectorized_segments.append(vectorized_segment)
        
    vectorized_segments.append(current_vectorized_segments)   
    
np_vectorized_segments = np.array(vectorized_segments)   


train_labels = label_binarizer.transform(target_train[:100])

# they want +1/-1 indicators
train_labels[train_labels == 0] = -1

savemat(OUT_PATH+'/train.mat',{'bags':np_vectorized_segments, 'targets':train_labels.T})    

In [182]:
# right now, segments_train and segments_test are list of lists of strings.

vectorized_segments = list()

for i in range(100):
    current_segments = segments_test[i]
    
    current_vectorized_segments = list()
    
    for segment in current_segments:
        
        segment_data = np.array([segment])
        
        vectorized_segment = vectorizer.transform(segment_data).toarray()
               
        current_vectorized_segments.append(vectorized_segment)
        
    vectorized_segments.append(current_vectorized_segments)   
    
np_vectorized_segments = np.array(vectorized_segments)   

test_labels = label_binarizer.transform(target_test[:100])

# they want +1/-1 indicators
test_labels[test_labels == 0] = -1

savemat(OUT_PATH+'/test.mat',{'bags':np_vectorized_segments, 'targets':test_labels.T})