## Q1) Vector compression

You want to compress an array that often contains consecutive appearances of the same integer. The task is to reduce the array into a compressed array that contains, one after the other, the integer seen and the number of consecutive repetitions of that integer.

```
Input: arr = [1,1,1,5,3,3,9,3,3,3,3,3]
Output: [1,3,5,1,3,2,9,1,3,5]
```


In [1]:
from itertools import groupby

def compress(a):
    list1 = a
    
    # remove consecutive duplicates and keep the sequence
    # [1,1,1,5,3,3,9,3,3,3,3,3] -> [1, 5, 3, 9, 3]
    lst1 = [i[0] for i in groupby(list1)]
    
    # count consecutive duplicates in list
    # [1,1,1,5,3,3,9,3,3,3,3,3] -> [3, 1, 2, 1, 5]
    lst2 = [sum(1 for _ in group) for _, group in groupby(list1)]
    
    # merge 2 lists alternatively
    merge_list = [sub[item] for item in range(len(lst2)) for sub in [lst1, lst2]]
    
    return merge_list

In [2]:
arr = [1,1,1,5,3,3,9,3,3,3,3,3]
compress(arr)

[1, 3, 5, 1, 3, 2, 9, 1, 3, 5]

## Q2) Cross-validation

Given a list L of values generated independently by some unknown process, we will use the mean of L to predict unseen values generated by the same process. Use leave-one-out cross-validation to estimate the mean absolute error of this process.

```
- Input: An array of floats arr
- Output: A float score

Example:
- arr = [1 2 3]
- score = 1.0

Explanation:
1. Leave 1 out:
    - Mean = (2+3)/2 = 2.5
    - MAE = 2.5 - 1 = 1.5
   
2. Leave 2 out:
    - Mean = (1+3)/2 = 2
    - MAE = 2 - 2 = 0
    
3. Leave 3 out:
    - Mean = (1+2)/2 = 1.5
    - MAE = 3 - 1.5 = 1.5
    
Average of MAE = (1.5 + 0 + 1.5)/3 = 1.0
```


In [3]:
import numpy as np

def loocv(x):
    
    mae_list = []
    
    for i in range(len(x)):
        # Leave One Out = exclude the element in the current ith position 
        y = [el for el in x if el!=x[i]]
        
        # get the element that is left out, convert 1-item list into integer
        y_loo = [int(el) for el in x if el==x[i]]
        y_true = y_loo[0]
        #print(y_true)
        
        # get the average of the remaining elements in the list
        prediction = np.mean(y)
        
        # MAE = average of y_true - prediction
        mae = np.mean(np.abs(y_true - prediction))
        
        # append MAE to list
        mae_list.append(mae)
        
    # MAE of LOOCV = average of all the MAEs in the list
    mae_loocv = np.mean(mae_list)
    return mae_loocv

In [4]:
true = [1,2,3]
loocv(true)

1.0

## Q3) Tokenization:

Given a list of sentences:

- convert words to lower case
- tokenize the words by white space
- convert to word matrix, ordered by alphabetical order

```
Input: ['This is ball','My ball','This is cat Cat']
Output: [[1 0 1 0 1], [1 0 0 1 0], [0 2 1 0 1]] 

               ball, cat, is, my, this
sentence 1:    1     0    1   0   1
sentence 2:    1     0    0   1   0
sentence 3:    0     2    1   0   1
```

### Solution 1: Custom count vectorizer

In [5]:
import itertools
from collections import Counter
from scipy.sparse import csr_matrix

def custom_fit(a):
    
    # convert words to lower case and split by white space
    # sort each word within the list alphabetically
    res = [sorted(sub.lower().split()) for sub in a]
    
    # bag of words = merge list of strings and sort alphabetically
    bow = sorted(list(set(itertools.chain.from_iterable(res))))
    #print(bow) 
    
    # get word index position from bag of words list
    vocab = {}
    for index, word in enumerate(bow):
        vocab[word] = index    
    print('bag of words:', vocab)
    
    # create empty lists to store sparse matrix
    row, col, val = [],[],[]
    
    for idx, sentence in enumerate(text):
        # count number words in each sentence
        count_word = dict(Counter(sentence.lower().split()))
        
        for word, count in count_word.items():
            # get the index of each word from bag of words
            col_index = vocab.get(word)
            
            if col_index >= 0:
                row.append(idx)
                col.append(col_index)
                val.append(count)
    
    word_matrix = (csr_matrix((val, (row,col)), shape=(len(text),len(vocab)))).toarray()
    return word_matrix

In [6]:
text = ['This is ball','My ball','This is cat Cat']
custom_fit(text)

bag of words: {'ball': 0, 'cat': 1, 'is': 2, 'my': 3, 'this': 4}


array([[1, 0, 1, 0, 1],
       [1, 0, 0, 1, 0],
       [0, 2, 1, 0, 1]])

### Solution 2: With SKLearn count vectorizer

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

def token_sk(a):
    
    # convert words to lower case
    res = [sub.lower() for sub in a]
    
    # create a Vectorizer Object 
    vectorizer = CountVectorizer()
    vectorizer.fit(res)
    
    # printing the identified unique words along with their indices
    print("Vocabulary: ", vectorizer.vocabulary_)
    
    # encode the document
    vector = vectorizer.transform(res)
    print(vector.toarray())

In [8]:
text = ['This is ball','My ball','This is cat Cat']
token_sk(text)

Vocabulary:  {'this': 4, 'is': 2, 'ball': 0, 'my': 3, 'cat': 1}
[[1 0 1 0 1]
 [1 0 0 1 0]
 [0 2 1 0 1]]


---------