Copy this notebook and don't forget to **ADD YOUR NAME** in the name of the copy.

Answers to send by email by next Wednesday evening:


*   subject : **ML1 LAB4**
*   (and not ML1-LAB4 or ML1LAB4) etc...
*   with the link to your colab, with edit rights (cf. sharing settings)



## TODO1 : linear prediction in numpy

Suppose we have already learnt a multiclass classifier into 3 classes, with matrix weight W and bias vector b.

(below we just set them randomly, but suppose they are the result of a learning phase)

Suppose we have 4 input objects to classify (matrix X below).

Implement how to predict the class
- for the full batch X
- for a single row in X (take the first row)

Tip: look for the numpy argmax method


In [29]:
import numpy as np

# weight matrix and bias vector
W = np.random.rand(10,3)  # one column of weights per class
b = np.random.rand(3)     # one bias value per class

# 4 input vectors of size 10
X = np.random.rand(4,10)

# Predict the class for the full batch X
distance = np.dot(X, W) + b
predictions = np.argmax(distance, axis=1)
print("predictions:", predictions)


predictions: [1 1 1 0]


# The scikit-learn python framework

[Scikit-learn](https://scikit-learn.org/stable/) is a machine learning python library, which implements various regression, classification and clustering algorithms.

It's imported as `sklearn`.

Note sklearn is much used for linear and log-linear models, and less for deep learning.

The classification and regression parts on the home page both point to the same general page on supervised learning: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

#Scikit-learn Vectorizers to get BOW vectors

In last lab we coded how to transform a collection of documents into matrices of "**bag of words**" representations of these documents (the X_train and X_test matrices of LAB2 and 3).

Scikit-learn has "vectorizer" methods to do that :
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

These matrices use the scipy.sparse type, which is appropriate for **sparse matrices**.

All the vectorizers modules have 3 methods:
- **fit** : builds the vocabulary and the correspondance between word forms and word ids
- **transform** : transforms the documents into matrices of counts
- **fit_transform** : performs both actions

In [8]:
#
from sklearn.feature_extraction.text import CountVectorizer

# a French corpus (to see what is going on with diacritics)
train_corpus = [
     'Ceci est un document.',
     "Aujourd'hui, ce document est à moi.",
     'Et voilà le troisième.',
     'Le premier document est-il le plus intéressant?',
 ]
vectorizer = CountVectorizer()

# the vectorizer is empty : this generates an error
#print(vectorizer.vocabulary_)
#print(vectorizer.get_feature_names_out())


## The fit_transform method

Calling `fit_transform` on train_corpus will :
- tokenize the text : it will split it into "words" using a regular expression to define what can be a separator between words
  - NB: this is a very uninformed and rough tokenization, meaning the obtained tokens are not always words as defined in linguistics
- identify the vocabulary and associate an id to each element of the vocabulary
- AND transform the training set into a matrix of BOW vectors

In [11]:

X_train = vectorizer.fit_transform(train_corpus)

# the matrix is sparse
print("type of X_train", type(X_train))
print("shape of X_train", X_train.shape)
print(X_train)

# here it is as a standard matrix
print(X_train.toarray())

type of X_train <class 'scipy.sparse._csr.csr_matrix'>
shape of X_train (4, 16)
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 21 stored elements and shape (4, 16)>
  Coords	Values
  (0, 2)	1
  (0, 4)	1
  (0, 14)	1
  (0, 3)	1
  (1, 4)	1
  (1, 3)	1
  (1, 0)	1
  (1, 6)	1
  (1, 1)	1
  (1, 10)	1
  (2, 5)	1
  (2, 15)	1
  (2, 9)	1
  (2, 13)	1
  (3, 4)	1
  (3, 3)	1
  (3, 9)	2
  (3, 12)	1
  (3, 7)	1
  (3, 11)	1
  (3, 8)	1
[[0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0]
 [1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1]
 [0 0 0 1 1 0 0 1 1 2 0 1 1 0 0 0]]


In [4]:
# here is the mapping between word forms and ids (our "w2i" in previous lab session)
print(vectorizer.vocabulary_)

# the list of word forms (our i2w)
print(vectorizer.get_feature_names_out())



{'ceci': 2, 'est': 4, 'un': 14, 'document': 3, 'aujourd': 0, 'hui': 6, 'ce': 1, 'moi': 10, 'et': 5, 'voilà': 15, 'le': 9, 'troisième': 13, 'premier': 12, 'il': 7, 'plus': 11, 'intéressant': 8}
['aujourd' 'ce' 'ceci' 'document' 'est' 'et' 'hui' 'il' 'intéressant' 'le'
 'moi' 'plus' 'premier' 'troisième' 'un' 'voilà']


## TODO2 : answer the following comprehension questions:
- What is the size of the vocabulary

The vocabulary size is 16.

- What does the 4th column of X.train.toarray() represent ?

The 4th column of X_train.toarray() represents the number of times the 4th word ("est" in this case) appears in each document.

- What is printed when printing the sparse matrix ?

This not print a real matrix, but the structure of the matrix, with the non-zero elements and their position in the matrix. It's a better way to store the matrix, because it's more efficient in terms of memory and computation.

## The transform method


In [16]:
test_corpus = [ 'Ah un nouveau document.',
              'Et ceci est encore un document.']
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_corpus)
X_test = vectorizer.transform(test_corpus)
print("shape of X_test", X_test.shape)

{'ceci': 2, 'est': 4, 'un': 14, 'document': 3, 'aujourd': 0, 'hui': 6, 'ce': 1, 'moi': 10, 'et': 5, 'voilà': 15, 'le': 9, 'troisième': 13, 'premier': 12, 'il': 7, 'plus': 11, 'intéressant': 8}
shape of X_test (2, 16)


## TODO3: analysis
- What happened to the words in test_corpus that are not present in train_corpus?

They are just ignored, because the vectorizer has been trained on the train_corpus, so it doesn't know the words from the test_corpus.
- Compare to vectorizer.fit_transform

The vectorizer.fit_transform method create the vocabulary and the correspondance between word forms and word ids, and transform the documents into matrices of counts. The transform method only transform the documents into matrices of counts, using the vocabulary and the correspondance between word forms and word ids created by the fit method.

## TODO4: changing the parameters

We are now providing input sentences in which tokens have all been separated by spaces.

1.   How can you change the tokenization that the CountVectorizer will use ? (see its constructor)

The tokenization can be changed within the token_pattern parameter, which is a regular expression that defines what is considered as a token. But to use this parameter, the analyzer parameter must be set to 'word'.

2.   in particular, how can you make CountVectorizer split on spaces only?

The token_pattern parameter can be set to r"(?u)[\w\-\']+", which will split on spaces only.

Indications: study https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html to see all the members of the instance, and deduce which member to modify:


1.   Find out which parameters to modify to switch to bigram and trigram of **characters** features, and print the obtained vocabulary
    - this means that the vocabulary will not be made of words, but of sequences of characters, of length 2 (character bigram) or 3 (character trigram)

To have only two characters in one token, the analyzer parameter must be set to 'char' and the ngram_range parameter must be set to (2, 2). To have three characters in one token, the ngram_range parameter must be set to (3, 3).

4.   Study the TfidfVectorizer class
 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html and deduce how to easily obtain TF.IDF weigthed vector representations of the documents


In [46]:
train_corpus = [
     'Ceci est un document .',
     "Aujourd'hui , ce document est encore un document à moi .",
     'Et voilà le troisième .',
     'Le premier document est -il le plus intéressant ?',
 ]


vectorizer = CountVectorizer()
vectorizer.set_params(analyzer='word')
vectorizer.set_params(token_pattern=r"(?u)[\w\-\']+")
X_train = vectorizer.fit_transform(train_corpus)
vectorizer.get_feature_names_out()

array(['-il', "aujourd'hui", 'ce', 'ceci', 'document', 'encore', 'est',
       'et', 'intéressant', 'le', 'moi', 'plus', 'premier', 'troisième',
       'un', 'voilà', 'à'], dtype=object)

In [39]:
train_corpus = [
     'Ceci est un document .',
     "Aujourd'hui , ce document est encore un document à moi .",
     'Et voilà le troisième .',
     'Le premier document est -il le plus intéressant ?',
 ]
from sklearn.feature_extraction.text import TfidfVectorizer

Tfvectorizer = TfidfVectorizer()
X_train = Tfvectorizer.fit_transform(train_corpus)
print(X_train.toarray()[0])

[0.         0.         0.64065543 0.40892206 0.         0.40892206
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.5051001  0.        ]
