# Technical exercise - Data scientist intern @ Giskard

Hi! As part of our recruitment process, we’d like you to complete the following technical test in 10 days. Once you finish the exercise, you can send your notebook or share your code repository by email (matteo@giskard.ai). If you want to share a private GitHub repository, make sure you give read access to `mattbit`.

If you have problems running the notebook, get in touch with Matteo at matteo@giskard.ai.

In [None]:
%pip install numpy pandas scikit-learn datasets transformers torch "giskard>=2.0.0b"

## Exercise 1: Code review

Your fellow intern is working on securing our API and wrote some code to generate secure tokens. You have been asked to review their code and make sure it is secure and robust. Can you spot the problem and write a short feedback?

In [19]:
import string
import secrets

ALPHABET = string.ascii_letters + string.digits

def generate_secret_key_using_secrets(size: int = 20):
    """Generates a cryptographically secure random token using secrets module."""
    token = "".join(secrets.choice(ALPHABET) for _ in range(size))
    return token

# function testing
secret_key = generate_secret_key_using_secrets()
print(secret_key)

B1IKnhie5N9EOGUoW5BZ


To generate security tokens, the library 'random' is clearly not the most secure. So I will make a small benchmark of libraries that would be more suitable to generate secure tokens. 

1) Secrets
    - Advantage: secrets is simple to use and is part of the standard Python library, which means there is no need to install third-party libraries.
    - Disadvantage: It does not offer as many advanced features as some other libraries, which can be limiting in complex use cases.


2) PyJWT 
    - Advantage: PyJWT facilitates the creation and verification of JWT, a widely used standard for authentication and authorization, facilitating interoperability with other JWT-supported services and libraries.
    - Disadvantage: Key management and configuration of JWT signature algorithms can be complex, especially for beginners.


3) UUID
    - Advantage: uuid generated UUIs are universally unique, making them suitable for many applications.
    - Disadvantage: UUIDs have a fixed length, which may not be suitable for all situations, as variable length tokens may be required.


4) Cryptography 
    - Advantage: cryptography offers advanced security and key management capabilities, making it an excellent choice for applications requiring a high level of security.
    - Disadvantage: Because of its power, cryptography can be more complex to use for simple tasks. It can be oversized for simple token generation needs.

The choice of library will depend on the needs. If simplicity is your priority, secrets or uuid may be enough. If you need JWT, PyJWT is a good choice. If you have high security requirements, cryptography may be the best option. Be sure to weigh these advantages and disadvantages according to your context of use.

So I decided to use the 'secrets' library and declare the 'ALPHABET' variable with the 'string' library and ASCII code to be more readable and not exclude capital letters :

## Exercise 2: High dimensions

Matteo, our ML researcher, is struggling with a dataset of 40-dimensional points. He’s sure there are some clusters in there, but he does not know how many. Can you help him find the correct number of clusters in this dataset?

Determining the optimal number of clusters in a data set with 40-dimensional points can be more complex than with two-dimensional data. The most common method for choosing the right number of clusters for large data is silhouette analysis and K-means algorithm. For this, I will use the Scikit-learn library:

In [27]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

x = np.load("points_1.npy")

minimum_clusters = 2
maximum_clusters = 30

silhouette_scores = []
optimal_clusters = None
optimal_score = -1.0

for n_clusters in range(minimum_clusters, maximum_clusters + 1):
    kmeans = KMeans(n_clusters=n_clusters, n_init=10,random_state=42)
    cluster_labels = kmeans.fit_predict(x)
    silhouette_avg = silhouette_score(x, cluster_labels)
    silhouette_scores.append(silhouette_avg)

    if silhouette_avg > optimal_score:
            optimal_score = silhouette_avg
            optimal_clusters = n_clusters
    
print(f"It looks like there are {optimal_clusters} clusters.")


It looks like there are 11 clusters.


Matteo is grateful for how you helped him with the cluster finding, and he has another problem for you. He has another high-dimensional dataset, but he thinks that those points could be represented in a lower dimensional space. Can you help him determine how many dimensions would be enough to well represent the data?

To determine how many dimensions would be sufficient to properly represent the data, I will use a principal component analysis (PCA) technique to reduce the size of the dataset while preserving a sufficiently high proportion of the total variance, always using the scikit-learn library:

In [33]:
import numpy as np
from sklearn.decomposition import PCA

x = np.load("points_2.npy")

pca = PCA()
pca.fit(x)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
threshold = 0.95  
num_dimensions = np.argmax(cumulative_variance >= threshold) + 1

print(f"It looks like the data is {num_dimensions}-dimensional")

It looks like the data is 25-dimensional


I chose to take a 95% cumulative variance threshold in order to minimize the loss of information. In doing so, we have a 25-dimensional dataset. For comparison, if I put cumulative variance threshold at 90%, I have a 13-dimensional dataset. 

## Exercise 3: Mad GPT

Matteo is a good guy but he is a bit messy: he fine-tuned a GPT-2 model, but it seems that something went wrong during the process and the model became obsessed with early Romantic literature.

Could you check how the model would continue a sentence starting with “Ty”? Could you recover the logit of the next best token? And its probability?

You can get the model from the HuggingFace Hub as `mattbit/gpt2wb`.


In [11]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F


tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("mattbit/gpt2wb")

# Definition and tokenization of the selected input text ('Ty')
input_text = "Ty"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generation of the next token while disabling the calculation of the gradient because we don't want to train the model. 
with torch.no_grad():
    output = model(input_ids)

# Retrieving logits from the next token using the model output
logits = output.logits[0, -1, :]

# We get the index of the token with the highest logits 
next_token_index = logits.argmax().item()

# Then we calculate the logits value of this token
next_token_logit = logits[next_token_index]

# We calculate the probability that this token is generated by the model with the function 'F.softmax' because the logit of this token is negative. 
next_token_probability = F.softmax(logits, dim=0)[next_token_index].item() * 100

# We end up decoding the token to find what will come after 'Ty' according to the model
next_token = tokenizer.decode(next_token_index)


print("Next Token:", next_token)
print("Logit:", next_token_logit)
print(f"Probability: {next_token_probability:.2f}%")

Next Token: ger
Logit: tensor(-16.2950)
Probability: 99.19%


## Exercise 4: Not bad reviews


We trained a random forest model to predict if a film review is positive or negative. Here is the training code:

In [None]:
import datasets

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


# Load training data
train_data = datasets.load_dataset("sst2", split="train[:20000]").to_pandas()
valid_data = datasets.load_dataset("sst2", split="validation").to_pandas()

# Prepare model
with open("stopwords.txt", "r") as f:
    stopwords = [w.strip() for w in f.readlines()]

preprocessor = TfidfVectorizer(stop_words=stopwords, max_features=5000, lowercase=False)
classifier = RandomForestClassifier(n_estimators=400, n_jobs=-1)

model = Pipeline([("preprocessor", preprocessor), ("classifier", classifier)])

# Train
X = train_data.sentence
y = train_data.label

model.fit(X, y)

print(
    "Training complete.",
    "Accuracy:",
    model.score(valid_data.sentence, valid_data.label),
)


Overall, it works quite well, but we noticed it has some problems with reviews containing negations, for example:

In [None]:
# Class labels are:
# 1 = Positive, 0 = Negative

# this returns positive, that’s right!
assert model.predict(["This movie is good"]) == [1]

# negative! bingo!
assert model.predict(["This movie is bad"]) == [0]

# WHOOPS! this ↓ is predicted as negative?! uhm…
assert model.predict(["This movie is not bad at all!"]) == [1]

# WHOOPS! this ↓ is predicted as negative?! why?
assert model.predict(["This movie is not perfect, but very good!"]) == [1]


Can you help us understand what is going on? Do you have any idea on how to fix it?
You can edit the code above.

## Exercise 5: Model weaknesses


The Giskard python library provides an automatic scanner to find weaknesses and vulnerabilities in ML models.

Using this tool, could you identify some issues in the movie classification model above? Can you propose hypotheses about what is causing these issues?

Then, choose one of the issues you just found and try to improve the model to mitigate or resolve it — just one, no need to spend the whole weekend over it!

You can find a quickstart here: https://docs.giskard.ai/en/latest/getting-started/quickstart.html