<a href="https://colab.research.google.com/github/tawounfouet/dataimpact-technical-test/blob/main/Copie_de_ML_DL_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science

**Answering questions with a ðŸ”´ are considered mandatory for CDI.**

Before starting the exercises make sure the GPU is enabled


*   Go to the runtime menu (&#8593; )
*   Select *Change runtime type*
*   Select GPU as *Hardware accelerator*


## Machine Learning

### Exercise 1 **(10 points)**
Using sklearn library, find the fitting algorithm to classify whether someone has diabetes or not according to the dataset found in the repository as `Data_science/machine_learning/diabetes.csv`. The column containing the target is 'Outcome'.

An accuracy of around 0.7 is considered acceptable, but make sure to try different models to find the best one.


---


To upload the `diabetes.csv` file to google colab run the cell below. It will display a button for upload. After upload, file will be visible on the files tab on your left (folder icon)

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving diabetes.csv to diabetes.csv
User uploaded file "diabetes.csv" with length 23873 bytes


In [None]:
import sklearn

In [None]:
# Write your code here
import pandas as pd

# Load the dataset
# df = pd.read_csv("Data_science/machine_learning/diabetes.csv")
df = pd.read_csv("diabetes.csv")

In [None]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Prepare the data
X = df.drop(columns=['Outcome'])  # Features
y = df['Outcome']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression

# Choose a classification algorithm and train the model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

print(f"y      : {y_pred[:10]}")
print(f"y_pred : {y_pred[:10]}")

y      : [0 0 0 0 0 0 0 1 1 1]
y_pred : [0 0 0 0 0 0 0 1 1 1]


In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", round(accuracy, 3))

Accuracy: 0.753


Testing others models

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


# Define a list of classifiers
classifiers = [
    ("Decision Tree", DecisionTreeClassifier()),
    ("Random Forest", RandomForestClassifier()),
    ("Support Vector Machine", SVC())
]

# Train and evaluate each classifier
for name, clf in classifiers:
    clf.fit(X_train_scaled, y_train)
    y_pred = clf.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name}: Accuracy - {accuracy:.3f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("-" * 50, "\n")

Decision Tree: Accuracy - 0.753
Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.75      0.80        99
           1       0.63      0.76      0.69        55

    accuracy                           0.75       154
   macro avg       0.74      0.76      0.74       154
weighted avg       0.77      0.75      0.76       154

Confusion Matrix:
[[74 25]
 [13 42]]
-------------------------------------------------- 

Random Forest: Accuracy - 0.773
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.82      0.82        99
           1       0.68      0.69      0.68        55

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154

Confusion Matrix:
[[81 18]
 [17 38]]
-------------------------------------------------- 

Support Vector Machine: Accuracy - 0.734
Classification 

## Deep Learning 1

### Exercise 3 -- Recommender Systems

#### ðŸ”´ 3.1 A wrong model **(12 points)**

We build a recommender system that takes as input a user id and an item id and outputs a rating (continuous value between 1 and 10).
Let's suppose there are `10000` users and `1000` items.
There are 6 wrong / incoherent lines in this code, correct them all, and place a short comment using `#` in the same line.

In [None]:
import tensorflow as tf

from tensorflow.keras.layers import Input, Embedding, Flatten
from tensorflow.keras.layers import Dense, Dropout, Dot, Concatenate
from tensorflow.keras.models import Model

tf.random.set_seed(0)  # fix the random seed (do not change this line)

EMBEDDING_SIZE = 100  #  more reasonable size for embeddings
NUM_ITEMS = 1000  # coorected the number of items to 1000 as mentioned in the problem statement
NUM_USERS = 10000
item_embedding = Embedding(output_dim=EMBEDDING_SIZE,
                           input_dim=NUM_ITEMS,
                           input_length=1)
user_embedding = Embedding(output_dim=EMBEDDING_SIZE,
                           input_dim=NUM_USERS,
                           input_length=1)

item = Input(shape=(1,), dtype='int32')
user = Input(shape=(1,), dtype='int32')
emb_item = Flatten()(item_embedding(item))
emb_user = Flatten()(user_embedding(user))
x = Concatenate()([emb_item, emb_user])
x = Dropout(0.99)(x) #  Corrected dropout value to 0.1  as 0.99 is too high.
x = Dense(1, activation="relu")(x)
x = Dense(1, activation="tanh")(x)  # Changed Dense layer units to 1 for output rating

model = Model([item, user], x)
model.compile(optimizer='rmsprop', loss='mean_squared_error')  # Corrected loss to 'mean_squared_error'

In [None]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 embedding (Embedding)       (None, 1, 100)               100000    ['input_1[0][0]']             
                                                                                                  
 embedding_1 (Embedding)     (None, 1, 100)               1000000   ['input_2[0][0]']             
                                                                                              

#### 3.2 Most similar embedding vectors **(10 points)**

We want to find the most similar items to a given vector $x$ using the embedding matrix $W$ of the previous (untrained) recommendation model.

ðŸ”´ 1. Retreive the right embedding matrix from the previous model and store it in `items_embeddings` **(3 points)**

In [None]:
#item_embeddings =  # TODO: change this
#item_embeddings.shape

# Retrieve the weights of the item_embedding layer from the previous model
item_embeddings = model.get_layer('embedding').get_weights()[0]

# Check the shape of the item_embeddings matrix
print("Shape of item_embeddings matrix:", item_embeddings.shape)

Shape of item_embeddings matrix: (1000, 100)


2. Write a function to compute the cosine similariti**es** between a vector x and all the possible vectors y in the item embedding matrix **(2 points)**. Recall that for a given vector y, the cosine similarity is given by:

$$cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^T \mathbf{y}}{|| \mathbf{x} ||_2 || \mathbf{y} ||_2 }$$



In [None]:
import numpy as np


def cosine_sims(x, y):
    # TODO: write me!
    # Compute dot product between x and y
    dot_product = np.dot(x, y.T)

    # Compute norms of x and y
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y, axis=1)

    # Compute cosine similarities
    similarities = dot_product / (norm_x * norm_y)

    return similarities


# Arbitrary query vector x used for testing:
x = np.ones(shape=(EMBEDDING_SIZE,))
similarities = cosine_sims(x, item_embeddings)

In [None]:
print("shape:", similarities.shape)
print(f"min={similarities.min()}, max={similarities.max()}")

shape: (1000,)
min=-0.3622936427789141, max=0.3180012145755963


3. Write a function to find the 3 most similar item vectors to our query vector `x`: **(5 points)**

In [None]:
def most_similar(x, item_embeddings, top_n=3):
    # TODO: write me!
    # Compute cosine similarities between x and all item vectors
    similarities = cosine_sims(x, item_embeddings)

    # Get indices of top_n items with highest similarities
    top_indices = np.argsort(similarities)[::-1][:top_n]

     # Create a list of (item_id, similarity) tuples
    similar_items = [(index, similarities[index]) for index in top_indices]

    return similar_items


most_similar(x, item_embeddings, top_n=3)
# possible output if the function is well written:
# A list of (item_id, similarity) tuples in descending similarity order:
# [(686, 0.333...),
# (728, 0.292...),
# (675, 0.257...)]

[(660, 0.3180012145755963),
 (480, 0.29472378435483043),
 (950, 0.2869575843145515)]

In [19]:
# simple CNN
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, Activation
from tensorflow.keras.models import Model

# Define input layer
my_input = Input(shape=(10, 10, 1))

# Add convolutional layer
my_output = Conv2D(1, (3, 3))(my_input)

# Add activation layer
my_output = Activation("sigmoid")(my_output)

# Create model
model = Model(inputs=my_input, outputs=my_output)

# Display model summary
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 10, 10, 1)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 8, 8, 1)           10        
                                                                 
 activation_3 (Activation)   (None, 8, 8, 1)           0         
                                                                 
Total params: 10 (40.00 Byte)
Trainable params: 10 (40.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Exercise 4 -- Computer Vision

A datascientist named Alice has built the following CNN:

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.layers import Softmax
from tensorflow.keras.layers import Input, Convolution2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.models import Model

In [3]:
input = Input((112, 112, 3))
x = Convolution2D(32, kernel_size=(3, 3), padding='same',
                  activation='relu', name='conv1')(input)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Convolution2D(64, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv2")(x)
x = Convolution2D(64, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv3")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Convolution2D(128, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv4")(x)
x = Convolution2D(128, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv5")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Convolution2D(256, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv6")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten(name="flatten")(x)
x = Dense(2048, name="fc1")(x)
output = Dense(1000, activation="softmax", name="fc2")(x)
model = Model(input, output)

In [4]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 112, 112, 3)]     0         
                                                                 
 conv1 (Conv2D)              (None, 112, 112, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2  (None, 56, 56, 32)        0         
 D)                                                              
                                                                 
 conv2 (Conv2D)              (None, 56, 56, 64)        18496     
                                                                 
 conv3 (Conv2D)              (None, 56, 56, 64)        36928     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 28, 28, 64)        0         
 g2D)                                                        

## Understanding the Layers Architecture:

- **Convolutional Layers (conv1 to conv6)**:

These layers apply filters to the input image, extracting features like edges, textures, and shapes.
Each filter has a specific size (e.g., 3x3) and generates a feature map.
The number of filters determines the number of feature maps produced by the layer.


- **MaxPooling2D Layers**:

These layers downsample the feature maps by taking the maximum value from a specific region (e.g., 2x2 pool size).
This reduces the height and width of the feature maps but retains the most prominent features.


- **Flatten Layer**:

This layer takes the multi-dimensional output from the convolutional layers (batch size, height, width, number of feature maps) and reshapes it into a single long vector.


## Calculation Steps:

**1. Convolutional Layer Output:**

- Assume the last convolutional layer (conv6) has 128 filters (output channels).
- After applying max pooling twice, the final height (h) and width (w) of the feature maps might be reduced to 14 compared to the original 112 (assuming a stride of 1 and no padding).
- So, the output shape from conv6 would be:
(batch_size, h, w, num_filters) = (32, 14, 14, 128).

In [5]:
(batch_size, h, w, num_filters) = (32, 14, 14, 128)

In [8]:
flattened_size = h * w * num_filters
flattened_size

25088

**2. Flatten Layer Calculation:**

The flatten layer takes this 4D tensor and multiplies the remaining height, width, and number of channels to get the total number of elements in the flattened vector:
- flattened_size = h * w * num_filters
- In this case: flattened_size = 14 * 14 * 128 = 819


In [12]:
# flattened_size = h * w * num_filters
flattened_size=14 * 14 * 128
flattened_size

25088

3. Final Shape:

Since we have a batch size of 32 images, the final shape of the activations at the output of the flatten layer becomes: (batch_size, flattened_size) = (32, 8192).

In [13]:
(batch_size, flattened_size) = (32, 8192)

##### 4.5 Our datascientist wants to make the original network fully-convolutional (e.g. to do coarse semantic segmentation) **(12 points)**

Modify the following code to make the network fully convolutionnal.
- Remember to use (1, 1)  convolutions instead of Dense Layers and do all necessary transformation.
- If needed, you may use the `Softmax(axis=-1)`.
- Keep exactly the same numbers of parameters as for the standard model.
- You may add or remove layers to make the whole thing work.

In [None]:
input = Input((112, 112, 3))
x = Convolution2D(32, kernel_size=(3, 3), padding='same',
                  activation='relu', name='conv1')(input)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Convolution2D(64, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv2")(x)
x = Convolution2D(64, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv3")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Convolution2D(128, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv4")(x)
x = Convolution2D(128, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv5")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Convolution2D(256, kernel_size=(3, 3), padding='same',
                  activation='relu', name="conv6")(x)

x = MaxPooling2D(pool_size=(2, 2))(x)

x = Convolution2D(2048, kernel_size=(7, 7), padding='valid',  # 7x7 is chosen to maintain spatial dimensions
                  activation='relu', name="conv7")(x)

x = Convolution2D(1000, kernel_size=(1, 1), padding='same',  # 1x1 convolution to get 1000 channels
                  activation=None, name="conv8")(x)

output = Softmax(axis=-1, name="softmax")(x)  # Softmax activation along the channel dimension
model = Model(input, output)

Use the following cells to test your model on a ranomd input (you do not need to edit the code of those cells):

In [None]:
test_batch = np.random.normal(size=(5, 112, 112, 3))
predicted = model.predict(test_batch)
predicted.shape



(5, 1, 1, 1000)

In [None]:
predicted.min()

0.0008871992

In [None]:
predicted.sum(axis=-1)[0]

array([[1.]], dtype=float32)