# Facial keypoints detection Kaggle Challenge

This is my attempt at using ML to complete the taks proposed by the challenge. 

The dataset consists of 96x96px images of faces, and the goal is to find the position of 15 features in each image. 

The train sets containts the pictures and the 30 (x,y) coordinates of the features for each training example. 

I'll begin by visualizing some of the pictures, and if possible their features. Then I'll think about how to detect the features. 

## Importing some useful libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns

import random

%matplotlib inline

##  Loading data

In [None]:
# Loading the training set

source_data = pd.read_csv('training.csv')

In [None]:
source_data.info()

In [None]:
type(source_data['Image'][0])

The pictures are given as strings. I will need to convert them to lists of integers. 

In [None]:
# extracting all the pictures

photos = source_data['Image'].apply(lambda str_pic: np.array([int(px) for px in str_pic.split()]))

# Now I have the pictures as arrays of pixel intensity.

np.sqrt(len(photos[0])) # Checking that the arrays have the right size. 

In [None]:
type(photos.head()[0])

In [None]:
# Plotting a face.
plt.imshow(random.choice(photos).reshape(96,96),cmap='gist_gray')

Plotting some random faces to see what they look like.

In [None]:
grid_size = 4
chosen_faces = random.sample(range(0,len(photos)), grid_size**2) #picking faces for a 4x4 grid. 

fig, axes = plt.subplots(grid_size, grid_size, gridspec_kw = dict(hspace = .05, wspace = .05), 
                         figsize=(10,10))

for i, ax in zip(chosen_faces, axes.flat):

    ax.axis('off')
    ax.imshow(photos[i].reshape(96,96),cmap='gist_gray')
    
#plt.tight_layout()


## Gathering the features

Now that we have the faces, lets have a closer look at the features. 

In [None]:
keypoints = source_data.drop('Image', axis = 1)

In [None]:
keypoints.info()

In [None]:
# Example 7041 is missing features. Lets plot it to see what that looks like.

#Getting the keypoints positions
guy_face = keypoints.iloc[7041]

x_points = [guy_face[i] for i in range(0,30,2)]
y_points = [guy_face[i+1] for i in range(0,30,2)] 

# Plotting a face.

plt.imshow(photos[7041].reshape(96,96),cmap='gist_gray')
plt.plot(x_points, y_points, 'ro', markerfacecolor = 'none')

In [None]:
# Plotting faces with features

grid_size = 4
chosen_faces = random.sample(range(0,len(photos)), grid_size**2) #picking faces for a 4x4 grid. 

fig, axes = plt.subplots(grid_size, grid_size, gridspec_kw = dict(hspace = .05, wspace = .05), 
                         figsize=(10,10))

for i, ax in zip(chosen_faces, axes.flat):

    ax.axis('off')
    ax.imshow(photos[i].reshape(96,96),cmap='gist_gray')
    
    x_points = [keypoints.iloc[i][j] for j in range(0,30,2)]
    y_points = [keypoints.iloc[i][j+1] for j in range(0,30,2)] 
    
    ax.plot(x_points, y_points, 'ro', markerfacecolor = 'none')


### Saving new dataframes with photos and keypoints to new csv files. 

In [None]:
# This works, but you end up with a csv file of about 2gb. It is just not worth it nor necessary.
# My only reason for doing this was that I needed to use this on a different computer. 

"""photos_for_csv = source_data['Image'].apply(lambda str_pic: [int(px) for px in str_pic.split()])

photo_arr = np.array([l for l in photos_for_csv])

np.savetxt('training_photos.csv', photo_arr, delimiter = ',')"""

In [None]:
"""photos.to_csv(r'training_photos.csv')
keypoints.to_csv(r'training_keypoints.csv')"""

#  Dealing with missing labels. 

I want to eventually put all of these train examples into a neural network, but first I need to deal with some issues. 

* MANY of the pictures are lacking some keypoints positions. This poses the problem of what the algorithm should find for those examples. I can think of a few alternatives:

    * Try to complete the missing labels and then use everything to train a NN that gives all 15 keypoints at once:
    
        * By manually finding and labelling the missing keypoints --> Of course not. 
        
        * Use the known labels as _features_, and train some sort of algorithm to find the rest of them. I could use KNN, for example, and fill in the missing info. I think it shouldn't take too long to give this a try. The problem is that I need to find batches of images having the same labels, to use as the Nearest Neighbors. So this would mean using vectors with different dimensions, or choosing at each step which features to ignore and to use as reference. __This sort of defeats the purpose: I would build a ML algorithm to find keypoints, and then use those keypoints to train a ML algorithm to find keypoints. The difference is that in the first case the feaures are known keypoints, while in the second they are the pixel intensities of the image, but still...__
        
        * Input the missing labels using the mean position of each feature, or by finding the distribution of positions for that feature and drawing randomly from it. 
    
    For all these methods of inputing the missing data I can sort of evaluate the result by looking at pictures with inputed keypoints, and see if the filled-in positions look more or less right. However, this does not sound very rigurous. 
    
    * If I don't want to fill in the missing labels, I could train one NN for each specific keypoints, and feed as trianing only images where the said keypoint is known. This implies training 15 different NNs, with a different number of training examples each. 
    
        * I could use these set of networks to predict the keypoints I need to give as a solution, or use them to complete the missing points, and then retrain. This souns more iterative and somewhat suboptimal. __Also quite redundant, and I'd probably be introducing some bias this way__.  

### Finding images with the same labeled keypoints. 

I want to make a table telling me which keypoints are already labelled for each training images. I guess veryfing that only one coordinate for the feature is enough. 

In [None]:
new_col_list = [keypoints.columns[i][:-2] for i in range(0,len(keypoints.columns),2)]
new_col_list

In [None]:
# a dataframe saying which keypoints are marked on each image. 
present_keypoints = pd.DataFrame(columns = new_col_list)

In [None]:
for col in new_col_list:
    present_keypoints[col] = pd.notnull(keypoints[col+'_x'])

In [None]:
present_keypoints['Total'] = present_keypoints.sum(axis = 1)

In [None]:
present_keypoints['Total'].value_counts()

In [None]:
present_keypoints[present_keypoints['Total'] == 14]

In [None]:
sns.heatmap(binary_df[binary_df['Total'] == 14].drop('Total', axis = 1), yticklabels=False,cbar=False,cmap='viridis')

## Filling in missing keypoints

I'll try too see how does it look when I fill the missing labels with the mean.

In [None]:
mean_positions = keypoints.mean()

In [None]:
mean_x = [mean_positions[j] for j in range(0,30,2)]
mean_y = [mean_positions[j+1] for j in range(0,30,2)] 

In [None]:
# Plotting faces with features

grid_size = 4
chosen_faces = random.sample(range(0,len(photos)), grid_size**2) #picking faces for a 4x4 grid. 

fig, axes = plt.subplots(grid_size, grid_size, gridspec_kw = dict(hspace = .05, wspace = .05), 
                         figsize=(10,10))

for i, ax in zip(chosen_faces, axes.flat):

    ax.axis('off')
    ax.imshow(photos[i].reshape(96,96), cmap='gist_gray')
    
    x_points = [keypoints.iloc[i][j] for j in range(0,30,2)]
    y_points = [keypoints.iloc[i][j+1] for j in range(0,30,2)] 
    
    ax.plot(x_points, y_points, 'ro', markerfacecolor = 'none')
    
    filled_x = [keypoints.mean()[j] if np.isnan(x_points[int(j/2)]) else np.nan for j in range(0,30,2) ]
    filled_y = [keypoints.mean()[j+1] if np.isnan(y_points[int(j/2)]) else np.nan for j in range(0,30,2) ]
    
    ax.plot(filled_x, filled_y, 'bD', markerfacecolor = 'none')

In [None]:
# Creating a df with complementary keypoints.

complement_keypoints = pd.DataFrame(columns = keypoints.columns)

In [None]:
keypoints[pd.notnull(keypoints)].iloc[7041] 

In [None]:
(4 - np.nan ) * 0

In [None]:
for i in range(len(keypoints[pd.notnull(keypoints)].iloc[7041] )):
    if np.isnan(keypoints[pd.notnull(keypoints)].iloc[7041][i] ):
        print('replace')
    else:
        print('ignore')