# CBU5201 mini-project submission

The mini-project has two separate components:


1.   **Basic component** [6 marks]: Using the genki4k dataset, build a machine learning pipeline that takes as an input an image and predicts 1) whether the person in the image is similing or not 2) estimate the 3D head pose labels in the image.
2.   **Advanced component** [10 marks]: Formulate your own machine learning problem and build a machine learning solution using the genki4k dataset (https://inc.ucsd.edu/mplab/398/). 

Your submission will consist of two Jupyter notebooks, one for the basic component and another one for advanced component. Please **name each notebook**:

* CBU5201_miniproject_basic.ipynb
* CBU5201_miniproject_advanced.ipynb

then **zip and submit them toghether**.

Each uploaded notebook should include: 

*   **Text cells**, describing concisely each step and results.
*   **Code cells**, implementing each step.
*   **Output cells**, i.e. the output from each code cell.

and **should have the structure** indicated below. Notebooks might not be run, please make sure that the output cells are saved.

How will we evaluate your submission?

*   Conciseness in your writing (10%).
*   Correctness in your methodology (30%).
*   Correctness in your analysis and conclusions (30%).
*   Completeness (10%).
*   Originality (10%).
*   Efforts to try something new (10%).

Suggestion: Why don't you use **GitHub** to manage your project? GitHub can be used as a presentation card that showcases what you have done and gives evidence of your data science skills, knowledge and experience. 

Each notebook should be structured into the following 9 sections:


# 1 Author

**Student Name**:  Yaoan Yang

**Student ID**:  210976881



# 2 Problem formulation

Describe the machine learning problem that you want to solve and explain what's interesting about it.

# 3 Machine Learning pipeline

Describe your ML pipeline. Clearly identify its input and output, any intermediate stages (for instance, transformation -> models), and intermediate data moving from one stage to the next. It's up to you to decide which stages to include in your pipeline. 

**Pre-Processing**

Creating folder for training, test, validation, copying the neccesary files

In [3]:
import os, shutil

base_dir = 'D:\\desktop\\bupt\\ML\\Mini_Project\\dataset'

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')

# Directory with our training smile pictures
train_smile_dir = os.path.join(train_dir, 'smile')

# Directory with our training nosmile pictures
train_nosmile_dir = os.path.join(train_dir, 'nosmile')

# Directory with our validation smile pictures
validation_smile_dir = os.path.join(validation_dir, 'smile')

# Directory with our validation nosmile pictures
validation_nosmile_dir = os.path.join(validation_dir, 'nosmile')

# Directory with our validation smile pictures
test_smile_dir = os.path.join(test_dir, 'smile')

# Directory with our validation nosmile pictures
test_nosmile_dir = os.path.join(test_dir, 'nosmile')


In [4]:
print('total training smile images:', len(os.listdir(train_smile_dir)))
print('total training nosmile images:', len(os.listdir(train_nosmile_dir)))
print('total validation smile images:', len(os.listdir(validation_smile_dir)))
print('total validation nosmile images:', len(os.listdir(validation_nosmile_dir)))
print('total test smile images:', len(os.listdir(test_smile_dir)))
print('total test nosmile images:', len(os.listdir(test_nosmile_dir)))

total training smile images: 1730
total training nosmile images: 1470
total validation smile images: 200
total validation nosmile images: 190
total test smile images: 232
total test nosmile images: 177


In [9]:
import tensorflow.keras as keras
keras.__version__
from keras.preprocessing.image import ImageDataGenerator

# All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(150, 150),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')
test_generator = test_datagen.flow_from_directory(
        test_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

Found 3200 images belonging to 2 classes.
Found 390 images belonging to 2 classes.
Found 409 images belonging to 2 classes.


**Building Model**

In [6]:
from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
                        input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 148, 148, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 74, 74, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 72, 72, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 36, 36, 64)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 34, 34, 128)       73856     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 17, 17, 128)      0

**Compile and Training**

In [11]:
from keras import optimizers

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

history = model.fit_generator(
      train_generator,
      steps_per_epoch=100,
      epochs=30,
      validation_data=validation_generator,
      validation_steps=50)

  super(RMSprop, self).__init__(name, **kwargs)
  if sys.path[0] == "":


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


**Validation and Training result**

In [12]:
model.save('D:\\desktop\\bupt\\ML\\Mini_Project\\smile_and_nosmile.h5')

In [6]:
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined

**One Photo Test**

In [7]:

import numpy as np
model = keras.models.load_model('D:\\desktop\\bupt\\ML\\Mini_Project\\smile_and_nosmile.h5')
#smile case
img_path='D:\\desktop\\bupt\\ML\\Mini_Project\\dataset\\test\\smile\\file1931.jpg'
img = keras.utils.load_img(img_path, target_size=(150, 150))
img_tensor = keras.utils.img_to_array(img)/255.0
img_tensor = np.expand_dims(img_tensor, axis=0)
prediction =model.predict(img_tensor)  
print(prediction)
if prediction[0][0]>0.5:
    result='smile'
else:
    result='nosmile'
print(result)
#non-smile case
img_path='D:\\desktop\\bupt\\ML\\Mini_Project\\dataset\\test\\nosmile\\file3824.jpg'
img = keras.utils.load_img(img_path, target_size=(150, 150))
img_tensor = keras.utils.img_to_array(img)/255.0
img_tensor = np.expand_dims(img_tensor, axis=0)
prediction =model.predict(img_tensor)  
print(prediction)
if prediction[0][0]>0.5:
    result='smile'
else:
    result='nosmile'
print(result)

[[0.7292042]]
smile
[[0.1454728]]
nosmile


**Evaluation**

In [40]:
import math
def evaluate_model(model, generator, nBatches):
    score = model.evaluate_generator(generator=generator,  # Generator yielding tuples
                                     steps=math.ceil(generator.samples / nBatches),
                                     # number of steps (batches of samples) to yield from generator before stopping
                                     max_queue_size=10,  # maximum size for the generator queue
                                     workers=1,
                                     # maximum number of processes to spin up when using process based threading
                                     use_multiprocessing=False,  # whether to use process-based threading
                                     verbose=1)
    print("loss: %.6f - acc: %.6f" % (score[0], score[1]))
evaluate_model(model, test_generator, nBatches=816)



  # Remove the CWD from sys.path while we load stuff.


loss: 0.471140 - acc: 0.900000


# 4 Transformation stage

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.

# 5 Modelling

Describe the ML model(s) that you will build. Explain why you have chosen them.

# 6 Methodology

Describe how you will train and validate your models, how model performance is assesssed (i.e. accuracy, confusion matrix, etc)

# 7 Dataset

Describe the dataset that you will use to create your models and validate them. If you need to preprocess it, do it here. Include visualisations too. You can visualise raw data samples or extracted features.

# 8 Results

Carry out your experiments here, explain your results.

# 9 Conclusions

Your conclusions, improvements, etc should go here