# CPS330 Assignment 1
*September 2020*

This first assignment is designed to give you experience with some basic data preparation and handling, Scikit-Learn classifiers, and some measures of performance.

This notebook is based on the dataset https://www.kaggle.com/jerzydziewierz/bee-vs-wasp which appeared on Kaggle in late August 2020.

**Instructions:** Notebooks have two types of cells; "code" cells that contain valid Python code, and "markdown" cells that contain text using markdown or LaTeX.  Throughout this notebook there are markdown cells with instructions and questions for you to answer.  After each question is a markdown cell for your answer -- double click on the markdown cell to edit it.  You only need to single click on a code cell to edit it.

At a minimum, complete all the cells in this notebook.


# Import libraries

In [None]:
import numpy as np
import pandas as pd
import cv2
import os

%matplotlib inline
import matplotlib.pyplot as plt

# Read CSV file with labels and data file names

In [None]:
ROOT = '/kaggle/input/bee-vs-wasp/kaggle_bee_vs_wasp'
df = pd.read_csv(os.path.join(ROOT, 'labels.csv'))

Be sure you know what the pandas method `read_csv()` does (look it up if you need to).

The `os.path.join()` function connects two or more strings representing parts of a valid path with the OS-specific path separator (slash on Mac OS X and Linux and backslash on Windows).

Create a new code cell (move your mouse pointer directly below this cell and click on the "+ Code" button) and execute the command `df.head()`.  Create another code cell and run the command `df.tail()`.

**Question:** What do the df.head() and df.tail() functions do?  How many records are the dataset?

**Your Answer:** 

**Question:** Find out what the `tqdm` module does.  Why do you think it's used in the cell below?

**Your answer:** 

In [None]:
from tqdm import tqdm
for idx in tqdm(df.index):
    df.loc[idx, 'path'] = df.loc[idx,'path'].replace('\\', '/')

**Question:** Why do we need to use the string `replace()` method here?  *Hint: use the `df.head()` function again and compare the output with the previous output.*

**Your answer:**

# Examine the data
The next cell produces a pie chart that shows the relative distribution of labels in the data.

In [None]:
counts = df['label'].value_counts()
labels = counts.index.tolist()
plt.pie(counts, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Unique values of the original data')
plt.show()

Perhaps you noticed in the output from `df.head()` that photos are marked with a "photo_quality" value.  A 0 value indicates a low(er) quality photo and a 1 indicates a high(er) quality photo.  We can see the relative distribution of these with a bar chart.

In [None]:
labels = list(df['photo_quality'].unique())
x = range(0, 2)
y = list(df['photo_quality'].value_counts())
plt.bar(x, y, tick_label=labels)
plt.title('High quality photos in original data (1=high, 0=low)')

plt.show()

Let's examine some of the images in the dataset to see what we're working with.

Study the following code, using Pandas and CV2 documentation (don't forget you can try these comands individually to see what they do) until you have a overall sense of how the following function works.  

In [None]:
def img_plot(df, root, label):
    """show the first 9 images that match the given label"""
    df = df.query('label == @label')
    imgs = []
    for path in df['path'][:9]:
        img = cv2.imread(os.path.join(root, path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        imgs.append(img)
    f, ax = plt.subplots(3, 3, figsize=(15,15))
    for i, img in enumerate(imgs):
        ax[i//3, i%3].imshow(img)
        ax[i//3, i%3].axis('off')
        ax[i//3, i%3].set_title('label: %s' % label)
    plt.show()

# show the first 9 bees
img_plot(df, ROOT, 'bee')

Use additional `img_plot()` commands (each in it's own code cell) to see samples from the other three classes.

# Create the train and test datasets
From the output of `df.head()` we see that the data is already partitioned into training, validation, and final_validation (i.e. testing) sets.  We're going to ignore that for now, and create our own training and test sets from the entire dataset.


In [None]:
def create_dataset(df, root, img_size):
    """Read dataset and convert images to img_size X img_size"""
    img_length = 3 * img_size * img_size
    imgs = []
    lbls = []
    for path, label in zip(tqdm(df['path']), df['label']):
        img = cv2.imread(os.path.join(root, path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, (img_size,img_size))
        imgs.append(np.array(img).reshape(img_length))
        lbls.append(label)
        
    imgs = np.array(imgs, dtype='float32')
    imgs = imgs / 255.0
    lbls = np.array(lbls)
    return imgs, lbls


In [None]:
X, y = create_dataset(df, ROOT, 128)

In [None]:
# Here we split the entire dataset into a training set (70%) and a test set (30%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

Final step (for now): generate true/false labels

In [None]:
y_train_bee = y_train == 'bee'
y_test_bee = y_test == 'bee'

# Time to classify
You've got training and testing sets so it's time to start classifying!