<a href="https://colab.research.google.com/github/xslittlemaggie/Deep-Learning-Machine-Learning-Projects/blob/master/Details_of_importing_dataset_from_Kaggle_and_organizing_train_validation_files_honey_bee_pollen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 0: Import library

In [1]:
# libraries for files
import os
import glob
import cv2

# libraries for image processing and NN models
import numpy as np
import pandas as pd
import random

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from tqdm import tqdm
from keras.preprocessing.image import ImageDataGenerator
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout
from tensorflow.keras.layers import Dense, Activation, Flatten
from keras.optimizers import SGD, adam, RMSprop

Using TensorFlow backend.


## Step 1. Load data from Kaggle

#### 1. get Kaggle API and key  (please enter your kaggle user name and key)

In [0]:
os.environ['KAGGLE_USERNAME'] = "maggie" # username from the json file 
os.environ['KAGGLE_KEY'] = "7adfc6c4e6c5eec087031fbb7397aee5" # key from the json file (This key is incorrect)

In [0]:
!pip install -q kaggle

#### 2. find the related dataset list

In [4]:
!kaggle datasets list -s honey-bee-pollen  # It will list the 20 datasets including "dogs-vs-cats" from kaggle

ref                                          title                                size  lastUpdated          downloadCount  
-------------------------------------------  -----------------------------------  ----  -------------------  -------------  
ivanfel/honey-bee-pollen                     Honey Bee pollen                     10MB  2018-11-20 16:06:11           1145  
jenny18/open-source-bee-hive-labeled-images  Open Source Bee Hive Labeled Images  31MB  2018-09-08 01:06:10            156  


#### 3. download dataset   
**ivanfel/honey-bee-pollen**

In [5]:
!kaggle datasets download -d ivanfel/honey-bee-pollen -p /content/

Downloading honey-bee-pollen.zip to /content
 50% 5.00M/9.98M [00:00<00:00, 24.8MB/s]
100% 9.98M/9.98M [00:00<00:00, 39.5MB/s]


#### 4. check the datafile name from left at Files

The file name is home-bee-pollen.zip, so we need to unzip this file and check how the data inside look like

In [0]:
!unzip -q /content/honey-bee-pollen.zip -d /content/honey-bee-pollen/

#### 4. create train/validation files to store training and validation data

In [0]:
# create train/validation files
os.mkdir("/content/honey-bee-pollen/PollenDataset/images/train")
os.mkdir("/content/honey-bee-pollen/PollenDataset/images/validation")

In [0]:
# create subfiles NP/P files under train and validation files respectively

# 1. create subfile NP and P files under train dataset
os.mkdir("/content/honey-bee-pollen/PollenDataset/images/train/NP")
os.mkdir("/content/honey-bee-pollen/PollenDataset/images/train/P")

# 2. create subfile NP and P files under validation dataset
os.mkdir("/content/honey-bee-pollen/PollenDataset/images/validation/NP")
os.mkdir("/content/honey-bee-pollen/PollenDataset/images/validation/P")

#### 5. Move all images beginning with NP to NP file under train, move all images beginning with P to P file under train. 

Later we can randomly move 10% of images from NP under train to NP under validation, and 10% images from P under train to P under validation. 

You can use other methods to do this split. 


It seems that this is no images for testing from this dataset.I didn't see any images unlabelled. Maybe I missed some information, please check

In [9]:
# check for one image about how the image file look like
pathes = glob.glob('/content/honey-bee-pollen/PollenDataset/images/*.jpg')
for path in pathes:
  head, tail = os.path.split(path)
  print("head:", head)
  print("tail:", tail)
  break

head: /content/honey-bee-pollen/PollenDataset/images
tail: P53827-50r.jpg


From the output above, the tail path of the image includes the labelling information, P or NP. 

Later, I will check the first letter of the image, if it is N (move to NP file), or P (move to P file)

In [0]:
# move all images beginning with NP to NP file under train, all images beginning with P to P file under train
pathes = glob.glob('/content/honey-bee-pollen/PollenDataset/images/*.jpg')
for path in pathes:
  head, tail = os.path.split(path)
  if tail[:1] == 'N':
    new_path = "/content/honey-bee-pollen/PollenDataset/images/train/NP/" + tail
  elif tail[:1] == 'P':
    new_path = "/content/honey-bee-pollen/PollenDataset/images/train/P/" + tail
  os.rename(path, new_path)

If you run the code line by line. Now you can go to the file,  you can see that we have moved all images to NP file and P files based on the image file names under train dataset.

Next we want to randomly move 10% of the images from NP (under train dataset) to NP file, 
and 10% of the images from P (under train dataset) to P file under validation dataset.

#### 6. check the number of images in NP and P files

In [12]:
total_NP_iamges = os.listdir("/content/honey-bee-pollen/PollenDataset/images/train/NP")
print("The total number of images labelled NP is: ", len(total_NP_iamges))

total_P_iamges = os.listdir("/content/honey-bee-pollen/PollenDataset/images/train/P")
print("The total number of images labelled P is: ", len(total_P_iamges))

The total number of images labelled NP is:  345
The total number of images labelled P is:  369


#### 7. move 10% of images to validation dataset

In [0]:
# NP in validation data
NP_pathes = glob.glob('/content/honey-bee-pollen/PollenDataset/images/train/NP/*.jpg') 
val_NP_idx = random.sample((range(len(total_NP_iamges))), k = 100)  # any number you want to move
val_NP = [NP_pathes[idx] for idx in val_NP_idx]

for path in val_NP:
  head, tail = os.path.split(path)  
  new_path = "/content/honey-bee-pollen/PollenDataset/images/validation/NP/" + tail 
  os.rename(path, new_path)  
  
# P in validation data
P_pathes = glob.glob('/content/honey-bee-pollen/PollenDataset/images/train/P/*.jpg') 
val_P_idx = random.sample((range(len(total_P_iamges))), k = 100)  # any number you want to move
val_P = [P_pathes[idx] for idx in val_P_idx]

for path in val_P:
  head, tail = os.path.split(path)  
  new_path = "/content/honey-bee-pollen/PollenDataset/images/validation/P/" + tail 
  os.rename(path, new_path) 

Now we have successfully moved the images to **train/NP** and **train/P** files, and **validation/NP** and **validation/P** files.

You can check the results from the left files