# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs
* [https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor](https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor)
*   Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: 
    * input/
    * └── Brain Tumor/ (Image files)
    * ├── Brain Tumor.csv
    * └── bt_dataset_t3.csv

## Additional Comments | Insights | Conclusions

Brain Tumor Data Set
- This dataset includes the Brain MRI image files and two csv files.

- The csv files contain brain tumor feature dataset including five first-order features and eight texture features with the target level (in the column Class).

    - First Order Features
        - Mean
        - Variance
        - Standard Deviation
        - Skewness
        - Kurtosis

    - Second Order Features
        - Contrast
        - Energy
        - ASM (Angular second moment)
        - Entropy
        - Homogeneity
        - Dissimilarity
        - Correlation
        - Coarseness 

- Image column defines image name and Class column defines either the image has tumor or not (1 = Tumor, 0 = Non-Tumor). These two feature are the ones we will take into consideration while classifying the images.



---

## Import packages

In [None]:
%pip install -r ../requirements.txt

# Change working directory

* Because of the Jupyter notebooks being in a subfolder, we need to change the directory for the code's execution

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/home/tom/codeinstitute/brain-tumor-detect/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/home/tom/codeinstitute/brain-tumor-detect'

## Setup Kaggle

### Install Kaggle

In [4]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


setup Kaggle details

In [5]:
# Kaggle json file and directory setup
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Kaggle download settings and download

In [12]:
KAGGLE_DATASET_URL = 'jakeshbohaju/brain-tumor'
DESTINATION_FOLDER = 'input/'
! kaggle datasets download -d $KAGGLE_DATASET_URL -p $DESTINATION_FOLDER


Downloading brain-tumor.zip to input
100%|██████████████████████████████████████| 14.0M/14.0M [00:01<00:00, 12.7MB/s]
100%|██████████████████████████████████████| 14.0M/14.0M [00:01<00:00, 9.78MB/s]


Unzip the downloaded file, and delete the zip file.

Unzip the downloaded file, and delete the zip file.

In [16]:
import zipfile
with zipfile.ZipFile(f'{DESTINATION_FOLDER}/brain-tumor.zip' , 'r') as zip_ref:
    zip_ref.extractall(DESTINATION_FOLDER)

os.remove(DESTINATION_FOLDER + '/brain-tumor.zip')

Rename directories and files

In [17]:
! ls input/

'Brain Tumor'  'Brain Tumor.csv'   bt_dataset_t3.csv


In [18]:
! mv 'input/Brain Tumor.csv' input/brain-tumor.csv
! mv input/Brain\ Tumor/ input/brain-tumor/
! mv input/brain-tumor/Brain\ Tumor/ input/brain-tumor/brain-tumor/

---

# Data Preparation

---

## Data Cleaning

1. Sort the image files into tumor and non-tumor directories
2. Remove non image files
3. Remove empty directories

In [19]:
! ls input/brain-tumor

brain-tumor


In [20]:
# Change the dir structure of the input folder
! mkdir input/mri-brain-tumor/
! cp input/brain-tumor/brain-tumor/* input/mri-brain-tumor/
! rm -rf input/brain-tumor/ 

In [21]:
# classify images according to the target 'Class'
import pandas as pd
df = pd.read_csv('input/brain-tumor.csv')

# take out Image and Class only into a new data set
new_df = df[['Image', 'Class']]
new_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3762 entries, 0 to 3761
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Image   3762 non-null   object
 1   Class   3762 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 58.9+ KB


In [22]:
import shutil

# make new directories mri-tumor and mri-non-tumor in input
os.mkdir('input/mri-tumor/')
os.mkdir('input/mri-non-tumor/')

# move files according to the class
for index, row in new_df.iterrows():
    image_file = row['Image'] + '.jpg'
    image_class = row['Class']
    # save the image into the folder according to the class
    if image_class == 0:
        # save the image into the folder according to the class
        shutil.move('input/mri-brain-tumor/'+ image_file, 'input/mri-non-tumor/')
    else:
        # save the image into the folder according to the class
        shutil.move('input/mri-brain-tumor/'+ image_file, 'input/mri-tumor/')


In [23]:
# remove non image files and empty folders
! rm input/*.csv
! rm -rf input/mri-brain-tumor


In [24]:
os.listdir('input')


['mri-tumor', 'mri-non-tumor']

---

## Split train validation test set

In [25]:
import os
import shutil
import random



# code adapted from the CI walkthrough project malaria detector
def split_dataset(input_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return
    
    labels = os.listdir(input_dir)  
    if 'test' in labels:
        pass
    else:
        # create train, test and validation folders with classes labels sub-folder
        for folder in ['train', 'test', 'validation']:
            os.makedirs(os.path.join(input_dir, folder))
            for label in labels:
                os.makedirs(os.path.join(input_dir, folder, label))

        for label in labels:
            images = os.listdir(os.path.join(input_dir, label))
            random.shuffle(images)
            train_set_size = int(len(images) * train_set_ratio)
            test_set_size = int(len(images) * test_set_ratio)
            validation_set_size = len(images) - train_set_size - test_set_size

            for image in images[:train_set_size]:
                shutil.move(os.path.join(input_dir, label, image), os.path.join(input_dir, 'train', label))
            for image in images[train_set_size:train_set_size + test_set_size]:
                shutil.move(os.path.join(input_dir, label, image), os.path.join(input_dir, 'test', label))
            for image in images[train_set_size + test_set_size:]:
                shutil.move(os.path.join(input_dir, label, image), os.path.join(input_dir, 'validation', label))

            os.rmdir(os.path.join(input_dir, label))

Conventionally,
* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [26]:
split_dataset(input_dir='input', train_set_ratio=0.7, validation_set_ratio=0.1, test_set_ratio=0.2)

---

# Push files to Repo

* Data collection and cleaning has finished. You can push the files to the GitHub repository and close this notebook.
* Follows [Data Visualization Notebook](./02_data_visualization.ipynb)