# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs
* [https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor](https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor)
*   Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: 
    * input/
    * └── Brain Tumor/ (Image files)
    * ├── Brain Tumor.csv
    * └── bt_dataset_t3.csv

## Additional Comments | Insights | Conclusions

Brain Tumor Data Set
- This dataset includes the Brain MRI image files and two csv files.

- The csv files contain brain tumor feature dataset including five first-order features and eight texture features with the target level (in the column Class).

    - First Order Features
        - Mean
        - Variance
        - Standard Deviation
        - Skewness
        - Kurtosis

    - Second Order Features
        - Contrast
        - Energy
        - ASM (Angular second moment)
        - Entropy
        - Homogeneity
        - Dissimilarity
        - Correlation
        - Coarseness 

- Image column defines image name and Class column defines either the image has tumor or not (1 = Tumor, 0 = Non-Tumor). These two feature are the ones we will take into consideration while classifying the images.



---

## Import packages

In [1]:
%pip install -r ../requirements.txt

Collecting streamlit==0.85.0 (from -r ../requirements.txt (line 2))
  Using cached streamlit-0.85.0-py2.py3-none-any.whl (7.9 MB)
Collecting altair<5 (from -r ../requirements.txt (line 3))
  Using cached altair-4.2.2-py3-none-any.whl (813 kB)
Collecting astor (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Using cached astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting attrs (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting base58 (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Using cached base58-2.1.1-py3-none-any.whl (5.6 kB)
Collecting blinker (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Obtaining dependency information for blinker from https://files.pythonhosted.org/packages/fa/2a/7f3714cbc6356a0efec525ce7a0613d581072ed6eb53eb7b9754f33db807/blinker-1.7.0-py3-none-any.whl.metadata
  Using cached blinker-1.7.0-py3-none-any.whl.metadata (1.9 kB)
Collecting cachetools>=4.0 (

# Change working directory

* Because of the Jupyter notebooks being in a subfolder, we need to change the directory for the code's execution

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/home/tom/codeinstitute/brain-tumor-detect/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/home/tom/codeinstitute/brain-tumor-detect'

## Setup Kaggle

### Install Kaggle

In [5]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting tqdm (from kaggle==1.5.12)
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Using cached python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, tqdm, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-8.0.1 text-unidecode-1.3 tqdm-4.66.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1

### setup Kaggle details

In [15]:
# Kaggle json file and directory setup
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Kaggle download settings and download

In [16]:
KAGGLE_DATASET_URL = 'jakeshbohaju/brain-tumor'
DESTINATION_FOLDER = 'input/'
! kaggle datasets download -d $KAGGLE_DATASET_URL -p $DESTINATION_FOLDER


Downloading brain-tumor.zip to input
 79%|█████████████████████████████▉        | 11.0M/14.0M [00:00<00:00, 18.2MB/s]
100%|██████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 18.4MB/s]


Unzip the downloaded file, and delete the zip file.

Unzip the downloaded file, and delete the zip file.

In [17]:
import zipfile
with zipfile.ZipFile(DESTINATION_FOLDER + '/brain-tumor.zip' , 'r') as zip_ref:
    zip_ref.extractall(DESTINATION_FOLDER)

os.remove(DESTINATION_FOLDER + '/brain-tumor.zip')

Rename directories and files

In [18]:
! ls input/

'Brain Tumor'	  'Brain Tumor.csv'    mri-brain-tumor
 brain-tumor.csv   bt_dataset_t3.csv


In [19]:
! mv 'input/Brain Tumor.csv' input/brain-tumor.csv
! mv input/Brain\ Tumor/ input/brain-tumor/
! mv input/brain-tumor/Brain\ Tumor/ input/brain-tumor/brain-tumor/

---

# Data Preparation

---

## Data Cleaning

1. Sort the image files into tumor and non-tumor directories
2. Remove non image files

In [11]:
! ls input/brain-tumor

brain-tumor


In [20]:
# Change the dir structure of the input folder
! mkdir input/mri-brain-tumor/
! cp input/brain-tumor/brain-tumor/* input/mri-brain-tumor/
! rm -rf input/brain-tumor/ 

mkdir: cannot create directory ‘input/mri-brain-tumor/’: File exists


In [23]:
# classify images according to the target 'Class'
import pandas as pd
df = pd.read_csv('input/brain-tumor.csv')

# take out Image and Class only into a new data set
new_df = df[['Image', 'Class']]
new_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3762 entries, 0 to 3761
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Image   3762 non-null   object
 1   Class   3762 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 58.9+ KB


In [28]:
import shutil

# make new directories mri-tumor and mri-non-tumor in input
os.mkdir('input/mri-tumor/')
os.mkdir('input/mri-non-tumor/')

# move files according to the class
for index, row in new_df.iterrows():
    image_file = row['Image'] + '.jpg'
    image_class = row['Class']
    # save the image into the folder according to the class
    if image_class == 0:
        # save the image into the folder according to the class
        shutil.move('input/mri-brain-tumor/'+ image_file, 'input/mri-non-tumor/')
    else:
        # save the image into the folder according to the class
        shutil.move('input/mri-brain-tumor/'+ image_file, 'input/mri-tumor/')


Image1.jpg 0
Image2.jpg 0
Image3.jpg 1
Image4.jpg 1
Image5.jpg 0
Image6.jpg 0
Image7.jpg 0
Image8.jpg 0
Image9.jpg 0
Image10.jpg 1
Image11.jpg 1
Image12.jpg 1
Image13.jpg 1
Image14.jpg 0
Image15.jpg 0
Image16.jpg 1
Image17.jpg 1
Image18.jpg 0
Image19.jpg 0
Image20.jpg 0
Image21.jpg 0
Image22.jpg 0
Image23.jpg 0
Image24.jpg 0
Image25.jpg 0
Image26.jpg 1
Image27.jpg 0
Image28.jpg 0
Image29.jpg 0
Image30.jpg 0
Image31.jpg 0
Image32.jpg 1
Image33.jpg 1
Image34.jpg 1
Image35.jpg 0
Image36.jpg 0
Image37.jpg 1
Image38.jpg 0
Image39.jpg 0
Image40.jpg 0
Image41.jpg 0
Image42.jpg 0
Image43.jpg 0
Image44.jpg 0
Image45.jpg 0
Image46.jpg 0
Image47.jpg 0
Image48.jpg 0
Image49.jpg 0
Image50.jpg 0
Image51.jpg 0
Image52.jpg 1
Image53.jpg 1
Image54.jpg 1
Image55.jpg 1
Image56.jpg 0
Image57.jpg 1
Image58.jpg 1
Image59.jpg 1
Image60.jpg 1
Image61.jpg 0
Image62.jpg 0
Image63.jpg 0
Image64.jpg 0
Image65.jpg 0
Image66.jpg 0
Image67.jpg 0
Image68.jpg 0
Image69.jpg 1
Image70.jpg 0
Image71.jpg 0
Image72.jpg 0
I

Image420.jpg 0
Image421.jpg 0
Image422.jpg 0
Image423.jpg 0
Image424.jpg 0
Image425.jpg 0
Image426.jpg 0
Image427.jpg 0
Image428.jpg 0
Image429.jpg 0
Image430.jpg 0
Image431.jpg 0
Image432.jpg 0
Image433.jpg 0
Image434.jpg 0
Image435.jpg 0
Image436.jpg 0
Image437.jpg 1
Image438.jpg 1
Image439.jpg 0
Image440.jpg 0
Image441.jpg 0
Image442.jpg 0
Image443.jpg 0
Image444.jpg 0
Image445.jpg 0
Image446.jpg 0
Image447.jpg 0
Image448.jpg 1
Image449.jpg 0
Image450.jpg 0
Image451.jpg 0
Image452.jpg 1
Image453.jpg 1
Image454.jpg 1
Image455.jpg 0
Image456.jpg 1
Image457.jpg 0
Image458.jpg 0
Image459.jpg 1
Image460.jpg 0
Image461.jpg 0
Image462.jpg 0
Image463.jpg 0
Image464.jpg 1
Image465.jpg 1
Image466.jpg 1
Image467.jpg 0
Image468.jpg 0
Image469.jpg 0
Image470.jpg 0
Image471.jpg 0
Image472.jpg 0
Image473.jpg 0
Image474.jpg 0
Image475.jpg 0
Image476.jpg 0
Image477.jpg 1
Image478.jpg 0
Image479.jpg 1
Image480.jpg 1
Image481.jpg 1
Image482.jpg 1
Image483.jpg 1
Image484.jpg 1
Image485.jpg 0
Image486.j

In [29]:
! rm -rf input/mri-brain-tumor


In [None]:
# remove non image files
def remove_non_image_files(directory):
    total_deleted = 0
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(directory)
    for folder in folders:
        files = os.listdir(directory + '/' + folder)
        image_files = []
        non_image_files = []
        for file in files:
            file_path = directory + '/' + folder + '/' + file

            if file.lower().endswith(image_extension):
                image_files.append(file_path)
            else:
                os.remove(file_path)  # remove non image file
                non_image_files.append(file_path)
                total_deleted += 1
remove_non_image_files('input/')
                
        print(f'Folder: {folder} - has {len(image_files)} image file(s) ')
        print(f'Folder: {folder} - has {len(non_image_files)} non-image file(s) ')
    print('-------------------------')
    print(f'Total file(s) deleted: {total_deleted}')

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
