# Classification Of Brain Tumors - Data Wrangling

This Notebook is used to list down:

1. How the Data for Test and Train sets was collected i.e, Data Sources.
2. How the data has been organized for Training and Testing the model.
3. What type of data is being used for the classifications.
4. Data Cleaning.

## Step# 1: Downloading Data from Data Sources.

            Data being used in this problem has been downloaded from the below locations and is already available for futher processing:

            1. https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection?
            2. https://www.kaggle.com/sartajbhuvaji/brain-tumor-classification-mri?


## Step# 2: Organizing data into Logical folder structure from where the system will read the data and process it.

            The downloaded data has been divided into 2 categories: 
                1. Yes - Containing images of the patients having a Brain Tumor.
                2. No - Containing images of the patients having no Tumor.

            These two folder have been placed inside a 'Data' folder.

    The Data for this problem will have the following Folder structure:

    Work_Dir
    |
    |
    |----> Data
    |       |
    |       |
    |       |----> No
    |       |
    |       |
    |       |----> Yes
    |
    |
    |----> Train
    |       |
    |       |
    |       |----> No
    |       |
    |       |
    |       |----> Yes
    |
    |
    |----> Test
    |       |
    |       |
    |       |----> No
    |       |
    |       |
    |       |----> Yes
    
    Thus, creating the required Folder structure.

In [1]:
#Importing the Required Python Packages
import os
import shutil
import glob

Printing the Current Working Directory

In [2]:
print(os.getcwd())

d:\Learning\Springboard\GitHub\Classification-Brain-Tumors


Printing the content of the Working Directory

In [3]:
print(os.listdir())

['.git', 'Capstone 2 Project Proposal.pdf', 'Data', 'Data Wrangling.ipynb', 'README.md']


Data folder is already present here. Hence creating the remaining folder structure.

In [4]:
#Reading the current path in a variable
path = os.getcwd()

In [5]:
#Creating a folder by name Train
os.mkdir(os.path.join(path, 'Train'))

In [6]:
#Creating subfolders Yes and No inside the Train folder
os.mkdir(os.path.join(path,'Train','Yes'))
os.mkdir(os.path.join(path,'Train','No'))

In [7]:
#Creating a folder by name Test
os.mkdir(os.path.join(path, 'Test'))

In [8]:
#Creating subfolders Yes and No inside the Test folder
os.mkdir(os.path.join(path,'Test','Yes'))
os.mkdir(os.path.join(path,'Test','No'))

## Step# 3: Printing out all the different types of Image data we are goint to work with.

In [9]:
imageType = []
#Fetching the different types of Images present in Data/Yes folder
for fl in os.listdir(os.path.join(path,'Data','Yes')):
    filename, fileExtension = os.path.splitext(fl)
    if(fileExtension not in imageType):
        imageType.append(fileExtension)

#Fetching the different types of Images present in Data/No folder
for fl in os.listdir(os.path.join(path,'Data','No')):
    filename, fileExtension = os.path.splitext(fl)
    if(fileExtension not in imageType):
        imageType.append(fileExtension)

#Printing the different types of Images present in the Dataset
print(imageType)

['.jpg', '.JPG', '.png', '.jpeg']


## Step# 4: Cleaning Data

In [10]:
# Creating a copy of the Raw data by copying all the images from Data/Yes and Data/No folder to Train/Yes and Train/No Folders
dataPath = os.path.join(path,'Data')
trainPath = os.path.join(path,'Train')
for fl in os.listdir(os.path.join(dataPath,'Yes')):
    shutil.copy(os.path.join(dataPath, 'Yes', fl),os.path.join(trainPath,'Yes'))

for fl in os.listdir(os.path.join(dataPath,'No')):
    shutil.copy(os.path.join(dataPath, 'No', fl),os.path.join(trainPath,'No'))

In [11]:
# Change the filenames to a sequence. Format: Y_seq# or N_seq#
# Change the extension of all the images to a common format-JPG
count = 1
for fl in glob.iglob(os.path.join(trainPath,'Yes','*.*')):    
    os.rename(fl, os.path.join(trainPath, 'Yes', 'Y_' + str(count) + '.JPG'))
    count = count + 1
count = 1
for fl in glob.iglob(os.path.join(trainPath,'No','*.*')):
    os.rename(fl, os.path.join(trainPath, 'No', 'N_' + str(count) + '.JPG'))
    count = count + 1

In [13]:
# Again Printing the different types of Files we are working with
imageType = []
#Fetching the different types of Images present in Train/Yes folder
for fl in os.listdir(os.path.join(trainPath,'Yes')):
    filename, fileExtension = os.path.splitext(fl)
    if(fileExtension not in imageType):
        imageType.append(fileExtension)

#Fetching the different types of Images present in Train/No folder
for fl in os.listdir(os.path.join(trainPath,'No')):
    filename, fileExtension = os.path.splitext(fl)
    if(fileExtension not in imageType):
        imageType.append(fileExtension)

#Printing the different types of Images present in the Dataset
print(imageType)

['.JPG']


### At this point the Data Wrangling has been completed and the resulting data is now ready for EDA