# <center> This is ML Classification Fun Trial on Kvasir Dataset</center>


## 1) Dataset downloaded via command 
```bash
wget https://datasets.simula.no/kvasir/data/kvasir-dataset.zip
unzip kvasir-dataset.zip
```
Dataset description as per the website 
```
"Kvasir version 1

The first version of the Kvasir dataset (v1) consists of 4,000 images in 8 classes showing anatomical landmarks, phatological findings or endoscopic procedures in the GI tract, i.e., 500 images for each class. The anatomical landmarks are Z-line, pylorus and cecum, while the pathological finding are esophagitis, polyps and ulcerative colitis. In addition, we provide two set of images related to removal of polyps, the "dyed and lifted polyp" and the "dyed resection margins".

Kvasir Dataset v1

The kvasir-dataset.zip (size 1.2 GB) archive contains 4,000 images, 8 classes, 500 images for each class. The images are stored in the separate folders named accordingly to the name of the class images belongs to. The image files are encoded using JPEG compression. The encoding settings can vary across the dataset and they reflecting the a priori unknown endoscopic equipment settings. The extension of the image files is ".jpg".

Extracted Features (Kvasir Dataset v1)

The kvasir-dataset-features.zip (size 4.7 MB) archive contains the extracted visual feature descriptors for all the image from the Kvasir Dataset. The extracted visual features are stored in the separate folders and files named accordingly to the name and the path of the corresponding image files. The extracted visual features are the global image features, namely: JCD, Tamura, ColorLayout, EdgeHistogram, AutoColorCorrelogram and PHOG. Each feature vector consists of a number of floating point values. The size of the vector depends on the feature. The size of the feature vectors are: 168 (JCD), 18 (Tamura), 33 (ColorLayout), 80 (EdgeHistogram), 256 (AutoColorCorrelogram) and 630 (PHOG). The extracted visual features are stored in the text files. Each file consists of eight lines, one line per each feature. Each line consists of a feature name separated from the feature vector by colon. Each feature vector consists of a corresponding number of floating point values separated by commas. The extension of the extracted visual feature files is ".features"."
```

## 2) Explore the data
We have 8 directories, each with 500 .jpg images.
Directories' names (classes' names): 
1. dyed-lifted-polyps
2. dyed-resection-margins
3. esophagitis
4. normal-cecum
5. normal-pylorus
6. normal-z-line
7. polyps
8. ulcerative-colitis

## 3) Downloading and Exploring the features' files
using the following _Bash_ commands:
``` bash
wget https://datasets.simula.no/kvasir/data/kvasir-dataset-features.zip
unzip kvasir-dataset-features.zip
```
The data organized in the sae way.. 8 directories.. each with 500 files (txt files in csv format but file name ending in .features)

## 4) Analysis

### Imporing Needed Libraries 


In [1]:
# importing libraries
import os
import glob
import pandas as pd

### Reading the data

In [2]:
! ls kvasir-dataset-features -1
! echo
! cat ./kvasir-dataset-features/dyed-lifted-polyps/0053d7cd-549c-48cd-b370-b4ad64a8098a.features | cut -d, -f 1 | cut -d: -f 1

dyed-lifted-polyps
dyed-resection-margins
esophagitis
normal-cecum
normal-pylorus
normal-z-line
polyps
ulcerative-colitis

JCD
Tamura
ColorLayout
EdgeHistogram
AutoColorCorrelogram
PHOG


In [None]:
path = './kvasir-dataset-features/'
data_classes = os.listdir(path)

# create 6 empty dataframes for the features 
JCD_df = pd.DataFrame()
Tamura_df = pd.DataFrame()
ColorLayout_df = pd.DataFrame()
EdgeHistogram_df = pd.DataFrame()
AutoColorCorrelogram_df = pd.DataFrame()
PHOG_df = pd.DataFrame()
    
for data_class in data_classes:
    # loop over the 8 directories.
    sub_path = path + data_class + '/'
    
#     for file_name in os.listdir(sub_path):
#         file_path = sub_path + file_name
    for file_path in glob.iglob(sub_path + '*.features'):
        file_name = file_path.split('/')[3]
        # loop over the 500 sample in each directory
        
        # appending to the 6 datafreames preiouvesly created
        # print(file_path)
        file_object = open(file_path)
        sample_name = data_class + '/' + file_name.split('.features')[0]
        
        JCD_df = JCD_df.append(pd.Series(file_object.readline().split(':')[1].rstrip('\n').split(','), name = sample_name))
        Tamura_df = Tamura_df.append(pd.Series(file_object.readline().split(':')[1].rstrip('\n').split(','), name = sample_name))
        ColorLayout_df = ColorLayout_df.append(pd.Series(file_object.readline().split(':')[1].rstrip('\n').split(','), name = sample_name))
        EdgeHistogram_df = EdgeHistogram_df.append(pd.Series(file_object.readline().split(':')[1].rstrip('\n').split(','), name = sample_name))
        AutoColorCorrelogram_df = AutoColorCorrelogram_df.append(pd.Series(file_object.readline().split(':')[1].rstrip('\n').split(','), name = sample_name))
        PHOG_df = PHOG_df.append(pd.Series(file_object.readline().split(':')[1].rstrip('\n').split(','), name = sample_name))
        
        file_object.close()


In [None]:
# Saving all to df.csv files for easier reading
! mkdir data
JCD_df.to_csv(r'./data/JCD_df.csv', header=False)
Tamura_df.to_csv(r'./data/Tamura_df.csv', header=False)
ColorLayout_df.to_csv(r'./data/ColorLayout_df.csv', header=False)
EdgeHistogram_df.to_csv(r'./data/EdgeHistogram_df.csv', header=False)
AutoColorCorrelogram_df.to_csv(r'./data/AutoColorCorrelogram_df.csv', header=False)
PHOG_df.to_csv(r'./data/PHOG_df.csv', header=False)

In [None]:
# Reading from .csv files previously saved
JCD_df = pd.read_csv(r'./data/JCD_df.csv', header=False)
Tamura_df = pd.read_csv(r'./data/Tamura_df.csv', header=False)
ColorLayout_df = pd.read_csv(r'./data/ColorLayout_df.csv', header=False)
EdgeHistogram_df = pd.read_csv(r'./data/EdgeHistogram_df.csv', header=False)
AutoColorCorrelogram_df = pd.read_csv(r'./data/AutoColorCorrelogram_df.csv', header=False)
PHOG_df = pd.to_csv(r'./data/PHOG_df.csv', header=False)

In [None]:
# pd.concat([JCD_df.reset_index(drop=True), pd.Series(['dyed-lifted-polyps'], name='class_label'), pd.Series(['01d38b8f-74b2-4147-9519-448d05bf8745'], name='sample_name')], axis=1)