## Goal 
### Identifying and localizing COVID-19 abnormalities on chest radiographs

## Reference  
* [YOLO ref1]( https://www.kaggle.com/ayuraj/train-covid-19-detection-using-yolov5)  
* [YOLO ref2](https://www.kaggle.com/h053473666/siim-cov19-yolov5-train#YOLOv5)  
* [simple-tutorial](https://www.kaggle.com/yujiariyasu/catch-up-on-positive-samples-plot-submission-csv?scriptVersionId=63394385)  

## The Domain Knowledge
#### ★Ground glass opacties
ground glass opacities (GGOs, for short) indicate abnormalities in the lungs. "Ground glass opacities [are] a pattern that can be seen when the lungs are sick," says Dr. Cortopassi. She adds that, while normal lung CT scans appear black, an abnormal chest CT with GGOs will show lighter-colored or gray patches.

#### ★Opacity(不透明度)
the degree of transparenet(x-ray image)

#### 1. Typical Appearance  
Commonly reported imaging features of greater specificity for COVID-19 pneumonia.
#### 2. Atypical Appearance  
Uncommonly or not reported features of COVID-19 pneumonia.
#### 3. Indeterminate Appearance(不確定)   
Nonspecific imaging features of COVID-19 pneumonia.
#### 4. Negative for Pneumonia(陰性） 

#### boxes
bounding boxes in easily-readable dictionary format

#### DICOM format
Any DICOM medical image consists of two parts—a header and the actual image itself. The header consists of data that describes the image, the most important being patient data.This includes the patient’s demographic information such as the patient’s name, age, gender, and date of birth.Hy
(https://theaisummer.com/medical-image-coordinates/)

# Data

* train_study_level.csv - the train study-level metadata, with one row for each study, including correct labels.
* train_image_level.csv - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format.  
Some images in both test and train have multiple bounding boxes.
* sample_submission.csv - a sample submission file containing all image- and study-level IDs.
* train folder - comprises 6,334 chest scans in DICOM format, stored in paths with the form study/series/image
* test folder - The hidden test dataset is of roughly the same scale as the training dataset.

## EDA

The Process of EDA
1. Data visualization
2. Feature select
3. Feature engineering
4. fill in missing value

### set up W&B
* save learning parameter
* vizualiztion image file

In [None]:
import wandb
import os
# os._Environは環境変数名keyと値valueが対になったマップ型オブジェクト
# print(os.environ)
# wandbとのAPI接続を暗号化する
#!wandb login $api_key

In [None]:
"""from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("edc8af4e0cd1f3bba30aaea945348675bb6346de")

os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': 'siim-fisabio-rsna', '_wandb_kernel': 'ruch'}"""

### Libarary

In [None]:
import pandas as pd
import pandas_profiling
import cv2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pydicom 
import random
import albumentations as A
from sklearn.model_selection import train_test_split
from fastai.vision.all import *
from fastai.medical.imaging import *
from pydicom.pixel_data_handlers.util import apply_voi_lut

### Look at train_study_level.csv / train_image_level.csv

In [None]:
%cd input

In [None]:
train_study_df = pd.read_csv("../input/siim-covid19-detection/train_study_level.csv")
train_image_data = pd.read_csv("../input/siim-covid19-detection/train_image_level.csv")

In [None]:
print(train_study_df.shape)
train_study_df.head()

### Show Distribution 
train_study_level.csv 

In [None]:
study_class = ["Negative for Pneumonia", "Typical Appearance","Indeterminate Appearance", "Atypical Appearance"]
plt.figure(figsize = (10,5))
plt.bar([1,2,3,4], train_study_df[study_class].values.sum(axis=0))
plt.xticks([1,2,3,4],study_class)
plt.ylabel('Frequency')
plt.show()

In [None]:
train_image_data.head()

We have our bounding box labels provided in the label column. The format is as follows:  
[class ID] [confidence score] [bounding box]

#### look at the distribution of opacity vs none:

In [None]:
train_image_data['split_label'] = train_image_data.label.apply(lambda x: [x.split()[offs:offs+6] for offs in range(0, len(x.split()), 6)])
# show the split_label
train_image_data['split_label'][:5]

In [None]:
train_image_data['split_label'].values[0]

In [None]:
classes_freq = []
for i in range(len(train_image_data)):
    for j in train_image_data.iloc[i].split_label: classes_freq.append(j[0])
plt.hist(classes_freq)
plt.ylabel('Frequency')

### target label distribution

In [None]:
image_data_path = "../input/siim-covid19-detection/train"

# show image
#cv2.imshow("image_data")

### look at the images 


In [None]:
# pixel_arrayプロパティを用いることで画像データがNumPyのndarrayとして取得する

def dicom2array(path, voi_lut=True, fix_monochrome=True):
    dicom = pydicom.read_file(path)
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data


def plot_img(img, size=(7, 7),is_rgb=True, title="", cmap='grap'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=camp)
    plt.suptitle(title)
    plt.show()
    
def plot_imgs(imgs, cols=4, size=7,is_rgb=True ,title="", cmap='gray', img_size=(500, 500)):
    rows = len(imgs) // cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show

In [None]:
dataset_path = Path('../input/siim-covid19-detection')
dicom_paths = get_dicom_files(dataset_path/'train')
imgs = [dicom2array(path) for path in dicom_paths[:4]]
plot_imgs(imgs)

#### Let's actually look at how many images are available per study:

In [None]:
num_images_per_study = []
for i in (dataset_path/'train').ls():
    num_images_per_study.append(len(get_dicom_files(i)))
    if len(get_dicom_files(i)) > 5:
        print(f'Study {i} had {len(get_dicom_files(i))} images')

In [None]:
plt.hist(num_images_per_study)

#### look at image appled boundig box

In [None]:
# 一致するファイルを抽出
def image_path(row):
    study_path = dataset_path/'train'/row.StudyInstanceUID
    for i in get_dicom_files(study_path):
        # 拡張子なしのファイル名の文字列はstem属性で取得
        if row.id.split('_')[0] == i.stem: return i
    

train_image_data['image_path'] = train_image_data.apply(image_path, axis=1)

In [None]:
train_image_data.head()

In [None]:
imgs = []
image_paths = train_image_data['image_path'].values
# ex ('../input/siim-covid19-detection/train/5776db0cec75/81456c9c5423/000a312787f2.dcm')

thickness = 10
scale = 5


for i in range(8):
    image_path = random.choice(image_paths)
    print(image_path)
    img = dicom2array(path=image_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale)
    img = np.stack([img, img, img], axis=-1)
    for i in train_image_data.loc[train_image_data['image_path'] == image_path].split_label.values[0]:
        if i[0] == 'opacity':
            img = cv2.rectangle(img, (int(float(i[2])/ scale), int(float(i[3])/ scale)),
                                     (int(float(i[4])/ scale), int(float(i[5])/ scale)),
                                     [0, 255, 0], thickness)
    img = cv2.resize(img, (500, 500))
    imgs.append(img)
            
plot_imgs(imgs, cmap=None)

###
# split_label.values ex) ['opacity', '1', '789.28836', '582.43035', '1815.94498', '2499.73327']

# Albumenatations

sample1 https://propen.dream-target.jp/blog/python_albumentations  
sample2 https://qiita.com/Takayoshi_Makabe/items/79c8a5ba692aa94043f7

## Modeling

###  model YOLO5

Download Yolov% repository in temp directory

In [None]:
%cd ../kaggle
#!mkdir tmp
%cd tmp

In [None]:
# Download YOLOv5
!git clone https://github.com/ultralytics/yolov5
%cd yolov5
# Install dependecies
%pip install -qr requirements.txt
%cd ../
import torch
# 学習回すときにGPUをONにする
print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

Chose model YOLOv5s、YOLOv5m、YOLOv5l、YOLOv5x (big order ⇄ GPU memory)  
use YOLOv5s first

Prepare Folder 

* required structure for dataset directory

```
/parent_folder
    /dataset
         /images
             /train
             /val
         /labels
             /train
             /val
    /yolov5
```

## Hyperparameters Set

In [None]:
TRAIN_PATH = 'input/siim-covid19-resized-to-256px-jpg/train/'
IMG_SIZE = 256
BATCH_SIZE = 16
EPOCHS = 10

### Prepare Dataset

In [None]:
%cd ..

In [None]:
#%cd kaggle
df = pd.read_csv("input/siim-covid19-detection/train_image_level.csv")

In [None]:
df['id'] = df.apply(lambda row: row.id.split('_')[0], axis = 1)
df['path'] = df.apply(lambda row: TRAIN_PATH+row.id+'.jpg', axis=1)
df['image_level'] = df.apply(lambda row: row.label.split(' ')[0], axis=1)

#### Load meta.csv file

In [None]:
meta_df = pd.read_csv('input/siim-covid19-resized-to-256px-jpg/meta.csv')

In [None]:
train_meta_df = meta_df.loc[meta_df.split == 'train'] # select_train
train_meta_df = train_meta_df.drop('split', axis=1) # delete_test
train_meta_df.columns = ['id', 'dim0', 'dim1']

In [None]:
# Merge both the dataframes   why??
df = df.merge(train_meta_df, on='id', how="left")
df.head()

#### Train-Validation split

In [None]:
train_df, valid_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df.image_level.values)

# ignore warning
train_df.loc[:, 'split'] = 'train'
valid_df.loc[:, 'split'] = 'valid'

df = pd.concat([train_df, valid_df]).reset_index(drop=True)

In [None]:
print(f'Size of dataset: {len(df)}, training images: {len(train_df)}. validation images: {len(valid_df)}')

#### prepare required folder structure

In [None]:
os.makedirs('tmp/covid/images/train', exist_ok=True)
os.makedirs('tmp/covid/images/valid', exist_ok=True)
! ls tmp/covid/images

In [None]:
# Move the images to relevant split folder.  (Need??)
from tqdm import tqdm
from shutil import copyfile
for i in tqdm(range(len(df))):
    row = df.loc[i]
    if row.split == 'train':
        copyfile(row.path, f'tmp/covid/images/train/{row.id}.jpg')
    else:
        copyfile(row.path, f'tmp/covid/images/valid/{row.id}.jpg')

### Create .YANL file

In [None]:
%cd tmp

In [None]:
%pwd

In [None]:
import yaml

data_yaml = dict(
    train = '../covid/images/train',
    val = '../covid/images/valid',
    nc = 2,
    names = ['none', 'opacity']
)

with open('data/data.yaml', 'w') as outfile:
    yaml.dump(data_yaml, outfile, default_flow_style=True)

%cat data/data.yaml

## Training

#### Use W&B

```
--img {IMG_SIZE} \ # Input image size.
--batch {BATCH_SIZE} \ # Batch size
--epochs {EPOCHS} \ # Number of epochs
--data data.yaml \ # Configuration file
--weights yolov5s.pt \ # Model name
--save_period 1\ # Save model after interval
--project kaggle-siim-covid # W&B project name
```

In [None]:
# 学習実行
"""
!python train.py --img {IMG_SIZE} \
                 --batch {BATCH_SIZE} \
                 --epochs {EPOCHS} \
                 --data data.yaml \
                 --weights yolov5s.pt \
                 --save_period 1\
                 --project kaggle-siim-covid
"""

### Inference

In [None]:
TEST_PATH = '/kaggle/input/siim-covid19-resized-to-256px-jpg/test/' # absolute path
MODEL_PATH = 'kaggle-siim-covid/exp/weights/best.pt'


```
--weights {MODEL_PATH} \ # path to the best model.
--source {TEST_PATH} \ # absolute path to the test images.
--img {IMG_SIZE} \ # Size of image
--conf 0.281 \ # Confidence threshold (default is 0.25)
--iou-thres 0.5 \ # IOU threshold (default is 0.45)
--max-det 3 \ # Number of detections per image (default is 1000) 
--save-txt \ # Save predicted bounding box coordinates as txt files
--save-conf # Save the confidence of prediction for each bounding box
```

In [None]:
"""
!python detect.py --weights {MODEL_PATH} \
                  --source {TEST_PATH} \
                  --img {IMG_SIZE} \
                  --conf 0.281 \
                  --iou-thres 0.5 \
                  --max-det 3 \
                  --save-txt \
                  --save-conf
"""

## submit

In [None]:
submission_df = pd.read_csv(dataset_path/'sample_submission.csv')

In [None]:
submission_df.head()

In [None]:
submission_df.iloc[2000:2010]

In [None]:
submission_df.to_csv("submission.csv", index=False)