## Note
> **This is a notebook that is being created for a Portuguese version of a Data Science class for Awari School. Hence, there will be many comments in portuguese. If there's any question, please leave a comment**

### Playlist em Vídeo Passo a Passo
- [Link da Playlist](https://loom.com/share/folder/8f3d5415a9fb4d37b8d6626d30b000b3)
- [Notebook auxiliar utilizado na Playlist](https://github.com/WittmannF/course/blob/master/day-4/assignment-3-cats-dogs-solved.ipynb)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
## 1. Leitura e análise dos metadados em CSV 
sample_submission = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')
test = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/test.csv')
train = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/train.csv')

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
train.describe(include='all')

In [None]:
test.describe(include='all')

In [None]:
cat_cols = ['patient_id', 'sex', 'anatom_site_general_challenge', 'diagnosis', 'benign_malignant']
print('Contagens dos atributos categóricos do conjunto de treino')
for col in cat_cols:
    print('Contagem de valores da coluna {}'.format(col))
    print(train[col].value_counts().head(20))
    print('='*80)


In [None]:
cat_cols = ['patient_id', 'sex', 'anatom_site_general_challenge']
print('Contagens dos atributos categóricos do conjunto de teste')
for col in cat_cols:
    print('Contagem de valores da coluna {}'.format(col))
    print(test[col].value_counts().head(20))
    print('='*80)


In [None]:
submission = sample_submission

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
pd.read_csv('submission.csv')

In [None]:
## 2. Visualização de Imagens
DATA_PATH = '../input/siim-isic-melanoma-classification/jpeg/'
TRAIN_PATH = f'{DATA_PATH}train/'
TEST_PATH = f'{DATA_PATH}test/'

In [None]:
import glob

filepaths = glob.glob(TRAIN_PATH+'/*.jpg')

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random
from keras.preprocessing.image import load_img

img2diag = train[['image_name', 'benign_malignant']].set_index('image_name')['benign_malignant'].to_dict()


In [None]:
img_path = random.choice(filepaths)
img_name = img_path.split('/')[-1].replace('.jpg', "")
img = load_img(img_path)
img_diagnostic = img2diag[img_name]
img_np = np.asarray(img)
plt.imshow(img_np)
plt.title(img_diagnostic)
plt.show()

In [None]:
## 3 - Criação do modelo Baseline (Ponto de Partida)
## 3.1 - Image data generator

# TODO: Import the model and the preprocess_input function
from keras.applications.resnet50 import preprocess_input

# TODO: Import the ImageDataGenerator class
from keras.preprocessing.image import ImageDataGenerator

In [None]:
# Shape in which all images are going to be reshaped
TARGET_SHAPE = (224, 224, 3)

# TODO: Initialize the data generator class 
datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

In [None]:
train_df_datagen = train[['image_name', 'benign_malignant']].copy()
train_df_datagen['image_name'] = train_df_datagen['image_name']+'.jpg'
train_df_datagen.head()

In [None]:
N_BENIGN = 584

filter_benign = train_df_datagen['benign_malignant']=='benign'
filter_malignant = train_df_datagen['benign_malignant']=='malignant'
sample_benign = train_df_datagen[filter_benign].sample(N_BENIGN, random_state=10)
# Let's try to ignore the class balance test make before
#sample_benign = train_df_datagen[filter_benign]

In [None]:
train_val_sampled = pd.concat([sample_benign, train_df_datagen[filter_malignant]])

In [None]:
from sklearn.model_selection import train_test_split

train_df, valid_df = train_test_split(train_val_sampled, 
                                      test_size=0.2, 
                                      random_state=1,
                                      stratify=train_val_sampled['benign_malignant']
                                     )

In [None]:
train_gen = datagen.flow_from_dataframe(train_df,
                           TRAIN_PATH,
                           'image_name',
                           'benign_malignant',
                           target_size=TARGET_SHAPE[:2],
                            class_mode='sparse'
                           )

In [None]:
valid_gen = datagen.flow_from_dataframe(valid_df,
                           TRAIN_PATH,
                           'image_name',
                           'benign_malignant',
                            target_size=TARGET_SHAPE[:2],
                            class_mode='sparse',
                            shuffle=False
                           )

In [None]:
test_df_datagen = test[['image_name']]+'.jpg'
test_df_datagen.head()

In [None]:
test_gen = datagen.flow_from_dataframe(test_df_datagen,
                            TEST_PATH,
                            'image_name',
                            target_size=TARGET_SHAPE[:2],
                            class_mode=None,
                            shuffle=False
                           )

In [None]:
## 3.2 Criando modelo base
# Leitura recomendada sobre ResNet e outros modelos: https://medium.com/analytics-vidhya/timeline-of-transfer-learning-models-db2a0be39b37 
from keras.models import Sequential
from keras.layers import Flatten, Dense, GlobalAveragePooling2D
from keras.applications.resnet50 import ResNet50


resnet_model = ResNet50(include_top=False, input_shape=TARGET_SHAPE, pooling='avg')

In [None]:
for layer in resnet_model.layers:
    layer.trainable = False

In [None]:
base_model = Sequential([resnet_model,
                         Dense(1024, activation='relu'),
                         Dense(2, activation='softmax')
                        ])

In [None]:
## 3.3 Treinar modelo

In [None]:
import tensorflow as tf

In [None]:
from keras.optimizers import Adam
base_model.compile(optimizer=Adam(lr=1e-4), 
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy']
                  )

In [None]:
base_model.fit_generator(train_gen,
                         validation_data=valid_gen,
                         epochs=3
                        )

In [None]:
test_gen

In [None]:
# Predict in the test set
pred = base_model.predict(test_gen)
# Get the malignant columns
pred = pred[:, 1]

In [None]:
submission_dict = {'image_name': test.image_name.values,
              'target': pred}

submission = pd.DataFrame(submission_dict)

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
pd.read_csv('submission.csv')

### Ideas and Next Steps
- Include best practices on Keras from [here](https://github.com/WittmannF/course/blob/master/day-4/Best_Practices_Playground.ipynb) and [here](https://www.kaggle.com/ipythonx/tf-keras-melanoma-classification-starter-tabnet)
    - Augmix, LRFinder, [attention](https://www.kaggle.com/ibtesama/melanoma-classification-with-attention), [effnet](https://www.kaggle.com/andradaolteanu/melanoma-competiton-aug-resnet-effnet-lb-0-91), [another effnet](https://www.kaggle.com/nroman/melanoma-pytorch-starter-efficientnet), [include more data](), 
- [Include metafeatures](https://www.kaggle.com/titericz/simple-baseline)
- Unfreeze layers
- Create multiple feature extractor and try different TL models
