# Generate Features

The purpose of this kernel is just to create a few datasets that can be used for further exploration and modeling in other kernels. 
It turns out that just the image metadata and and the image size contains a lot of useful information We'll start by creating feature-engineered datasets with just that information.

<u>**Contents**</u>
1. Import data
2. Data Imputation<br>
    2.1 Approach 1<br>
    2.2 Approach 2<br>

## 1. Import data
The below code imports necessary libraries and also imports metadata into dataframe.

In [1]:
import numpy as np
import pandas as pd
import os
from PIL import Image
from tqdm import tqdm
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn.metrics import roc_auc_score

import lightgbm as lgb

train = pd.read_csv('../input/siim-isic-melanoma-classification/train.csv')
test = pd.read_csv('../input/siim-isic-melanoma-classification/test.csv')
sample = pd.read_csv('../input/siim-isic-melanoma-classification/sample_submission.csv')

## 2. Data Imputation

### 2.1 Approach 1
Following features found in the train.csv and test.csv are imputated.
* Sex with na
* age_approx with 0
* anatom_site_general_challenge with na

In [None]:
train['sex'] = train['sex'].fillna('na')
train['age_approx'] = train['age_approx'].fillna(0)
train['anatom_site_general_challenge'] = train['anatom_site_general_challenge'].fillna('na')

test['sex'] = test['sex'].fillna('na')
test['age_approx'] = test['age_approx'].fillna(0)
test['anatom_site_general_challenge'] = test['anatom_site_general_challenge'].fillna('na')

The below code computes the width and height of corresponding Train images which are in jpeg format.

In [3]:
trn_images = train['image_name'].values
trn_sizes = np.zeros((trn_images.shape[0],2))
for i, img_path in enumerate(tqdm(trn_images)):
    img = Image.open(os.path.join('../input/siim-isic-melanoma-classification/jpeg/train/', f'{img_path}.jpg'))
    trn_sizes[i] = np.array([img.size[0],img.size[1]])

100%|██████████| 33126/33126 [03:04<00:00, 179.11it/s]


In [None]:
The below code computes the width and height of corresponding Test images which are in jpeg format.

In [4]:
test_images = test['image_name'].values
test_sizes = np.zeros((test_images.shape[0],2))
for i, img_path in enumerate(tqdm(test_images)):
    img = Image.open(os.path.join('../input/siim-isic-melanoma-classification/jpeg/test/', f'{img_path}.jpg'))
    test_sizes[i] = np.array([img.size[0],img.size[1]])

100%|██████████| 10982/10982 [00:58<00:00, 186.19it/s]


Now, we append the "width" and "heights" columns to the pandas dataframes.

In [5]:
train['w'] = trn_sizes[:,0]
train['h'] = trn_sizes[:,1]
test['w'] = test_sizes[:,0]
test['h'] = test_sizes[:,1]

We will convert "sex", "anatom_site_general_challenge" features in both Train & Test into Categorical features. Finally, we will also save them as "train_meta_size.csv" & "test_meta_size.csv".

In [6]:
le = preprocessing.LabelEncoder()

train.sex = le.fit_transform(train.sex)
train.anatom_site_general_challenge = le.fit_transform(train.anatom_site_general_challenge)
test.sex = le.fit_transform(test.sex)
test.anatom_site_general_challenge = le.fit_transform(test.anatom_site_general_challenge)

feature_names = ['sex','age_approx','anatom_site_general_challenge','w','h']
ycol = ['target']

train[feature_names + ycol].to_csv('train_meta_size.csv', index=False)
test[feature_names ].to_csv('test_meta_size.csv', index=False)

The problem with the above approach is that we have very different distribution of missing values in train and test sets, so any algorithms that are sensitive to those discrepancies will lead to difference between the local CV and LB. We'll try to do somethign a bit more sophisticated now. 

### 2.2 Approach 2
We will now perform Data Imputation in a different way. But, firstly, we will load the train and Test csv files again.

In [10]:
train = pd.read_csv('../input/siim-isic-melanoma-classification/train.csv')
test = pd.read_csv('../input/siim-isic-melanoma-classification/test.csv')

test.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge
0,ISIC_0052060,IP_3579794,male,70.0,
1,ISIC_0052349,IP_7782715,male,40.0,lower extremity
2,ISIC_0058510,IP_7960270,female,55.0,torso
3,ISIC_0073313,IP_6375035,female,50.0,torso
4,ISIC_0073502,IP_0589375,female,45.0,lower extremity


Printing sample data from test.csv.

In [11]:
train.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,diagnosis,benign_malignant,target
0,ISIC_2637011,IP_7279968,male,45.0,head/neck,unknown,benign,0
1,ISIC_0015719,IP_3075186,female,45.0,upper extremity,unknown,benign,0
2,ISIC_0052212,IP_2842074,female,50.0,lower extremity,nevus,benign,0
3,ISIC_0068279,IP_6890425,female,45.0,head/neck,unknown,benign,0
4,ISIC_0074268,IP_8723313,female,55.0,upper extremity,unknown,benign,0


Printing unique values of "diagnosis" feature in Train.

In [12]:
np.unique(train.diagnosis.values, return_counts=True)

(array(['atypical melanocytic proliferation', 'cafe-au-lait macule',
        'lentigo NOS', 'lichenoid keratosis', 'melanoma', 'nevus',
        'seborrheic keratosis', 'solar lentigo', 'unknown'], dtype=object),
 array([    1,     1,    44,    37,   584,  5193,   135,     7, 27124]))

Making a list of Features that need to be imputated. We will also combine the Train and Test dataframes before computation.

In [13]:
cols = ['sex', 'age_approx', 'anatom_site_general_challenge']

train_test = train[cols].append(test[cols])

Imputing the values in Train & Test features as follows:
1. Sex feature with with the most common value(i.e. male) from the combined Train & Test data.
2. age_approx with mean age from the combined Train & Test data.
3. anatom_site_general_challenge with the most common value(i.e. unknown) from the combined Train & Test data.

In [18]:
train_test['age_approx'] = train_test['age_approx'].fillna(train_test['age_approx'].mean())
train_test['sex'] = train_test['sex'].fillna(train_test['sex'].value_counts().index[0])
train_test['anatom_site_general_challenge'] = train_test['anatom_site_general_challenge'].fillna(train_test['anatom_site_general_challenge'].value_counts().index[0])

train[cols] = train_test[:train.shape[0]][cols].values
test[cols] = train_test[train.shape[0]:][cols].values

test.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge
0,ISIC_0052060,IP_3579794,male,70.0,torso
1,ISIC_0052349,IP_7782715,male,40.0,lower extremity
2,ISIC_0058510,IP_7960270,female,55.0,torso
3,ISIC_0073313,IP_6375035,female,50.0,torso
4,ISIC_0073502,IP_0589375,female,45.0,lower extremity


The below code computes the width and height of corresponding Train & Test images which are in jpeg format.

In [19]:
trn_images = train['image_name'].values
trn_sizes = np.zeros((trn_images.shape[0],2))
for i, img_path in enumerate(tqdm(trn_images)):
    img = Image.open(os.path.join('../input/siim-isic-melanoma-classification/jpeg/train/', f'{img_path}.jpg'))
    trn_sizes[i] = np.array([img.size[0],img.size[1]])
    
    
test_images = test['image_name'].values
test_sizes = np.zeros((test_images.shape[0],2))
for i, img_path in enumerate(tqdm(test_images)):
    img = Image.open(os.path.join('../input/siim-isic-melanoma-classification/jpeg/test/', f'{img_path}.jpg'))
    test_sizes[i] = np.array([img.size[0],img.size[1]])

100%|██████████| 33126/33126 [03:05<00:00, 178.54it/s]
100%|██████████| 10982/10982 [00:56<00:00, 193.81it/s]


Now, we append the "width" and "heights" columns to the pandas dataframes.

In [21]:
train['w'] = trn_sizes[:,0]
train['h'] = trn_sizes[:,1]
test['w'] = test_sizes[:,0]
test['h'] = test_sizes[:,1]

We will convert "sex", "anatom_site_general_challenge" features in both Train & Test into Categorical features.

In [22]:
le = preprocessing.LabelEncoder()

le.fit(train_test.sex)

train.sex = le.transform(train.sex)
test.sex = le.transform(test.sex)

le = preprocessing.LabelEncoder()

le.fit(train_test.anatom_site_general_challenge)

train.anatom_site_general_challenge = le.transform(train.anatom_site_general_challenge)
test.anatom_site_general_challenge = le.transform(test.anatom_site_general_challenge)


train.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,diagnosis,benign_malignant,target,w,h
0,ISIC_2637011,IP_7279968,1,45.0,0,unknown,benign,0,6000.0,4000.0
1,ISIC_0015719,IP_3075186,0,45.0,5,unknown,benign,0,6000.0,4000.0
2,ISIC_0052212,IP_2842074,0,50.0,1,nevus,benign,0,1872.0,1053.0
3,ISIC_0068279,IP_6890425,0,45.0,0,unknown,benign,0,1872.0,1053.0
4,ISIC_0074268,IP_8723313,0,55.0,5,unknown,benign,0,6000.0,4000.0


Printing sample Test rows.

In [23]:
test.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,w,h
0,ISIC_0052060,IP_3579794,1,70.0,4,6000.0,4000.0
1,ISIC_0052349,IP_7782715,1,40.0,1,6000.0,4000.0
2,ISIC_0058510,IP_7960270,0,55.0,4,6000.0,4000.0
3,ISIC_0073313,IP_6375035,0,50.0,4,6000.0,4000.0
4,ISIC_0073502,IP_0589375,0,45.0,1,1920.0,1080.0


Finally, we will also save them as "train_meta_size_2.csv" & "test_meta_size_2.csv".

In [24]:
train[feature_names + ycol].to_csv('train_meta_size_2.csv', index=False)
test[feature_names ].to_csv('test_meta_size_2.csv', index=False)

We'll now add metafeatures from final layer the trained VGG-16 Model with attention:

In [26]:
oof_c = pd.read_csv('../input/output/submission_train.csv')
submission_c = pd.read_csv('../input/outpu/submission.csv')
oof_c.head()

Unnamed: 0,image_name,target,pred,fold
0,ISIC_2637011,0,0.018716,0
1,ISIC_0076262,0,0.02536,0
2,ISIC_0074268,0,0.024038,0
3,ISIC_0015719,0,0.020294,0
4,ISIC_0082543,0,0.020425,0


append the contents to test.

In [34]:
del oof_c['target']

test['pred'] = submission_c['target']
test.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,w,h,pred
0,ISIC_0052060,IP_3579794,1,70.0,4,6000.0,4000.0,0.027359
1,ISIC_0052349,IP_7782715,1,40.0,1,6000.0,4000.0,0.025799
2,ISIC_0058510,IP_7960270,0,55.0,4,6000.0,4000.0,0.025983
3,ISIC_0073313,IP_6375035,0,50.0,4,6000.0,4000.0,0.024942
4,ISIC_0073502,IP_0589375,0,45.0,1,1920.0,1080.0,0.032569


We will convert "sex", "anatom_site_general_challenge" features in both Train & Test into Categorical features. Finally, we will also save them as "train_meta_size3.csv" & "test_meta_size3.csv".

In [36]:
train_2[feature_names + ['fold'] + ycol].to_csv('train_meta_size_3.csv', index=False)
test[feature_names ].to_csv('test_meta_size_3.csv', index=False)