###### Notebook created by: Arnav Chavan (@[carnav0400](https://www.kaggle.com/carnav0400)), Udbhav Bamba (@[ubamba98](https://www.kaggle.com/ubamba98))

## NOTE: Turn on the Internet and GPU for this kernal before starting 

# How to add dataset to the kernal
* Click on "Add Data" 
* Search "CLabsCVcomp"
* Click on "Add"
* Done


# Importing all Libraries
PS - FastAI imports all necessary libraries for you

In [None]:
from fastai import *
from fastai.vision import *
from sklearn.metrics import f1_score



# Seed everything for reproducibility
You may like to read more about it at [link](https://medium.com/@ODSC/properly-setting-the-random-seed-in-ml-experiments-not-as-simple-as-you-might-imagine-219969c84752).

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(43)

# EDTA

## Reading CSV

In [None]:
train = pd.read_csv('../input/clabscvcomp/data/train.csv')
test_df = pd.read_csv('../input/clabscvcomp/data/sample_submission.csv')

train.head() ## Shows the first five rows of data frame

In [None]:
sorted(train.genres.unique()) ## Shows all classes in the dataframe

In [None]:
train.genres.value_counts(normalize=True) ## Distribution of dataset

Dataset looks very imbalanced. Try to read more about it. This blog post might be a good read [link](https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/)

# Defining DataBunch for FastAI
Read more about it [here](https://docs.fast.ai/vision.data.html#ImageDataBunch.from_df)

In [None]:
sz = (512, 512) ## Image size
bs = 32 ## Batch size
tfms = get_transforms( ## Transformation to apply on Train data
    do_flip=True, ## Horizontal flip
    flip_vert=False, ## Vertical flip
    max_rotate=20, ## Rotation
    max_zoom=1.5, ## Center zoom
    max_lighting=0.75 ## lighting
)

In [None]:
plt.imread('../input/clabscvcomp/data/train_data/1000.jpg').shape

In [None]:
data = (
    ImageList.from_df(df=train, path='', folder='../input/clabscvcomp/data/train_data/', cols='id', suffix = '.jpg') ## define data path
    .split_by_rand_pct(valid_pct=0.2) ## validation split
    .label_from_df(cols='genres') ## load labels from
    .transform(tfms, size=sz)
    .databunch(bs=bs, num_workers=4) 
    .normalize(imagenet_stats)
    )

Now lets add test data to the DataBunch

In [None]:
test_data = ImageList.from_df(test_df, path='../input/clabscvcomp/data/test_data/', cols='id', suffix = '.jpg')
data.add_test(test_data)

# Visualizing dataset

In [None]:
data.show_batch(3)

# Define F1 metric

In [None]:
def F1(y_pred, y):
    y_pred = y_pred.softmax(dim=1) 
    y_pred = y_pred.argmax(dim=1)
    return torch.tensor(f1_score(y.cpu(), y_pred.cpu(), labels=list(range(10)), average='weighted'),device='cuda:0')

# Defining FastAI's Learner 
Learner is an integration of DataBunch + Model + callbacks
More about it can be found [here](https://docs.fast.ai/vision.learner.html)

In [None]:
learn = cnn_learner(
                    data, ## DataBunch
                    models.resnet18, ## Resnet50 
                    metrics=[F1, accuracy], ## Matrices
                    callback_fns=ShowGraph ## Allows us to visualize training
).mixup().to_fp16()

# Lets start training!!

###### Freeze all layers but last layer and training some epochs with one-cycle policy
Read more: [1-cycle policy basics](https://sgugger.github.io/the-1cycle-policy.html), [Documentation](https://docs.fast.ai/callbacks.one_cycle.html)

In [None]:
learn.freeze() 
learn.fit_one_cycle(3)

###### Unfreeze all layers and find best learning rate

In [None]:
learn.unfreeze()
learn.lr_find()
learn.recorder.plot(suggestion=True)

###### Continue training

In [None]:
learn.fit_one_cycle(20, max_lr=slice(1e-4, 1e-3))

## Predicting for test data

In [None]:
preds = learn.get_preds(ds_type=DatasetType.Test) ## get prediction in test data
np.save("preds.npy", preds[0].numpy())
preds = np.argmax(preds[0].numpy(),axis = 1)
categories = sorted(train.genres.unique().astype('str'))
final_preds = []
for idx in preds:
    final_preds.append(categories[idx])
final_submit = pd.read_csv('../input/clabscvcomp/data/sample_submission.csv')
final_submit.genres = final_preds
final_submit.head()
final_submit.to_csv('submission.csv',index = False)

## Now click on the "Commit" to submit the notebook. This notebook generates 'submission.csv' which can be check how the model performed. 
## After the notebook is commited successfully. Click on the Output button. This will bring you to a screen with an option to Submit to Competition. Hit that and you will see how your model performed.
## NOTE: We expect everyone to generate such notebooks for your final submission. Only the teams with notebook submitted against their final submission will be considered for prize money!

# Things to try next:
* Try different architectures, optimizers, loss functions etc.
* Think of ways of tackling data imbalance problem.
* Try different image size
* Try Ensembling methods.
* Apply semi supervised learning.

# PS: This competition is hosted to promote learning. So we request you to publish your baseline models via Kaggle kernels and discuss on the discussion tab to help others learn. Thanks!