# Overview of the project 

This project attempts to implement NIPS 2017 paper "Searching for activation function" (Zoph & Le 2017). Although neural networks are powerful and flexible models they are still hard to design and limited to human creativity. One important consideration to designing a neural networks is the activation function as it works as the non-linearity between the affine transformation in a neural network. However how do we choose which basic functions to use and combine with to construct a new activaiton functions. Essentially the problem becomes a search problem of finding the best activation function in a search space. The approach that Zoph and Le took in their paper was similar to their earlier work "Neural archiecture search using reinforcement learning" that used a RNN to sample the possible hyperparamers of a neural network while using a polciy gradient approach(specifically REINFORCE)to train the network to maximise the validation accuracy of the child network (i.e. the reward signal).

![title](img/nas.jpeg)

In this paper they used a RNN to instead generate the choice of unary and binary functions used to create the activation function and trained the RNN using the Policy Proximal Optimization (PPO) algorthim.  

![title](img/activationfunctions.png)

Using PPO is in strak contrast with the other 2 papers previous publish by Zoph & Le which used REINFORCE and Trust Region Policies (TRP). A reason why did not used REINFORCE was given in their paper "Neural Optimizer Search with Reinforcement Learning", there the authors explanation was that REINFORCE exhbited poor sampling efficiency comapred to Trust Region Policies. However no explanations was given as to why PPO was used in searching for activations instead of TRP. Another thing the author failed to explain was why they used policy gradient methods instead of other approaches like evolutionary strategies. Finally there was no control done in any of the papers as to whether their approach was better than say random search and how much better was it than random search. Such a controlled experiment might also make it possible to compare other approaches say evolutionary strategies with their method more fairly. This is however outside the scope of a mere paper implementation and is worth considering as a future project. 

![title](img/Rnn.png)

# Dependencies

This project requires tensorflow gpu

In [None]:
!pip install tensorflow-gpu

# Start searching for activation functions

Download cifar-10 datset

In [None]:
!python cifar10_download_and_extract.py

Run the training program

In [None]:
!python main.py

The search is conducted using ResNet-20 as the child network architecture and trained on CIFAR-10 for 10k. 

# Testing on CIFAR-100

In the paper 3 datasets were used to test the transfer capabilties of the newly found activation functions 
1. CIFAR-100
2. ImageNet
3. WMT

I wasn't able to donwload imagenet because of it's large size. WMT 2014 EnglishIn this notebook we only look at CIFAR100 and WMT 2014 English-German Dataset

In [None]:
!python cifar100_download_and_extract.py
!python cifar100_train.py
!python cifar100_test.py

# Swish

Swish was found by the original activation function search implemented by the the original paper by Zoph and Le and was demostrated to have an **improvement of the top-1 classification by ImageNet by 0.9% by simply
replacing all relu activation functions with swish**. 

In [None]:
!python swish.py

![title](img/loss_rmsprop.png)