# Ambizione

Goal is to generate perfect research project title to win Ambizione grant ([link](http://www.snf.ch/en/funding/careers/ambizione/)). 

As input we use the successful grantees from the past years ([pdf](http://www.snf.ch/SiteCollectionDocuments/ambizione_liste_beitragsempfangende_e.pdf)). 
This input is converted to a plain text document where each line is the title of a winning proposal.

Okay, we're in Switzerland, so one issue is that we have mainly English titles in there, but also some German and French ones. 
Since this could impact the quality of our generative model, we try removing the majority of them just by matching common letters non-existing in English.

In [30]:
bad_words = ['é', 'è', 'à', 'ö', 'ä', 'ü']

with open('ambizione.txt') as oldfile, \
        open('ambizione_cleaned.txt', 'w') as newfile:
    for line in oldfile:
        if not any(bad_word in line for bad_word in bad_words):
            newfile.write(line)

The titles in this file are sorted alphabetically - maybe this would cause some unwanted patterns during our model's training.
Let's shuffle the lines to make sure that doesn't happen

In [2]:
import random
with open('ambizione_cleaned.txt', 'r') as source:
    data = [(random.random(), line) for line in source]
data.sort()
with open('ambizione_cleaned_shuffled.txt', 'w') as target:
    for _, line in data:
        target.write(line)

Luckily, we don't have to reinvent the wheel to get our model.
textgenrnn is a Python 3 module on top of Keras/TensorFlow for creating char-rnns, with many cool features - we're gonna use that.

In [2]:
from textgenrnn import textgenrnn

textgen = textgenrnn()

Using TensorFlow backend.


You can specify many options to train the model straight from a text file.
Let's try to train this with the following ones to start off:
```python
new_model=True,      # this just starts a new model trained from the text 
line_delimited=True, # this tells it that each line starts a new logic
word_level=True,     # for now use existing words and rearrange them - alternative would be to have each char as a pattern
max_length=5,        # maximum number of previous patterns to predict next one. Set to 5 as these titles mainly simple compositions of noun + preposition + noun..
max_gen_length=20,   # maximum patterns considered as generated "sentence".
num_epochs=3,        # epoch to train for
gen_epochs=-1,       # after how many epochs a test output is generated, don't want this for now
train_size=0.8,      # use 80% of the sample for training, rest for validation
dropout=0.1          # try dropout of 10% of nodes to reduce chance of overtraining
```

In [4]:
textgen.reset()
textgen.train_from_file('ambizione_cleaned_shuffled.txt',
                        new_model=True,
                        line_delimited=True,
                        word_level=True,
                        max_length=5,
                        max_gen_length=20,
                        num_epochs=3,
                        gen_epochs=-1,
                        train_size=0.8,
                        dropout=0.1)

633 texts collected.
Training new model w/ 2-layer, 128-cell LSTMs
Training on 6,875 word sequences.
Epoch 1/3
Epoch 2/3
Epoch 3/3


Huh, the model seems to overtrain on the training data right away. Maybe we don't have sufficient input patterns available?
Let's still see what it spits out.

In [5]:
textgen.generate(5, temperature=0.8)

 40%|████      | 2/5 [00:01<00:01,  1.51it/s]

@ and images atlantic at compared

quantum the and ions of neurodevelopmental from linkage materials



 80%|████████  | 4/5 [00:01<00:00,  2.39it/s]

genetic and and social latin the after of : cell in interfaces

. a of , an policy the and : of



100%|██████████| 5/5 [00:01<00:00,  3.07it/s]

phages cloud . idea the communication of the semiconductors






Hm, that doesn't look great.

The SNF has another funding program available, "Eccellenza", for professorships. They also have a similar sized list of winners.
If the project titles are good enough to get a better grant, surely we can combine them with our patterns so far. 
The input pdf is [here](http://www.snf.ch/SiteCollectionDocuments/fop_awa_pfs_zusprachen_2018.pdf).
Same issue with the languages, not with the sorting.

In [6]:
with open('eccellenza.txt') as oldfile, \
        open('eccellenza_cleaned.txt', 'w') as newfile:
    for line in oldfile:
        if not any(bad_word in line for bad_word in bad_words):
            newfile.write(line)

In [7]:
filenames = ['ambizione_cleaned_shuffled.txt', 'eccellenza_cleaned.txt']
with open('winning_titles.txt', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            outfile.write(infile.read())

In [18]:
textgen.reset()
textgen.train_from_file('winning_titles.txt',
                        new_model=True,
                        line_delimited=True,
                        word_level=True,
                        max_length=5,
                        max_gen_length=20,
                        num_epochs=3,
                        gen_epochs=-1,
                        train_size=0.8,
                        dropout=0.2)

1,281 texts collected.
Training new model w/ 2-layer, 128-cell LSTMs
Training on 13,463 word sequences.
Epoch 1/3
Epoch 2/3
Epoch 3/3


In fact, I want to have a proposal on "Dark matter", so we can tell the model to start generation with that pattern:

In [19]:
textgen.generate(5, temperature=0.8, prefix="Dark matter")

 60%|██████    | 3/5 [00:02<00:03,  1.83s/it]

dark matter the of : idea in in cells s

dark matter and - overlooked

dark matter and of and

dark matter in mechanism metabolic



100%|██████████| 5/5 [00:02<00:00,  1.31s/it]

dark matter , generational macrophages






Hm, still not great...

Maybe we can work around the issue of the few training patterns?
There's a thing called transfer learning.
Basically we take an already trained model, and then fine-tune it on our patterns.
In this case the idea is to have the existing model being trained on lots of text already know basic things about language patterns.
The fine-tuning just adds the words specific for this problem.

Our "language" is a bit specific, so let's use something related. The EU publishes their successful grants as well:
https://data.europa.eu/euodp/en/data/

This is in csv and we need to remove all the overhead and format it so we can feed it to our text generator.

In [20]:
import pandas as pd
df = pd.read_csv('https://cordis.europa.eu/data/cordis-fp7projects.csv', sep=';',
                 header=0, error_bad_lines=False)

In [21]:
df

Unnamed: 0,rcn,id,acronym,status,programme,topics,frameworkProgramme,title,startDate,endDate,...,objective,totalCost,ecMaxContribution,call,fundingScheme,coordinator,coordinatorCountry,participants,participantCountries,subjects
0,104434,304806,GAUGE/GRAVITY,ONG,FP7-IDEAS-ERC,ERC-SG-PE2,FP7,The Gauge/Gravity Duality and Geometry in Stri...,2013-01-01,2018-12-31,...,While the three sub-atomic forces are describe...,1253098,1253098,ERC-2012-StG_20111012,ERC-SG,KING'S COLLEGE LONDON,UK,,,PSE;SCI
1,98756,265847,ECO2,ONG,FP7-ENVIRONMENT,Ocean.2010-3,FP7,Sub-seabed CO2 Storage: Impact on Marine Ecosy...,2011-05-01,2015-04-30,...,The ECO2 project sets out to assess the risks ...,1397817412,10500000,FP7-OCEAN-2010,CP-IP,HELMHOLTZ ZENTRUM FUR OZEANFORSCHUNG KIEL,DE,UNIVERSITETET I TROMSOE - NORGES ARKTISKE UNIV...,NO;UK;NL;BE;PL;DE;IT;FR;SE,ENV
2,108338,332769,OBESCLAIM,ONG,FP7-PEOPLE,FP7-PEOPLE-2012-CIG,FP7,Fighting against obesity in Europe”: The role ...,2013-09-01,2018-08-07,...,The aim of this project is to investigate whet...,100000,100000,FP7-PEOPLE-2012-CIG,MC-CIG,CENTRO DE INVESTIGACION Y TECNOLOGIA AGROALIME...,ES,,,LIF
3,91155,228344,EUROFLEETS,ONG,FP7-INFRASTRUCTURES,INFRA-2008-1.1.1,FP7,TOWARDS AN ALLIANCE OF EUROPEAN RESEARCH FLEETS,2009-09-01,2013-08-31,...,The quality of the infrastructures available f...,894520212,7200000,FP7-INFRASTRUCTURES-2008-1,CP-CSA-Infra,INSTITUT FRANCAIS DE RECHERCHE POUR L'EXPLOITA...,FR,HAVFORSKNINGSINSTITUTTET;INSTITUTO ESPANOL DE ...,NO;ES;BG;EL;DE;BE;IE;PT;RO;FR;IT;PL;NL;EE;TR;UK,SCI
4,107499,319818,I2MOVE,ONG,FP7-IDEAS-ERC,ERC-2012-SyG,FP7,An Intelligent Implantable MOdulator of Vagus ...,2013-04-01,2018-11-30,...,Obesity is one of the greatest public health c...,7175339,7175339,ERC-2012-SyG,ERC-SyG,IMPERIAL COLLEGE OF SCIENCE TECHNOLOGY AND MED...,UK,,,SCI
5,90086,200431,INNOSHADE,ONG,FP7-NMP,NMP-2007-1.2-1,FP7,Innovative Switchable Shading Appliances based...,2008-09-01,2012-08-31,...,"INNOSHADE is concerned with an innovative, nan...",109476058,7555176,FP7-NMP-2007-LARGE-1,CP-IP,FRAUNHOFER GESELLSCHAFT ZUR FOERDERUNG DER ANG...,DE,UNIVERSIDADE DO MINHO;VYZKUMNY USTAV ORGANICKY...,PT;CZ;DE;FR;IT;SI;NL;TR;IL;ES;CA,NNT
6,186157,625238,EPIREP,ONG,FP7-PEOPLE,FP7-PEOPLE-2013-IIF,FP7,Characterization of epithelial wound repair at...,2015-02-26,2017-06-01,...,Inflammatory bowel disease (IBD) affects milli...,2308098,2308098,FP7-PEOPLE-2013-IIF,MC-IIF,KOBENHAVNS UNIVERSITET,DK,,,LIF
7,106470,324514,EPICSTENT,ONG,FP7-PEOPLE,FP7-PEOPLE-2012-IAPP,FP7,Antibody-functionalised cardiovascular stents ...,2013-04-01,2017-03-31,...,An industry-academia collaboration is proposed...,102418565,102418565,FP7-PEOPLE-2012-IAPP,MC-IAPP,NATIONAL UNIVERSITY OF IRELAND GALWAY,IE,ASHLAND SPECIALTIES IRELAND LIMITED;BALTON SPO...,IE;PL;SK,LIF
8,108144,320330,FLAGSHIP,ONG,FP7-SSH,SSH.2012.7.1-1,FP7,Forward Looking Analysis of Grand Societal cHa...,2013-01-01,2015-12-31,...,The objectives of FLAGSHIP are: i) Understandi...,32427234,2496656,FP7-SSH-2012-2,CP-FP,ISTITUTO DI STUDI PER L'INTEGRAZIONE DEI SISTE...,IT,STICHTING THE HAGUE INSTITUTE FOR THE INTERNAT...,NL;FR;PL;BE;LU;ES;BG;PT;AT;EE,SCI
9,97935,256721,STAYERS,ONG,FP7-JTI,SP1-JTI-FCH.2009.3.1;SP1-JTI-FCH.2009.3.2,FP7,STAYERS\nStationary PEM fuel cells with lifeti...,2011-01-01,2014-06-30,...,Economical use of PEM fuel cell power for stat...,4305717,1938497,FCH-JU-2009-1,JTI-CP-FCH,NEDSTACK FUEL CELL TECHNOLOGY BV,NL,SOLVICORE GMBH & CO KG;STIFTELSEN SINTEF;SOLVA...,DE;NO;IT;BE,HFC;SCI;MAT


We have to be a bit careful now. This table includes related funding where we expect titles to follow similar patterns, but also unrelated ones where this might not be the case. Let's select out the Ambizione equivalent (ERC-SG: Starting Grant for young researchers, and "MC-.." Marie Curie fellows are also young researchers projects).

Okay, we can also drop everything except the title.

In [22]:
df1 = df[df.fundingScheme == 'ERC-SG']
df2 = df[df['fundingScheme'].str.contains("MC-")]
df = pd.concat([df1, df2])
df = df.filter(['title'])

Now let's save it to a text file so we can use it.

In [23]:
df.to_csv('fp7projects.txt', header=None, index=None, sep=' ')

In [24]:
with open('fp7projects.txt', 'r') as infile, \
        open('fp7projects_cleaned.txt', 'w') as outfile:
    data = infile.read()
    data = data.replace("\"", "")
    outfile.write(data)

Check that everything is consistent by seeing how many lines of text there are in the file:

In [23]:
def file_lengthy(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


print("Number of lines in the file: ", file_lengthy("fp7projects_cleaned.txt"))


Number of lines in the file:  13362


Okay, a few stray long lines were split in two. We can live with that for now.

Now let's train our model again.This time with 13,362 training patterns - nice!

In [None]:
textgenbase = textgenrnn(name="base_model")
textgenbase.reset()
textgenbase.train_from_file('fp7projects_cleaned.txt',
                            new_model=True,
                            line_delimited=True,
                            word_level=True,
                            max_length=5,
                            max_gen_length=20,
                            num_epochs=20,
                            gen_epochs=-1,
                            train_size=0.8,
                            dropout=0.2)

13,361 texts collected.
Training new model w/ 2-layer, 128-cell LSTMs
Training on 132,169 word sequences.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
   6/1032 [..............................] - ETA: 6:34 - loss: 3.4880

First save the model weights to a file, and then see how it performs.

In [6]:
textgenbase.save("base_model.hdf5")
textgenbase.generate(5, max_gen_length=20, temperature=0.5)

 40%|████      | 2/5 [00:00<00:00, 17.75it/s]

the role of the pathway in integrity of disease

impact of climate change on sea surface microlayer effect

the role of in the regulation of atmospheric carbon dioxide ( - )

novel strategies for brain regeneration



100%|██████████| 5/5 [00:00<00:00, 18.07it/s]

“ biochemical isolation and functional characterization of native - type - iii - v / semiconductor heterostructures on self






Now that we have our base model, we want to fine-tune it with the patterns from the SNF Ambizione winners. For this we don't start with a new model but build upon the existing one from the previous step. We also don't want to train for many epochs, as this would again lead to overtraining as we've seen before.

In [7]:
textgenbase.train_from_file('winning_titles.txt',
                            new_model=False,
                            line_delimited=True,
                            word_level=True,
                            max_length=5,
                            max_gen_length=20,
                            num_epochs=3,
                            gen_epochs=-1,
                            train_size=0.8,
                            dropout=0.2)
textgenbase.save("winning_model.hdf5")

1,281 texts collected.
Training on 11,857 word sequences.
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [22]:
textgenbase.generate(5,
                     max_gen_length=18,
                     temperature=0.8)

100%|██████████| 5/5 [00:00<00:00, 18.90it/s]

infection during the development of progression in children with severe infections

the normative or how regions of atlantic transfer

next generation sustainable organic catalytic compounds

a novel technique for guided treatment of cardiovascular disease

proton transfer in the overcome system






Here we go. With a temperature of 0.8 there seems to be a reasonable balance between creativity and correctness.
Temperature here basically defines how "wrong" the model is willing to risk to be on the next prediction.

Who wouldn't fund this research proposal? Maybe I should have submitted something titled:  
***"dark matter physics with gravitational waves"***  
Just need to work on a model which predicts also the ~15 page project description...

Other outputs:
1. *"inference with a focus on machine learning"*
2. *"next generation sustainable organic catalytic compounds"*
3. *"theoretical foundations of practical impact and global implications"*
4. *"theory and applications of linear and topological insulators in matter at interfaces"*
5. *"knowledge in the physics of the universe"*

Or we tell it to start with something non-scientific (`prefix='star wars'`):
1. *"star wars at the large hadron collider"*