Lyrics-Conditioned Neural Melody Generation ( demonstration at https://www.youtube.com/watch?v=2PHcKhaLxAU ).

@September 19, 2020: A full source of melody generation from lyrics can be downloaded at https://drive.google.com/file/d/1FbKTMX4w7nKyMf4-ZQF5J5DC71S-GPzh/view?usp=sharing . @September 17, 2020: Updated answers for readers and released the updated version of this work https://drive.google.com/file/d/1NIJAHuZMD2gro6Ws3o-b5K_Ec-nQiJuC/view?usp=sharing Accepted by ACM Transaction on Multimedia Computing Communication and Applications (TOMCCAP), 2021. @Feb.14, 2020: Released Codes of Condtional LSTM-GAN for Melody Generation from Lyrics at https://drive.google.com/file/d/1j0qhd0YkTp1-q6FNEE7y4O8JpA865KQ5/view?usp=sharing

If you use our lyrics-melody dataset and lyrics embedding (including our skip-gram mdoel and BERT model repectively trained in our lyrics dataset), please kindly cite our paper "Conditional LSTM-GAN for Melody Generation from Lyrics" available at https://arxiv.org/pdf/1908.05551.pdf Accepted by ACM Transaction on Multimedia Computing Communication and Applications (TOMCCAP), 2021. You can find our 12 melodies (melodies_experiment.zip) used in subjective evaluation of this paper. These 12 melodies respectively are generated by baseline method, LSTM-GAN, and ground truth.

--baseline method: bas1-4--; --LSTM-GAN:gen1-4--; --ground truth:GT1-4--

If you have any questions, contact Yi Yu (https://home.hiroshima-u.ac.jp/yiyu/index.html, yiyu@hiroshima-u.ac.jp). Abhishek Srivastava and Simon Canales were inolved in this work respectively from August to September 2020 and June to August 2019 during their internships in NII, Tokyo.

Answers to some questions raised by the readers are summarized as follows (have merged same/similar questions to avoid unnecessary repetition). We will keep updating answer lists once new questions come out.

Important### Regarding to syllable-level and word-level embedding: Download all source codes of "Condtional LSTM-GAN for Melody Generation from Lyrics“ at https://drive.google.com/file/d/1FbKTMX4w7nKyMf4-ZQF5J5DC71S-GPzh/view . Then, find the file "Create song.ipynb" to extract syllable embedding and word embedding.

Updated questions and answers @ September 17, 2020 Q1: One syllable in the lyrics can correspond to many notes in the melody. This means that the sequence of syllables and the sequence of notes may not have the same length, and one also has to determine which notes correspond to which syllables. -------------------------------A1: An English syllable is a unit of sound, which could be a word or a part of a word. One syllable may correspond to one or more notes in English songs. The pairs of one-syllable-to-one-note are also considered to generate melody from lyrics in this study.

Q2: This paper treats "rests" as an attribute of a note, which is not the most intuitive approach. A more natural approach would be to treat a rest as a note with a special pitch value (e.g. 0). ---------------------A2: A note typically contains two attributes: pitch and duration. A rest means how long the silence in a piece of melody will last, which is an important element of music. If we treat a rest as a note with a special pitch value, it is hard to find a syllable paired with it. Therefore, in our study, we take the “rest” as music attribute. Then, each triplet of music attributes {note, duration, rest} corresponds to a syllable as shown in Fig. 1.

Q3: The dimensionality of the lyrics embedding is only 20.--------------------A3: Our method encodes syllable information at two different semantic levels: syllable-level and word-level. Two skip-gram models for syllable-level and word-level embedding are trained as a lyrics encoder (lyrics embedding), which aims at associating a vector representation with an English syllable. The number of syllables is much smaller than that of general words. The concatenation of a syllable and its word forms a 20-dimension vector. Accordingly, lyrics sequences (each with 20 syllables), each syllable being encoded in 20-dimension, are used in our experiment.

Q4: When discretizing the generated melody, the authors use a predefined set of allowed note lengths (Table 2).-----------------A4: Note durations shown in Table 2 is based on the distribution of note duration in our dataset, which is shown in Fig. 5. Therefore, when we discretize note lengths of the generated melody, note durations in Table 2 are considered.

Q5: The proposed method makes no effort trying to ensure the melody has a clear bar structure (e.g. 3/4 or 4/4). One note generated with the wrong duration can totally destroy the rhythm. ------------A5: The continuous-valued sequence generated by the generator need to be constrained to the underlying musical representation of discrete-valued MIDI attributes. Due to the quantization error, the generated note could be associated to an improper duration, which would destroy the rhythm. But we think this probability is low after applying LSTM in the melody generation. We plan to apply Gumbel-Softmax relaxation technique to train a conditional LSTM-GAN for the discrete generation of music attributes in the future.

Q6: The baseline system, which samples notes from certain distributions, is rather weak. From the description in the paper, it does not seem to consider the relationship between neighboring notes. -------------A6: Since the topic of this work focuses on lyrics-conditioned melody generation, we added maximum likelihood estimation (MLE) with LSTM as a stronger baseline, which is often used as a classical baseline to compare with the GANs. The corresponding description of MLE is newly added in Sec.6.3 and some new results of MLE are added in Fig. 8, Table 5, and Fig. 15.

Q7: Most two sequences of 20 notes are taken from each song. Are the songs significantly longer than 40 notes? If so, isn't this a waste of data? Also, are the sequences complete sentences? ----------------A7: Most of songs have more than 40 notes with the corresponding syllables, and a few songs are significantly longer than 40 notes. In order to avoid the bias in the training data (many samples from just a few songs), we choose to extract 2 sequences of 20 notes from each song, if possible. In this process, we ensure the sequences with the pairs of syllable and note are complete.

Q8: When analyzing the effect of conditioning on lyrics, it would be especially interesting to see how the sentiments (e.g. happy, sad) of the lyrics affect the distribution of pitch and note lengths. Here it will be important to compare against a baseline not conditioned on lyrics. Some examples would also assist the analysis. --------------------A8: This could be an interesting experiment for analyzing emotion-based music generation. Unfortunately, since our original dataset with paired syllable and note does not contain emotion information, it is hard to consider the impact of sentiments. To prove the effectiveness of melody generation from lyrics which is our main focus in the paper, a lot of experiments have been provided in Sec. 6.6. In addition, we have added new experiments to show the performance of the estimated distribution without conditioning lyrics, which is also compared with the results of conditioning lyrics and ground truth.

Q9: In Fig. 1, the rhythm is totally off from the original song. Is this intended? --------------A9: The figure is to show the alignment between lyrics and melody. Since the MIDI song is played by a software to show the alignment, it could have a small offset from the original song. But it does not affect to understand the alignment.

Q10: The sentence is confusing: "each melody sequence has 20 notes, which needs 20 LSTM cells to learn the sequential alignment between lyrics and melody" -- this doesn't explain clearly how alignment is performed, and also seems to imply that the hidden layer size 400 is a product of 20 * 20. Also, does the sentence "The output with the triplet music attributes of previous LSTM cell is concatenated with current 20-dimensional syllable embedding, which are further fed to current LSTM cell" imply that the output of the final layer is also fed into the lower LSTM layers? --------------A10: The alignment means the musical relationship between syllables and notes, which is originally contained in the MIDI songs. During the training, we actually feed the pairs of syllables and notes into the model. LSTM can learn this sequential relationship. The hidden layer size 400 is not related to a product of 20*20. Since this is lyrics-conditioned melody generation, the final output (generated MIDI sequences) of generator should be concatenated with syllable embedding and used as the input of the discriminator.

Q11: "scale" is better called "key". ----------------A11: The concept of standard scales is important for the melody generation task, and is defined in Wikipedia (https://en.wikipedia.org/wiki/Scale_(music)). If a melody contains all notes belonging to one of the standard scales, it indicates this melody has a perfect scale consistency.

Q12: In Fig. 9, the Random baseline system generates more transitions with a difference of 0 than other transitions. Why is this the case if the baseline system samples notes at random? ----------A12: A difference 0 means the next note is the same as the current note. this is because the baseline considers the note distributions, and so the note with the highest probability of occurrence appears more times.

Q13: As the lyric/melody is paired, supervised learning with LSTM / transformer could be a reasonable baseline, which is not compared to in the paper. To be specific, the randomness could be injected with dropout in both training and test so that supervised learning does not necessarily mean no randomness. --------------A13: We have added the Maximum likelihood estimation (MLE) with LSTM as a stronger baseline. Some corresponding results are newly added in Fig. 8, Fig. 9, Table 5, and Fig. 15. Drop-out is used in the training to improve system reliability. Randomization of the generation results is realized by using a random noise as input.

Q14: Two important and basic elements of music (tonality and rhythm) are almost completely missing in this paper. For example, there could be evaluation on alignment of bars or beats, and evaluation on the alignment on tonality (C major, D minor). Music is different from voice / sound and music has its inner manually defined structure in music theory. ----------A14: As we know, GANs can create new data samples that resemble the training data. In this initial study, three music attributes (note, duration, and rest) are considered as our training data. Thus, we did a lot of efforts on evaluating lyrics-conditioned melody generation in Sec. 6.5-6.7 where properties of generated notes, the transitions between notes, note duration, rest duration, effect of lyrics conditioning, subjective listening are evaluated. Our quantization of continuous values generated from GAN follows the standard scale consistency such as C major and D minor. We have newly added the description about standard scale in Sec. 4.4 to avoid misleading. Rhythm has been evaluated in Sec. 6.7 as a subjective metric.

Q15: As the lyric/melody is paired, a reasonable baseline could be supervised learning with LSTM / transformer. Two important and basic elements, tonality and rhythm, are completely missing in either modeling or evaluation. For example, there is no evaluation based on alignment of bars or beats. There is no evaluation on tonality, e.g. the ground-truth could be C Major and the generated one could be F Minor. -----------------------A15: We have newly added MLE as a stronger baseline in this paper. Accordingly, some new experiments are newly added in the experiments such as Fig. 8, Table 5, and Fig. 15. The nature of GAN is to train a model to generate melodies resembling the training samples. In our study, our training data contains note, note duration, and rest, which can represent the melody. Thus, to evaluate the effectiveness of our method, the experiments mainly focus on if melodies generated by our conditional LSTM-GAN resemble the distribution of the training samples, which is consistent with the motivation of our GAN generation and can indicate if generated melodies satisfy the music knowledge/theory contained in the original training data. Our original dataset does not contain bar information. We actually evaluated music theory treatment to see if generated melodies resemble the training data in Fig. 8, Table 5, Fig. 9, Fig. 13 and Fig. 14. The evaluation of Rhythm and the entire melody are presented in Sec. 6.7. Our quantization of continuous values generated from GAN follows the standard scale consistency such as C major and D minor. We have newly added the description in Sec. 4.4 to avoid misleading.

Q16: It is not clear how syllable embedding vectors are obtained? And why concatenated them with noisy vectors. -------------A16: We created syllable embedding which encodes lyrics information at two different semantic levels: syllable-level and word-level. Two skip-gram models for syllable-level and word-level embedding are trained as lyrics encoders (lyrics embedding), which aims at associating a vector representation with an English syllable. In the original unconditional GAN model, the generator builds a mapping function from a prior noise distribution to the data space. By concatenating the syllable embedding and noise as input, GAN is expected to instruct the melody generation process (by lyrics) while have enough diversity (by noise) in melody generation. We have newly added descriptions in Sec. 4.2 and Sec. 4.3.4.

Q17: It is not clear why the proposed model use GAN. Moreover, the choice of discriminator seems not representative. -----------A17: Our aim is to learn a deep model that is able to represent the distribution of real samples, which further has the capability of generating new samples from this estimated distribution. In particular, using the capability of deep learning and generative modeling, sequential alignment relationship can be learned between lyrics and melody from real training samples. We have newly added the description for addressing why GAN is used in the first paragraph of Sec.4. The discriminator loss function penalizes the discriminator for misestimating a real sample as fake or a fake sample as real, while the generator is trained to generate better samples so as to be regarded as real samples. Because the input of the discriminator is a sequence, LSTM is also used in the discriminator.

Q18: The question on two high-level aspects of melody tonality and rhythm is addressed. The rhythm is evaluated subjectively. For the tonality (called standard scale in section 4.4), scale consistency is achieved by a postprocess step called tuning scheme in this paper. The tuning scheme only ensures that the generated melody is consistent in one scale. However, it doesn't ensure that the scale of the generated melody matches the scale of the ground truth. Evaluation on scale matching could be done automatically. However, there is no evaluation on this aspect. ---------A18: We have done the new experiment of scale consistency and the results are described in Sec.6.5, as follows: “We also investigated scale-consistency of generated melodies. The mean accuracy of scale-consistency for the Conditional LSTM-GAN model is 48.6%. In contrast, for the MLE baseline and Random baseline models, it is 47.3% and 46.6% respectively. The mean accuracy of scale-consistency is not high due to two main reasons: i) the mapping between lyrics and melody is not unique i.e. for a given lyrics there can be multiple melodies belonging to different standard scales. ii) some notes overlap between different standard scales, e.g., D,E,F,G,A appear in both C𝑚𝑎𝑗𝑜𝑟 and D𝑛𝑎𝑡𝑢𝑟𝑎𝑙 𝑚𝑖𝑛𝑜𝑟.”

Q19: I still have questions about the n-gram related evaluation. As pointed out in author response A 2.3, 2-MIDI-number and 3-MIDI-number are n-gram evaluation metrics. Standard n-gram metrics (e.g. BLEU2, BLEU3, BLEU4) in sequence generation tasks (e.g. music melody, captioning, translation) calculate the percentage of n-gram terms matched in the ground truth. However, 2-MIDI-number and 3-MIDI-number doesn't fall into this category of metrics and are merely the statistic number. In my opinion, such statistic number-style n-gram metric is too weak for evaluation. I'd like to see stronger automatic evaluation metrics.------------A19: We have done some new experiments for calculating BLEU2, BLEU3, BLEU4, which are shown in Table 6. “To compare the qualities of generated melodies, standard BLEU scores as evaluation metrics 3 for conditional LSTM-GAN, MLE baseline, and Random baseline are respectively calculated, which are shown in Table 6. From this table, we can find that our conditional LSTM-GAN model achieves the highest BLEU scores compared with the other two baseline methods. This also concludes that our method can generate melodies of relatively higher qualities.”

Q20: The first 8 notes in Fig.1 should have the same duration. If they don't, it probably indicates an error when you quantize the duration of notes. -------------A20: The alignment in Fig. 1 is from the original MIDI dataset. We have checked the data again and confirmed that the durations in the Fig. 1 match and reflect original durations in the MIDI dataset. Moreover, we also invited an expert who knows music to view this alignment between lyrics and melody. She confirmed us that there is no problem.

Q21: The definition of "3-MIDI-number repetitions" and "2-MIDI-number repetitions" is still not clear. The added red text provided the definition of 3-grams and 2-grams. But how are these used to compute the "repetitions"? -------------A21: 3-MIDI-number repetitions: a count of how often 3-grams of MIDI numbers (a melody slice consisting of 3 adjacent notes) repeats throughout the sequence, which is also used as a metric in previous work. In each generated sequence, the number of 3-gram repetition equals to the sum of the “number of each 3-gram occurrence minus 1”, i.e., only repeated 3-grams are taken into account. 2-MIDI-number repetitions: a count of how often 2-grams of MIDI numbers repeats throughout a sequence. In each generated sequence, the number of 2-gram repetition equals to the sum of the “number of each 2-gram occurrence minus 1”. In Table 5 (In-songs attributes metrics evaluation), the averages of 3-MIDI-number repetitions and 2-MIDI-number repetitions are calculated for each method.

Old questions and answers before September 17, 2020 Q1 you analyse a dataset of 12,197 MIDI songs and demonstrate the results in Figure 5. I have a question about rest duration and it would be great if you could help me to figure out. It shows that almost 80% of sequences do not have a rest. Does it calculate the breathe break for singers?-------------------- A1: Fig. 5 shows the flattened distribution of the rest duration over all the notes from our dataset. This doesn't mean that 80% of sequences do not have a rest, but that 80% of the notes don't have a rest before them. The rest duration value was calculated according to the note-off, note-on information of MIDI files, using eq. (12) and the allowed rest duration values shown in table 3. You can notice that the shortest rest representation we use is a quarter rest. This means that short silences will be represented with a 0 rest value.

Q2 The Neural Melody Composition from Lyrics paper recommends setting every midi file to the same tempo and key. Would it help to do this with your data? (Or maybe they're the same already? Your paper doesn't say.)--------------------- A2: The data is not set to the same key. However, people are free to train the model with another dataset. The tempo information is used to compute the value of the rest duration or note duration. The tempo is then not taken into account.

Q3 The tempo information is used to compute the value of the rest duration or note duration. The tempo is then not taken into account. What you mean "The tempo is then not taken >into account"?--------------------- A3: The durations’ unit is the beat. Therefore, the absolute duration of a note or a rest is depending on the tempo. However, this is not used during training or sampling. To recreate midi files, we can choose any tempo.

Q4 In I – Introduction Existing works, e.g., Markov models [7], random forests [8], and recurrent neural network (RNN)[9], can generate lyrics-conditioned music melody. However, these methods cannot ensure that the distribution of generated data is consistent with that of real samples. Could you detail why?-------------------- A4: Markov models, random forests, and RNN, all can learn the transition probability between adjacent notes in sequences, but they do not explicitly model the distribution of the note sequence. In our work, LSTM does the similar work as RNN. But we adopt the GAN model by adding a discriminator. GAN helps to model unknown distributions, and ensures that the learned distribution (of note sequence) is consistent with that of the real samples.

Q5 In the case of Markov, etc., I understand that the generation may be not very conformant to the style learnt, unless using a high order Markov but with the risk of recopying entire sequences from the corpus and thus plagiat. But, in the case of a RNN-based architecture [9], what is the rationale?-------------------- A5: As mentioned before, RNN does the similar work as LSTM in our work. But without including a discriminator, it only learns transmission probability between adjacent notes, but does not promise that generated sequences look like real ones.

Q6 I mean you could use a RNN (LSTM) architecture with conditioning. Have you tried to compare your LSTM-GAN-conditioning architecture to a LSTM-GAN equivalent architecture?-------------------- A6: This was tried, and we noticed mode collapse, meaning that there was less variety in the generated melodies. In other words, conditioning seem to reduce mode collapse.

Q7 RNN GAN architecture. How do you use the Generator? Do you enter a sequence of syllables and generate the corresponding sequence of MIDI triplets? Or do you do it iteratively, i.e., generating one by one successive triplets from successive syllables? Please confirm ! --------------------A7: We feed the all sequence to an LSTM, assuring that the output state of the LSTM at time t-1 is given as an input to the same LSTM at time t. This can be viewed as a non-iterative process (unrolled representation of LSTM) or as an equivalent iterative process. In the tensorflow implementation this process is non-iterative.

Q8 The same for the Discriminator, what is the exact input? A sequence of triplets I guess?-------------------- A8: The exact input of discriminator is the generated sequences of MIDI triplets which are the output of generator.

Q9 Baseline model Do you create melodies by concatenating randomly sampled MIDI triplets from the validation set? But the resulting sequence is not significative.-------------------- A9: Melodies from baseline model are created by randomly sampling the testing set based on the dataset distribution of music attributes. The MIDI numbers are restricted between 60 and 80.

Answers to your questions:

--Which folder/file I can use to crawl/have 12,197 MIDI files (7,998+ 4,199) ? -- lmd-full_and_reddit_MIDI_dataset.
--Which folder/file I can use to crawl/have 7,998 MIDI ﬁles come from "LMD-full" MIDI Dataset? -- lmd-full_MIDI_dataset.
--Which folder/file I can use to craw/have 4,199 MIDI ﬁles come from reddit MIDI dataset? -- The reddit dataset is not parsed alone. You can find it by using "lmd-full_and_reddit_MIDI_dataset".
--Which folder/file I can use to get all syllable-level, word-level, and sentence-level embedding vectors extracted from our trained Skip-gram model? -- Skip-gram_lyrics_encoders (lyric_encoders) contains all trained models.
--Which folder/file I can use to get all syllable-level, word-level, and sentence-level embedding vectors extracted from our trained BURT model? -- This is not uploaded due to limited space, but, you can email us to obtain.
--Which folder/file I can use to get information about how to directly use the trained Skip-gram model to extract lyrics embedding vectors? -- Skip-gram_model_script_to_extract_syllables_and_word_level_embeddings.ipynb can be used to directly get embeddings.
--Which folder/file I can use to get information about how to directly use the trained BURT model to extract lyrics embedding vectors? -- This is not uploaded due to limited space, but, you can email us to obtain.
--Which folder/file we can use to get lyrics embedding vectors used in the paper "Conditional GAN-LSTM for Melody Generation from Lyrics"? -- Same script can be used to get the embeddings (Skip-gram_model_script_to_extract_syllables_and_word_level_embeddings.ipynb can be used to directly to get embeddings).

More details:

For the project, two differents dataset were used :

One dataset that can be found in the "partial_dataset" folder and comes from the LAKH Midi Dataset lmd-full (downloadable at this url : https://colinraffel.com/projects/lmd/). Only English songs were used from the dataset. To download the MIDI files corresponding the .npy files from the dataset, you can search the names of the files in both dataset, that are unchanged and serve as ID. This dataset was used for training the LSTM-GAN model. Both word-level parsing and syllable-level parsing were used in the training (see below for more information)
One dataset that is made by mergind the one from LAKH Midi Dataset and one found on https://www.reddit.com/r/datasets/comments/3akhxy/the_largest_midi_collection_on_the_internet/. This dataset was used for training the Skip-gram embeddings as well as the BURT embeddings. Fom this dataset, only Word-level parsing was used.

lyrics embeddings for "lmd-full + reddit" are used for training skip-gram model and BURT model, while, lyrics embeddings for "lmd-full" are used for training, validation, and testing in Conditional LSTM-GAN model for melody generation from lyrics.

The parsing is as follow :

— The syllable parsing : This format is the lowest level that pair together every notesand the corresponding syllable and it’s attributes.

— The word parsing : This format regroups every notes of a word and gives the attributesof every syllables that makes the word.

— The Sentence parsing : Similarly, this format put together every notes that forms asyllable (or in most case, a lyric line) and it’s corresponding attributes.

— The Sentence and word parsing : Using the two last mentionned format, this one consist ofparsing the lyrics and notes in sentences and, whithin these sentences, to separateeverything in words.1

One note always containing one and only one syllable. We parsed every songs in continous attributes and discrete attributes.

The discrete attributes are, in order:

— The Pitch of the note : In music, the pitch is what decide of how the note should beplayed. We used the Midi note number as an unit of pitch, it can take any integervalue between 0 and 127.

— The duration of the note : The duration of the note in number of staves. It can be a quarter note, a half note, a whole note or more. The exhaustive set of values itcan take in our parsing is : [0.25 ;0.5 ;1 ;2 ;4 ;8 ;16 ;32].

— The duration of the rest before the note : This value can take the same numerical values as the Duration but it can also be null (so zero).

The continuous attributes are, in order:

— The start of the note : In seconds since the beginning of the sung song.

— The length of the note : In seconds.

— The frequency of the note : In Hertz.

— The velocity of the note : Mesured as an integer by the pretty_midi python package.

An example on the song Listen to the Rhythm of the falling rain in the syllable parsing is :

List [74.0, 0.5, 0.0] [0.0, 0.18595050000000057, 587.3295358348151, 110.0]

en [72.0, 1.0, 0.0] [0.247933999999999, 0.24380176666666742, 523.2511306011972, 110.0]

to [72.0, 0.5, 0.0] [0.24793400000000076, 0.18595050000000057, 523.2511306011972, 110.0]

the [69.0, 1.0, 0.0] [0.24793400000000076, 0.24380176666666564, 440.0, 110.0]

rhy [69.0, 0.5, 0.0] [0.247933999999999, 0.18595050000000057, 440.0, 110.0]

thm [67.0, 1.0, 0.0] [0.24793400000000076, 0.24380176666666564, 391.99543598174927, 110.0]

of [67.0, 1.0, 0.0] [0.247933999999999, 0.24380176666666742, 391.99543598174927, 110.0]

the [65.0, 0.5, 0.0] [0.24793400000000076, 0.18595050000000057, 349.2282314330039, 110.0]

fall [67.0, 2.5, 0.0] [0.24793400000000076, 0.41322333333333283, 391.99543598174927, 110.0]

ing [65.0, 1.0, 0.0] [0.49586799999999975, 0.24380176666666564, 349.2282314330039, 110.0]

rain [65.0, 4.0, 0.0] [0.247933999999999, 0.9917360000000013, 349.2282314330039, 110.0]

Tel [74.0, 1.0, 1.0] [1.2396700000000003, 0.24380176666666742, 587.3295358348151, 110.0]

ling [72.0, 0.5, 0.0] [0.24793400000000076, 0.18595050000000057, 523.2511306011972, 110.0]

me [72.0, 1.0, 0.0] [0.247933999999999, 0.24380176666666742, 523.2511306011972, 110.0]

just [69.0, 0.5, 0.0] [0.24793400000000076, 0.18595050000000057, 440.0, 110.0]

what [69.0, 0.5, 0.0] [0.24793400000000076, 0.1239669999999986, 440.0, 110.0]

a [67.0, 0.5, 0.0] [0.1239669999999986, 0.12396700000000038, 391.99543598174927, 110.0]

fool [69.0, 1.5, 0.0] [0.12396700000000038, 0.37190100000000115, 440.0, 110.0]

I've [72.0, 0.5, 0.0] [0.49586799999999975, 0.18595050000000057, 523.2511306011972, 110.0]

been [72.0, 2.0, 0.0] [0.24793400000000076, 0.49586799999999975, 523.2511306011972, 110.0]

MIT License

(We will update our source timely and codes used for this project will be shared soon. Accordingly, license will be updated. ) Permission is hereby granted, free of charge, to any person obtaining a copy of this source and associated documentation files (the "dataset and embedding vectors"), without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the source.

THE SOURCE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOURCE OR THE USE OR OTHER DEALINGS IN THE SOURCE.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
Skip-gram_lyric_encoders		Skip-gram_lyric_encoders
lmd-full_MIDI_dataset		lmd-full_MIDI_dataset
lmd-full_and_reddit_MIDI_dataset		lmd-full_and_reddit_MIDI_dataset
0. Create train-test-valid datasets.ipynb		0. Create train-test-valid datasets.ipynb
3. Melody generation using test data.ipynb		3. Melody generation using test data.ipynb
4. Create song.ipynb		4. Create song.ipynb
README.md		README.md
Readme		Readme
Skip-gram_model_script_to_extract_syllables_and_word_level_embeddings.ipynb		Skip-gram_model_script_to_extract_syllables_and_word_level_embeddings.ipynb
lstm-gan-lyrics2melody.py		lstm-gan-lyrics2melody.py
melodies_experiments.zip		melodies_experiments.zip
midi_statistics.py		midi_statistics.py
mmd.py		mmd.py
roll.py		roll.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lyrics-Conditioned Neural Melody Generation ( demonstration at https://www.youtube.com/watch?v=2PHcKhaLxAU ).

If you have any questions, contact Yi Yu (https://home.hiroshima-u.ac.jp/yiyu/index.html, yiyu@hiroshima-u.ac.jp). Abhishek Srivastava and Simon Canales were inolved in this work respectively from August to September 2020 and June to August 2019 during their internships in NII, Tokyo.

About

Uh oh!

Releases

Packages

Languages

yy1lab/Lyrics-Conditioned-Neural-Melody-Generation

Folders and files

Latest commit

History

Repository files navigation

Lyrics-Conditioned Neural Melody Generation ( demonstration at https://www.youtube.com/watch?v=2PHcKhaLxAU ).

If you have any questions, contact Yi Yu (https://home.hiroshima-u.ac.jp/yiyu/index.html, yiyu@hiroshima-u.ac.jp). Abhishek Srivastava and Simon Canales were inolved in this work respectively from August to September 2020 and June to August 2019 during their internships in NII, Tokyo.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages