# Song Generation
<br>
The Neural Network will generate a new ,"fake" song script, based on patterns it recognizes in the training data. It can be used to generate new song lyrics.

## Introduction 
We will implement character level RNN to make our own song lyrics. 


## Dataset
We will be using [55000+ Song Lyrics](https://www.kaggle.com/mousehead/songlyrics/kernels) dataset for training our model. the dataset contains song lyrics of different authers. For the initial steps we will be<br>
 > 1. load in this data and look at some samples.
        



In [4]:
#import required lib
import numpy as np
import pandas as pd

In [5]:
#load dataset
data = pd.read_csv('data/songdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


As we only need `text` for training we will be extracting text and discard else for now.

In [7]:
#extract song lyrics
text = data["text"]
#print text sample
text[:10]

0    Look at her face, it's a wonderful face  \nAnd...
1    Take it easy with me, please  \nTouch me gentl...
2    I'll never know why I had to go  \nWhy I had t...
3    Making somebody happy is a question of give an...
4    Making somebody happy is a question of give an...
5    Well, you hoot and you holler and you make me ...
6    Down in the street they're all singing and sho...
7    Chiquitita, tell me what's wrong  \nYou're enc...
8    I was out with the morning sun  \nCouldn't sle...
9    I'm waitin' for you baby  \nI'm sitting all al...
Name: text, dtype: object

## Data Exploration

Now to understand song and have sence of dat and its structure we will be exploring dataset.



In [41]:
corpus = [sent for sent in text]
print('Dataset Stats')
print('Number of unique words(approx): {}'.format(len({word: None for sent in corpus for word in sent.split()})))


print('Number of lines: {}'.format(len(corpus)))

word_count_line = [len(song.split('\n')) for song in corpus]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

lines = [line for song in corpus for line in song.split('\n')]

print()
print('The lines {} to {}:'.format(0,10))
print('\n'.join(lines[0:10]))

Dataset Stats
Number of unique words(approx): 210321
Number of lines: 57650
Average number of words in each line: 42.11129228100607

The lines 0 to 10:
Look at her face, it's a wonderful face  
And it means something special to me  
Look at the way that she smiles when she sees me  
How lucky can one fellow be?  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what could I do?  
  


## Data processing
In this section we will process and clean data for better undeerstanding. we will be implement the following pre-processing functions below:

    1. Lookup Table
    2. Tokenize Punctuation


### Lookup Table

For word embedding we have created two `dict` for the following purposes,<br>

    1. For convertion of word to integer
    2. Getting word to corresponding word


In [42]:
def lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    unique_words = tuple(set(text))
    int_word = dict(enumerate(unique_words))
    word_int = {int_word[i]: i for i in int_word}
    
    # return tuple
    return (word_int, int_word)


## Tokenize Punctuation
As for text generation punctuation plays a vital role, we will be tokenzing it. we will split songs with `'\n'`. The words like `baby` and `baby!` will be diffent in previous case. We will replace the puctuation with some word that is very most likely to come in our sample.<br>
<br>
For this purpose we will create a `dict` that maps puntuation to its new words

In [43]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """    
    punct_dict = {'.': "||Period||",
                  ',': "||Comma||",
                  '"': "||QuotationMark||",
                  ';': "||Semicolon||",
                  '!': "||ExclamationMark||",
                  '?': "||QuestionMark||",
                  '(': "||LeftParentheses||",
                  ')': "||RightParentheses||",
                  '-': "||Dash||",
                  '\n':"||Return||"}

    
        
    return punct_dict

### Saving current progress
As the project can large, we will be saving current progress

In [49]:
import pickle

SPECIAL_WORDS = {'PADDING': '<PAD>'}

text = "\n".join(line for line in lines)

token_dict = token_lookup()
for key, token in token_dict.items():
    text = text.replace(key, ' {} '.format(token))
    
text = text.lower()
text = text.split()

vocab_to_int, int_to_vocab = lookup_tables(text + list(SPECIAL_WORDS.values()))

int_text = [vocab_to_int[word] for word in text]
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('checkpoint/preprocess.p', 'wb'))

## Build the Neural Network

In this section, we'll build the components necessary to build an RNN by implementing the RNN Module and forward and backpropagation functions.

In [50]:
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

No GPU found. Please use a GPU to train your neural network.


### Batching and creating dataloader

In [53]:
from torch.utils.data import TensorDataset, DataLoader

def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    words = np.array(words)
    batch_len = batch_size*sequence_length
    n_batches = len(words)//batch_len
    words = words[:batch_len*n_batches]
    feature, target = [], []
    for ii in range(0, len(words), sequence_length):
        x = words[ii:ii+sequence_length]
        feature.append(x)
        try:
            y = words[ii+sequence_length]
        except:
            y = x[0]
        target.append(y)
#     print(feature, target)
    feature, target = np.asarray(feature), np.asarray(target)
    feature, target = torch.from_numpy(feature), torch.from_numpy(target)
    dataset = TensorDataset(feature, target)
    dataloader = DataLoader(dataset=dataset,batch_size=batch_size)
    return dataloader

In [54]:
# test dataloader

test_text = np.arange(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = next(data_iter)


print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49]], dtype=torch.int32)

torch.Size([10])
tensor([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 45], dtype=torch.int32)
