# Cifra Genre Classification

- In this notebook, we will use FastAI and HuggingFace libraries to fine-tune deep learning models for music genre classification on our dataset.
- These libraries help us leverage state-of-the-art models like RNNs and Transformers.
- Initially, we'll explore using artist genre for prediction, but keep in mind this might not be very accurate due to genre variations within an artist's work.
- Since music genre classification is often linked to song content, we can adapt successful NLP (Natural Language Processing) techniques for text classification to analyze song cifras and predict genre.
- These models can be used to predict genres for unlabeled data ("nao-informada") in our dataset.
- While RNNs are capable of text generation, generating cifras is a more specialized task that might require additional techniques beyond basic RNNs.

## Dataset available

Let's refresh the dataset we have consolidated and check some possible issues to deep learning model fine-tuning: 
1. There are too many artist genres;
2. The distribution of cifras is not balanced accross different genres.
3. Some genres have only a few samples.

Therefore, let's select only the top 7 well-known Brazilian music genres, excluding some redundant classifications

In [1]:
import pandas as pd

cifras_df = pd.read_csv("valid_cifras.csv", index_col=0)
cifras_df.describe(include=[object])

Unnamed: 0,cifra_url,url_artist_name,artist_name,artist_genre,url_song_name,song_name,cifra_file_loc,cifra_key
count,60561,60561,60561,60561,60561,60561,60561,60561
unique,60561,15271,15953,73,48463,49596,60561,37
top,https://www.cifras.com.br/cifra/gesil-amarante...,hinos-avulsos-ccb,Hinos Avulsos Ccb,nao-informada,saudade,Saudade,cifras/gesil-amarante-jr/sou-um-milagre.txt,G
freq,1,306,293,14746,62,61,1,9637


In [2]:
cifras_df.artist_genre.unique()

array(['gospel', 'mpb', 'indie-rock', 'rockn-roll', 'nao-informada',
       'catolicas', 'sertanejo', 'infantis', 'brasil', 'worship',
       'jovem-guarda', 'pisadinha', 'rock-alternativo', 'forro',
       'punk-rock', 'diversos', 'gauchas', 'romantica', 'pop-music',
       'funk-carioca', 'folk', 'samba-e-pagode', 'pop-rock',
       'velha-guarda', 'oldies', 'latinas', 'regional', 'musica-crista',
       'reggae', 'eletronica', 'raphip-hop', 'axe-music', 'brega',
       'espiritas', 'rock-classico', 'indie', 'heavy-metal', 'country',
       'indie-pop', 'funk', 'bossa-nova', 'rb', 'trap', 'fado', 'musical',
       'blues', 'indian', 'Samba-Rock', 'besteirol', 'autoconhecimento',
       'rock-grunge', 'musica-nativista', 'hinos', 'samba-enredo',
       'australiano', 'thrash-metal', 'umbanda', 'jazz', 'novelas',
       'jingle', 'rock-gotico', 'choro', 'soul', 'filmes', 'lambada',
       'reggaeton', 'disco', 'k-pop', 'especial-de-natal', 'numetal',
       'dance-music', 'opera', 'man

In [3]:
len(cifras_df.artist_genre.unique())

73

In [4]:
genres = [[genre,len(cifras_df[cifras_df['artist_genre']==genre])] for genre in cifras_df.artist_genre.unique()]
sorted(genres,key=lambda x: x[1], reverse=True)[0:10]

[['nao-informada', 14746],
 ['gospel', 9902],
 ['sertanejo', 8058],
 ['mpb', 4291],
 ['samba-e-pagode', 2527],
 ['forro', 2509],
 ['pop-music', 2198],
 ['catolicas', 1907],
 ['rockn-roll', 1872],
 ['diversos', 1466]]

In [5]:
selected_genres = ['gospel','sertanejo','mpb','samba-e-pagode','forro','pop-music','rockn-roll']
print("Selected genres: " + str(selected_genres))

Selected genres: ['gospel', 'sertanejo', 'mpb', 'samba-e-pagode', 'forro', 'pop-music', 'rockn-roll']


#### FastAI expects labels of the data to be organized in a certain way:
- "train" and "test" folders will split the dataset into training and validation data.
- In each folder, there will be folders with each classification possible, so we will organize the cifras folder with genres instead of artists.

## Dataset Path Organization and balance data

- Reorganize the paths according to what FastAI expects.
- Balance the data getting randomly 1500 cifras for each genre.
- Split dataset into training 80% and validation 20%.

In [6]:
from pathlib import Path
import shutil
import os

num_cifras = 10
train_prop = 0.8

new_folder ='selected_cifras_test'

def organize_df_to_training(new_folder, num_cifras, train_prop, random_seed=42):

    train_folder = new_folder + '/train'
    valid_folder = new_folder + "/test"

    # Check if new folder already exists
    if os.path.isdir(new_folder):
        # Remove folder if already exists
        shutil.rmtree(new_folder)

    # Create new folder
    os.makedirs(new_folder)
    os.makedirs(train_folder)
    os.makedirs(valid_folder)

    # Loop through all selected genres
    for genre in selected_genres:

        # Create subfolders for that genre
        new_train_folder = train_folder + "/" + genre
        os.makedirs(new_train_folder)
        new_train_path = Path(new_train_folder)

        new_valid_folder = valid_folder + "/" + genre
        os.makedirs(new_valid_folder)
        new_valid_path = Path(new_valid_folder)

        # Get all itens for that selected genre
        subset_df = cifras_df[cifras_df["artist_genre"] == genre]
        # Take some samples of cifras for that genre
        samples = subset_df.sample(num_cifras, random_state=random_seed)

        #Split into train and valid
        train_samples = samples.iloc[:int(len(samples)*train_prop)]
        valid_samples = samples.iloc[int(len(samples)*train_prop):]

        # Copy training samples
        for index, row in train_samples.iterrows():    
            if Path(row.cifra_file_loc).is_file():
                # Copy file to new folder while adding artist name to cifra file name
                new_location = new_train_path / (row.url_artist_name + "_" + row.url_song_name + ".txt" )

                shutil.copy(row.cifra_file_loc, new_location)
            else:
                print("File not found!")

        # Copy validation samples
        for index, row in valid_samples.iterrows():    
            if Path(row.cifra_file_loc).is_file():
                # Copy file to new folder while adding artist name to cifra file name
                new_location = new_valid_path / (row.url_artist_name + "_" + row.url_song_name + ".txt" )

                shutil.copy(row.cifra_file_loc, new_location)
            else:
                print(f"Cifra File {row.cifra_file_loc} not found!")


num_cifras = 10 # Number of cifras for each genre
train_prop = 0.8 # Proportion of cifras that will be to training, the rest will be for validation

new_folder ='selected_cifras_test' # Name of folder where training dataset will be stored

organize_df_to_training(new_folder, num_cifras, train_prop)

In [7]:
!tree selected_cifras_test

[01;34mselected_cifras_test[0m
├── [01;34mtest[0m
│   ├── [01;34mforro[0m
│   │   ├── avioes-do-forro_seu-choro-nao-me-faz-desistir.txt
│   │   └── wesley-safadao_amiga-parceira.txt
│   ├── [01;34mgospel[0m
│   │   ├── comunidade-doce-mae-de-deus_apenas-comecou.txt
│   │   └── julio-cesar-e-marlene_foi-ele.txt
│   ├── [01;34mmpb[0m
│   │   ├── nilo-amaro-e-seus-cantores-de-ebano_urutau.txt
│   │   └── tie_a-bailarina-e-o-astronauta.txt
│   ├── [01;34mpop-music[0m
│   │   ├── junior-de-oliveira_jeitinho-perfeito.txt
│   │   └── wanessa-camargo_vou-lembrar.txt
│   ├── [01;34mrockn-roll[0m
│   │   ├── cor-do-invisivel_sosia-em-ideias.txt
│   │   └── divididos_ojos-de-rio.txt
│   ├── [01;34msamba-e-pagode[0m
│   │   ├── grupo-na-hora-h_no-batuque-do-meu-samba.txt
│   │   └── os-originais-do-samba_tenha-fe-pois-amanha-um-lindo-dia-vai-nascer.txt
│   └── [01;34msertanejo[0m
│       ├── roberta-miranda_gracas-a-deus.txt
│       └── tonico-e-tinoco_histo

- The previous code is capable of copying a number of samples from our dataset into an organized folder according to FastAI standards.
- We run the code for a couple of samples only, and checked that it is splitting the dataset correctly.
- Now we can generate the split of 1500 cifras for each music genre.

In [8]:
num_cifras = 1500 # Number of cifras for each genre
train_prop = 0.8 # Proportion of cifras that will be to training, the rest will be for validation

new_folder ='selected_cifras' # Name of folder where training dataset will be stored

organize_df_to_training(new_folder, num_cifras, train_prop)

In [11]:
!tree selected_cifras --filelimit 10

[01;34mselected_cifras[0m
├── [01;34mtest[0m
│   ├── [01;34mforro[0m  [300 entries exceeds filelimit, not opening dir]
│   ├── [01;34mgospel[0m  [300 entries exceeds filelimit, not opening dir]
│   ├── [01;34mmpb[0m  [300 entries exceeds filelimit, not opening dir]
│   ├── [01;34mpop-music[0m  [300 entries exceeds filelimit, not opening dir]
│   ├── [01;34mrockn-roll[0m  [300 entries exceeds filelimit, not opening dir]
│   ├── [01;34msamba-e-pagode[0m  [300 entries exceeds filelimit, not opening dir]
│   └── [01;34msertanejo[0m  [300 entries exceeds filelimit, not opening dir]
└── [01;34mtrain[0m
    ├── [01;34mforro[0m  [1200 entries exceeds filelimit, not opening dir]
    ├── [01;34mgospel[0m  [1200 entries exceeds filelimit, not opening dir]
    ├── [01;34mmpb[0m  [1200 entries exceeds filelimit, not opening dir]
    ├── [01;34mpop-music[0m  [1200 entries exceeds filelimit, not opening dir]
    ├── [01;34mrockn-roll[0m  [1200 entries excee

## FastAI Text Classification

In [12]:
from fastai.text.all import *

- After organizing the folder and balacing the data for the top 7 music genres in the dataset, we can use FastAI DataLoaders and check what it looks like.

In [13]:
ds_path = Path(new_folder)
dls = TextDataLoaders.from_folder(ds_path, valid='test')
dls.show_batch()

  return getattr(torch, 'has_mps', False)


Unnamed: 0,text,category
0,"xxbos [ intro]dução : f xxmaj dm xxmaj bb \n\n xxup verso 1 \n\n f xxmaj dm \n xxmaj pisando fundo , acelerando tudo \n xxmaj bb f xxup f7 f \n xxmaj xxunk saindo do limite \n f xxup f7 f xxmaj dm xxmaj bb xxmaj bb9 f xxup f7 f \n é o que eu te disse eu sou assim , partindo pra cima , fugindo de mim \n f xxup f7 f xxmaj dm \n xxmaj eu corro muito , eu vou pra todo lado \n xxmaj bb f xxup f7 f \n xxmaj levando comigo quem ta do meu lado xxunk … \n f xxmaj dm xxmaj bb xxmaj bb9 f xxup f7 f \n é o que eu te disse eu sou assim , partindo pra cima , fugindo de mim \n\n pré - refrão \n\n xxmaj bb xxmaj dm \n xxmaj ah não perco",rockn-roll
1,xxbos xxmaj abrir tablatura \n▁\n▁\n▁ xxup intro ( freely ) \n▁ xxmaj em xxmaj bm \n▁ 1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 + \n e| xxrep 16 - | xxrep 16 - | xxrep 3 - +7 xxrep 11 - | \n xxup b| xxrep 14 - 0-| xxrep 16 - | xxrep 3 - +7 xxrep 11 - | \n xxup g| xxrep 12 - 0 xxrep 3 - | xxrep 16 - | xxrep 3 - +7 xxrep 11 - | \n xxup d| xxrep 4 - 0 xxrep 3 - 2 / 4 xxrep 5 - \ 2 - 0 xxrep 9 - 0 xxrep 3 - | xxrep 16 - | \n xxmaj a|0h2 xxrep 13 - | xxrep 4 - 2 - 0 - h2 xxrep 6 -,pop-music
2,xxbos [ intro ] ( base del tema ) \n\n▁ d f d f d f d \n xxmaj viaje de xxunk de sombras y cielos \n▁ d f d f d xxmaj dsus4 \n xxmaj mezcla de muerte y amor \n▁ d f d f d f d \n xxmaj pies que xxunk xxunk el suelo \n▁ d f d f d xxmaj riff1 x2 \n xxmaj mezcla de miedo y amor \n\n▁ xxmaj riff1 \n xxmaj no son tus ojos ni tu bandera \n▁ xxmaj riff1 \n xxmaj no son tus ojos ni tu bandera \n\n d f d f d f d \n xxmaj eso que dicen a veces no es cierto \n d f d f d \n xxmaj esto no es un desierto \n▁ d f d f d f d \n xxmaj nubes que siempre se caen del cielo \n▁ d f d f,rockn-roll
3,"xxbos [ intro ] d7m(9 ) xxup d6(9 ) xxup d7m(9 ) xxup d6(9 ) xxup d7m(9 ) \n▁ bb7m(9 ) bb7 m bb7m(9 ) bb7 m \n▁ bb7m(9 ) bb7 m bb7m(9 ) \n\n xxup d7m(9 ) \n xxmaj melhor eu ir \n▁ xxup g7 m f # m7 xxup g7 m \n xxmaj tudo bem vai ser melhor só \n▁ xxup d7m(9 ) \n xxmaj se teve que ser assim \n▁ xxup g7 m xxup c7(4 ) xxup b7(4 ) xxmaj bb7(4 ) xxup a7(4 ) \n é que pensando bem nunca existiu nós \n▁ xxup d7m(9 ) \n xxup só eu que pensei na gente \n▁ xxup g7 m f # m7 xxup g7 m \n xxmaj ainda que demorei pra terminar , dói \n▁ xxmaj bm7 xxup a9 \n xxmaj não era só comigo que você ficava \n▁ xxup g7 m d / f # f",samba-e-pagode
4,"xxbos xxmaj intro : f xxmaj gm c xxmaj dm \n\n [ verso 1 ] \n▁ xxmaj gm \n xxmaj eu sei que eu me atrasei \n▁ c f \n xxmaj desculpa eu vacilei , pode dizer \n▁ xxmaj dm \n xxmaj eu só vou ouvir porque \n\n▁ xxmaj gm c \n xxmaj ultimamente , você anda carente \n▁ f \n xxmaj diz que só foi mais uma \n▁ xxmaj dm \n xxmaj das mancadas que eu dei \n\n [ verso 2 ] \n▁ xxmaj gm c \n xxmaj pensando bem , são xxunk de um fim \n▁ f xxmaj dm \n xxmaj parece o limite , qualquer coisa é pretexto aqui \n▁ xxmaj gm c \n xxmaj uns minutos de espera , você vira uma fera \n▁ f xxup f7 \n e diz que não confia em mim \n\n [ pré - refrão ] \n▁ xxmaj gm \n xxmaj",sertanejo
5,xxbos xxmaj parte 1 c # m b a e \n▁ c # m b e a \n\n ( guitarra 1 ) \n\n▁ c # m b a e \n xxup e| xxrep 30 - | \n xxup b|-5 - 5 - 5 - 5 - 4 - 4 - 4 - 4 - 2 - 2 - 2 - 2 - 9 - -9-| \n xxup g| xxrep 30 - | \n xxup d|-6 - 6 - 6 - 6 - 4 - 4 - 4 - 4 - 2 - 2 - 2 - 2 - 9 - -9-| \n xxup a| xxrep 30 - | \n xxup e| xxrep 30 - | \n\n▁ c # m b e a \n xxup e| xxrep 30 - | \n xxup b|-5 - 5 - 5 - 5 - 4 - 4 - 4 - 4 - 9 - 9,gospel
6,xxbos xxmaj abrir tablatura \n▁\n▁\n▁\n xxup intro \n▁ e xxrep 49 - \n▁ b xxrep 49 - \n▁ g xxrep 49 - \n▁ d xxrep 12 - 2 xxrep 5 - 2 xxrep 3 - 2 xxrep 12 - 2 xxrep 5 - 2 xxrep 3 - 2 xxrep 3 - \n▁ a xxrep 12 - 3 xxrep 5 - 3 xxrep 3 - 3 xxrep 12 - 5 xxrep 5 - 5 xxrep 3 - 5 xxrep 3 - \n▁ e -0 - 3 - 5 - -5 - 5 - 5 - 5 - 5 - 5 - 5 - 5 - -0 - 5 - 3 - 3 - 3 - 3 - 3 - 3 - 3 - 3 - 3 xxrep 3 - \n\n xxup riff 1 \n▁ e xxrep 49 - \n▁ b xxrep 49 - \n▁ g xxrep 33 - 7 - 7,pop-music
7,xxbos xxmaj intro d / a a \n▁ c / a g / a c / a g / a \n▁ d / a a f / a g / a \n▁ d / a a \n▁ c / a g / a c / a g / a \n▁ f g a g a \n\n [ verso 1 ] \n\n a a / c # \n xxmaj este é um tempo de festa \n▁ d a / c # \n xxmaj este é um tempo de louvor \n▁ xxmaj bm xxup e4 a d / f # \n xxmaj pra celebrar aquele que primeiro nos amou \n e / g # a g / a \n xxmaj transformou nosso choro em riso \n▁ d a / c # f # m \n xxmaj nos deu novas vestes de louvor \n▁ g xxup e4 a \n xxmaj pra celebrar aquele,gospel
8,xxbos xxmaj intro f # c # d # m a # m \n▁ b f # g # m7 c # \n▁ f # c # d # m a # m \n▁ b f # g # m c # \n\n f # \n xxup lá vai mais um sonhador \n▁ c # 4 c # \n xxmaj falavam assim de mim \n g # m f # b \n xxmaj ele é só mais um caso comum \n▁ c # 4 c # \n xxmaj xxunk meu fim \n f # \n xxup lá vai mais um sonhador \n▁ c # 4 c # \n xxmaj falavam também de xxmaj josé \n g # m \n xxmaj mas xxmaj josé não xxunk \n f # b \n xxmaj não reclamou \n▁ c # 4 c # \n xxmaj segurou a fé \n\n▁ d # m \n xxmaj,gospel


- Learner from FastAI will define the pipeline of training the RNN.
- By default for text classification, FastAI uses Long Short-Term Memory Networks (LSTM).
- The metric we will use if the accuracy of the model.

In [14]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

In [None]:
learn.fine_tune(4, 1e-2)

  return getattr(torch, 'has_mps', False)


epoch,train_loss,valid_loss,accuracy,time


  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)


- Sadly, my no-GPU notebook is not able to handle the fine-tuning process.
- Also, there are many warnings from PyTorch yield.
- Let's save the dataset to run in another notebook with more processing capability.

In [1]:
!tar -czf selected_cifras.tar.gz selected_cifras