<a href="https://colab.research.google.com/github/remi-vidal/NLP-ensae/blob/main/notebook_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset used in this project can be downloaded [here](https://www.kaggle.com/datasets/michaelarman/poemsdataset). It is comprised of two folders, both containing subfolders of poems. These poems are categorized by the form (e.g. haiku, sonnet, etc.) or topic (love, nature, joy, peace, etc.). For our classification, we only keep the topics.

In this notebook, we convert the sub-folders into a dataframe. The folder "topics" needs to be in same directory as this notebook.

In [1]:
import os
import pandas as pd
import re
import numpy

rootdir = 'topics'
df = pd.DataFrame(columns=['file_title','content', 'theme'])

# Iteration over topics
for subdir, _, files in os.walk(rootdir):
    theme = os.path.basename(os.path.normpath(subdir)) #Theme string
    content = []
    titles = []
    
    # Iteration over files inside a topic
    for file in files:
        f = os.path.join(subdir, file)
        if os.path.isfile(f):
            my_file = open(f, "r")
            content.append(my_file.read())
            titles.append(file)
            
#     df_theme = pd.DataFrame(content, columns = ['content'])
    df_theme = pd.DataFrame({"file_title" : titles, "content" : content})
    df_theme['theme'] = theme
    
    df = pd.concat([df, df_theme], ignore_index=True)

In [2]:
df

Unnamed: 0,file_title,content,theme
0,LonelyPoemsALonelyHeartInAThunderstormPoembysy...,A lonely heart sets the table but is she a Mil...,lonely
1,LonelyPoemsALonelyWeekendPoembysylviaspencer.txt,"How bright it is on a Friday, when I am away f...",lonely
2,LonelyPoemsLonelyInTheDarkPoembynouriblack.txt,I am living in the darkness\nI feel so scared\...,lonely
3,LonelyPoemsLonelyBurialPoembyStephenVincentBen...,"There were not many at that lonely place,\nWhe...",lonely
4,LonelyPoemsFilmscriptForALonelyAfternoonPoemby...,"walks into deserted playground\nindifferently,...",lonely
...,...,...,...
14330,MusicPoemsWarMusicPoembyHenryVanDyke.txt,Break off! Dance no more!\nDanger is at the do...,music
14331,MusicPoemsMusicPoembyRainerMariaRilke.txt,"Take me by the hand;\nit's so easy for you, An...",music
14332,MusicPoemsTheSoundOfMusicPoembyRaviSathasivam.txt,The sound of music gives us so much emotions\n...,music
14333,MusicPoemsReedMusicPoembyFrederickKesner.txt,Amber frosted reeds\nin the summer's wind\nswa...,music


In [3]:
def extract_author(x):

    expr = r'(?<=by)(.*)(?=.txt)'
    author = re.findall(expr, x)[0]
    
    upper_separator = re.findall('[A-Z][^A-Z]*', author)
    if len(upper_separator) > 0 :
        author = ' '.join(upper_separator)
        
    return author

        
def extract_title(x):
    
    expr = r'(?<=oems)(.*)(?=Poemby)'
    title = re.findall(expr, x)
    
    if len(title) > 0:
        upper_separator = re.findall('[A-Z][^A-Z]*', title[0])
        if len(upper_separator) > 0 :
            title = ' '.join(upper_separator)

    else : 
        title = np.nan
        
    return title

In [4]:
df['author'] = df['file_title'].apply(extract_author)
df['title'] = df['file_title'].apply(extract_title)

In [5]:
df = df[['title', 'author', 'content', 'theme', 'file_title']]
df

Unnamed: 0,title,author,content,theme,file_title
0,A Lonely Heart In A Thunderstorm,sylviaspencer,A lonely heart sets the table but is she a Mil...,lonely,LonelyPoemsALonelyHeartInAThunderstormPoembysy...
1,A Lonely Weekend,sylviaspencer,"How bright it is on a Friday, when I am away f...",lonely,LonelyPoemsALonelyWeekendPoembysylviaspencer.txt
2,Lonely In The Dark,nouriblack,I am living in the darkness\nI feel so scared\...,lonely,LonelyPoemsLonelyInTheDarkPoembynouriblack.txt
3,Lonely Burial,Stephen Vincent Benet,"There were not many at that lonely place,\nWhe...",lonely,LonelyPoemsLonelyBurialPoembyStephenVincentBen...
4,Filmscript For A Lonely Afternoon,Michael Shepherd,"walks into deserted playground\nindifferently,...",lonely,LonelyPoemsFilmscriptForALonelyAfternoonPoemby...
...,...,...,...,...,...
14330,War Music,Henry Van Dyke,Break off! Dance no more!\nDanger is at the do...,music,MusicPoemsWarMusicPoembyHenryVanDyke.txt
14331,Music,Rainer Maria Rilke,"Take me by the hand;\nit's so easy for you, An...",music,MusicPoemsMusicPoembyRainerMariaRilke.txt
14332,The Sound Of Music,Ravi Sathasivam,The sound of music gives us so much emotions\n...,music,MusicPoemsTheSoundOfMusicPoembyRaviSathasivam.txt
14333,Reed Music,Frederick Kesner,Amber frosted reeds\nin the summer's wind\nswa...,music,MusicPoemsReedMusicPoembyFrederickKesner.txt


In [None]:
df.to_csv('df_cleaned.csv')