# Project: Lyrics Classifier

### Goal
* Scrape poems from lyrikline.org and save them on hard drive
* Load poems into corpus
* Create vectors for bag-of-words approach
* Train models on poems
* Predict poet of poem

### Import libraries

In [1]:
from bs4 import BeautifulSoup as soup
import os

### Load corpus

In [2]:
POETS = [
    "H. C. Artmann",
    "Marcel Beyer",
    "Nico Bleutge",
    "Nora Bossong",
    "Ann Cotten",
    "Paul Celan",
    "Oswald Egger",
    "Durs Grünbein",
    "Ernst Jandl",
    "Thomas Kling",
    "Friederike  Mayröcker",
    "Monika Rinck",
    "Ulf Stolterfoht",
    "Uljana Wolf",
]


def create_poems_corpus(poets):
    """loads song texts from files and stores lyrics and artist index in seperate lists"""
    complete_poems = []
    indices = []
    for i, poet in enumerate(poets):
        directory = f"lyrik/{poet.lower().replace(' ', '_')}" # lyrics ... -lyrics
        allfiles = os.listdir(directory)
        all_poems = []
        for file in allfiles:
            with open(directory + "/" + file, "r", encoding="utf-8") as f:
                poem = f.read()
                all_poems.append(poem)
        indices += [i] * len(all_poems)
        print(poet, len(all_poems))
        complete_poems += all_poems
    return complete_poems, indices

In [3]:
# Store lists into variables, print out number of songs by artist
complete_poems, indices = create_poems_corpus(POETS) # POETS or ARTISTS

H. C. Artmann 8
Marcel Beyer 10
Nico Bleutge 15
Nora Bossong 10
Ann Cotten 15
Paul Celan 10
Oswald Egger 10
Durs Grünbein 10
Ernst Jandl 10
Thomas Kling 10
Friederike  Mayröcker 10
Monika Rinck 15
Ulf Stolterfoht 15
Uljana Wolf 14


### Create vectors

In [4]:
german = ["a","ab","aber","ach","acht","achte","achten","achter","achtes","ag","alle","allein","allem","allen","aller","allerdings","alles","allgemeinen","als","also","am","an","ander","andere","anderem","anderen","anderer","anderes","anderm","andern","anderr","anders","au","auch","auf","aus","ausser","ausserdem","außer","außerdem","b","bald","bei","beide","beiden","beim","beispiel","bekannt","bereits","besonders","besser","besten","bin","bis","bisher","bist","c","d","d.h","da","dabei","dadurch","dafür","dagegen","daher","dahin","dahinter","damals","damit","danach","daneben","dank","dann","daran","darauf","daraus","darf","darfst","darin","darum","darunter","darüber","das","dasein","daselbst","dass","dasselbe","davon","davor","dazu","dazwischen","daß","dein","deine","deinem","deinen","deiner","deines","dem","dementsprechend","demgegenüber","demgemäss","demgemäß","demselben","demzufolge","den","denen","denn","denselben","der","deren","derer","derjenige","derjenigen","dermassen","dermaßen","derselbe","derselben","des","deshalb","desselben","dessen","deswegen","dich","die","diejenige","diejenigen","dies","diese","dieselbe","dieselben","diesem","diesen","dieser","dieses","dir","doch","dort","drei","drin","dritte","dritten","dritter","drittes","du","durch","durchaus","durfte","durften","dürfen","dürft","e","eben","ebenso","ehrlich","ei","ei,","eigen","eigene","eigenen","eigener","eigenes","ein","einander","eine","einem","einen","einer","eines","einig","einige","einigem","einigen","einiger","einiges","einmal","eins","elf","en","ende","endlich","entweder","er","ernst","erst","erste","ersten","erster","erstes","es","etwa","etwas","euch","euer","eure","eurem","euren","eurer","eures","f","folgende","früher","fünf","fünfte","fünften","fünfter","fünftes","für","g","gab","ganz","ganze","ganzen","ganzer","ganzes","gar","gedurft","gegen","gegenüber","gehabt","gehen","geht","gekannt","gekonnt","gemacht","gemocht","gemusst","genug","gerade","gern","gesagt","geschweige","gewesen","gewollt","geworden","gibt","ging","gleich","gott","gross","grosse","grossen","grosser","grosses","groß","große","großen","großer","großes","gut","gute","guter","gutes","h","hab","habe","haben","habt","hast","hat","hatte","hatten","hattest","hattet","heisst","her","heute","hier","hin","hinter","hoch","hätte","hätten","i","ich","ihm","ihn","ihnen","ihr","ihre","ihrem","ihren","ihrer","ihres","im","immer","in","indem","infolgedessen","ins","irgend","ist","j","ja","jahr","jahre","jahren","je","jede","jedem","jeden","jeder","jedermann","jedermanns","jedes","jedoch","jemand","jemandem","jemanden","jene","jenem","jenen","jener","jenes","jetzt","k","kam","kann","kannst","kaum","kein","keine","keinem","keinen","keiner","keines","kleine","kleinen","kleiner","kleines","kommen","kommt","konnte","konnten","kurz","können","könnt","könnte","l","lang","lange","leicht","leide","lieber","los","m","machen","macht","machte","mag","magst","mahn","mal","man","manche","manchem","manchen","mancher","manches","mann","mehr","mein","meine","meinem","meinen","meiner","meines","mensch","menschen","mich","mir","mit","mittel","mochte","mochten","morgen","muss","musst","musste","mussten","muß","mußt","möchte","mögen","möglich","mögt","müssen","müsst","müßt","n","na","nach","nachdem","nahm","natürlich","neben","nein","neue","neuen","neun","neunte","neunten","neunter","neuntes","nicht","nichts","nie","niemand","niemandem","niemanden","noch","nun","nur","o","ob","oben","oder","offen","oft","ohne","ordnung","p","q","r","recht","rechte","rechten","rechter","rechtes","richtig","rund","s","sa","sache","sagt","sagte","sah","satt","schlecht","schluss","schon","sechs","sechste","sechsten","sechster","sechstes","sehr","sei","seid","seien","sein","seine","seinem","seinen","seiner","seines","seit","seitdem","selbst","sich","sie","sieben","siebente","siebenten","siebenter","siebentes","sind","so","solang","solche","solchem","solchen","solcher","solches","soll","sollen","sollst","sollt","sollte","sollten","sondern","sonst","soweit","sowie","später","startseite","statt","steht","suche","t","tag","tage","tagen","tat","teil","tel","tritt","trotzdem","tun","u","uhr","um","und","uns","unse","unsem","unsen","unser","unsere","unserer","unses","unter","v","vergangenen","viel","viele","vielem","vielen","vielleicht","vier","vierte","vierten","vierter","viertes","vom","von","vor","w","wahr","wann","war","waren","warst","wart","warum","was","weg","wegen","weil","weit","weiter","weitere","weiteren","weiteres","welche","welchem","welchen","welcher","welches","wem","wen","wenig","wenige","weniger","weniges","wenigstens","wenn","wer","werde","werden","werdet","weshalb","wessen","wie","wieder","wieso","will","willst","wir","wird","wirklich","wirst","wissen","wo","woher","wohin","wohl","wollen","wollt","wollte","wollten","worden","wurde","wurden","während","währenddem","währenddessen","wäre","würde","würden","x","y","z","z.b","zehn","zehnte","zehnten","zehnter","zehntes","zeit","zu","zuerst","zugleich","zum","zunächst","zur","zurück","zusammen","zwanzig","zwar","zwei","zweite","zweiten","zweiter","zweites","zwischen","zwölf","über","überhaupt","übrigens"]

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd


def vectors_and_df(complete_poems, indices):
    """creates vectors for songs and returns dataframe with songs as word vectors by all artists"""
    cv = TfidfVectorizer(stop_words=german)
    cv.fit(complete_poems)
    corpus_vecs = cv.transform(complete_poems)
    return pd.DataFrame(corpus_vecs.todense(), index=indices, columns=cv.get_feature_names()), cv


In [6]:
# Store results into dataframe, keep cv for later prediction
df, cv = vectors_and_df(complete_poems, indices)
df

Unnamed: 0,10,1499,20,200,30,3000ern,aar,abblend,abbruch,abbruchreif,...,überzeichnetem,überzeuge,überzogen,übrig,übrigblieb,üppigen,üppiger,все,жиды,поэты
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Define features and target column
X = df
y = df.index

### Train test split

In [8]:
# Split the data into train and test set
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

### Train models

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB, CategoricalNB

models_params = {
    "MultinomialNB": {},
    "CategoricalNB": {},
    "RandomForestClassifier": {
        "n_estimators": 500,
        "max_depth": 200,
        "max_features": "auto",
        "n_jobs": -1,
        "random_state": 1,
    },
    "LogisticRegression": {"C": 1e6},
}

def train_models(models_params):
    """trains models on corpus and returns dataframe with scores"""
    scores = {}
    for model in models_params:
        if model == "LogisticRegression":
            m = LogisticRegression(**models_params[model])
        elif model == "RandomForestClassifier":
            m = RandomForestClassifier(**models_params[model])
        elif model == "MultinomialNB":
            m = MultinomialNB(**models_params[model])
        elif model == "CategoricalNB":
            m = MultinomialNB(**models_params[model])

        m.fit(Xtrain, ytrain)
        score_train = m.score(Xtrain, ytrain)
        score_test = m.score(Xtest, ytest)
        scores[f"{model}"] = {
            "params": models_params[model],
            "train score": score_train,
            "test score": score_test,
        }
    return pd.DataFrame(scores).T

In [10]:
df_scores = train_models(models_params)
df_scores

Unnamed: 0,params,train score,test score
MultinomialNB,{},0.976744,0.212121
CategoricalNB,{},0.976744,0.212121
RandomForestClassifier,"{'n_estimators': 500, 'max_depth': 200, 'max_f...",1.0,0.0909091
LogisticRegression,{'C': 1000000.0},1.0,0.333333


### Train on full data set

In [11]:
# Train on most promising model
model = "LogisticRegression"
m = LogisticRegression(**models_params[model])
m.fit(X, y)
m.score(X, y)

1.0

In [20]:
# Input Goethe's "Osterspaziergang" for prediction
new_poem = [
    """
    Vom Eise befreit sind Strom und Bäche
    Durch des Frühlings holden, belebenden Blick,
    Im Tale grünet Hoffnungsglück;
    Der alte Winter, in seiner Schwäche,
    Zog sich in rauhe Berge zurück.
    Von dort her sendet er, fliehend, nur
    Ohnmächtige Schauer körnigen Eises
    In Streifen über die grünende Flur.
    Aber die Sonne duldet kein Weißes,
    Überall regt sich Bildung und Streben,
    Alles will sie mit Farben beleben;
    Doch an Blumen fehlts im Revier,
    Sie nimmt geputzte Menschen dafür.
    Kehre dich um, von diesen Höhen
    Nach der Stadt zurück zu sehen!
    Aus dem hohlen finstern Tor
    Dringt ein buntes Gewimmel hervor.
    Jeder sonnt sich heute so gern.
    Sie feiern die Auferstehung des Herrn,
    Denn sie sind selber auferstanden:
    Aus niedriger Häuser dumpfen Gemächern,
    Aus Handwerks- und Gewerbesbanden,
    Aus dem Druck von Giebeln und Dächern,
    Aus der Straßen quetschender Enge,
    Aus der Kirchen ehrwürdiger Nacht
    Sind sie alle ans Licht gebracht.
    Sieh nur, sieh! wie behend sich die Menge
    Durch die Gärten und Felder zerschlägt,
    Wie der Fluß in Breit und Länge
    So manchen lustigen Nachen bewegt,
    Und, bis zum Sinken überladen,
    Entfernt sich dieser letzte Kahn.
    Selbst von des Berges fernen Pfaden
    Blinken uns farbige Kleider an.
    Ich höre schon des Dorfs Getümmel,
    Hier ist des Volkes wahrer Himmel,
    Zufrieden jauchzet groß und klein:
    Hier bin ich Mensch, hier darf ichs sein!
    """
]

new_poem2 = [
    """
    Beete, hört nun kurz her, bitte. 
    Ich habe überlegt, diese Abhandlung in Versen zu schreiben. 
    Es düngte mich eine Weile eine gute Idee. 
    Die verschwand, zu recht, wie ich meine. 
    Verzeiht mir also die unregelmäßigen Zeilen, in denen ich mich euch aussetzen werde. 
    Es ist kaum was drin: eins für die Vögel, eins für die anderen Vögel, 
    eins für den Tod und eines, das möglicherweise überleben könnte. 
    Verzeiht. 
    Zu fette Witze schärfen zuweilen den Mulch, mit dem ich euch belegen wollte, 
    an die Grenze der Unannehmlichkeit, einer Art von Kratzen oder Schaben, 
    das mitunter auch zu merken ist, wenn Spaten an einem Stein vorbeigeht, 
    aber ich etwas tiefer davon haben will. 
    Das kennt ihr wohl oder übel gründlicher als ich. 
    Ich hoffe euer Mark zu erschüttern, 
    und wenn ich dabei in einen sich auftuenden Schacht trete 
    und mir im günstigsten Fall etwas verrenke. 
    Ihr seht, ich meins ernst und schone niemanden. 
    Im Gegenteil, hier habt ihr ein Versprechen, 
    ich werde solche Planen über euch ausbreiten, 
    dass das Kondenswasser allen Überlegens, 
    das Brüten eines ganzen Frühlings über euch hereinbricht, 
    sooft eine leichte Brise die versprochene Behütung, 
    leicht wie das Versprechen selbst, zu Wellen ähnlich einem Meer im Theater bewegt. 
    Wollt ihr das? Ich werde euch später, wenn mein Versuch zu Ende ist, danach fragen, 
    ob ihr das gewollt habt, weil mir schon klar ist, 
    dass ich mir im vorläufigen Stadium, dem lediglich der Bestellung, 
    keine klare Antwort abholen kann, 
    und vertraue auf eure zuneigungsvollen Informationen, 
    wenn ich, nachdem dies alles vorüber ist, 
    unter euch mich aufzuhalten die Ehre haben werde. 
    """
]

In [21]:
def predict_poet(poem, poets):
    """predicts artist of song based on artists in corpus"""
    # transform song into vector matrix
    poem_vecs = cv.transform(poem)
    ynew = poem_vecs.todense()
    
    print(f"This classifier predicts the poem to be written by:\n")
    for i, artist in enumerate(poets):
        print(f"{poets[i]}: {round(m.predict_proba(ynew)[0][i] * 100, 1) }%.")
    poem_pred = m.predict(ynew)[0]
    confidence = m.predict_proba(ynew).max()
    if confidence > 0.9:
        confidence_word = "definitely"
    elif confidence > 0.7:
        confidence_word = "probably"
    else:
        confidence_word = "maybe"
    print(f"\nThis poem is {confidence_word} by {poets[poem_pred]}!")

In [22]:
predict_poet(new_poem2, POETS)

This classifier predicts the poem to be written by:

H. C. Artmann: 0.0%.
Marcel Beyer: 0.0%.
Nico Bleutge: 0.0%.
Nora Bossong: 0.0%.
Ann Cotten: 100.0%.
Paul Celan: 0.0%.
Oswald Egger: 0.0%.
Durs Grünbein: 0.0%.
Ernst Jandl: 0.0%.
Thomas Kling: 0.0%.
Friederike  Mayröcker: 0.0%.
Monika Rinck: 0.0%.
Ulf Stolterfoht: 0.0%.
Uljana Wolf: 0.0%.

This poem is definitely by Ann Cotten!
