# Datenpräparation

1. Einlesen der CSV-Datei
2. Entfernen der NaN-Einträge
3. Aufteilen des zusammenhängenden Textes in einzelne Wörter
4. Entfernen aller Satzzeichen
5. Abgrenzen des Zeitraums 1970 bis 2016
6. Hinzufügen einer Wörterzahl Eigenschaft
7. Texte zwischen 15 und 1500 Wörter 
8. Speichern einer präparierten CSV-Datei

In [1]:
import pandas as pd

### Einlesen der CSV-Datei

In [21]:
df = pd.read_csv('lyrics.csv')
df.head()

Unnamed: 0,index,song,year,artist,genre,text
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


## Formatierung der Buchstaben
Alle Buchstaben werden in ihere kleingeschriebene Form gebracht

In [3]:
df['text'] = df['text'].str.lower()
df.head()

Unnamed: 0,index,song,year,artist,genre,text
0,0,ego-remix,2009,beyonce-knowles,Pop,"oh baby, how you doing?\nyou know i'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,if you search\nfor tenderness\nit isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"oh oh oh i, oh oh oh i\n[verse 1:]\nif i wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"party the people, the people the party it's po..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362237 entries, 0 to 362236
Data columns (total 6 columns):
index     362237 non-null int64
song      362235 non-null object
year      362237 non-null int64
artist    362237 non-null object
genre     362237 non-null object
text      266557 non-null object
dtypes: int64(2), object(4)
memory usage: 16.6+ MB


### Entfernen der NaN-Einträge

In [22]:
df = df.fillna("")

### Aufteilen des zusammenhängenden Textes in einzelne Wörter
Aus der Bibliothek nltk.tokenize nutzen wir die Funktion word_tokenize um die einzelnen Wörter zu separieren

In [6]:
from nltk.tokenize import word_tokenize
df['word_tokenize'] = df['text'].apply(word_tokenize)

### Entfernen aller Satzzeichen
Wir entfernen semtliche Interpunktionen aus unserem Datensatz

In [7]:
import string
stop = list(string.punctuation)

df['word_tokenize_no_punctuation'] = df['word_tokenize'].apply(lambda x: [item for item in x if item not in stop])
df['word_tokenize_no_punctuation']

0         [oh, baby, how, you, doing, you, know, i, 'm, ...
1         [playin, everything, so, easy, it, 's, like, y...
2         [if, you, search, for, tenderness, it, is, n't...
3         [oh, oh, oh, i, oh, oh, oh, i, verse, 1, if, i...
4         [party, the, people, the, people, the, party, ...
5         [i, heard, church, bells, ringing, i, heard, a...
6         [this, is, just, another, day, that, i, would,...
7         [waiting, waiting, waiting, waiting, waiting, ...
8         [verse, 1, i, read, all, of, the, magazines, w...
9         [n-n-now, honey, you, better, sit, down, and, ...
10        [i, lay, alone, awake, at, night, sorrow, fill...
11        [hello, hello, baby, you, called, i, ca, n't, ...
12        [feels, like, i, 'm, losing, my, mind, love, i...
13        [youre, everything, i, thought, you, never, we...
14        [i, got, ta, give, up, to, quite, the, storm, ...
15        [it, really, hurts, to, say, this, yes, it, do...
16        [you, 're, bad, for, me, i, cl

### Abgrenzen des Zeitraums 
Zeitraum von 1970 bis 2016

In [8]:
df = df[df['year'] >= 1970]
df = df[df['year'] <= 2016]
df

Unnamed: 0,index,song,year,artist,genre,text,word_tokenize,word_tokenize_no_punctuation
0,0,ego-remix,2009,beyonce-knowles,Pop,"oh baby, how you doing?\nyou know i'm gonna cu...","[oh, baby, ,, how, you, doing, ?, you, know, i...","[oh, baby, how, you, doing, you, know, i, 'm, ..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see...","[playin, ', everything, so, easy, ,, it, 's, l...","[playin, everything, so, easy, it, 's, like, y..."
2,2,honesty,2009,beyonce-knowles,Pop,if you search\nfor tenderness\nit isn't hard t...,"[if, you, search, for, tenderness, it, is, n't...","[if, you, search, for, tenderness, it, is, n't..."
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"oh oh oh i, oh oh oh i\n[verse 1:]\nif i wrote...","[oh, oh, oh, i, ,, oh, oh, oh, i, [, verse, 1,...","[oh, oh, oh, i, oh, oh, oh, i, verse, 1, if, i..."
4,4,black-culture,2009,beyonce-knowles,Pop,"party the people, the people the party it's po...","[party, the, people, ,, the, people, the, part...","[party, the, people, the, people, the, party, ..."
5,5,all-i-could-do-was-cry,2009,beyonce-knowles,Pop,i heard\nchurch bells ringing\ni heard\na choi...,"[i, heard, church, bells, ringing, i, heard, a...","[i, heard, church, bells, ringing, i, heard, a..."
6,6,once-in-a-lifetime,2009,beyonce-knowles,Pop,this is just another day that i would spend\nw...,"[this, is, just, another, day, that, i, would,...","[this, is, just, another, day, that, i, would,..."
7,7,waiting,2009,beyonce-knowles,Pop,"waiting, waiting, waiting, waiting\nwaiting, w...","[waiting, ,, waiting, ,, waiting, ,, waiting, ...","[waiting, waiting, waiting, waiting, waiting, ..."
8,8,slow-love,2009,beyonce-knowles,Pop,[verse 1:]\ni read all of the magazines\nwhile...,"[[, verse, 1, :, ], i, read, all, of, the, mag...","[verse, 1, i, read, all, of, the, magazines, w..."
9,9,why-don-t-you-love-me,2009,beyonce-knowles,Pop,"n-n-now, honey\nyou better sit down and look a...","[n-n-now, ,, honey, you, better, sit, down, an...","[n-n-now, honey, you, better, sit, down, and, ..."


### Wörter der Lieder zählen 
Als Eigenschaft speichern

In [16]:
df['word_tokenize_length'] = df['word_tokenize_no_punctuation'].apply(len)

###  Wortzahl eingrenzen
Texte zwischen 15 und 1500 Wörter

In [10]:
df = df[df['word_tokenize_length'] >= 15]
df = df[df['word_tokenize_length'] <= 1500]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259793 entries, 0 to 362236
Data columns (total 9 columns):
index                           259793 non-null int64
song                            259793 non-null object
year                            259793 non-null int64
artist                          259793 non-null object
genre                           259793 non-null object
text                            259793 non-null object
word_tokenize                   259793 non-null object
word_tokenize_no_punctuation    259793 non-null object
word_tokenize_length            259793 non-null int64
dtypes: int64(3), object(6)
memory usage: 19.8+ MB


In [11]:
df.head()

Unnamed: 0,index,song,year,artist,genre,text,word_tokenize,word_tokenize_no_punctuation,word_tokenize_length
0,0,ego-remix,2009,beyonce-knowles,Pop,"oh baby, how you doing?\nyou know i'm gonna cu...","[oh, baby, ,, how, you, doing, ?, you, know, i...","[oh, baby, how, you, doing, you, know, i, 'm, ...",474
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see...","[playin, ', everything, so, easy, ,, it, 's, l...","[playin, everything, so, easy, it, 's, like, y...",270
2,2,honesty,2009,beyonce-knowles,Pop,if you search\nfor tenderness\nit isn't hard t...,"[if, you, search, for, tenderness, it, is, n't...","[if, you, search, for, tenderness, it, is, n't...",177
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"oh oh oh i, oh oh oh i\n[verse 1:]\nif i wrote...","[oh, oh, oh, i, ,, oh, oh, oh, i, [, verse, 1,...","[oh, oh, oh, i, oh, oh, oh, i, verse, 1, if, i...",555
4,4,black-culture,2009,beyonce-knowles,Pop,"party the people, the people the party it's po...","[party, the, people, ,, the, people, the, part...","[party, the, people, the, people, the, party, ...",338


## Stopwörter entfernen
Festlegen der zu entfernenden Wörter

In [12]:
from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
[words.capitalize() for words in stop]
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Entfernen der festgelegten Wörter

In [13]:
df['word_tokenize_no_stopwords']=df['word_tokenize_no_punctuation'].apply(lambda x: [item for item in x if item not in stop])
df.head()

Unnamed: 0,index,song,year,artist,genre,text,word_tokenize,word_tokenize_no_punctuation,word_tokenize_length,word_tokenize_no_stopwords
0,0,ego-remix,2009,beyonce-knowles,Pop,"oh baby, how you doing?\nyou know i'm gonna cu...","[oh, baby, ,, how, you, doing, ?, you, know, i...","[oh, baby, how, you, doing, you, know, i, 'm, ...",474,"[oh, baby, know, 'm, gon, na, cut, right, chas..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see...","[playin, ', everything, so, easy, ,, it, 's, l...","[playin, everything, so, easy, it, 's, like, y...",270,"[playin, everything, easy, 's, like, seem, sur..."
2,2,honesty,2009,beyonce-knowles,Pop,if you search\nfor tenderness\nit isn't hard t...,"[if, you, search, for, tenderness, it, is, n't...","[if, you, search, for, tenderness, it, is, n't...",177,"[search, tenderness, n't, hard, find, love, ne..."
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"oh oh oh i, oh oh oh i\n[verse 1:]\nif i wrote...","[oh, oh, oh, i, ,, oh, oh, oh, i, [, verse, 1,...","[oh, oh, oh, i, oh, oh, oh, i, verse, 1, if, i...",555,"[oh, oh, oh, oh, oh, oh, verse, 1, wrote, book..."
4,4,black-culture,2009,beyonce-knowles,Pop,"party the people, the people the party it's po...","[party, the, people, ,, the, people, the, part...","[party, the, people, the, people, the, party, ...",338,"[party, people, people, party, 's, popping, si..."


### Speichern einer präparierten CSV-Datei

In [14]:
#df[['song','year','genre','word_tokenize_no_punctuation','word_tokenize_length','word_tokenize_no_stopwords']].to_csv("lyrics_clean_utf8.csv",encoding='utf-8' )