# Examining Gender Bias in Song Lyrics

### Introduction

In this project we aim to compare the gender bias in Hip-Hop songs versus Electronic songs by using Word2Vec to calculate the distances between a selection of words in the lyrics and mean gender embeddings. 

Datasets used in this mini project:
- word_cats.csv: lists of words within certain categories such as occupation.
- english_cleaned_lyrics.csv: dataset of song lyrics of 160.856 songs. Retrieved from:
https://github.com/hiteshyalamanchili/SongGenreClassification/tree/master/dataset
- male_words.p: list of male words.
- female_words.p: list of female words.
    
This project was initially a school assignment for my Master's degree at Utrecht University. Hence, some of the code used in this notebook was adjusted from the code used in this lab manual: https://jveerbeek.gitlab.io/dm-manual/.

### Import Modules

In [1]:
import pandas as pd
import numpy as np
import pickle
import gensim
from gensim.models import Word2Vec
import spacy

### Importing the Data

First, I import the necessary data. Next, I import the dataset containing the song lyrics. This dataset can then be used to analyze gender bias in the lyrics. I also import word_cats.csv.

In [2]:
PATH_DF = 'english_cleaned_lyrics.csv'
PATH_CORRECTION = 'indx2newdate.p'

def load_dataset(data_path, path_correction):
    df = pd.read_csv(data_path)
    indx2newdate = pickle.load(open(PATH_CORRECTION, 'rb'))
    df['year'] = df['index'].apply(lambda x: int(indx2newdate[x][0][:4]) if indx2newdate [x][0] != "" else 0)
    return df[df.year > 1960][['song', 'year', 'artist', 'genre', 'lyrics']]

df_songs = load_dataset(PATH_DF , PATH_CORRECTION)

df_wordcats = pd.read_csv('word_cats.csv')

## Genre: Hip-Hop

I create a subset of all the Hip-Hop songs and I lemmatize them using Spacy so they can be used to train the word embeddings model. Next, I set the parameters of the word embeddings model and train it on the lemmatized songs.

### Lemmatizing the Songs

In [4]:
texts_h = df_songs[df_songs.genre == 'Hip-Hop'].lyrics
texts_h = [text.lower() for text in texts_h]
nlp = spacy.load("en_core_web_sm")
processed_texts_h = [text for text in nlp.pipe(texts_h,
                                               disable=["ner",
                                                        "parser"])]
lemmatized_texts_h = [[token.lemma_ for token in text if not token.is_punct] for text in processed_texts_h]

### Model Training

In [6]:
model1 = Word2Vec(size=100,
                  sg=1,
                  window=10, 
                  min_count=1,
                  workers=1)

model1.build_vocab(lemmatized_texts_h)

model1.train(lemmatized_texts_h,
             total_examples=model1.corpus_count,
             epochs=model1.epochs) # grab some coffee while training

(25507600, 38820875)

### Calculating Gender Bias

Next, I import the male and female words and calculate their mean embedding. Subsequently, I loop over all the categories and all the words in the categories of the word_cats dataset. Next, I calculate the gender biases per word and then calculate their averages over the categories, I store the results in a dataframe.

In [8]:
male_words = pickle.load(open('male_words.p', 'rb'))
words = [word for word in male_words if word in model1.wv.vocab]
mean_embedding_male_h = np.mean([model1.wv[word] for word in words], axis=0)

female_words = pickle.load(open('female_words.p', 'rb'))
words = [word for word in female_words if word in model1.wv.vocab]
mean_embedding_female_h = np.mean([model1.wv[word] for word in words], axis=0)

In [94]:
cat_bias = []
for category in df_wordcats.columns:
    word_bias = []
    for word in df_wordcats[category]:
        if word in model1.wv.vocab:
            male_dist = np.linalg.norm(np.subtract(model1.wv[word], mean_embedding_male_h))
            female_dist = np.linalg.norm(np.subtract(model1.wv[word], mean_embedding_female_h))
            bias = male_dist - female_dist
            word_bias.append(bias)
    cat_bias.append(
        {
            'Category': category,
            'Bias Hip-Hop': np.mean(word_bias),
        }
    )
    
df_bias_h = pd.DataFrame(cat_bias)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


## Genre: Electronic

The same steps are repeated for the genre 'Electronic'.

### Lemmatizing the songs

In [11]:
texts_e = df_songs[df_songs.genre == 'Electronic'].lyrics
texts_e = [text.lower() for text in texts_e]
nlp = spacy.load("en_core_web_sm")
processed_texts_e = [text for text in nlp.pipe(texts_e,
                                               disable=["ner",
                                                        "parser"])]
lemmatized_texts_e = [[token.lemma_ for token in text if not token.is_punct] for text in processed_texts_e]

### Model Training

In [12]:
model2 = Word2Vec(size=100,
                  sg=1,
                  window=10, 
                  min_count=1,
                  workers=1)

model2.build_vocab(lemmatized_texts_e)

model2.train(lemmatized_texts_e,
             total_examples=model2.corpus_count,
             epochs=model2.epochs) # grab some coffee while training

(3451514, 5555110)

### Calculating Gender Bias

In [13]:
male_words = pickle.load(open('male_words.p', 'rb'))
words = [word for word in male_words if word in model2.wv.vocab]
mean_embedding_male_e = np.mean([model2.wv[word] for word in words], axis=0)

female_words = pickle.load(open('female_words.p', 'rb'))
words = [word for word in female_words if word in model2.wv.vocab]
mean_embedding_female_e = np.mean([model2.wv[word] for word in words], axis=0)

In [95]:
cat_bias = []
for category in df_wordcats.columns:
    word_bias = []
    for word in df_wordcats[category]:
        if word in model2.wv.vocab:
            male_dist = np.linalg.norm(np.subtract(model2.wv[word], mean_embedding_male_e))
            female_dist = np.linalg.norm(np.subtract(model2.wv[word], mean_embedding_female_e))
            bias = male_dist - female_dist
            word_bias.append(bias)
    cat_bias.append(
        {
            'Category': category,
            'Bias Electronic': np.mean(word_bias),
        }
    )
    avg_bias.append(np.mean(word_bias))
    
df_bias_e = pd.DataFrame(cat_bias)

## Comparing Country and Hip-Hop

I combine the bias per category results of both genres in a dataframe for convenient comparison. I also calculate the mean gender bias for each genre. Furthermore, I print the combined dataframes excluding one of 'Category' columns twice, once sorted by Hip-Hop bias and once sorted by Electronic bias.

In [91]:
combined_bias = pd.concat([df_bias_h, df_bias_e],axis=1)
print(combined_bias['Bias Hip-Hop'].mean())
print(combined_bias['Bias Electronic'].mean())

0.022187198570463806
0.018668477369758945


In [92]:
combined_bias.iloc[:,[0,1,3]].sort_values('Bias Hip-Hop')

Unnamed: 0,Category,Bias Hip-Hop,Bias Electronic
9,work,0.000523,-0.00339
11,money,0.008971,0.007808
10,leisure,0.010284,0.004449
13,occupation,0.01068,-0.022288
8,body,0.02164,0.019641
12,relig,0.022707,0.024891
5,family,0.02328,0.003592
3,negemo,0.025807,0.037792
7,percept,0.026788,0.036827
4,social,0.031037,0.015025


In [93]:
combined_bias.iloc[:,[0,1,3]].sort_values('Bias Electronic')

Unnamed: 0,Category,Bias Hip-Hop,Bias Electronic
13,occupation,0.01068,-0.022288
9,work,0.000523,-0.00339
5,family,0.02328,0.003592
10,leisure,0.010284,0.004449
11,money,0.008971,0.007808
4,social,0.031037,0.015025
8,body,0.02164,0.019641
12,relig,0.022707,0.024891
7,percept,0.026788,0.036827
2,posemo,0.041764,0.03699


When comparing the gender bias results from the genres Hip-Hop and Electronic, the results suggest that the difference in overall gender bias is very low (0.022 and 0.019 mean over all categories). Furthermore, the results suggest that gender bias is very similar in terms of how it is distributed over categories. For both genres, all mean gender biases per category are between -0.022 and 0.044, with most categories very slightly biased towards women (mean male distance > mean female distance), however, the numbers are so small that they can be considered negligible. 