In this kernel, I'm going to explore how a person's gender relates to his/her interest  in various music genres and phobias. There are an incredible amount of variables given, and since this is just my second kernel on kaggle, I want to keep things simple. Therefore, I have narrowed it down to such a specific use case.

We'll start off by importing the dataaset and narrowing it down to the subset which we are interested in. For convenience, I'm going to rename some of the columns.

In [None]:
import pandas as pd
df = pd.read_csv('../input/responses.csv', header=0)

df.rename(columns = {'Metal or Hardrock':'Metal',
                     'Number of siblings':'NumSiblings',
                    'Left - right handed':'Hand',
                    'House - block of flats':'Block'}, inplace=True)

music = df.columns[:19].tolist()
phobias = df.columns[63:73].tolist()
demographics = df.columns[-10:].tolist()
df = df[music+phobias+demographics]

print (df.shape)

The following are the features which we are interested in:

In [None]:
print (demographics)
print (music)
print (phobias)


The first interesting hypthesis I wanted to test is - how does music interest vary with gender? To start with, lets look at the interest in metal music for men and women (since metal is my favourite genre of music). The distributions which we want to compare are : metal ratings for men vs women.

In [None]:
df_metal_male = df[df['Gender']=='male']['Metal']
df_metal_female = df[df['Gender']=='female']['Metal']
print (df_metal_male.mean(), df_metal_male.std())
print (df_metal_female.mean(), df_metal_female.std())
from scipy import stats
stats.ttest_ind(df_metal_male.dropna(), df_metal_female.dropna())
#df_metal_male.value_counts()
#df_metal_male[isnan(df_metal_male)]


In [None]:
df_metal_male.value_counts().plot(kind='bar')

In [None]:
df_metal_female.value_counts().plot(kind='bar')

We've looked at the means, but a better way to quantify the difference in 2 distributions is "effect size". There are different kinds of effect sizes, the one which we'll be looking at is "Cohen's d" (this particular effect size tries to standardise the difference in means of the two distributions). 

In [None]:
import math
def effect_size(series1, series2):    
    diff = series1.mean() - series2.mean()
    var1 = series1.var()
    var2 = series2.var()
    n1, n2 = len(series1), len(series2)
    pooled_var = (n1*var1 + n2*var2)/(n1+n2)
    return diff/ math.sqrt(pooled_var)

print (effect_size(df_metal_male, df_metal_female))

Lets try to compare this effect size with other differences in males and females which are considered more "obvious". For example, distribution of weights and heights for males and females should show considerable difference.

In [None]:
df_height_male =  df[df['Gender']=='male']['Height']
df_height_female =  df[df['Gender']=='female']['Height']
print (effect_size(df_height_male, df_height_female))

In [None]:
df_height_male =  df[df['Gender']=='male']['Weight']
df_height_female =  df[df['Gender']=='female']['Weight']
print (effect_size(df_height_male, df_height_female))

This is in line with the wiki page for [cohen's d effect size](https://en.wikipedia.org/wiki/Effect_size), according to which 0.35 (difference in metal music likeness) is somewhere between "small" and "medium", while height and weight are in the "very large" to "huge" category of effect sizes.

So, this gives us some quantitative sense of how metal music tastes are different in men and women. While the difference is interesting, it is not as pronounced as the difference in weights or heights.

Finally, we'll calculate the effect size for all music types and phobias and see what we get.

In [None]:
effect_sizes = []
for col in music+phobias:
    df_male = df[df['Gender']=='male'][col]
    df_female = df[df['Gender']=='female'][col]
    effect_sizes.append((col, effect_size(df_male, df_female)))
df_final = pd.DataFrame(sorted(effect_sizes, key=lambda a:abs(a[1]), reverse=True))
df_final.head(10)

As we can see, metal has the most pronounced difference among music genres, while Spiders is most pronounced among phobias (females tend to be more phobic of spiders). This is in line with [Miraslav Sabo's kernel](https://www.kaggle.com/miroslavsabo/analyzing-gender-differences), where he analyzed the average response and found the biggest difference between genders to be in fear of spiders.

I would love some feedback. One thing which I have been thinking is : What other types of effect size would you have used, and the reasoning behind using them?

Thanks for reading.