# Spain Wage Structure Survey

The data presented here has been retrieved from the periodical INE survey about Spain's population wage structure:
https://www.ine.es/dyngs/INEbase/en/operacion.htm?c=Estadistica_C&cid=1254736177025&menu=ultiDatos&idp=1254735976596

Information about each variable can be found in the dr_EES_2014.xlsx file.


In [None]:
salary <- read.csv("Data/salario.csv")
head(salary)
dim(salary)

In [None]:
library(ggplot2)

# default plots size
options(repr.plot.width = 20, repr.plot.height = 10, repr.plot.res = 100)

# default theme and font size
theme_set(theme_bw(base_size = 24))



ggplot(salary, aes(x=SALBRUTO)) + geom_density(fill="darkorange", color="black")

In [None]:
ggplot(salary, aes(x=log(SALBRUTO))) + geom_density(fill="darkorange", color="black")

In [None]:
ggplot(salary, aes(x=log(SALBRUTO))) + geom_density(aes(fill=SEXO),  color="black", alpha=0.5)

In [None]:
# How different are wages, on average? Remember aggregate?
aggregate(salary$SALBRUTO, by=list(salary$SEXO), mean)

In [None]:
# A0	DIRECTORES Y GERENTES
# B0	TÉCNICOS Y PROFESIONALES CIENTÍFICOS E INTELECTUALES DE LA SALUD Y LA ENSEÑANZA
# C0	OTROS TÉCNICOS Y PROFESIONALES CIENTÍFICOS E INTELECTUALES
# D0	TÉCNICOS; PROFESIONALES DE APOYO
# E0	EMPLEADOS DE OFICINA QUE NO ATIENDEN AL PÚBLICO
# F0	EMPLEADOS DE OFICINA QUE ATIENDEN AL PÚBLICO
# G0	TRABAJADORES DE LOS SERVICIOS DE RESTAURACION Y COMERCIO
# H0	TRABAJADORES DE LOS SERVICIOS DE SALUD Y EL CUIDADO DE PERSONAS
# I0	TRABAJADORES DE LOS SERVICOS DE PROTECCION Y SEGURIDAD
# J0	TRABAJADORES CUALIFICADOS EN EL SECTOR AGRÍCOLA, GANADERO, FORESTAL Y PESQUERO
# K0	TRABAJADORES CUALIFICADOS DE LA CONSTRUCCION, EXCEPTO LOS OPERADORES DE MÁQUINAS
# L0	TRABAJADORES CUALIFICADOS DE LAS INDUSTRIAS MANUFACTURERAS, EXCEPTO OPERADORES DE INSTALACIONES Y MÁQUINAS
# M0	OPERADORES DE INSTALACIONES Y MAQUINARIA FIJAS, Y MONTADORES
# N0	CONDUCTORES Y OPERADORES DE MAQUINARIA MOVIL
# O0	TRABAJADORES NO CUALIFICADOS EN SERVICIOS
# P0	PEONES DE LA AGRICULTURA, PESCA, CONSTRUCCIÓN, INDUSTRIAS MANUFACTURERAS Y TRANSPORTES
# Q0	OCUPACIONES MILITARES

# How different are wages by position?
aggregate(salary$SALBRUTO, by=list(salary$CNO1), mean)

In [None]:
# Boxplots by positions
ggplot(salary, aes(y=log(SALBRUTO))) + geom_boxplot(aes(fill=CNO1))

In [None]:
# Let's compare wages between sexes depending on position
means <- aggregate((salary$SALBRUTO), by=list(salary$SEXO, salary$CNO1), mean)
sds <- aggregate((salary$SALBRUTO), by=list(salary$SEXO, salary$CNO1), sd)

head(means)
head(sds)

sex_positions <- data.frame(SEXO=means$`Group.1`, Position=means$`Group.2`,   mean=means$x, sd=sds$x)
sex_positions

## Facets

Facets allow you to create multiple plot based on factors.

In [None]:
means <- aggregate((salary$SALBRUTO), by=list(salary$SEXO, salary$CNO1), mean)
sds <- aggregate((salary$SALBRUTO), by=list(salary$SEXO, salary$CNO1), sd)

sex_positions <- data.frame(SEXO=means$`Group.1`, Position=means$`Group.2`,   mean=means$x, sd=sds$x)


options(repr.plot.width = 20, repr.plot.height = 20, repr.plot.res = 100)


# Let's compare wages between sexes depending on position
#ggplot(sex_positions, aes(x=Position, y=mean)) + geom_point(aes(color=SEXO), size=5)
#ggplot(sex_positions, aes(x=Position, y=mean)) + geom_point(aes(x=SEXO, color=SEXO), size=5) + facet_wrap(Position~.)
ggplot(sex_positions, aes(x=Position, y=mean)) + 
    geom_errorbar(aes(x=SEXO, ymin=mean-sd, ymax=mean+sd), width=0) + 
    geom_point(aes(x=SEXO, color=SEXO), size=5) +     
    facet_wrap(Position~., scales = "free_y")

## grid.arrange and cheating with data

In previous classes we saw that you can use the par() function to arrange several different plot on the same page. With ggplot you need to use gridArrange, from the gridExtra package.

In [None]:
library(ggplot2)
library(gridExtra)

# default plots size
options(repr.plot.width = 20, repr.plot.height = 10, repr.plot.res = 100)

# default theme and font size
theme_set(theme_bw(base_size = 24))


# Let's compare wages between sexes depending on studies (with the LOG salary)
mean <- aggregate(log(salary$SALBRUTO), by=list(salary$SEXO, salary$ESTU), mean)
sd <- aggregate(log(salary$SALBRUTO), by=list(salary$SEXO, salary$ESTU), sd)
sex_estu_log <- data.frame(SEXO=mean$`Group.1`, ESTU=mean$`Group.2`,   mean=mean$x, sd=sd$x)
head(sex_estu_log)

# Let's compare wages between sexes depending on studies (with the NATURAL salary)
mean <- aggregate(salary$SALBRUTO, by=list(salary$SEXO, salary$ESTU), mean)
sd <- aggregate(salary$SALBRUTO, by=list(salary$SEXO, salary$ESTU), sd)
sex_estu_nat <- data.frame(SEXO=mean$`Group.1`, ESTU=mean$`Group.2`,   mean=mean$x, sd=sd$x)
head(sex_estu_nat)


# geom_pointragne allows you to put dots and bars at the same time (for more shapes: http://www.sthda.com/english/wiki/ggplot2-point-shapes)
p1 <- ggplot(sex_estu_log) + geom_pointrange(aes(x=ESTU, y=mean, ymin=mean-sd, ymax=mean+sd, fill=SEXO), size=1.5, shape=22, color="black", position = position_dodge(width=0.3))
p2 <- ggplot(sex_estu_nat) + geom_pointrange(aes(x=ESTU, y=mean, ymin=mean-sd, ymax=mean+sd, fill=SEXO), size=1.5, shape=24, color="black", position = position_dodge(width=0.3))

# plot the two graph on the same page next to each other
grid.arrange(p1, p2, nrow=1)



It looks like we are getting two different results depending on whether the data has been log-transformed or not. In one case, studying decreases the gender pay gap and in the other it makes it worse! What do you think is happening? Which of the two plots is more correct? Are they both wrong? *hint: try to plot the histogram each data distributions (as in the first figures) against its own mean (as a vertical line), what do you observe?*