Xavier Perez Gonzalez - - ANR 386264
===============

Analysis of salaries gap in Regions in Spain
===============

## Final Assignment  - Applied Economics 1

### Question: 
Are there regional differences in salaries in Spain?


#### Motivation:
The economic differences across regions in Spain are wide. More specifically, the salaries across Spain regions vary. For example, in industrial regions, such as Basque country or Catalonia, salaries are close to the average of UE. And in less developed regions like Andalusia or Extremadura, the salaries are lower. Other touristic regions such as Canary Islands or Balearic Islands have some specifics characteristics in their labour market that make them difficult to classify. In this context, the effect of the education on salary is an interesting topic to discuss.

#### Data:
The data of this study is retrieved from the four-yearly [Wage Structure Survey of 2010](http://www.ine.es/dyngs/INEbase/en/operacion.htm?c=Estadistica_C&cid=1254736177025&menu=resultados&secc=1254736195110&idp=1254735976596). The quality of the data is guaranteed as includes 28,500 establishments and 220,000 workers. Although, the data regard the workers of the Autonomous cities of Ceuta and Melilla can be statistically insignificantly as both cities have a population of 85,000 approximately.
The INE (Spanish Statistical Office) offers a microdata file detailed with the characteristics of the worker, the establishment, and the contract of the worker. The microdata is somewhat capped to accomplish confidentiality requirements. Another constraint is the absence of data of agricultural employment.


#### Method:
To find the contribution of an extra year of the salary two equations are been designed based on [Pastor et al (2007)](http://www.ivie.es/es/productos/85-el-rendimiento-del-capital-humano-en-espana.php). The first equation is the basic equation of wage without regions variable.The equation only contains years of education, years of experience, years of experience to the second power in order to capture the concave slop of experience and the gender, .

(1) $$\ Ln (Salary/hour) = \beta_1 educyears+ \beta_2 expyears+ \beta_3 expyears^2+ \beta_4 woman + \epsilon_i \\$$

Where $educyears$ is the number of years of education of the workers, $expyears$ is the number of years of experience of the worker, $expyears^2$ is the number of years of experience of the workers raised to the second power, $woman$ is a gender dichotomous variable of the worker and ε is the random error term.

(2)   $$\ Ln(Salary/hour) =  \beta_1 educyears+ \beta_2 expyears+ \beta_3 expyears2+ \beta_4 woman+ \beta_5 fulltime+  \beta_6 permanentcontract + \beta_7 Energysector+ \beta_8 Manufacturing+ \beta_9 Office+ \beta_{10} Retailing+ \beta_{11} Transport+ \beta_{12} Hospitality+ \beta_{13} Information+ \beta_{14} Finance+ \beta_{15} Scientific+ \beta_{16} Outofmarketservices + \beta_{17} Arts+ \beta_{18} Otherservices + \epsilon_i \\$$

Where $fulltime$ is dichotomous variable of type of working day and $permanentcontract$ is dichotomous variable of type of length of the contract. The other variables are the sector of the establishment of the worker. The construction sector is being left out as reference.



## Importing Libraries

In [9]:

import matplotlib.pyplot as plt
from astropy.table import Table, Column
from pandas.stats.plm import * 
import pandas as pd
import statsmodels.formula.api as sm

information=pd.DataFrame.from_csv('/Users/Xavi/Documents/Xavi/Applied Economics/Final Assignment/xavi_data.csv')
data.head()

Unnamed: 0_level_0,ORDENTRA,REG,SECC,ESTRATO,CONTROL,MERCADO,REGULACION,SEXO,ANONAC,NACI,...,madrid,murcia,navarra,pvasco,rioja,exp2,expe2,secenergetico,inmofin,serviciosfueradelmercado
ORDENCCC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000027.0,1,16,B0,1,2,1,2,1,1952,1,...,0,0,0,1,0,2116,2116,1,0,0
1000027.0,2,16,B0,1,2,1,2,1,1957,1,...,0,0,0,1,0,1681,1681,1,0,0
1000028.0,1,16,C0,2,2,3,3,1,1950,1,...,0,0,0,1,0,1369,1369,0,0,0
1000028.0,2,16,C0,2,2,3,3,6,1956,1,...,0,0,0,1,0,1296,1296,0,0,0
1000028.0,3,16,C0,2,2,3,3,1,1950,1,...,0,0,0,1,0,2304,2304,0,0,0



Table 1 shows the summary of all the variables.



#### Table 1

In [11]:
information.describe().transpose()


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ORDENTRA,174350.0,8.166183,6.816664,1.0000,3.000000,6.000000,13.000000,50.0000
REG,174350.0,9.113679,4.523674,1.0000,6.000000,9.000000,13.000000,18.0000
ESTRATO,174350.0,3.580614,1.530800,1.0000,2.000000,4.000000,5.000000,5.0000
CONTROL,174350.0,1.864302,0.342469,1.0000,2.000000,2.000000,2.000000,2.0000
MERCADO,174350.0,1.882392,0.934500,1.0000,1.000000,2.000000,2.000000,4.0000
REGULACION,174350.0,2.026355,0.787836,1.0000,1.000000,2.000000,3.000000,3.0000
SEXO,174350.0,3.119644,2.470904,1.0000,1.000000,1.000000,6.000000,6.0000
ANONAC,174350.0,1969.687927,10.533206,1930.0000,1962.000000,1971.000000,1978.000000,1994.0000
NACI,174350.0,1.150622,0.647556,1.0000,1.000000,1.000000,1.000000,5.0000
CNO2,174350.0,53.520786,24.705854,11.0000,34.000000,45.000000,75.000000,98.0000



Table 2 shows the results of the first regression.

#### Table 2


In [16]:
result = sm.ols('log_salariohora ~ escolaridad + expe + exp2 + mujer + cat + pvasco + andalucia + extremadura + canarias + balears', data=information).fit()
print result.summary()




                            OLS Regression Results                            
Dep. Variable:        log_salariohora   R-squared:                       0.287
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     7027.
Date:                Wed, 18 Jan 2017   Prob (F-statistic):               0.00
Time:                        21:35:21   Log-Likelihood:                -98835.
No. Observations:              174350   AIC:                         1.977e+05
Df Residuals:                  174339   BIC:                         1.978e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       1.1692      0.006    190.843      


And Table 3 shows the results of regression 2:

#### Table 3



In [19]:
result = sm.ols('log_salariohora ~ escolaridad + expe + exp2 + mujer + +indefinido + extractivas + manufacturas + energia + agua + comercio + transporte + hosteleria + informacion + finanzas + tcompleto + ciencias + educacion + sanidad + artisticas + otros_servicios + cat + pvasco + andalucia + extremadura + canarias + balears', data=information).fit()
print result.summary()



                            OLS Regression Results                            
Dep. Variable:        log_salariohora   R-squared:                       0.335
Model:                            OLS   Adj. R-squared:                  0.334
Method:                 Least Squares   F-statistic:                     3371.
Date:                Wed, 18 Jan 2017   Prob (F-statistic):               0.00
Time:                        21:44:41   Log-Likelihood:                -92852.
No. Observations:              174350   AIC:                         1.858e+05
Df Residuals:                  174323   BIC:                         1.860e+05
Df Model:                          26                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept           1.1105      0.007    1

## Conclusion

As observed from regression 1 and from regression 2 using more control variables, the regions with more industry ('cat' for Catalonia and 'pvasco' for Basque Contry) have higher salaries compared to the rest of Spain. Also, we can see that from the touristic region, only 'balears' (Balearic Islands) has higher salaries while the Canarian Islands ('canarias')
. Finally, the less developed regions have lower salaries as expected ('andalucia' for Andalucia' and 'extremadura' for Extremadura).

