# Project name: Civil Servants Remuneration in the EU
## Preliminary analysis
### Part 2: EU Civil Servant average salary
#### Step 1: Data wrangling

LAST UPDATE: 22.08.22 17:37:26
EXTRACTION DATE: 29.08.22 11:31:53
SOURCE OF DATA: Eurostat

### Column names and the values it holds
TIME: Year

GEO: EU country name

LCSTRUCT: Net remuneration in nominal terms / real terms

P_ADJ: Nominal value

UNIT: Euro/Purchasing power standard (PPS)

Value: Value of the income

## Data Gathering
In the cell below, only the data related to the average income of the EU member states have been loaded. The rest of the files will be wrangled separately in separate notebooks.

In [116]:
# Import relevant libraries
import pandas as pd

In [117]:
# Load the data
df_salary = pd.read_csv('/Volumes/GoogleDrive-114951830941804947409/My Drive/Data analyst/Projects/Civil_Servant_Salary_EU/original_data/2. Average remuneration of national civil servants in central public administration [prc_rem_avg].csv')

In [118]:
# take a first look at the data
df_salary.head()

Unnamed: 0,TIME,GEO,LCSTRUCT,P_ADJ,UNIT,Value
0,2014S2,Belgium,Net remuneration in nominal terms / real terms,Nominal value,Euro,2 647.7
1,2014S2,Belgium,Net remuneration in nominal terms / real terms,Nominal value,Purchasing power standard (PPS),2 647.7
2,2014S2,Bulgaria,Net remuneration in nominal terms / real terms,Nominal value,Euro,547.40
3,2014S2,Bulgaria,Net remuneration in nominal terms / real terms,Nominal value,Purchasing power standard (PPS),993.20
4,2014S2,Czechia,Net remuneration in nominal terms / real terms,Nominal value,Euro,948.50


In [119]:
# make column names lower case
df_salary.columns = df_salary.columns.str.lower()

In [120]:
# simplify the dataframe
df_salary.time.replace("S2", "", regex=True, inplace=True) # remove the S2 notation
df_salary.drop(df_salary[df_salary['unit'] == "Purchasing power standard (PPS)"].index, axis=0, inplace=True) # drop the PPS rows
df_salary.drop(['lcstruct', 'p_adj', 'unit'], axis=1, inplace=True) # drop the columns with redundant info
df_salary.reset_index(drop='index') # reset the dataframe index

Unnamed: 0,time,geo,value
0,2014,Belgium,2 647.7
1,2014,Bulgaria,547.40
2,2014,Czechia,948.50
3,2014,Denmark,2 515.6
4,2014,Germany,3 654.3
...,...,...,...
219,2021,Slovenia,1 726.0
220,2021,Slovakia,1 850.0
221,2021,Finland,3 665.0
222,2021,Sweden,3 080.0


In [121]:
# check the overall data characteristics
df_salary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 446
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   time    224 non-null    object
 1   geo     224 non-null    object
 2   value   223 non-null    object
dtypes: object(3)
memory usage: 5.4+ KB


In [122]:
# convert the value columns to float64
df_salary.replace(" ", "", regex=True, inplace=True)
df_salary.value = df_salary.value.astype('float64')

In [123]:
# change Germany long name to the short one
index_ger = df_salary[df_salary['geo'].str.contains("Germany")].index # store index values to identify rows that contains Germany
df_salary.loc[index_ger, "geo"] = 'Germany' # assign the new values

In [124]:
52.1429/12 * 12

52.1429

In [125]:
# convert mean value from semester to yearly
## since the value is the monthly average we will adjut the value to reflect the values in the income dataset
df_salary.value = (df_salary.value / (52.1429/12)) * 52.1429 # convert 1 month to weeks by dividing 52.1429 weeks to 12 months

In [126]:
# change the value_mean columns name to income
df_salary.rename(columns={'value': 'salary'}, inplace=True)

In [127]:
# store the new data
df_salary.to_csv('/Volumes/GoogleDrive-114951830941804947409/My Drive/Data analyst/Projects/Civil_Servant_Salary_EU/modified_data/2. EU Civil Servant average salary.csv', index=False)