# Project name: Civil Servants Remuneration in the EU
## Part 3: EU Civil Servant number of civil servants employed
### Step 1: Data wrangling

LAST UPDATE: 22.08.22 17:37:26
EXTRACTION DATE: 29.08.22 11:31:53
SOURCE OF DATA: Eurostat

### Column names and description of the values it holds
TIME: Year

GEO: Country name

STATINFO: Term to represent data coverage (Total)

UNIT: Term to represent the value (Number)

VALUE: Represent the total number of civil servant employees

# Data Gathering

The cells below contain the wrangling process of the data related to the total number of the EU civil servants.

In [1]:
# import relevant libraries
import pandas as pd

In [2]:
# load the data
df_number = pd.read_csv('/Volumes/GoogleDrive-114951830941804947409/My Drive/Data analyst/Projects/Civil_Servant_Salary_EU/original_data/3. National civil servants in central public administration [prc_rem_nr].csv')

In [3]:
# view data
df_number.head()

Unnamed: 0,TIME,GEO,STATINFO,UNIT,Value
0,2012S2,Belgium,Total,Number,34697
1,2012S2,Bulgaria,Total,Number,45114
2,2012S2,Czechia,Total,Number,19163
3,2012S2,Denmark,Total,Number,7832
4,2012S2,Germany (until 1990 former territory of the FRG),Total,Number,13280


In [4]:
# check the datatypes
df_number.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   TIME      280 non-null    object
 1   GEO       280 non-null    object
 2   STATINFO  280 non-null    object
 3   UNIT      280 non-null    object
 4   Value     280 non-null    object
dtypes: object(5)
memory usage: 11.1+ KB


In [5]:
# view unique values in each column
def uniq_val(column_name = ''):
    print(df_number[column_name].unique())
for column in df_number.columns:
    print(column + ':')
    uniq_val(column)
    print('\n')

TIME:
['2012S2' '2013S2' '2014S2' '2015S2' '2016S2' '2017S2' '2018S2' '2019S2'
 '2020S2' '2021S2']


GEO:
['Belgium' 'Bulgaria' 'Czechia' 'Denmark'
 'Germany (until 1990 former territory of the FRG)' 'Estonia' 'Ireland'
 'Greece' 'Spain' 'France' 'Croatia' 'Italy' 'Cyprus' 'Latvia' 'Lithuania'
 'Luxembourg' 'Hungary' 'Malta' 'Netherlands' 'Austria' 'Poland'
 'Portugal' 'Romania' 'Slovenia' 'Slovakia' 'Finland' 'Sweden'
 'United Kingdom']


STATINFO:
['Total']


UNIT:
['Number']


Value:
['34,697' '45,114' '19,163' '7,832' '13,280' '15,200' '24,077' '69,993'
 '34,897' '77,774' ':' '166,420' '13,525' '2,973' '16,368' '946' '11,694'
 '23,865' '110,887' '48,734' '128,596' '118,552' '12,221' '21,149'
 '33,778' '24,974' '60,224' '350,990' '28,681' '36,788' '18,806' '12,980'
 '10,081' '24,286' '58,300' '42,382' '73,837' '159,674' '9,590' '2,904'
 '16,951' '984' '12,403' '24,511' '109,098' '19,996' '127,622' '92,391'
 '13,191' '17,872' '62,465' '337,570' '31,297' '34,718' '17,691' '12,376'
 '9

In [6]:
# drop the redundant columns
df_number.drop(columns=['STATINFO', 'UNIT'], axis=1, inplace=True)

In [7]:
# remove S2 from the year values in TIME column
df_number['TIME'] = df_number['TIME'].str.replace('S2', '')

In [8]:
# convert Value column to integer
df_number['Value'] = df_number['Value'].str.replace(',', '') # clean the number values
df_number['Value'] = df_number['Value'].str.replace(':', '') # clean the number values
df_number['Value'] = pd.to_numeric(df_number['Value'])

In [9]:
# change Germany long name to the short one
index_ger = df_number[df_number['GEO'].str.contains("Germany")].index # store index values to identify rows that contains Germany
df_number.loc[index_ger, "GEO"] = 'Germany' # assign the new values

In [10]:
# check again the unique values in each column
for column in df_number.columns:
    print(column + ':')
    uniq_val(column)
    print('\n')

TIME:
['2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019' '2020' '2021']


GEO:
['Belgium' 'Bulgaria' 'Czechia' 'Denmark' 'Germany' 'Estonia' 'Ireland'
 'Greece' 'Spain' 'France' 'Croatia' 'Italy' 'Cyprus' 'Latvia' 'Lithuania'
 'Luxembourg' 'Hungary' 'Malta' 'Netherlands' 'Austria' 'Poland'
 'Portugal' 'Romania' 'Slovenia' 'Slovakia' 'Finland' 'Sweden'
 'United Kingdom']


Value:
[ 34697.  45114.  19163.   7832.  13280.  15200.  24077.  69993.  34897.
  77774.     nan 166420.  13525.   2973.  16368.    946.  11694.  23865.
 110887.  48734. 128596. 118552.  12221.  21149.  33778.  24974.  60224.
 350990.  28681.  36788.  18806.  12980.  10081.  24286.  58300.  42382.
  73837. 159674.   9590.   2904.  16951.    984.  12403.  24511. 109098.
  19996. 127622.  92391.  13191.  17872.  62465. 337570.  31297.  34718.
  17691.  12376.   9360.  24420.  66857.  41590.  78182.  11537. 155482.
   8992.   3041.  16627.    986.  14287.  25273. 108732.  20566.  22455.
  89152.  36130.  62215. 310

In [11]:
# store new data
df_number.to_csv('/Volumes/GoogleDrive-114951830941804947409/My Drive/Data analyst/Projects/Civil_Servant_Salary_EU/modified_data/3. EU_Number of civil servants.csv', index=False)