# Data translation

During September 2020, a YouTube channel named [SocialNerds](https://www.youtube.com/channel/UCd5jW000te6bExqYth4TIxQ) released an anonymized [dataset](https://docs.google.com/spreadsheets/d/1TVL6IfF9yaEKa3S6ma69pn-6o2YFxzUgEMTdiec8BpU/edit#gid=613445015) of nearly 600 entries that describes salary levels of software engineers. The data was collected online during the summer of 2020 through a Google Forms questionnaire & commented upon on a [video](https://www.youtube.com/watch?v=e-83bz4RhQ4). The participants are Greek software engineers working mostly for companies located in Greece or abroad.

Unfortunately the dataset is in Greek, therefore it had to be translated. The process of translation is the scope of this notebook.

## Load data

In [1]:
# install google translator
%pip install googletrans

# import libraries
import os
import pandas as pd
from googletrans import Translator

Collecting googletrans
  Downloading googletrans-3.0.0.tar.gz (17 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.0 MB/s  eta 0:00:01
[?25hCollecting hstspreload
  Downloading hstspreload-2020.10.20-py3-none-any.whl (972 kB)
[K     |████████████████████████████████| 972 kB 19.2 MB/s eta 0:00:01
[?25hCollecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 2.2 MB/s  eta 0:00:01
Collecting rfc3986<2,>=1.3
  Downloading rfc3986-1.4.0-py2.py3-none-any.whl (31 kB)
Collecting contextvars>=2.1; python_version < "3.7"
  Downloading contextvars-2.4.tar.gz (9.6 kB)
Collecting h11<0.10,>=0.8
  Downloading h11-0.9.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 3.7 MB/s  eta 0:00:01
[?25hCollecting h2==3.*
  Downloading h2-3.2.0-py2.py3-

In [2]:
input_file_path = 'data/data_original.csv'
data_original = pd.read_csv(input_file_path)

# print out the first row of data info
data_original.head(1)

Unnamed: 0,Timestamp,Πόσα χρόνια δουλεύεις επαγγελματικά ως προγραμματιστής;,Με τι είδος development ασχολείσαι επαγγελματικά αυτή την περίοδο;,Σε ποιες γλώσσες προγραμματισμού δουλεύεις επαγγελματικά αυτή την περίοδο;,Τι μέγεθος είναι η εταιρεία που δουλεύεις;,Ποιος είναι ο τρόπος εργασίας;,Έχεις άτομα υπό την επίβλεψη σου;,Έχεις προσωπικά projects ή κάνεις freelancing πέρα από την κύρια εργασία σου;,Σε ποια πόλη μένεις;,Σε ποια πόλη δουλεύεις;,Φύλλο;,Ποιος είναι ο ετήσιος καθαρός μισθός σου;
0,7/15/2020 12:03:11,4-5,"DevOps, Backend, Frontend","C#, JavaScript",11-50,Και τα δύο,Όχι,Ναι,Αθήνα,Αθήνα,Άντρας,18200


## Translate column labels

The first major step would be to check & translate the column labels of the DataFrame.

In [3]:
data_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 585 entries, 0 to 584
Data columns (total 12 columns):
 #   Column                                                                         Non-Null Count  Dtype 
---  ------                                                                         --------------  ----- 
 0   Timestamp                                                                      585 non-null    object
 1   Πόσα χρόνια δουλεύεις επαγγελματικά ως προγραμματιστής;                        585 non-null    object
 2   Με τι είδος development ασχολείσαι επαγγελματικά αυτή την περίοδο;             585 non-null    object
 3   Σε ποιες γλώσσες προγραμματισμού δουλεύεις  επαγγελματικά αυτή την περίοδο;    585 non-null    object
 4   Τι μέγεθος είναι η εταιρεία που δουλεύεις;                                     585 non-null    object
 5   Ποιος είναι ο τρόπος εργασίας;                                                 585 non-null    object
 6   Έχεις άτομα υπό την επίβλεψη σου; 

In [4]:
# create a translator
translator = Translator()

In [5]:
translations = translator.translate(data_original.columns.tolist(), src='el', dest='en')

for index, translation in enumerate(translations):
    print("Index:      ", index)
    print("Original:   ", translation.origin)
    print("Translated: ", translation.text, "\n")

Index:       0
Original:    Timestamp
Translated:  Timestamp 

Index:       1
Original:    Πόσα χρόνια δουλεύεις επαγγελματικά ως προγραμματιστής;
Translated:  How many years have you worked professionally as a programmer? 

Index:       2
Original:    Με τι είδος development ασχολείσαι επαγγελματικά αυτή την περίοδο;
Translated:  What kind of development do you do professionally at the moment? 

Index:       3
Original:    Σε ποιες γλώσσες προγραμματισμού δουλεύεις  επαγγελματικά αυτή την περίοδο;
Translated:  What programming languages ​​are you currently working on professionally? 

Index:       4
Original:    Τι μέγεθος είναι η εταιρεία που δουλεύεις;
Translated:  What size is the company you work for? 

Index:       5
Original:    Ποιος είναι ο τρόπος εργασίας;
Translated:  How does it work? 

Index:       6
Original:    Έχεις άτομα υπό την επίβλεψη σου;
Translated:  Do you have people under your supervision? 

Index:       7
Original:    Έχεις προσωπικά projects ή κάνεις freelanc

The translated questions were checked and had to be corrected in some cases.

In [6]:
translations[5].text = 'Do you work on-premises or remotely?'
translations[10].text = 'Sex'

data_original.columns = map(lambda x: x.text, translations)
data_original.head(1)

Unnamed: 0,Timestamp,How many years have you worked professionally as a programmer?,What kind of development do you do professionally at the moment?,What programming languages ​​are you currently working on professionally?,What size is the company you work for?,Do you work on-premises or remotely?,Do you have people under your supervision?,Do you have personal projects or do you do freelancing beyond your main job?,In which city do you live;,In what city do you work?,Sex,What is your annual net salary?
0,7/15/2020 12:03:11,4-5,"DevOps, Backend, Frontend","C#, JavaScript",11-50,Και τα δύο,Όχι,Ναι,Αθήνα,Αθήνα,Άντρας,18200


Currently the survey's questions are used as labels for the DataFrame columns. It was decided to change the labels to shorter camel cased descriptions.

In [7]:
data_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 585 entries, 0 to 584
Data columns (total 12 columns):
 #   Column                                                                        Non-Null Count  Dtype 
---  ------                                                                        --------------  ----- 
 0   Timestamp                                                                     585 non-null    object
 1   How many years have you worked professionally as a programmer?                585 non-null    object
 2   What kind of development do you do professionally at the moment?              585 non-null    object
 3   What programming languages ​​are you currently working on professionally?     585 non-null    object
 4   What size is the company you work for?                                        585 non-null    object
 5   Do you work on-premises or remotely?                                          585 non-null    object
 6   Do you have people under your supervision?

In [8]:
shortnames = [
    'TimeStamp',
    'YearsExperience',
    'DevelopmentType',
    'ProgrammingLanguages',
    'CompanySize',
    'WorkLocation',
    'SuperivisionRole',
    'WorkOutsideMainJob',
    'CityLive',
    'CityWork',
    'Sex',
    'NetSalary'
]

data_original.columns = shortnames
data_original.head(1)

Unnamed: 0,TimeStamp,YearsExperience,DevelopmentType,ProgrammingLanguages,CompanySize,WorkLocation,SuperivisionRole,WorkOutsideMainJob,CityLive,CityWork,Sex,NetSalary
0,7/15/2020 12:03:11,4-5,"DevOps, Backend, Frontend","C#, JavaScript",11-50,Και τα δύο,Όχι,Ναι,Αθήνα,Αθήνα,Άντρας,18200


## Translate data values

The second major step would be to check & translate the column values of the DataFrame.

In [9]:
data_original.nunique()

TimeStamp               585
YearsExperience           5
DevelopmentType          57
ProgrammingLanguages    126
CompanySize               6
WorkLocation              3
SuperivisionRole          2
WorkOutsideMainJob        2
CityLive                 43
CityWork                 40
Sex                       2
NetSalary               232
dtype: int64

Some helper functions are defined that are used to detect Greek characters.

In [10]:
# Helper code to check if a text contains greek characters
# https://stackoverflow.com/a/3308844

import unicodedata as ud

latin_letters= {}

def is_latin(uchr):
    try: return latin_letters[uchr]
    except KeyError:
         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))

def only_roman_chars(unistr):
    return all(is_latin(uchr)
           for uchr in unistr
           if uchr.isalpha()) # isalpha suggested by John Machin

In [11]:
temp_list = data_original[data_original.applymap(lambda x: not only_roman_chars(str(x)))].values.tolist()
temp_list = [item for items in temp_list for item in items]
temp_list = list(set(temp_list))

clean_list = [x for x in temp_list if str(x) != 'nan']

value_translations = translator.translate(clean_list, src='el', dest='en')

for translation in value_translations:
    print(translation.origin,'-->',translation.text)

Κάλυμνος --> Kalymnos
Κύπρο --> Cyprus
15,360 για μερικη απασχοληση σαν εργαζομενος φοιτητης(2ο ετος) --> 15,360 for part-time work as a working student (2nd year)
Άντρας --> Man
Ρέθυμνο --> Rethimno
ΣΑΛΑΜΙΝΑ --> SALAMIS
Τρίπολη --> Tripoli
περίπου 26600 --> about 26,600
Λεμεσό --> Limassol
700€ το μήνα --> 700 € per month
Κομοτηνή --> Komotini
Όχι --> No
Σέρρες --> Serres
Απομακρυσμένα --> Remote
Και τα δύο --> Both
Χανια --> Chania
Γυναίκα --> Woman
Λάρισα --> Larissa
Δε δουλεύω ακόμα --> I'm not working yet
Δράμα --> Drama
Ξάνθη --> Blonde
70000 στο περιπου  --> 70000 in approx
Κοζανη --> Kozani
Καβάλα --> Kavala
Λάρνακα --> Larnaca
δεν έχω συγκεκριμένη πολη --> I do not have a specific city
Βόλος --> Marble
750 καθαρά x 12 + (750 + 375 δώρα) = 10.125 (χωρίς bonus) --> 750 net x 12 + (750 + 375 gifts) = 10.125 (without bonus)
Λευκωσια --> Nicosia
Ιωάννινα --> Janina
ΚΑΒΑΛΑ --> ΚΑΒΑΛΑ
860 για 32 ωρες τη βδομαδα --> 860 for 32 hours per week
Αθήνα --> Athena
Στον χώρο του εργοδότη -->

The translated values were examined and had to be corrected in some cases.

In [12]:
correction_dictionary = {x.origin: x.text for x in value_translations}
correction_dictionary['Ξάνθη'] = 'Xanthi'
correction_dictionary['ΚΑΒΑΛΑ'] = 'KAVALA'
correction_dictionary['Βόλος'] = 'Volos'

data_original = data_original.replace(correction_dictionary)

## Save data

In [13]:
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

output_file = 'data_translated.csv'
data_original.to_csv(os.path.join(data_dir, output_file), index=False)