# Cleaning and exploratory data analysis

The associated [Zenodo](https://zenodo.org/record/4324025#.Y_-XSXbMK3B) page says ~5000 respondents completed both surveys, so the first step is to try to find the identifier associated to each respondent and construct a new file that has all the data from both files for each respondent.

In [1]:
# Import pandas to load the data
import pandas as pd

# Load the first data set
data1 = pd.read_csv("ocean157ksafe.csv")

# Load the second data set
data2 = pd.read_csv("concienciaysustancia10k7safe.csv")

# Translate the columns from each file to English, this time overwriting the data files instead of creating new ones

# There are only a few columns in the first data file that need translating
data1.rename(columns={"2_edad":"2_age", 
                      "2_genero":"2_gender",
                      "2_nacionalidad":"2_nationality",
                      "2_pais_actual":"2_current_country"},
            inplace=True)

# Use a loop to translate the columns from data2
from googletrans import Translator
translator = Translator()
for spanish_name in data2.columns.values.tolist():
    to_english = translator.translate(spanish_name)
    data2.rename(columns={spanish_name:to_english.text}, 
                 inplace=True) 

# Translate the columns that the loop couldn't translate
data2.rename(columns={"2_programa_microdosificacion":"2_microdosing_program", 
                      "6_psychoactive_cosumidos":"6_psychoactives_consumed",
                      "9_feliz":"9_happy",
                      "11_encaro_oblicaciones_sin_problemas":"11_face_obligations_without_problems"},
            inplace=True)

Both data files have a column called "hash", that may be the identifier for each respondent.  We can check this by seeing if any of the hash values from `data1` appear in `data2`.

In [11]:
# Create a list of the hash ids that appear in both data files
common_hash_ids = []

# Append to the list using a loop
for hash_id in data1['hash'].values:
    if (hash_id in data2['hash'].values)==True:
        common_hash_ids.append(hash_id)

# Count the number of entries in the list
len(common_hash_ids)

7090