# Cleaning I

In order to find a model to use we must clean the data, [`data1`](data/ocean157ksafe.csv) and [`data2`](data/concienciaysustancia10k7safe.csv), and then do some exploration.  

Tasks:
- **Merge the data** 
- **Column descriptions for `data1`** 
- **Translatations for `data2`**  
- **Delete unwanted columns, final cleanup**

In this notebook we do the first two tasks.

## Merge the data

Each data set contains the results of one of two surveys conducted.  Respondents were identified using a `hash` identifier.  We wish to merge the two data sets for the respondents who completed both surveys.

In [1]:
# Import pandas to handle the data
import pandas as pd

# Load the data
data1 = pd.read_csv("data/ocean157ksafe.csv")
data2 = pd.read_csv("data/concienciaysustancia10k7safe.csv")

# Create the merged data frame
master_data = pd.merge(data1,
                       data2, 
                       how='inner',
                       on="hash")
master_data

Unnamed: 0,0_1,0_2,0_3,0_4,0_5,0_6,0_7,0_8,0_9,0_10,...,12_usualmente_encuentro_cosas_de_que_reirme,12_confianza_en_mi_pasar_tiempos_dificiles,12_en_emergencias_pueden_confiar_en_mi,12_ver_situacion_varios_puntos_de_vista,12_mi_vida_tiene_sentido,12_no_insito_en_cosas_que_no_puedo_hacer_nada,12_encuentro_salida_en_situacion_dificil,12_tengo_energia_para_lo_que_tengo_que_hacer,12_siento_comodo_si_hay_gente_a_la_que_no_le_agrado,13_login_disclaimer_fork
0,4,5,5,1,5,3,5,1,5,5,...,6,6,7,7,7,7,6,7,6,2
1,4,3,5,3,4,3,4,4,4,3,...,6,7,7,5,5,6,5,4,6,2
2,5,5,5,3,3,5,5,1,1,5,...,6,3,7,6,6,2,6,4,6,2
3,4,4,5,4,4,3,5,3,3,5,...,4,2,6,6,3,3,6,3,2,2
4,5,3,2,2,5,5,5,4,4,5,...,7,7,7,6,7,5,6,5,7,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7085,5,3,5,2,5,1,5,5,3,5,...,7,7,7,5,6,2,6,4,2,2
7086,4,5,5,5,2,4,4,5,1,4,...,5,2,7,6,1,7,6,2,4,2
7087,4,2,3,3,3,4,5,5,3,4,...,7,4,7,7,3,4,6,4,4,2
7088,2,3,3,3,3,4,5,4,4,4,...,4,5,7,7,6,4,5,4,1,2


## Column descriptions for `data1`

The first data set has a table on [Kaggle](https://www.kaggle.com/datasets/thedevastator/mental-health-in-drug-users-during-covid-19) with the descriptions for the columns, particularly those that are just labelled with numbers.  We wish to replace those column names with the descriptions.  To do this we first need to scrape the data from the table on Kaggle.

In [2]:
# Import BeautifulSoup to scrape data 
from bs4 import BeautifulSoup

# Import requests to get the html
import requests

# This is the url for the html we want
kaggle_url = "https://www.kaggle.com/datasets/thedevastator/mental-health-in-drug-users-during-covid-19"

# Request the html code and assign it to a variable
kaggle_html = requests.get(kaggle_url)

# Make a BeautifulSoup object
kaggle_soup = BeautifulSoup(kaggle_html.content)

# Show the html
print(kaggle_soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Mental Health in Drug Users During COVID-19 | Kaggle
  </title>
  <meta charset="utf-8"/>
  <meta content="index, follow" name="robots"/>
  <meta content="Exploring Personality and Risk Profiles" name="description"/>
  <meta content="no-cache" name="turbolinks-cache-control"/>
  <meta content="health,covid19,mental health,research" name="keywords"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0" name="viewport"/>
  <meta content="#008ABC" name="theme-color"/>
  <script nonce="kA3urVWPu2Mk+CFGX5NVOw==" type="text/javascript">
   window["pageRequestStartTime"] = 1678992204117;
    window["pageRequestEndTime"] = 1678992204223;
    window["initialPageLoadStartTime"] = new Date().getTime();
  </script>
  <link crossorigin="anonymous" href="https://www.google-analytics.com" rel="preconnect"/>
  <link href="https://stats.g.doubleclick.net" rel="preconnect"/>
  <link href="https://storage.goo

The table appears twice, in the head element and in the body element.  It happens to occur in the first `script` element in the body, so we will use that information to extract it.  

In [3]:
# Find the html element that contains the table
kaggle_table = kaggle_soup.body.find('script').text # Turns the html into a string
kaggle_table

'var Kaggle=window.Kaggle||{};Kaggle.State=Kaggle.State||[];Kaggle.State.push({"basics":{"datasetId":2837490,"slug":"mental-health-in-drug-users-during-covid-19","title":"Mental Health in Drug Users During COVID-19","description":"_____\\n# Mental Health in Drug Users During COVID-19\\n### Exploring Personality and Risk Profiles\\nBy  [[source]](https://zenodo.org/record/4324025#.Y8OqntJBwUE)\\n_____\\n\\n### About this dataset\\n\\u0026gt; This anonymous online survey dataset explores the mental health outcomes of psychedelic and non-psychedelic drug users during the COVID-19 pandemic. Using psychometric scales to assess personality traits, anxiety, negative and positive affect, well-being and resilience, principal component analysis was applied to ascertain drug use reports from the sample population. Risk profiles including risk taking/avoidance behaviours, risk perception and risk tolerance are analysed to gain a deeper insight into potential correlations with mental health outcome

To further isolate the table from the `script` element, we can use the `split` function to separate each row of the table by the string `|\\n`.  Note that the original html only had one backslash, but when we turned the html into a string an escape character was added.

In [4]:
# Use "|\\n" to split the html according to the rows of the table
kaggle_strings = kaggle_table.split("|\\n")
kaggle_strings

['var Kaggle=window.Kaggle||{};Kaggle.State=Kaggle.State||[];Kaggle.State.push({"basics":{"datasetId":2837490,"slug":"mental-health-in-drug-users-during-covid-19","title":"Mental Health in Drug Users During COVID-19","description":"_____\\n# Mental Health in Drug Users During COVID-19\\n### Exploring Personality and Risk Profiles\\nBy  [[source]](https://zenodo.org/record/4324025#.Y8OqntJBwUE)\\n_____\\n\\n### About this dataset\\n\\u0026gt; This anonymous online survey dataset explores the mental health outcomes of psychedelic and non-psychedelic drug users during the COVID-19 pandemic. Using psychometric scales to assess personality traits, anxiety, negative and positive affect, well-being and resilience, principal component analysis was applied to ascertain drug use reports from the sample population. Risk profiles including risk taking/avoidance behaviours, risk perception and risk tolerance are analysed to gain a deeper insight into potential correlations with mental health outcom

Now we have a list of strings that includes each line of the table, but we don't want the first two strings or the last one, so we can get rid of them.

In [5]:
# Indices of the unwanted strings
unwanted = [0,1,len(kaggle_strings)-1]

# Delete the unwanted strings
for kaggle_strings_index in sorted(unwanted, 
                                   reverse=True): # It's good practice to delete the highest indices first, so we reverse the order of the indices
    del kaggle_strings[kaggle_strings_index]
kaggle_strings

['| **0_1**                           | Age of the respondent. (Numeric)                                                              ',
 '| **0_2**                           | Gender of the respondent. (Categorical)                                                       ',
 '| **0_3**                           | Country of residence of the respondent. (Categorical)                                         ',
 '| **0_4**                           | Number of times the respondent has used psychedelics. (Numeric)                               ',
 '| **0_5**                           | Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 '| **0_6**                           | Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 '| **0_7**                           | Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 '| **0_8**                           | N

Now that we've isolated the rows of the table we need to separate the two columns.

In [6]:
# Initialize the column lists
numbered_columns = []
descriptions = []

# Split each table line into two strings, the first goes into one column list and the second goes into the other
for line in kaggle_strings:
    split_line = line.split(" | ")
    numbered_columns.append(split_line[0])
    descriptions.append(split_line[1])
numbered_columns

['| **0_1**                          ',
 '| **0_2**                          ',
 '| **0_3**                          ',
 '| **0_4**                          ',
 '| **0_5**                          ',
 '| **0_6**                          ',
 '| **0_7**                          ',
 '| **0_8**                          ',
 '| **0_9**                          ',
 '| **0_10**                         ',
 '| **0_11**                         ',
 '| **0_12**                         ',
 '| **0_13**                         ',
 '| **0_14**                         ',
 '| **0_15**                         ',
 '| **0_16**                         ',
 '| **0_17**                         ',
 '| **0_18**                         ',
 '| **0_19**                         ',
 '| **0_20**                         ',
 '| **0_41**                         ',
 '| **0_42**                         ',
 '| **0_43**                         ',
 '| **0_44**                         ',
 '| **1_extraversion**               ',


In [7]:
descriptions

['Age of the respondent. (Numeric)                                                              ',
 'Gender of the respondent. (Categorical)                                                       ',
 'Country of residence of the respondent. (Categorical)                                         ',
 'Number of times the respondent has used psychedelics. (Numeric)                               ',
 'Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 'Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 'Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 'Number of times the respondent has used psychedelics in the past month. (Numeric)             ',
 'Number of times the respondent has used other psychoactive drugs in the past month. (Numeric) ',
 'Number of times the respondent has used psychedelics in the past week. (Numeric)              ',
 'Number o

Next we need to modify `numbered_columns` so that the column names match those in the data frame.

In [8]:
# Rename the entries in numbered_columns
for entry in numbered_columns:
    entry_index = numbered_columns.index(entry)
    split_entry = entry.split("**")
    entry = split_entry[1]
    numbered_columns[entry_index] = entry
numbered_columns

['0_1',
 '0_2',
 '0_3',
 '0_4',
 '0_5',
 '0_6',
 '0_7',
 '0_8',
 '0_9',
 '0_10',
 '0_11',
 '0_12',
 '0_13',
 '0_14',
 '0_15',
 '0_16',
 '0_17',
 '0_18',
 '0_19',
 '0_20',
 '0_41',
 '0_42',
 '0_43',
 '0_44',
 '1_extraversion',
 '1_agreeableness',
 '1_conscientiousness',
 '1_neuroticism',
 '1_openness',
 '1_percentil_extraversion',
 '1_percentil_agreeableness',
 '1_percentil_conscientiousness',
 '1_percentil_neuroticism',
 '1_percentil_openness',
 '2_edad',
 '2_genero',
 '2_pais_actual',
 '3_timestamp',
 'hash',
 '0_21',
 '0_22',
 '0_23']

Now make a loop that will change the column names in `master_data`.

In [9]:
# Change the names of the columns in master_data
for entry in numbered_columns:
    if entry in master_data.columns.values.tolist():
        entry_index = numbered_columns.index(entry)
        master_data.rename(columns={entry:descriptions[entry_index]},
                          inplace=True) # inplace=True is needed to ensure the column name gets changed in the data frame
master_data.columns.values.tolist() 

['Age of the respondent. (Numeric)                                                              ',
 'Gender of the respondent. (Categorical)                                                       ',
 'Country of residence of the respondent. (Categorical)                                         ',
 'Number of times the respondent has used psychedelics. (Numeric)                               ',
 'Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 'Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 'Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 'Number of times the respondent has used psychedelics in the past month. (Numeric)             ',
 'Number of times the respondent has used other psychoactive drugs in the past month. (Numeric) ',
 'Number of times the respondent has used psychedelics in the past week. (Numeric)              ',
 'Number o

Finally, save `master_data` as a `.csv` so we can load it in the next notebook, [CleaningII](CleaningII.ipynb).

In [10]:
# Save master_data to a csv file
master_data.to_csv("data/master_data.csv")