# Cleaning and preprocessing I

## Merging the two data files

According to the codebook on [Kaggle](https://www.kaggle.com/datasets/thedevastator/mental-health-in-drug-users-during-covid-19), each respondent's identifier is given in the `hash` column of each data file.  First we load the translated data we saved in the ProblemStatement notebook.

In [1]:
# Import pandas to load the data
import pandas as pd

# Load the first data set
data1 = pd.read_csv("data/translated_data1.csv")

# Load the second data set
data2 = pd.read_csv("data/translated_data2.csv")

Now we can check how many respondents completed both surveys by looking for `hash` entries that appear in both data files.  

In [2]:
# Create a list of the hash ids that appear in both data files
common_hash_ids = []

# Append to the list using a loop
for hash_id in data1['hash'].values:
    if (hash_id in data2['hash'].values)==True:
        common_hash_ids.append(hash_id)

# Count the number of entries in the list
len(common_hash_ids)

7090

Based on the output, there are 7090 respondents who completed both surveys.  We shall create a data frame merging the data from both data files for each of those respondents.

In [3]:
# Create the new data frame
master_data = pd.merge(data1,
                       data2, 
                       how='inner',
                       on="hash")

# Show the new data frame
master_data    

Unnamed: 0,Unnamed: 0_x,0_1,0_2,0_3,0_4,0_5,0_6,0_7,0_8,0_9,...,12_usually_find_things_to_laugh about,12_confidence_in_my_going_through_difficult_times,12_in_emergencies_you_can_trust_me,12_see_situation_several_points_of_view,12_my_life_has_sense,12_I_don't_insist_in_things_that_I_can't_do_anything,12_meeting_exit_in_difficult_situation,12_I_have_energy_for_what_I_have_to_do,12_I_feel_comfortable_if_there_are_people_who_don't_like_me,13_login_disclaimer_fork
0,6,4,5,5,1,5,3,5,1,5,...,6,6,7,7,7,7,6,7,6,2
1,7,4,3,5,3,4,3,4,4,4,...,6,7,7,5,5,6,5,4,6,2
2,8,5,5,5,3,3,5,5,1,1,...,6,3,7,6,6,2,6,4,6,2
3,9,4,4,5,4,4,3,5,3,3,...,4,2,6,6,3,3,6,3,2,2
4,10,5,3,2,2,5,5,5,4,4,...,7,7,7,6,7,5,6,5,7,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7085,157068,5,3,5,2,5,1,5,5,3,...,7,7,7,5,6,2,6,4,2,2
7086,157070,4,5,5,5,2,4,4,5,1,...,5,2,7,6,1,7,6,2,4,2
7087,157071,4,2,3,3,3,4,5,5,3,...,7,4,7,7,3,4,6,4,4,2
7088,157072,2,3,3,3,3,4,5,4,4,...,4,5,7,7,6,4,5,4,1,2


As expected, `master_data` has 7090 rows.

## Renaming the columns in `data1`

Now it would be helpful if the numbered columns from `data1` were replaced with their descriptions.  To do this we can scrape the table of the column descriptions from [the Kaggle page](https://www.kaggle.com/datasets/thedevastator/mental-health-in-drug-users-during-covid-19).

In [4]:
# Import BeautifulSoup to scrape data 
from bs4 import BeautifulSoup

# Import requests to get the html
import requests

# This is the url for the html we want
kaggle_url = "https://www.kaggle.com/datasets/thedevastator/mental-health-in-drug-users-during-covid-19"

# Request the html code and assign it to a variable
kaggle_html = requests.get(kaggle_url)

# Make a BeautifulSoup object
kaggle_soup = BeautifulSoup(kaggle_html.content)

# Show the html
print(kaggle_soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Mental Health in Drug Users During COVID-19 | Kaggle
  </title>
  <meta charset="utf-8"/>
  <meta content="index, follow" name="robots"/>
  <meta content="Exploring Personality and Risk Profiles" name="description"/>
  <meta content="no-cache" name="turbolinks-cache-control"/>
  <meta content="health,covid19,mental health,research" name="keywords"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0" name="viewport"/>
  <meta content="#008ABC" name="theme-color"/>
  <script nonce="NEpbndpQ86ArfL+DafpzVw==" type="text/javascript">
   window["pageRequestStartTime"] = 1678663240921;
    window["pageRequestEndTime"] = 1678663241034;
    window["initialPageLoadStartTime"] = new Date().getTime();
  </script>
  <link crossorigin="anonymous" href="https://www.google-analytics.com" rel="preconnect"/>
  <link href="https://stats.g.doubleclick.net" rel="preconnect"/>
  <link href="https://storage.goo

The table appears twice, in the head element and in the body element.  It happens to occur in the first `script` element in the body.  

In [5]:
# Finds the html element that contains the table
kaggle_table = kaggle_soup.body.find('script').text # Turns the html into a string
kaggle_table

'var Kaggle=window.Kaggle||{};Kaggle.State=Kaggle.State||[];Kaggle.State.push({"basics":{"datasetId":2837490,"slug":"mental-health-in-drug-users-during-covid-19","title":"Mental Health in Drug Users During COVID-19","description":"_____\\n# Mental Health in Drug Users During COVID-19\\n### Exploring Personality and Risk Profiles\\nBy  [[source]](https://zenodo.org/record/4324025#.Y8OqntJBwUE)\\n_____\\n\\n### About this dataset\\n\\u0026gt; This anonymous online survey dataset explores the mental health outcomes of psychedelic and non-psychedelic drug users during the COVID-19 pandemic. Using psychometric scales to assess personality traits, anxiety, negative and positive affect, well-being and resilience, principal component analysis was applied to ascertain drug use reports from the sample population. Risk profiles including risk taking/avoidance behaviours, risk perception and risk tolerance are analysed to gain a deeper insight into potential correlations with mental health outcome

The table is made using Markdown.  To extract it from the html element, we can use the split function to separate each row of the table by the string `|\\n`.  Note that the original html only had one backslash, but when we turned the html into a string an escape character was added.

In [6]:
# Use "|\\n" to split the html according to the rows of the table
kaggle_strings = kaggle_table.split("|\\n")
kaggle_strings

['var Kaggle=window.Kaggle||{};Kaggle.State=Kaggle.State||[];Kaggle.State.push({"basics":{"datasetId":2837490,"slug":"mental-health-in-drug-users-during-covid-19","title":"Mental Health in Drug Users During COVID-19","description":"_____\\n# Mental Health in Drug Users During COVID-19\\n### Exploring Personality and Risk Profiles\\nBy  [[source]](https://zenodo.org/record/4324025#.Y8OqntJBwUE)\\n_____\\n\\n### About this dataset\\n\\u0026gt; This anonymous online survey dataset explores the mental health outcomes of psychedelic and non-psychedelic drug users during the COVID-19 pandemic. Using psychometric scales to assess personality traits, anxiety, negative and positive affect, well-being and resilience, principal component analysis was applied to ascertain drug use reports from the sample population. Risk profiles including risk taking/avoidance behaviours, risk perception and risk tolerance are analysed to gain a deeper insight into potential correlations with mental health outcom

Now we have a list of strings that includes each line of the table, but we don't want the first two strings or the last one, so we can get rid of them.

In [7]:
# Indices of the unwanted strings
unwanted = [0,1,len(kaggle_strings)-1]

# Delete the unwanted strings
for kaggle_strings_index in sorted(unwanted, 
                                   reverse=True): # It's good practice to delete the highest indices first, so we reverse the order of the indices
    del kaggle_strings[kaggle_strings_index]
kaggle_strings

['| **0_1**                           | Age of the respondent. (Numeric)                                                              ',
 '| **0_2**                           | Gender of the respondent. (Categorical)                                                       ',
 '| **0_3**                           | Country of residence of the respondent. (Categorical)                                         ',
 '| **0_4**                           | Number of times the respondent has used psychedelics. (Numeric)                               ',
 '| **0_5**                           | Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 '| **0_6**                           | Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 '| **0_7**                           | Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 '| **0_8**                           | N

Now that we've isolated the rows of the table we need to separate the two columns.

In [8]:
# Initialize the column lists
numbered_columns = []
descriptions = []

# Split the strings into two strings, the first goes into one column and the second goes into the other
for line in kaggle_strings:
    split_line = line.split(" | ")
    numbered_columns.append(split_line[0])
    descriptions.append(split_line[1])

# Show the first column    
numbered_columns

['| **0_1**                          ',
 '| **0_2**                          ',
 '| **0_3**                          ',
 '| **0_4**                          ',
 '| **0_5**                          ',
 '| **0_6**                          ',
 '| **0_7**                          ',
 '| **0_8**                          ',
 '| **0_9**                          ',
 '| **0_10**                         ',
 '| **0_11**                         ',
 '| **0_12**                         ',
 '| **0_13**                         ',
 '| **0_14**                         ',
 '| **0_15**                         ',
 '| **0_16**                         ',
 '| **0_17**                         ',
 '| **0_18**                         ',
 '| **0_19**                         ',
 '| **0_20**                         ',
 '| **0_41**                         ',
 '| **0_42**                         ',
 '| **0_43**                         ',
 '| **0_44**                         ',
 '| **1_extraversion**               ',


In [9]:
# Show the second column
descriptions

['Age of the respondent. (Numeric)                                                              ',
 'Gender of the respondent. (Categorical)                                                       ',
 'Country of residence of the respondent. (Categorical)                                         ',
 'Number of times the respondent has used psychedelics. (Numeric)                               ',
 'Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 'Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 'Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 'Number of times the respondent has used psychedelics in the past month. (Numeric)             ',
 'Number of times the respondent has used other psychoactive drugs in the past month. (Numeric) ',
 'Number of times the respondent has used psychedelics in the past week. (Numeric)              ',
 'Number o

Now we need to make the columns in `numbered_columns` match the columns in `master_data`.  We can use the `split` function again.

In [10]:
# Rename the entries in numbered_columns
for entry in numbered_columns:
    entry_index = numbered_columns.index(entry)
    split_entry = entry.split("**")
    entry = split_entry[1]
    numbered_columns[entry_index] = entry
numbered_columns    

['0_1',
 '0_2',
 '0_3',
 '0_4',
 '0_5',
 '0_6',
 '0_7',
 '0_8',
 '0_9',
 '0_10',
 '0_11',
 '0_12',
 '0_13',
 '0_14',
 '0_15',
 '0_16',
 '0_17',
 '0_18',
 '0_19',
 '0_20',
 '0_41',
 '0_42',
 '0_43',
 '0_44',
 '1_extraversion',
 '1_agreeableness',
 '1_conscientiousness',
 '1_neuroticism',
 '1_openness',
 '1_percentil_extraversion',
 '1_percentil_agreeableness',
 '1_percentil_conscientiousness',
 '1_percentil_neuroticism',
 '1_percentil_openness',
 '2_edad',
 '2_genero',
 '2_pais_actual',
 '3_timestamp',
 'hash',
 '0_21',
 '0_22',
 '0_23']

Now make a loop that will change the column names in `master_data`.

In [11]:
# Change the names of the columns in master_data
for heading in numbered_columns:
    if heading in master_data.columns.values.tolist():
        heading_index = numbered_columns.index(heading)
        master_data.rename(columns={heading:descriptions[heading_index]},
                          inplace=True)
master_data.columns.values.tolist()        

['Unnamed: 0_x',
 'Age of the respondent. (Numeric)                                                              ',
 'Gender of the respondent. (Categorical)                                                       ',
 'Country of residence of the respondent. (Categorical)                                         ',
 'Number of times the respondent has used psychedelics. (Numeric)                               ',
 'Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 'Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 'Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 'Number of times the respondent has used psychedelics in the past month. (Numeric)             ',
 'Number of times the respondent has used other psychoactive drugs in the past month. (Numeric) ',
 'Number of times the respondent has used psychedelics in the past week. (Numeric)          

We've used all the information from Kaggle to make the column names more descriptive, but it's still not perfect.  There are still many numbered columns with no description and many of the columns extracted from `data2` have titles that are not descriptive enough.  The next step is to delete unwanted columns.  This is done in the notebook [CleaningAndPreprocessingII](CleaningAndPreprocessingII.ipynb).

The last step in this notebook is to save the `master_data` to a file to call in the next notebook.

In [12]:
# Save master_data to a csv file
master_data.to_csv("data/master_data.csv")