# Cleaning and preprocessing II

In this notebook we continue to clean the data from [Kaggle](https://www.kaggle.com/datasets/thedevastator/mental-health-in-drug-users-during-covid-19).  In the notebook [ProblemStatementStakeholdersKPIs](ProblemStatementStakeholdersKPIs.ipynb) we translated the Spanish columns in both data sets to English.  In [CleaningAndPreprocessingI](CleaningAndPreprocessingI.ipynb) we merged the data from the two files, according to the identifier `hash`, and then we used the descriptions of the columns in the first data file to rename them.  Then we created a file `master_data.csv` with these changes.

The next steps in cleaning the data are to:
* Delete the useless or meaningless columns in `master_data.csv` or change the names of columns to make them more concise.
* Scale the responses so that higher entries correspond to better mental health outcomes for all columns.

In [1]:
# Load pandas and the data
import pandas as pd
master_data = pd.read_csv("data/master_data.csv")

## Deleting or renaming columns

There is no straightforward way to efficiently delete or rename the columns in `master_data` because different columns need to be deleted or renamed for different reasons, some of which are subjective.

In [2]:
# Count the number of columns
columns = master_data.columns.values.tolist()
len(columns)

198

There are 198 columns to go through so examining each one will take a long time.  One idea to save some time is to build a loop with user inputs about each column, namely the choice to delete or rename it.

In [3]:
# Show the current columns of master_data
columns

['Unnamed: 0',
 'Unnamed: 0_x',
 'Age of the respondent. (Numeric)                                                              ',
 'Gender of the respondent. (Categorical)                                                       ',
 'Country of residence of the respondent. (Categorical)                                         ',
 'Number of times the respondent has used psychedelics. (Numeric)                               ',
 'Number of times the respondent has used other psychoactive drugs. (Numeric)                   ',
 'Number of times the respondent has used psychedelics in the past year. (Numeric)              ',
 'Number of times the respondent has used other psychoactive drugs in the past year. (Numeric)  ',
 'Number of times the respondent has used psychedelics in the past month. (Numeric)             ',
 'Number of times the respondent has used other psychoactive drugs in the past month. (Numeric) ',
 'Number of times the respondent has used psychedelics in the past week. (Num

The loop will edit a lot about `master_data` at once so it's good to have a backup before making such drastic changes.

In [4]:
# Back up master_data
master_data_cleaned = master_data.copy()

Now the loop that will change the columns:

In [None]:
# Loop that will prompt me to delete or rename each column
for i in sorted(list(range(0,len(columns))), 
                reverse=True):
    # Prompt to delete
    print("Column index:",i)
    print("Column name:",columns[i])
    delete_choice = input("Delete column (y or n)?  ")
    # Make sure the input is valid
    while ((delete_choice != "y") and (delete_choice != "n")):
                          delete_choice = input("Enter y or n.  ")
    # Delete if yes
    if (delete_choice == "y"):  
                          del master_data_cleaned[columns[i]]
                          print()  
    # Prompt to rename if no
    if (delete_choice == "n"):
                          rename_choice = input("Rename the column (y or n)?  ")
                          # Make sure the input is valid
                          while ((rename_choice != "y") and (rename_choice != "n")):
                              rename_choice = input("Enter y or n.  ")
                          # Rename if yes
                          if (rename_choice == "y"):
                                new_name = input("Enter the new name: ")
                                master_data_cleaned.rename(columns = {columns[i]:new_name}, inplace=True)
                                # Verify the name was changed
                                new_columns = master_data_cleaned.columns.values.tolist()
                                master_index = new_columns.index(new_name)
                                print("Now new_columns[",master_index,"] = ",new_columns[master_index]+".")
                                print()
                          # Pass if no
                          if (rename_choice == "n"):
                                print()
                                pass 
# Show the new data frame
master_data_cleaned

Column index: 197
Column name: 13_login_disclaimer_fork
Delete column (y or n)?  y

Column index: 196
Column name: 12_I_feel_comfortable_if_there_are_people_who_don't_like_me
Delete column (y or n)?  n
Rename the column (y or n)?  y
Enter the new name: I_feel_comfortable_if_there_are_people_who_don't_like_me
Now new_columns[ 196 ] =  I_feel_comfortable_if_there_are_people_who_don't_like_me.

Column index: 195
Column name: 12_I_have_energy_for_what_I_have_to_do
