## Using categorical data in mathematical models

You have probably heard of studies that fit models to categorical data. For instance, you might have heard that the incidence of a particular type of cancer is higher among folks who work in nail salons. To reach such a conclusion, it is often necessary to consider multiple types of categorical data - profession, zip code, etc - along with quantitative data such as age. 

But how can a string value like profession be converted into a number for use in statistical modeling?

Let's start by considering a *binary* variable - one that has only two possibilities

One way to do so is through the use of what's called 'one-hot encoding'. This

#### Set up required library imports and check that all required files are present


In [7]:
from qiime2 import Metadata
import pandas as pd
import numpy as np
from os.path import exists,join
from os import mkdir
from IPython.display import HTML
metadata_file = "../input/GCMP_EMP_map_r28_no_empty_samples.txt"
output_dir = "../output/"
required_files = [metadata_file]

#### Check that all required files are present

In [3]:
for existing_file in required_files:
    if not exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
print("Done.")

if not exists(output_dir):
    print(f"Output directory {output_dir} does not yet exist, creating it...")
    mkdir(output_dir)
    print("Done.")

Done.


##### Load GCMP metadata 

First we'll get the GCMP metadata table loaded as metadata, then we can extract a pandas dataframe from it for one-hot encoding of categorical variables

In [10]:

categorical_cols = ["complex_robust","taxonomy_string_to_family","tissue_compartment","binary_turf_contact","ocean_area"]
numerical_cols = ["temperature","depth","latitude","colony_width1","longitude"]

def get_numerical_values_only(metadata_filepath,categorical_cols,numerical_cols,\
                              forbidden_values = ['','Unknown','unknown',np.nan,\
                                'Missing: Not collected','Not applicable']):
    """
    metatdata
    """
    metadata = Metadata.load(metadata_file)
    df = metadata.to_dataframe()
    for cat in categorical_cols:
        df = df[df[cat].notnull()]
        for forbidden_value in forbidden_values:
            df = df[df[cat] != forbidden_value]
      
    dummies = pd.get_dummies(df[categorical_cols])
    dummies = dummies.join(df[numerical_cols],how="inner")

    for cat in numerical_cols:
        dummies = dummies[dummies[cat] != 'NaN']
        dummies = dummies[dummies[cat].notnull()]
        dummies = dummies[dummies[cat] != 'Missing: Not collected']
        dummies = dummies[dummies[cat] != '']
    
    return dummies

dummies = get_numerical_values_only(metadata_file,categorical_cols,numerical_cols)
dummies.to_csv(join(output_dir,"one_hot_encoded_metadata.tsv"),sep="\t")
dummies




Unnamed: 0_level_0,complex_robust_complex,complex_robust_outgroup,complex_robust_robust,taxonomy_string_to_family_Cnidaria_Anthozoa_Actiniaria_Actiniidae,taxonomy_string_to_family_Cnidaria_Anthozoa_Actiniaria_Aiptasiidae,taxonomy_string_to_family_Cnidaria_Anthozoa_Actiniaria_Stichodactylidae,taxonomy_string_to_family_Cnidaria_Anthozoa_Alcyonacea_Alcyoniidae,taxonomy_string_to_family_Cnidaria_Anthozoa_Alcyonacea_Tubiporidae,taxonomy_string_to_family_Cnidaria_Anthozoa_Alcyonacea_Xeniidae,taxonomy_string_to_family_Cnidaria_Anthozoa_Corallimorpharia_Discosomatidae,...,ocean_area_Eastern Pacific,ocean_area_Red Sea,ocean_area_South China Sea,ocean_area_Tasman Sea,ocean_area_Western Indian,temperature,depth,latitude,colony_width1,longitude
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10895.E1.10.Poc.dami.1.20140728.M,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,22.8,4.8768,-14.68463,25,145.44175
10895.E1.10.Poc.dami.1.20140728.S,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,22.8,4.8768,-14.68463,25,145.44175
10895.E1.10.Poc.dami.1.20140728.T,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,22.8,4.8768,-14.68463,25,145.44175
10895.E1.10.Poc.dami.1.20140731.M,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,23.9,1.524,-14.68716,17,145.44449
10895.E1.10.Poc.dami.1.20140731.S,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,23.9,1.524,-14.68716,17,145.44449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10895.E9.Out.Mil.sp.1.20150821.T,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,23.89,9.14,-21.0175,25,55.238333
10895.E9.Out.Sar.sp.1.20150821.M,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,1,23.89,9.45,-21.0175,13,55.238333
10895.E9.Out.Sar.sp.1.20150821.S,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,1,23.89,9.45,-21.0175,13,55.238333
10895.E9.Out.Sar.sp.1.20150821.T,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,1,23.89,9.45,-21.0175,13,55.238333
