### Working with tabular data

We're going to predict people's salary given some data.

In [1]:
from fastai import *
from fastai.tabular import *

In [6]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path / 'adult.csv')

In [11]:
dep_var = 'salary'  # column we're predicting

# independent variables
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']  # categorial
cont_names = ['age', 'fnlwgt', 'education-num'] # continuous

# list of pre-processing steps, i think? similar to the transforms we apply on image data
procs = [
            FillMissing,  # Look for missing values and deal with them some way. 
            Categorify,   # Find categorical variables and turn them into Pandas categories
            Normalize
        ]

Then we've got something which is a lot like transforms in computer vision. Transforms in computer vision do things like flip a photo on its axis, turn it a bit, brighten it, or normalize it. But for tabular data, instead of having transforms, we have things called processes. And they're nearly identical but the key difference, which is quite important, is that a processor is something that happens ahead of time. So we basically pre-process the data frame rather than doing it as we go. So transformations are really for data augmentation﹣we want to randomize it and do it differently each time. Or else, processes are the things that you want to do once, ahead of time.

In [12]:
test = (TabularList
           .from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names))

In [13]:
data = (TabularList
           .from_df(df, path=path, cat_names = cat_names, cont_names=cont_names, procs=procs)
           .split_by_idx(list(range(800, 1000)))
           .label_from_df(cols=dep_var)
           .add_test(test)
           .databunch())

In [14]:
data.show_batch(rows=10)

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,False,0.5434,0.2778,-0.4224,>=50k
Private,HS-grad,Married-civ-spouse,Handlers-cleaners,Husband,White,False,-1.0692,-0.827,-0.4224,<50k
Private,HS-grad,Never-married,Other-service,Other-relative,Black,False,-0.9226,1.5966,-0.4224,<50k
Private,12th,Divorced,Machine-op-inspct,Not-in-family,White,False,1.203,-0.9532,-0.8135,<50k
Private,Some-college,Never-married,Craft-repair,Not-in-family,White,False,-1.0692,-0.9038,-0.0312,<50k
Federal-gov,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,False,0.9098,-0.4711,-0.4224,>=50k
Private,Some-college,Married-civ-spouse,Exec-managerial,Husband,White,False,0.7632,0.0559,-0.0312,>=50k
Local-gov,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,False,0.9098,3.4002,1.1422,<50k
Private,HS-grad,Never-married,Sales,Own-child,White,False,-1.509,1.0727,-0.4224,<50k
Private,Some-college,Separated,Prof-specialty,Unmarried,White,False,0.3235,-1.4591,-0.0312,<50k


In [15]:
learn = tabular_learner(data, layers=[200, 100], metrics=accuracy)  # we'll learn about the layers param later

In [16]:
learn.fit(1, 1e-2)  # epochs, learn_rate

epoch,train_loss,valid_loss,accuracy,time
0,0.357828,0.394037,0.83,00:04


In [17]:
row = df.iloc[0]

In [19]:
learn.predict(row)

(Category >=50k, tensor(1), tensor([0.4094, 0.5906]))