# Tabular models

In [1]:
from fastai.tabular import *

Untar the data from `URLs.ADULT_SAMPLE` and save the result to `path`. Read the data from `path/'adult.csv'` to the variable `df`.

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


Set the `dep_var` to 'salary'. Set the `cat_names` to:
- 'workclass'
- 'education'
- 'marital-status'
- 'occupation'
- 'relationship'
- 'race'

Set `cont_names` to:
- 'age'
- 'fnlwgt'
- 'education-num'

Set `procs` to:
- FillMissing
- Categorify
- Normalize

In [3]:
dep_var = 'salary'

cat_names = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race'
]

cont_names = [
    'age',
    'fnlwgt',
    'education-num'
]

procs = [
    FillMissing,
    Categorify,
    Normalize
]

Create a `TabularList` from `df` at indices 800-1000 as the test set. Assign it to variable `test`.

In [8]:
test = TabularList.from_df(df.iloc[800:1000].copy(), cat_names=cat_names, cont_names=cont_names, procs=procs)

Create a `TabularList.from_df` using the stuff you've created so far, split by index, label from the `dep_var` column, add the test set you created, and turn it all into a data bunch.

In [12]:
data = (TabularList.from_df(df, cat_names=cat_names, cont_names=cont_names, procs=procs)
        .split_by_idx(list(range(800,1000)))
        .label_from_df(dep_var)
        .add_test(test)
        .databunch(bs=48))

Show a batch.

In [13]:
data.show_batch()

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Private,10th,Never-married,Machine-op-inspct,Not-in-family,White,False,-0.776,-0.4634,-1.5958,<50k
Self-emp-not-inc,Doctorate,Married-civ-spouse,Prof-specialty,Husband,White,False,0.6166,-0.7821,2.3157,>=50k
Private,Some-college,Never-married,Craft-repair,Not-in-family,White,False,-0.776,2.0918,-0.0312,<50k
Self-emp-not-inc,Assoc-voc,Never-married,Craft-repair,Not-in-family,White,False,-0.4828,-0.8722,0.3599,<50k
Private,Doctorate,Married-civ-spouse,Exec-managerial,Husband,White,False,0.0303,0.9203,2.3157,>=50k


Create a learner using `tabular_learner` with layers [200, 100] and using the `accuracy` metric.

In [14]:
learn = tabular_learner(data, layers=[200,100], metrics=[accuracy])

Fit the learner with one epoch, lr=1e-2.

In [17]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.361656,0.377377,0.81,00:08


## Inference

Get a single row.

In [24]:
row = df.iloc[0,:]

Call learn.predict on it.

In [25]:
learn.predict(row)

(Category >=50k, tensor(1), tensor([0.3542, 0.6458]))