### Copy Training and Prediction Data

In [1]:
%%bash
mkdir -p /content/mldata
wget https://storage.googleapis.com/cloud-datalab/sampledata/ml/census/census_train.csv -P /content/mldata -q
wget https://storage.googleapis.com/cloud-datalab/sampledata/ml/census/census_test.csv -P /content/mldata -q
wget https://storage.googleapis.com/cloud-datalab/sampledata/ml/census/census_predict.csv -P /content/mldata -q

### Browse and Explore Your CSV Data

View several lines from head. "columns" in cell is optional. Without it, the names will be col0, col1...coln.

In [2]:
%%csv view -i /content/mldata/census_train.csv
columns: label, age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country

age,capital-gain,capital-loss,education,education-num,fnlwgt,hours-per-week,label,marital-status,native-country,occupation,race,relationship,sex,workclass
39,2174,0,Bachelors,13,77516,40,<=50K,Never-married,United-States,Adm-clerical,White,Not-in-family,Male,State-gov
50,0,0,Bachelors,13,83311,13,<=50K,Married-civ-spouse,United-States,Exec-managerial,White,Husband,Male,Self-emp-not-inc
38,0,0,HS-grad,9,215646,40,<=50K,Divorced,United-States,Handlers-cleaners,White,Not-in-family,Male,Private
53,0,0,11th,7,234721,40,<=50K,Married-civ-spouse,United-States,Handlers-cleaners,Black,Husband,Male,Private
28,0,0,Bachelors,13,338409,40,<=50K,Married-civ-spouse,Cuba,Prof-specialty,Black,Wife,Female,Private


Get stats of columns (--profile). -n is number of lines to read, and is optional (default to 5).

In [None]:
%%csv view -i /content/mldata/census_train.csv --profile -n 200
columns: label, age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country

### Infer Schema and Generate Feature Class by Running
Run the following command and it will replace cell input with feature class definition in next cell.
Note that --target (and --key) can be either a column name, or an index into the columns (0 based, -1 means last).

In [None]:
%%ml features --csv /content/mldata/census_train.csv --target label
columns: label, age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country

### Define feature class
Now the feature class is generated. Modify it as appropriate, such as converting a column from text to categorical. Then execute the cell.

In [4]:
%%tensorflow feature

import google.cloud.ml.features as features


class CsvFeatures(features.CsvFeatureSet):
  """ This class is generated from command line:
         %%ml csv-schema ...
         Please modify it as appropriate!!!
  """

  def __init__(self):
    columns = 'label','age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country'
    super(CsvFeatures, self).__init__(columns)

  target = features.target('label').classification()
  attrs = [
      features.numeric('age').min_max_scale(-1.0, 1.0),
      features.numeric('capital-gain').min_max_scale(-1.0, 1.0),
      features.numeric('capital-loss').min_max_scale(-1.0, 1.0),
      features.numeric('education-num').min_max_scale(-1.0, 1.0),
      features.numeric('fnlwgt').min_max_scale(-1.0, 1.0),
      features.numeric('hours-per-week').min_max_scale(-1.0, 1.0),
      features.categorical('education').one_of_k(),
      features.categorical('marital-status').one_of_k(),
      features.categorical('occupation').one_of_k(),
      features.categorical('race').one_of_k(),
      features.categorical('relationship').one_of_k(),
      features.categorical('sex').one_of_k(),
      features.categorical('workclass').one_of_k(),
  ]
  native_country = features.text('native-country').bag_of_words(vocab_size=10000)


### Preprocess Training and Testing Data
Output: preprocessed train data, test data, and metadata generated from train data

In [5]:
%%ml preprocess -o /content/mldata/
train: /content/mldata/census_train.csv
test: /content/mldata/census_test.csv

### Take a peek at metadata generated

In [10]:
%%bash
cat /content/mldata/metadata.yaml

columns:
  age:
    max: 90.0
    mean: 38.58164675532078
    min: 17.0
    name: age
    scale:
      max: 1.0
      min: -1.0
    transform: scale
    type: numeric
  capital-gain:
    max: 99999.0
    mean: 1077.6488437087312
    min: 0.0
    name: capital-gain
    scale:
      max: 1.0
      min: -1.0
    transform: scale
    type: numeric
  capital-loss:
    max: 4356.0
    mean: 87.303829734959
    min: 0.0
    name: capital-loss
    scale:
      max: 1.0
      min: -1.0
    transform: scale
    type: numeric
  education:
    items:
      10th: 0
      11th: 1
      12th: 2
      1st-4th: 3
      5th-6th: 4
      7th-8th: 5
      9th: 6
      Assoc-acdm: 7
      Assoc-voc: 8
      Bachelors: 9
      Doctorate: 10
      HS-grad: 11
      Masters: 12
      Preschool: 13
      Prof-school: 14
      Some-college: 15
    name: education
    transform: one_of_k
    type: categorical
  education-num:
    max: 16.0
    mean: 10.0806793403151
    min: 1.0
    name: education-num
    scale

### Preprocessing Prediction Data
Using the metadata we generated to preprocess more data.

In [11]:
%%ml preprocess -o /content/mldata/
predict: /content/mldata/census_predict.csv
metadata: /content/mldata/metadata.yaml

### Now take a look at all data we preprocessed

In [12]:
%%bash
ls /content/mldata

census_predict.csv
census_test.csv
census_train.csv
metadata.yaml
preprocessed_predict
preprocessed_test
preprocessed_train
