使用 PetFinder.my mini 数据集，预测宠物是否会被领养。主要包括
1. 使用 pandas 将 CSV 文件加载到 DataFrame 中
2. 使用 tf.data 构建输入流水线分批处理和打乱行
3. 使用 Keras 预处理层处理特征列
4. 使用 Keras 内置方法构建、训练和评估模型

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras import layers

### 读取数据集

In [3]:
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')

dataframe = pd.read_csv(csv_file)

In [4]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,2
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,0
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,3
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,"Good guard dog, very alert, active, obedience ...",8,2
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,This handsome yet cute boy is up for adoption....,3,2


### 数据预处理
- 修改 AdoptionSpeed 目标列，0 表示宠物未被领养，1 表示宠物已被领养
- 移除多余字段

In [5]:
# In the original dataset, `'AdoptionSpeed'` of `4` indicates
# a pet was not adopted.
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)

# Drop unused features.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

In [6]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,1,1
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,2,1
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,7,1
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,8,1
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,3,1


### 拆分为训练集、验证集和测试集
使用 80:10:10 之类的比例将其分别拆分为训练集、验证集和测试集：

In [7]:
train, val, test = np.split(dataframe.sample(frac=1), [int(0.8*len(dataframe)), int(0.9*len(dataframe))])

In [8]:
print(len(train), 'training examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

9229 training examples
1154 validation examples
1154 test examples


### 使用 tf.data 创建输入流水线
创建一个效用函数，将每个训练集、验证集和测试集 DataFrame 转换为 tf.data.Dataset，然后对数据进行打乱和批处理

In [9]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  df = dataframe.copy()
  labels = df.pop('target')
  #df = {key: value[:,tf.newaxis] for key, value in dataframe.items()}
  ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

In [10]:
batch_size = 256
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

### 使用 Keras 预处理层处理特征列

使用以下四个预处理层来执行预处理、结构化数据编码和特征工程：

- tf.keras.layers.Normalization：对输入特征执行逐特征归一化。
- tf.keras.layers.CategoryEncoding：将整数分类特征转换为独热、多热或 tf-idf 密集表示。
- tf.keras.layers.StringLookup：将字符串分类值转换为整数索引。
- tf.keras.layers.IntegerLookup：将整数分类值转换为整数索引。

In [11]:
def get_normalization_layer(name, dataset):
  # Create a Normalization layer for the feature.
  normalizer = layers.Normalization(axis=None)

  # Prepare a Dataset that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

In [12]:
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a layer that turns strings into integer indices.
  if dtype == 'string':
    index = layers.StringLookup(max_tokens=max_tokens)
  # Otherwise, create a layer that turns integer values into integer indices.
  else:
    index = layers.IntegerLookup(max_tokens=max_tokens)

  # Prepare a `tf.data.Dataset` that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Encode the integer indices.
  encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())

  # Apply multi-hot encoding to the indices. The lambda function captures the
  # layer, so you can use them, or include them in the Keras Functional model later.
  return lambda feature: encoder(index(feature))

归一化数值特征（宠物照片的数量和领养费），并将它们添加到一个名为 encoded_features 的输入列表中：

In [13]:
all_inputs = []
encoded_features = []

# Numerical features.
for header in ['PhotoAmt', 'Fee']:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

将数据集中的整数分类值（宠物年龄）转换为整数索引，执行多热编码，并将生成的特征输入添加到 encoded_features

In [14]:
age_col = tf.keras.Input(shape=(1,), name='Age', dtype='int64')

encoding_layer = get_category_encoding_layer(name='Age',
                                             dataset=train_ds,
                                             dtype='int64',
                                             max_tokens=5)
encoded_age_col = encoding_layer(age_col)
all_inputs.append(age_col)
encoded_features.append(encoded_age_col)

对字符串分类值重复相同的步骤

In [15]:
categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                    'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']

for header in categorical_cols:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(name=header,
                                               dataset=train_ds,
                                               dtype='string',
                                               max_tokens=5)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)

### 创建、编译并训练模型

In [16]:
all_features = tf.keras.layers.concatenate(encoded_features)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(all_inputs, output)

In [17]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

In [18]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Age (InputLayer)               [(None, 1)]          0           []                               
                                                                                                  
 Type (InputLayer)              [(None, 1)]          0           []                               
                                                                                                  
 Color1 (InputLayer)            [(None, 1)]          0           []                               
                                                                                                  
 Color2 (InputLayer)            [(None, 1)]          0           []                               
                                                                                              

 category_encoding_9 (CategoryE  (None, 4)           0           ['string_lookup_8[0][0]']        
 ncoding)                                                                                         
                                                                                                  
 category_encoding_10 (Category  (None, 5)           0           ['string_lookup_9[0][0]']        
 Encoding)                                                                                        
                                                                                                  
 concatenate (Concatenate)      (None, 48)           0           ['normalization[0][0]',          
                                                                  'normalization_1[0][0]',        
                                                                  'category_encoding[0][0]',      
                                                                  'category_encoding_1[0][0]',    
          

In [19]:
model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1b1cf483880>

In [20]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.7512997984886169


### 执行推断

In [21]:
model.save('my_pet_classifier')
reloaded_model = tf.keras.models.load_model('my_pet_classifier')



INFO:tensorflow:Assets written to: my_pet_classifier\assets


INFO:tensorflow:Assets written to: my_pet_classifier\assets


In [25]:
sample = {
    'Type': 'Cat',
    'Age': 3,
    'Breed1': 'Tabby',
    'Gender': 'Male',
    'Color1': 'Black',
    'Color2': 'White',
    'MaturitySize': 'Small',
    'FurLength': 'Short',
    'Vaccinated': 'No',
    'Sterilized': 'No',
    'Health': 'Healthy',
    'Fee': 100,
    'PhotoAmt': 2,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
print(input_dict)
predictions = reloaded_model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])

print(
    "This particular pet had a %.1f percent probability "
    "of getting adopted." % (100 * prob)
)

{'Type': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Cat'], dtype=object)>, 'Age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([3])>, 'Breed1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Tabby'], dtype=object)>, 'Gender': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Male'], dtype=object)>, 'Color1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Black'], dtype=object)>, 'Color2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'White'], dtype=object)>, 'MaturitySize': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Small'], dtype=object)>, 'FurLength': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Short'], dtype=object)>, 'Vaccinated': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'No'], dtype=object)>, 'Sterilized': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'No'], dtype=object)>, 'Health': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Healthy'], dtype=object)>, 'Fee': <tf.Tensor: shape=(1,), dtype=int32, nump