本教程提供了有关如何将pandas数据帧加载到中的示例tf.data.Dataset。

本教程使用了克利夫兰心脏病基金会提供的一个小型数据集。CSV中有几百行。每行描述一个患者，每列描述一个属性。我们将使用此信息来预测患者是否患有心脏病，在此数据集中这是一项二进制分类任务。

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pandas as pd
import tensorflow as tf

下载包含心脏数据集的csv文件。

In [2]:
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')

Downloading data from https://storage.googleapis.com/applied-dl/heart.csv


使用熊pandas取csv文件。

In [3]:
df = pd.read_csv(csv_file)

In [4]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


In [5]:
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

将数据框中的thal列转换object为离散数值。

In [6]:
df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes

In [7]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,2,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,4,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [8]:
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal           int8
target        int64
dtype: object

使用加载数据 tf.data.Dataset<br>
用于tf.data.Dataset.from_tensor_slices从熊猫数据框中读取值。

使用的优点之一tf.data.Dataset是它使您可以编写简单，高效的数据管道。阅读加载数据指南以了解更多信息。

In [9]:
target = df.pop('target')

In [10]:
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))

In [15]:
for feat, targ in dataset.take(5):
    print ('Features: {}, Target: {}'.format(feat, targ))

Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.
   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0


由于pd.Series实现了__array__协议，因此几乎可以在使用 np.array或tf.Tensor。

In [17]:
tf.constant(df['thal'])

<tf.Tensor: id=54, shape=(303,), dtype=int32, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3

随机播放和批处理数据集。

In [18]:
train_dataset = dataset.shuffle(len(df)).batch(1)

创建和训练模型

In [19]:
def get_compiled_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
    return model

In [20]:
model = get_compiled_model()
model.fit(train_dataset, epochs=15)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x22ea483d470>

功能列的替代<br>
将字典作为输入传递给模型就像创建匹配的tf.keras.layers.Input图层字典，应用任何预处理并使用功能性api将它们堆叠起来一样容易。您可以将其用作要素列的替代方法。

In [21]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0,2
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3
2,67,1,4,120,229,0,2,129,1,2.6,2,2,4
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3


In [22]:
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}

In [23]:
inputs

{'age': <tf.Tensor 'age:0' shape=(None,) dtype=float32>,
 'sex': <tf.Tensor 'sex:0' shape=(None,) dtype=float32>,
 'cp': <tf.Tensor 'cp:0' shape=(None,) dtype=float32>,
 'trestbps': <tf.Tensor 'trestbps:0' shape=(None,) dtype=float32>,
 'chol': <tf.Tensor 'chol:0' shape=(None,) dtype=float32>,
 'fbs': <tf.Tensor 'fbs:0' shape=(None,) dtype=float32>,
 'restecg': <tf.Tensor 'restecg:0' shape=(None,) dtype=float32>,
 'thalach': <tf.Tensor 'thalach:0' shape=(None,) dtype=float32>,
 'exang': <tf.Tensor 'exang:0' shape=(None,) dtype=float32>,
 'oldpeak': <tf.Tensor 'oldpeak:0' shape=(None,) dtype=float32>,
 'slope': <tf.Tensor 'slope:0' shape=(None,) dtype=float32>,
 'ca': <tf.Tensor 'ca:0' shape=(None,) dtype=float32>,
 'thal': <tf.Tensor 'thal:0' shape=(None,) dtype=float32>}

In [24]:
list(inputs.values())

[<tf.Tensor 'age:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'sex:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'cp:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'trestbps:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'chol:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'fbs:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'restecg:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'thalach:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'exang:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'oldpeak:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'slope:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'ca:0' shape=(None,) dtype=float32>,
 <tf.Tensor 'thal:0' shape=(None,) dtype=float32>]

In [25]:
tf.stack(list(inputs.values()), axis=-1)

<tf.Tensor 'stack:0' shape=(None, 13) dtype=float32>

In [15]:
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

In [28]:
len(df.to_dict('list'))

13

In [30]:
len(df.to_dict('list')['age'])

303

最简单的使用pd.DataFrame的方法是转成字典

In [16]:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)

In [17]:
for dict_slice in dict_slices.take(1):
    print (dict_slice)

({'age': <tf.Tensor: id=14768, shape=(16,), dtype=int32, numpy=array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57])>, 'sex': <tf.Tensor: id=14776, shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1])>, 'cp': <tf.Tensor: id=14771, shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3])>, 'trestbps': <tf.Tensor: id=14780, shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150])>, 'chol': <tf.Tensor: id=14770, shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168])>, 'fbs': <tf.Tensor: id=14773, shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0])>, 'restecg': <tf.Tensor: id=14775, shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0])>, 'thalach': <tf.Tensor: id=14779, shape=(16,), dtype=int32, n

In [18]:
model_func.fit(dict_slices, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x209737786d8>