<a href="https://colab.research.google.com/github/patelsaumya/numpy/blob/master/05_Importing%20and%20Exporting%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px; text-align:center"><h1 style='margin:10px 5px'>Import and Export Data</h1>
<hr>
<p style="color:#006666; text-align:right;font-size:10px">
Copyright by MachineLearningPlus. All Rights Reserved.
</p>

</div>

We don't usually create the arrays that we work with. You need to be able to bring data present in another file to Numpy.

Numpy provides useful functions to load data from an external file and save it as well.


<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:7px 5px; font-size:16px'>Import Data</h2>
</div>

The main import methods are:

1. numpy.loadtxt()
2. numpy.genfromtext()

Use `np.loadtxt` when there is no missing data.

In [None]:
import numpy as np
data = np.loadtxt('Datasets/data.txt', delimiter="\t")
data

array([[ 1.     , 87.     , 57.54435],
       [ 2.     ,  8.     ,  7.31704],
       [ 3.     , 56.     , 56.82095],
       [ 4.     , 63.     , 64.15579],
       [ 5.     ,  2.     ,  5.74522],
       [ 6.     , 45.     , 19.56758],
       [ 7.     , 43.     , 39.62271],
       [ 8.     , 47.     , 34.95107],
       [ 9.     ,  2.     ,  9.38692],
       [10.     , 79.     , 36.41022],
       [11.     , 67.     , 49.83894],
       [12.     , 24.     , 23.47974],
       [13.     , 61.     , 72.55357],
       [14.     , 85.     , 39.24693],
       [15.     , 63.     , 53.6279 ],
       [16.     ,  2.     , 16.72441],
       [17.     , 29.     , 37.25533],
       [18.     , 45.     , 18.78498],
       [19.     , 33.     , 19.8089 ],
       [20.     , 28.     , 46.03384],
       [21.     , 21.     , 23.7864 ],
       [22.     , 27.     , 44.42627],
       [23.     , 65.     , 34.94804],
       [24.     , 61.     , 53.49576],
       [25.     , 10.     , 25.98564]])

In [None]:
data[0, :]

array([ 1.     , 87.     , 57.54435])

In [None]:
type(data)

numpy.ndarray

When there are missing values, it errors out.

In [None]:
data = np.loadtxt('Datasets/data_miss.txt', delimiter="\t")
data

ValueError: could not convert string to float: 

In such situation, use `np.genfromtxt()`. It fills in missing data with `nan`.

In [None]:
data = np.genfromtxt('Datasets/data_miss.txt', delimiter="\t")
data

array([[ 1.     , 87.     , 57.54435],
       [ 2.     ,  8.     ,  7.31704],
       [ 3.     , 56.     , 56.82095],
       [ 4.     , 63.     , 64.15579],
       [ 5.     ,  2.     ,  5.74522],
       [ 6.     , 45.     , 19.56758],
       [ 7.     , 43.     , 39.62271],
       [ 8.     , 47.     , 34.95107],
       [ 9.     ,  2.     ,      nan],
       [10.     , 79.     , 36.41022],
       [11.     , 67.     , 49.83894],
       [12.     , 24.     ,      inf],
       [13.     , 61.     , 72.55357],
       [14.     , 85.     , 39.24693],
       [15.     , 63.     , 53.6279 ],
       [16.     ,  2.     , 16.72441],
       [17.     , 29.     ,      nan],
       [18.     , 45.     , 18.78498],
       [19.     , 33.     , 19.8089 ],
       [20.     , 28.     , 46.03384],
       [21.     , 21.     , 23.7864 ],
       [22.     , 27.     , 44.42627],
       [23.     , 65.     , 34.94804],
       [24.     , 61.     , 53.49576],
       [25.     , 10.     , 25.98564]])

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:7px 5px; font-size:16px'>CSV File</h2>
</div>

Let's try loading a csv file with column names.

By default, it takes the dtype as `float`. In such cases, the text fields will go missing.

In [None]:
data = np.genfromtxt('Datasets/Mall_Customers.csv', delimiter=",")
data

array([[ nan,  nan,  nan,  nan,  nan],
       [  1.,  nan,  19.,  15.,  39.],
       [  2.,  nan,  21.,  15.,  81.],
       ...,
       [198.,  nan,  32., 126.,  74.],
       [199.,  nan,  32., 137.,  18.],
       [200.,  nan,  30., 137.,  83.]])

So, explicitly mention the datatype.

In [None]:
# Change dtype and skip header
data = np.genfromtxt('Datasets/Mall_Customers.csv', 
                     delimiter=",", 
                     dtype='object',
                    skip_header=1)

data[:5, :]

array([[b'1', b'Male', b'19', b'15', b'39'],
       [b'2', b'Male', b'21', b'15', b'81'],
       [b'3', b'Female', b'20', b'16', b'6'],
       [b'4', b'Female', b'23', b'16', b'77'],
       [b'5', b'Female', b'31', b'17', b'40']], dtype=object)

The problem with this is, the numbers are identified as bytes and not as numbers. So doing math is not easy.

In [None]:
# Divide 3rd col by 2nd col. ERROR!
data[:, 3] / data[:, 2] 

TypeError: unsupported operand type(s) for /: 'bytes' and 'bytes'

Convert to float and then divide. Works!

In [None]:
output = data[:, 3].astype('float') / data[:, 2].astype('float')
output[:10]

array([0.78947368, 0.71428571, 0.8       , 0.69565217, 0.5483871 ,
       0.77272727, 0.51428571, 0.7826087 , 0.296875  , 0.63333333])

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:7px 5px; font-size:16px'>Better Way: Define the Data Type and then import</h2>
</div>

In [None]:
dt = np.dtype({'names': ["CustomerID", "Genre", "Age", "Annual_Income", "Spending_Score"],
               'formats': [np.int16, 'U16', np.int16, np.int16, np.int16]})

In [None]:
# Change dtype and skip header
data = np.genfromtxt('Datasets/Mall_Customers.csv', 
                     delimiter=",", 
                     dtype=dt,
                    skip_header=1)

data[:15]

array([( 1, 'Male', 19, 15, 39), ( 2, 'Male', 21, 15, 81),
       ( 3, 'Female', 20, 16,  6), ( 4, 'Female', 23, 16, 77),
       ( 5, 'Female', 31, 17, 40), ( 6, 'Female', 22, 17, 76),
       ( 7, 'Female', 35, 18,  6), ( 8, 'Female', 23, 18, 94),
       ( 9, 'Male', 64, 19,  3), (10, 'Female', 30, 19, 72),
       (11, 'Male', 67, 19, 14), (12, 'Female', 35, 19, 99),
       (13, 'Female', 58, 20, 15), (14, 'Female', 24, 20, 77),
       (15, 'Male', 37, 20, 13)],
      dtype=[('CustomerID', '<i2'), ('Genre', '<U16'), ('Age', '<i2'), ('Annual_Income', '<i2'), ('Spending_Score', '<i2')])

In [None]:
data.shape

(200,)

In [None]:
data[0]['Age']

19

In [None]:
data[0]['Genre']

'Male'

In [None]:
data['Age']

array([19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24, 37, 22, 35,
       20, 52, 35, 35, 25, 46, 31, 54, 29, 45, 35, 40, 23, 60, 21, 53, 18,
       49, 21, 42, 30, 36, 20, 65, 24, 48, 31, 49, 24, 50, 27, 29, 31, 49,
       33, 31, 59, 50, 47, 51, 69, 27, 53, 70, 19, 67, 54, 63, 18, 43, 68,
       19, 32, 70, 47, 60, 60, 59, 26, 45, 40, 23, 49, 57, 38, 67, 46, 21,
       48, 55, 22, 34, 50, 68, 18, 48, 40, 32, 24, 47, 27, 48, 20, 23, 49,
       67, 26, 49, 21, 66, 54, 68, 66, 65, 19, 38, 19, 18, 19, 63, 49, 51,
       50, 27, 38, 40, 39, 23, 31, 43, 40, 59, 38, 47, 39, 25, 31, 20, 29,
       44, 32, 19, 35, 57, 32, 28, 32, 25, 28, 48, 32, 34, 34, 43, 39, 44,
       38, 47, 27, 37, 30, 34, 30, 56, 29, 19, 31, 50, 36, 42, 33, 36, 32,
       40, 28, 36, 36, 52, 30, 58, 27, 59, 35, 37, 32, 46, 29, 41, 30, 54,
       28, 41, 36, 34, 32, 33, 38, 47, 35, 45, 32, 32, 30], dtype=int16)

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:7px 5px; font-size:16px'>Export Data and Load it back</h2>
</div>

If it's a single array, save it in `.npy` format. If you have multiple arrays to save in same file, use `.npz` format.

In [None]:
# Store the arrays to disk

# Single array
np.save('TEMP/output.npy', output)

# Multiple arrays: arrays will be save with names "arr_0", "arr_1",..
np.savez('TEMP/outputs.npz', output, data)

Load it back

In [None]:
# Single array
a = np.load('TEMP/output.npy')
a[:5]

array([0.78947368, 0.71428571, 0.8       , 0.69565217, 0.5483871 ])

Set `allow_pickle=True` for multidimensional arrays.

In [None]:
# Multiple arrays
b = np.load('TEMP/outputs.npz', allow_pickle=True)
b

<numpy.lib.npyio.NpzFile at 0x1c369284f48>

See the arrays stored in it.

In [None]:
b.files

['arr_0', 'arr_1']

In [None]:
b['arr_0'][:5]

array([0.78947368, 0.71428571, 0.8       , 0.69565217, 0.5483871 ])

In [None]:
b['arr_1']

array([(  1, 'Male', 19,  15, 39), (  2, 'Male', 21,  15, 81),
       (  3, 'Female', 20,  16,  6), (  4, 'Female', 23,  16, 77),
       (  5, 'Female', 31,  17, 40), (  6, 'Female', 22,  17, 76),
       (  7, 'Female', 35,  18,  6), (  8, 'Female', 23,  18, 94),
       (  9, 'Male', 64,  19,  3), ( 10, 'Female', 30,  19, 72),
       ( 11, 'Male', 67,  19, 14), ( 12, 'Female', 35,  19, 99),
       ( 13, 'Female', 58,  20, 15), ( 14, 'Female', 24,  20, 77),
       ( 15, 'Male', 37,  20, 13), ( 16, 'Male', 22,  20, 79),
       ( 17, 'Female', 35,  21, 35), ( 18, 'Male', 20,  21, 66),
       ( 19, 'Male', 52,  23, 29), ( 20, 'Female', 35,  23, 98),
       ( 21, 'Male', 35,  24, 35), ( 22, 'Male', 25,  24, 73),
       ( 23, 'Female', 46,  25,  5), ( 24, 'Male', 31,  25, 73),
       ( 25, 'Female', 54,  28, 14), ( 26, 'Male', 29,  28, 82),
       ( 27, 'Female', 45,  28, 32), ( 28, 'Male', 35,  28, 61),
       ( 29, 'Female', 40,  29, 31), ( 30, 'Female', 23,  29, 87),
       ( 31, 'Male', 