# How to Load Machine Learning Data

Based on **Patric Loeber** video: https://www.youtube.com/watch?v=eivDTOlA0TE&list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E&index=16

# Loading with pure python and csv module

**Not recomended** because it is slower and need more code than other methods. Manualy option.

In [5]:
import csv
import numpy as np

FILE_NAME = "spambase.data"
with open(FILE_NAME, 'r') as f:
    data = list(csv.reader(f, delimiter=","))
    
data = np.array(data)
print(data.shape)
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4601, 58)
(4601, 57) (4601,)


# Loading with numpy

First method

In [9]:
import numpy as np

FILE_NAME = "spambase.data"
data = np.loadtxt(FILE_NAME, delimiter=",")
    
print(data.shape)
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4601, 58)
(4601, 57) (4601,)


Second method, Even better. Recomended one, does the same thing as last method but also offers a little more options for parameters. We can for example deal with missing data.

In [10]:
import numpy as np

FILE_NAME = "spambase.data"
data = np.genfromtxt(FILE_NAME, delimiter=",")
    
print(data.shape)
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4601, 58)
(4601, 57) (4601,)


# Pandas

Pandas tries to read a header, in this case we don't have it. Even more options and it should be a little faster

In [13]:
import pandas as pd

FILE_NAME = "spambase.data"
df = pd.read_csv(FILE_NAME, header=None, delimiter=",")
data = df.to_numpy()

print(data.shape)
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4601, 58)
(4601, 57) (4601,)


## Good practices

If we know data type, the good practice is to specifying it. Otherwise the function has t figure the data type for itself and it can take more time or be wrong. Some alghoritms or classifiers expect data type as float

In [14]:
import numpy as np

FILE_NAME = "spambase.data"
data = np.genfromtxt(FILE_NAME, delimiter=",", dtype=np.float32)
    
print(data.shape, type(data[0][0]))
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

import pandas as pd

FILE_NAME = "spambase.data"
df = pd.read_csv(FILE_NAME, header=None, delimiter=",", dtype=np.float32)
data = df.to_numpy()

print(data.shape,  type(data[0][0]))
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4601, 58) <class 'numpy.float32'>
(4601, 57) (4601,)
(4601, 58) <class 'numpy.float32'>
(4601, 57) (4601,)


## Headers

In np.genfromtxt by using skip_header argument we can specify how many rows we want to omit. Same for pandas read_csv but we have to use skiprows argument.

In [15]:
import numpy as np

FILE_NAME = "spambase.data"
data = np.genfromtxt(FILE_NAME, delimiter=",", dtype=np.float32, skip_header=1)
    
print(data.shape, type(data[0][0]))
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

import pandas as pd

FILE_NAME = "spambase.data"
df = pd.read_csv(FILE_NAME, header=None, delimiter=",", dtype=np.float32, skiprows=1)
data = df.to_numpy()

print(data.shape,  type(data[0][0]))
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4600, 58) <class 'numpy.float32'>
(4600, 57) (4600,)
(4600, 58) <class 'numpy.float32'>
(4600, 57) (4600,)


## Missing values

Missing values are very common. So we can occur them from time to time or even more often. Usually, when data is loaded, missing values are automatically detected, however, there may be situations in which the function does not recognize missing values.

We can specify additional missing values like strings etc. In np.genfromtxt we use missing_values argument where we can use a list or just a string value. In pandas read_csv function we use argument called na_values where we have to use a list.

By adding argument filling_values in genfromtxt function we can specify with which values we want to replace missing ones. In pandas we can do something similar by using fillna() function which we must apply after loading our data.

In [18]:
import numpy as np

FILE_NAME = "spambase.data"
data = np.genfromtxt(FILE_NAME, delimiter=",", dtype=np.float32, skip_header=1, missing_values="Hello", filling_values=9999.0)
    
print(data.shape, type(data[0][0]))
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

import pandas as pd

FILE_NAME = "spambase.data"
df = pd.read_csv(FILE_NAME, header=None, delimiter=",", dtype=np.float32, skiprows=1, na_values=["Hello"])
df = df.fillna(9999.0)
data = df.to_numpy()

print(data.shape,  type(data[0][0]))
n_samples, n_features = data.shape
n_features -= 1
X = data[:,0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)

(4600, 58) <class 'numpy.float32'>
(4600, 57) (4600,)
(4600, 58) <class 'numpy.float32'>
(4600, 57) (4600,)
