Source: [machinelearningmastery](https://machinelearningmastery.com/load-machine-learning-data-python/)

# 1. Considerations When Loading CSV Data

There are a number of considerations when loading your machine learning data from CSV files.

For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV request for comment titled [Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180).

## 1.1 CSV File Header

Does your data have a file header?

If so this can help in automatically assigning names to each column of data. If not, you may need to name your attributes manually.

Either way, you should explicitly specify whether or not your CSV file had a file header when loading your data.

## 1.2 Comments

Does your data have comments?

Comments in a CSV file are indicated by a **hash ("#")** at the start of a line.

If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.

## 1.3 Delimiter

The standard delimiter that separates values in fields is the **comma (",")** character.

Your file could use a different delimiter like **tab ("\t")** in which case you must specify it explicitly.

## 1.4 Quotes

Sometimes field values can have spaces. In these CSV files the values are often quoted.

The default quote character is the **double quotation marks " "**. Other characters can be used, and you must specify the quote character used in your file.

# 2. Machine Learning Data Loading Recipes

## 2.1 Load CSV with Python Standard Library

The Python API provides the module `CSV` and the function `reader()` that can be used to load CSV files.

Once loaded, you convert the CSV data to a NumPy array and use it for machine learning.

For example, you can download the [Pima Indians dataset](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv) into your local directory.

All fields are numeric and there is no header line. Running the recipe below will load the CSV file and convert it to a NumPy array.

In [1]:
# Load CSV (using python)
import csv
import numpy

filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)

(768, 9)


For more information on the `csv.reader()` function, see [CSV File Reading and Writing](https://docs.python.org/2/library/csv.html) in the Python API documentation.

## 2.2 Load CSV File With NumPy

You can load your CSV data using NumPy and the `numpy.loadtxt()` function.

This function assumes **no header row and all data has the same format**. The example below assumes that the file pima-indians-diabetes.data.csv is in your current working directory.

In [2]:
# Load CSV
import numpy

filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = numpy.loadtxt(raw_data, delimiter=",")
print(data.shape)

(768, 9)


This example can be modified to load the same dataset directly from a URL as follows:

In [3]:
# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")
print(dataset.shape)

(768, 9)


## 2.3 Load CSV File With Pandas

You can load your CSV data using Pandas and the `pandas.read_csv()` function.

This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a `pandas.DataFrame` that you can immediately start summarizing and plotting.

In [4]:
# Load CSV using Pandas
import pandas

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(filename, names=names)
print(data.shape)

(768, 9)


In [5]:
# Load CSV using Pandas from URL
import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
print(data.shape)

(768, 9)
