## 2.1 Loading a Sample Dataset
### Problem
You want to load a preexisting sample dataset.
### Solution
scikit-learn comes with a number of popular datasets for you to use:

In [1]:
# Load scikit-learn's datasets
from sklearn import datasets
# Load digits dataset
digits = datasets.load_digits()
# Create features matrix
features = digits.data
# Create target vector
target = digits.target
# View first observation
features[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

In [2]:
# View first Output
target[0]

0

## 2.2 Creating a Simulated Dataset
### Problem
You need to generate a dataset of simulated data
### Solution
scikit-learn offers many methods for creating simulated data. Of those, three methods are particularly useful.

When we want a dataset designed to be used with linear regression, make_regression is a good choice:

In [3]:
# Load library
from sklearn.datasets import make_regression
# Generate features matrix, target vector, and the true coefficients
features, target, coefficients = make_regression(n_samples = 100,
                                                 n_features = 3,
                                                 n_informative = 3,
                                                 n_targets = 1,
                                                 noise = 0.0,
                                                 coef = True,
                                                 random_state = 1)
# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])



Feature Matrix
 [[ 1.29322588 -0.61736206 -0.11044703]
 [-2.793085    0.36633201  1.93752881]
 [ 0.80186103 -0.18656977  0.0465673 ]]
Target Vector
 [-10.37865986  25.5124503   19.67705609]


If we are interested in creating a simulated dataset for classification, we can use
make_classification:

In [4]:
# Load library
from sklearn.datasets import make_classification
# Generate features matrix and target vector
features, target = make_classification(n_samples = 100,
                                       n_features = 3,
                                       n_informative = 3,
                                       n_redundant = 0,
                                       n_classes = 2,
                                       weights = [.25, .75],
                                       random_state = 1)
# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
 [[ 1.06354768 -1.42632219  1.02163151]
 [ 0.23156977  1.49535261  0.33251578]
 [ 0.15972951  0.83533515 -0.40869554]]
Target Vector
 [1 0 0]


Finally, if we want a dataset designed to work well with clustering techniques, scikitlearn offers make_blobs:

In [5]:
# Load library
from sklearn.datasets import make_blobs
# Generate feature matrix and target vector
features, target = make_blobs(n_samples = 100,
                              n_features = 2,
                              centers = 3,
                              cluster_std = 0.5,
                              shuffle = True,
                              random_state = 1)
# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
 [[ -1.22685609   3.25572052]
 [ -9.57463218  -4.38310652]
 [-10.71976941  -4.20558148]]
Target Vector
 [0 1 1]


__As might be apparent from the solutions, make_regression returns a feature matrix
of float values and a target vector of float values, while make_classification and
make_blobs return a feature matrix of float values and a target vector of integers rep‐
resenting membership in a class.scikit-learn’s simulated datasets offer extensive options to control the type of data
generated. scikit-learn’s documentation contains a full description of all the parame‐
ters, but a few are worth noting.__


__In make_regression and make_classification, n_informative determines the
number of features that are used to generate the target vector. If n_informative is less
than the total number of features (n_features), the resulting dataset will have redun‐
dant features that can be identified through feature selection techniques.
In addition, make_classification contains a weights parameter that allows us to
simulate datasets with imbalanced classes. For example, weights = [.25, .75]
would return a dataset with 25% of observations belonging to one class and 75% of
observations belonging to a second class.__

## 2.3 Loading a CSV File
### Problem
You need to import a comma-separated values (CSV) file.
### Solution
Use the pandas library’s read_csv to load a local or hosted CSV file:

In [6]:
# Load library
import pandas as pd
# Load dataset
dataframe = pd.read_csv("files\\file.csv")
# View first two rows
dataframe.head(2)

Unnamed: 0,R1,R2,R3
0,1,2,3
1,4,5,6


In [7]:
# Load library
import pandas as pd
# Load dataset
dataframe = pd.read_csv("files\\file.csv",header=None)
# View first two rows
dataframe.head(2)

Unnamed: 0,0,1,2
0,R1,R2,R3
1,1,2,3


## 2.4 Querying a SQL Database
### Problem
You need to load data from a database using the structured query language (SQL).
### Solution
pandas’ read_sql_query allows us to make a SQL query to a database and load it:


In [8]:
# Load libraries
import pandas as pd
from sqlalchemy import create_engine
# Create a connection to the database
#database_connection = create_engine('sqlite:///#database name')
# Load data
#dataframe = pd.read_sql_query('SELECT * FROM data', database_connection)
# View first two rows
#dataframe.head(2)