# Introduction to Scikit-Learn

Welcome!

We will be using the Scikit-Learn module to interactively learn about machine learning with Python. 

To install with Anaconda Prompt use:
**conda install scikit-learn**

To install with pip use:
**pip install scikit-learn**

## Algorithms

Scikit-Learn is a package that provides efficient implementations of common algorithms, or models. Once we learn to use one type of model, switching to a new model should be fairly similar.

Every algorithm is implemented via Scikit-Learn's Estimator API. A model must be imported using this general model:

__*from sklearn.family import Model*__

Furthermore, using shift-tab next to the end of the word "Model" reveals that estimators are equipped with default parameters that may be modified during instantiation.

### For example, linear regression:

In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)
print(model)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)


## Let's fit our model on some data!

We will first split our data into a training set and a test set. Keep in mind that this is only an introduction and we will get a lot more practice with this in later projects.

### Creating a Fake Data Set:

In [29]:
'''NOTE: sklearn.cross_validation submodule has been renamed 
and deprecated to sklearn.model_selection as of Feb. 2019'''

import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(20).reshape((10,2)), range(10)
print('Data: \n%s \n' % X)
print('Labels: %s' % list(y))

Data: 
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]] 

Labels: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


### Splitting our Data intro Training and Testing Sets:

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print('Training Set and Labels: \n\n%s\n\n%s\n' % (X_train, y_train))
print('Testing Set and Labels: \n\n%s\n\n%s' % (X_test, y_test))

Training Set and Labels: 

[[ 2  3]
 [18 19]
 [10 11]
 [12 13]
 [ 6  7]
 [16 17]
 [14 15]]

[1, 9, 5, 6, 3, 8, 7]

Testing Set and Labels: 

[[8 9]
 [0 1]
 [4 5]]

[4, 0, 2]


### Training, or Fitting, our Model on the Training Data:

Recall that we instantiated a linear regression estimator that we named 'model.' What we are about to see is an example of a **supervised learning process**.

In [41]:
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

We use __*model.fit(data,labels)*__ for supervised learning.

We will use __*model.fit(data)*__ for **unsupervised learning applications** that can work off unlabeled data later in later projects.

### Using our Model to Predict Labels, or Values:

We will use a new estimator method to predict labels for our testing set.

For **unsupervised estimators** we use these methods:
- __*model.predict()*__ which returns predicted labels in clusting algorithms

- __*model.predict_proba()*__ which returns the probability that a new observation has each categorical label

- __*model.score()*__ which implements a score between 0 and 1, with a larger score indicating a better fit

For **unsupervised estimators** we use these methods:
- __*model.transform()*__ which, given an unsupervised model, transforms new data into the new basis. It can accept one argument, X_new, and returns the new representation of the data based off the given model

- __*model.fit_transform()*__ which some unsupervised estimators use to more efficiently perform a fit and a transform on the same input data

In [42]:
predictions = model.predict(X_test)

In [43]:
predictions

array([4.00000000e+00, 1.63757896e-15, 2.00000000e+00])