# Fundamentals of machine learning using Python 
## Introduction to scikit-learn library

***
<br>

## What is Scikit-Learn (Sklearn)

<img src="img/sklearn-logo.png" style="width:300px">

* Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python.
* It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.
* This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

## Origin of Scikit-Learn

* It was originally called scikits.learn and was initially developed by __David Cournapeau__ as a Google summer of code project in __2007__.
* Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on __1st Feb. 2010__.

## Features of Scikit-Learn

* Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data.
* Some of the most popular groups of models provided by Sklearn are as follows:
    * __Supervised Learning algorithms__ − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn.
    * __Unsupervised Learning algorithms__ − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.
    * __Clustering__ − This model is used for grouping unlabeled data.
    * __Cross Validation__ − It is used to check the accuracy of supervised models on unseen data.
    * __Dimensionality Reduction__ − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.
    * __Ensemble methods__ − As name suggest, it is used for combining the predictions of multiple supervised models.
    * __Feature extraction__ − It is used to extract the features from data to define the attributes in image and text data.
    * __Feature selection__ − It is used to identify useful attributes to create supervised models.
* It is open source library and also commercially usable under BSD license.

## What is machine learning?

<img src="img/machine-learning.jpg" style="width:400px">

* In general, a learning problem considers a set of $n$ objects of data and then tries to predict properties of unknown data.
* If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Learning problems fall into a few main categories:
* __Supervised learning__, in which the data comes with additional attributes that we want to predict. This problem can be either:
    * __classification__
        * Samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
        * An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories.
        * Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
    * __regression__
         * If the desired output consists of one or more continuous variables, then the task is called regression.
         I An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
* __Unsupervised learning__, in which the training data consists of a set of input vectors $x$ without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called __clustering__, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.



## Loading an example dataset

* scikit-learn comes with a few standard datasets.
* For instance the `iris` and `digits` datasets for classification and the `diabetes` dataset for regression.
* In the case of classification, each loaded object has, among other things, `data` and `target` attributes containing, respectively, the values of the features of the objects in the set and their class memberships.

In [1]:
from sklearn import datasets

iris = datasets.load_iris()
print(iris.data.shape)
iris.data[:10], iris.target[:10]

(150, 4)


(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

In [2]:
from sklearn import datasets

digits = datasets.load_digits()
print(digits.data.shape)
digits.data[:5], digits.target[:5]

(1797, 64)


(array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
         15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
         12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
          0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
         10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
          9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
         15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
          0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
         16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.],
        [ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
         14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
          1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
          0.,  0.,  9., 16., 16.,  5.,  0.,  0

In [3]:
from sklearn import datasets

diabetes = datasets.load_diabetes()
print(diabetes.data.shape)
diabetes.data[:5], diabetes.target[:5]

(442, 10)


(array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
         -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
         -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
         -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
        [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
          0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
        [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
          0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]]),
 array([151.,  75., 141., 206., 135.]))

## --- Exercise ---

Using the `sklearn.datasets.load_breast_cancer()` method, load the `breast cancer` dataset, check data size and feature names.

In [None]:
# Write your code here
