# L2.1 - Introduction to SciKit-Learn

sklearn APIs are organized on the lines of our ML framework
* Training data and preprocessing
* Model subsumes loss function and optimization procedure
* Model selection and evaluation

sklearn APIs are well designed with the following principles
* consistency: All APIs share a consistent and simple interface
* Inspection: All learnable parameters as well as hyperparameters of all estimators are acessibble direcly via public instance variabnles
* Nonproliferation of classes: datasets are represented as numpy arrayts or scipy sparse matrix instead of custom designed classes
* Composition: existing building blocks are reduced as much as possible
* Sensible defaults: values are used for parameters that enables quick baseline building

## types of sklearn objects
1. Transformers
    * transforms datasets
    * transform()
    * fit() learns parameters
    * fit_transform() firs parameters and transforms dataset
2. Estimators
    * estimates model parameters based on training data and hyper parameters
    * fit() method
3. Predictors
    * Makes prediction on dataset
    * predict() method that takes datasets as an input and returns predictions
    * score() method to measure quality of predictions

## Data API
Loading generating and preprocessing
* sklearn.datasets = loading datasets - custom as well as popular reference datset
* sklearn.preprocessing = scaling, centering, normalization and binarization methods
* sklean.impute = filling missing values
* sklearn.feature_selection = implements feature selection algorithms
* sklearn.feature_extraction = Implements feature extraction from raw data

## Model API
Implement supervised and unsupervised models
### Regression
* sklearn.linear_model - linear, ridge, lasso models
* sklearn.trees

### Classification
* sklearn.linear model
* sklearn.svm
* sklearn.trees
* sklearn.neighbours
* sklearn.naive_bayes
* sklearn.multiclass

### Multioutput
Implements multioutput calssification and regression
* sklearn.multioutput

### Clustering
implements clustering algorithms
* sklearn.cluster

## Model evaluation API
sklearn.metrics implements different metric for model evaluation

## Model selection API
sklearn.model_selection implements various selection strategies like cross-validation, tuning-hyper-parameters and plotting learning curves

## Model inspection API
sklearn.model_inspection

use documentation using ?function_name

# L2.2 - Data Loading

General API has three main kind of interfaces
* loaders - used to load toy datasets bundled with sklearn - learn_*
* fetchers - used to download and load datasets from the internet - fetch_*
* generators - used to generate controlled synthetic datasets - make_*

both loaders and fetchers return a `Bunch` object which is a dictionary with two keys of our interest
Key - (Data, target)
Values - Array 

generators return tuple (X, y)

## Dataset loaders
bundled with sklearn and do not require to download them from external sources
* load_iris - classification
* load_diabeter - regression
* load_digits - classification
* load_linnerud - multioutput
* load_wine - classification
* load_breast_cancer - classification

### loading external datasets
* fetch_openml() - datasets from openml.org
* pandas.io - tools to read common formats
* scipy.io - specializes in scientific compyuting like .mat or .arff
* numpy/routines.io - leoading columnar data
* dataset.load_files - directories of text files
* dataset.load_svmlight_files() - loads data in svmlight and libsvm sparse format
* skimage.io - provides tools to load images and videos in numpy arrays
* scipy.io.wavfile.read - reading wav file

## Dataset fetchers
download and load datasets from outside

## generators
1. Regression:
    * make_regression() - makes regression targets as a sparce random linear combination of features with noise
2. Classification:
    * make_blobs() and make_classification() first creates a bunch of normally distributed clusters of points and then assign one or more clusters to each class thereby creating multiclass datasets
    * make_multi_classification() generates random sampled with multiple labels with a specific generative process and rejection sampling
3. Clustering
    * make_blobs

For managing numerical data, sklearn recommends using optimized file format like HDF5 to reduce data load times

# L2.3 - demonstration of sklearn dataset API

## Loaders

### iris dataset

In [1]:
# Loading iris datset
from sklearn.datasets import load_iris
data = load_iris()
type(data)

sklearn.utils._bunch.Bunch

In [2]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [3]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [4]:
data.data[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [5]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [6]:
?load_iris

[1;31mSignature:[0m [0mload_iris[0m[1;33m([0m[1;33m*[0m[1;33m,[0m [0mreturn_X_y[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mas_frame[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Load and return the iris dataset (classification).

The iris dataset is a classic and very easy multi-class classification
dataset.

Classes                          3
Samples per class               50
Samples total                  150
Dimensionality                   4
Features            real, positive

Read more in the :ref:`User Guide <iris_dataset>`.

.. versionchanged:: 0.20
    Fixed two wrong data points according to Fisher's paper.
    The new version is the same as in R, but not as in the UCI
    Machine Learning Repository.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object. See
    below for more information about the `data` and `target` object.

    .. versionadded:

In [7]:
feature_matrix, label_vector = load_iris(return_X_y=True)
feature_matrix.shape, label_vector.shape

((150, 4), (150,))

## fetchers

In [9]:
from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()\

housing_data.data.shape

(20640, 8)

In [10]:
# fetch openml
from sklearn.datasets import fetch_openml

X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X.shape, y.shape

((70000, 784), (70000,))

## Generators

In [11]:
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True, random_state=42)

X.shape, y.shape

((100, 5), (100,))

In [12]:
# multi regression
X, y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True, random_state=42)
X.shape, y.shape

((100, 5), (100, 5))

In [13]:
# classification
from sklearn.datasets import make_classification

X, y = make_classification(n_samples = 100, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=42)

X.shape, y.shape

((100, 10), (100,))

In [15]:
# multilabel
from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(n_samples=100, n_features=10, n_classes=5, n_labels=2)

X.shape, y.shape

((100, 10), (100, 5))

# L2.4 - Data Preprocessing

real world training data is not clean and have issues such as missing values, features on different scales, non-numeric attributes etc

often there is a need to pre process data to make it amenable for training model, sklearn provides a rich set of transformaers for this job

same pre-processing should be applied to both training and testing dataset
sklearn probides pipeline for making it easier to chain multiple ransforms together and apply them uniformly accross train, eval and test

Once you get data, first job to explore data and list down preprocessing needed 

## Pre processing methods
* Data Cleaning - sklearn.preprocessing
* Feature extraction - sklearn.feature_extraction
* Feature Reduction - sklearn.decomposition.pca
* feature expansion - sklearn.kernel_approximatoin

## transformer methods
* fit() - method learn model parameters from a training set
* transform() - applies the learn transformation to new data
* fit_transform()

## feature extraction
sklearn.feature_extraction has useful APIs to extract features from data
* DictVectorizer - converts list of mapping of feature name and feature value into a matrix
* Feature Hasher
    * High-speed, low memoty vectorizer that uses feature hashing technique
    * instead of building a hash table of features, as the vectorizers do, it applies a hash function to the mfeatures to determine their column index in sample matrices directly
    * this results in increased speed and reduced memory usage at the expense of inspectability, hasher does not remember what the input features looked like and has no inverse_transform method

* sklearn.feature_extraction.image.* = for image dataset
* sklearn.feature_extraction.text.* = had functions for text datasets

## dealing with missing values
sklearn.impute - provides functionality to fill missing values in a dataset, MissingIndicator provides indicators for missing values
* SimpleImpute - fills with either mean, median, most_frequent and constant
* KNNInputer - uses K nearest neighbours approach to fill missing values in a dataset, mean values of n closest neighbours based on euclidean distance

# L2.4.2 - handling missing data

In [None]:
df.isna()
# returns the matrix of missing or available

df.isna().sum()

# this will show unique values of a columns
df['age'].unique()

# L2.5 - Categorical transformers

1. OneHotEncoder
    * encodes categorical feature or label as one hot numeric array
2. LabelEncoder
    * encodes **target** variables with value between 0 and k-1 (k-> number of distinct values)
3. Ordinal Encoder
    * encodes categorical variables with values between 0 and k-1
4. LabelBinarizer
    * several regression and binary classification can be extended to multi class setup in one v all fasion
    * involves training a single regressor or classifier per class
5. MultiLabelBinarizer|
    * Encodes categorical features with values between 0 and k-1
6. add_dummy_feature
    * add column with all values 1

# L2.6 - Numeric transformers


## Feature Scaling
* numerical features with different scales leads to slower convergence of iterative optimization procedures
* good practice to scale numerical features so that they are on the same scale

1. StandardScaler
    * transforms the original feature vector x into new of deviations from mean
2. MinMaxScaler
    * x* = (x-x.min)/(x.max - x.min)
3. MaxAbsScaler
    * transforms vector x so that all falls into new feature of range [-1,1]
    * Maxabsvalue = max(x.max, |x.min|)
4. FunctionTransformer
    * applying user defined function
5. Polynomial transform
    * fenerates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree
6. KBinsDiscretizer
    * divides a continuous variable into bins then one hot coding further applied

# L2.7&8 - Filter and Wrapper Based Feature Selection

all features do not contribute towards fitting a model, features that do not contribute can be removed, decrease in size of dataset and computation cost of fitting a model. sklearn.feature_selection provides APIs to accomplish this task
## Filter
1. VarianceThreshold
    * Removes features with variance below a certain threshold
2. SelectKBest - removes all but the k highest scoring features
4. SelectPercentile - removes all but a user specified highest scoring percentage of features
6. GenericUnivariateSelect - Performs univariate feature selection with a configurable strategy which can be found by hyper parameter searcj
    * SelectFpr selects features based on a false positive rate test
    * SelectFds selects features based on an estimated false discovery rate

### Univariate scoring function
* Each API need a scoring function to score each feature
* Three classes of scoring functions are there: mutual information, Chi sq, and F-statistics
* MI and F can ve used in both classification and regression problems
    * mutual_info_regression, mutual_info_classif
    * f_regression, f_classif
* Chi sq only for classifications - chi2

### Mutual Information
* Measures dependency between two variables, returns a non negative value, higher the value, higher dependency

### Chi2
* easured dependence between two variables
* Computes chi sq stats between two non negative feature (boolean or freq) and class label
* high indicates higher degree of correlation
## Wrapper
1. Recursive Feature Elimination (RFE)
    * uses an estimator to recursively remove features
        * initially fits an estimator on all features
    * obtainsfeature importance from the estimator and remove the least important feature
    * repeats and removes until desired are obtained
   RFECV can be used where no of parameters should be optimum not specified, the function performs cross validation
2. SelectFromModel
    * selects desired number of important features (as specified with max_features) above certain threshold of feature importance as obtained from the trained estimator
    * feature importance is obtained via coef_, feature_importance_ or importance_getter
    * feature importance threshold can be specified either numerically or through string argument base don built in heuristics such as 'mean', 'median' and float multiples of these like 0.1*mean
3. Sequential Feature Selection
    * Performs feature selection by selecting or deselecting features one by one in a greedy manner
        * forward selection - starting with zero feature, obrains the berst class validatoin score for an estimator when trained on that featured, repeats by adding new features
        * backward selection - starts with all and keeps removing one by one
    * forward and backward do not yeild the same
    * choose one which is closest from inital to end

# L2.9 - Heterogeneous features transformations

Composite transformer: 

sklearn.compose has useful classes and methods to apply transformation on subset of features and combine them

## ColumnTransfer
* Applies a set of transformers to columns of an array or pandas.DataFrame, concatenates the transformed outputs from different transformers into a single matrix
* It is useful for transforming heterogeneous data by applying different transformers to separate subsets of features
* combines different feature selection mechanisms and transformation into a single transformer object

## TransformedTargetRegressor
* Transforms target variable y before fitting a regression model
* The predicted values are mapped back to the original space via an inverse transform
* function takes regressor and transformer to be applied to the target variable as arguments

# L2.10 - Dimensionality reduction by PCA

unsupervised dimensionality reduction