### Steps in a Machine Learning Project
A Machine Learning Project involves the following steps:

#### Defining the Problem:
Define a problem statement, which addresses a business problem.
#### Obtaining the Source Data:
The raw data required to build a model can be presented in a single or multiple sources such as relational databases, and social networking sites.
#### Understanding Data Through Visualization:
Look into data and understand important features such as its mean, and spread.
#### Preparing Data for Machine Learning Algorithms:
Mostly, the captured raw data cannot be used to train using a Machine learning algorithm. The raw datasets have to be manipulated or transformed through one or more pre-processing steps.
#### Choosing an algorithm:
Based on features of data set, pick a suitable algorithm.
#### Building the Model:
Train the algorithm with considered training data set and verify its performance through a metric.
#### Fine-tuning the Model:
Identify values of vital parameters, associated with the chosen model for better performance.
#### Use the best model:
Use the model with better performance for addressing the defined problem.

### Scikit-Learn Utilities
scikit-learn library has many utilities that can be used to perform the following tasks involved in Machine Learning.

1. Preprocessing
2. Model Selection
3. Classification
4. Regression
5. Clustering
6. Dimensionality Reduction

### Steps with scikit-learn
Mostly, one would perform the following steps while working on a Machine learning problem with scikit-learn:

1. Cleaning raw data set.
2. Further transforming with many scikit-learn pre-processing utilities.
3. Splitting data into train and test sets with train_test_split utility.
4. Creating a suitable model with default parameters.
5. Training the Model using fit function.
6. Evaluating the Model and fine-tuning it.

In [1]:
from sklearn.datasets import load_iris

dataset = load_iris()
# DESCR - printing description of data
print(dataset.DESCR)

# Using parameters "return_X_y" for returning (data, target) and "as_frame" to return as dataframe
data, target = load_iris(return_X_y = True, as_frame=True)
data

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [2]:
#returns feature data
dataset.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [3]:
# returns target column
dataset.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [4]:
# returns feature names i.e. columns of feature data
dataset.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [5]:
# returns target names
dataset.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [6]:
# path to dataset
dataset.frame

In [7]:
#filename location
dataset.filename

'C:\\Users\\Shubham\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'

### Preprocessing - Introduction
**Preprocessing** is a step, in which raw data is modified or transformed into a format, suitable for further downstream processing.

**scikit-learn** provides many preprocessing utilities such as,

1. Standardization mean removal
2. Scaling
3. Normalization
4. Binarization
5. One Hot Encoding
6. Label Encoding
7. Imputation

## Standardization
**Standardization or Mean Removal** is the process of transforming each feature vector into a normal distribution with mean 0 and variance 1.

- This can be achieved using StandardScaler.
- An example with its output is shown in the next two cards, which requires the following imports.

methods - 
1. StandardScalar()
2. fit("data")
3. transform("data")
4. fit_transform("data")

fit method does nothing but is useful when used in a pipeline and we can use fit_transform() directly.

In [25]:
from sklearn.datasets import load_breast_cancer
import sklearn.preprocessing as preprocessing

breast_cancer= load_breast_cancer()
standardizer = preprocessing.StandardScaler()
breast_cancer_standardized = standardizer.fit_transform(breast_cancer.data)

print('Mean of each feature after Standardization :\n\n')
print(breast_cancer_standardized.mean(axis=0))
print('\nStd. of each feature after Standardization :\n\n')
print(breast_cancer_standardized.std(axis=0))

Mean of each feature after Standardization :


[-3.16286735e-15 -6.53060890e-15 -7.07889127e-16 -8.79983452e-16
  6.13217737e-15 -1.12036918e-15 -4.42138027e-16  9.73249991e-16
 -1.97167024e-15 -1.45363120e-15 -9.07641468e-16 -8.85349205e-16
  1.77367396e-15 -8.29155139e-16 -7.54180940e-16 -3.92187747e-16
  7.91789988e-16 -2.73946068e-16 -3.10823423e-16 -3.36676596e-16
 -2.33322442e-15  1.76367415e-15 -1.19802625e-15  5.04966114e-16
 -5.21317026e-15 -2.17478837e-15  6.85645643e-16 -1.41265636e-16
 -2.28956670e-15  2.57517109e-15]

Std. of each feature after Standardization :


[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


## Scaling
Scaling transforms existing data values to lie between a minimum and maximum value.
1. MinMaxScaler transforms data to range 0 and 1.
2. MaxAbsScaler transforms data to range -1 and 1.
- Transforming **breast_cancer** dataset through Scaling is shown in next three cards.

In [9]:
min_max_scaler = preprocessing.MinMaxScaler().fit(breast_cancer.data)

breast_cancer_minmaxscaled = min_max_scaler.transform(breast_cancer.data)

By default, transformation occurs to a range of 0 and 1. It can also be customized with feature_range argument as shown in next example.

In [10]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10)).fit(breast_cancer.data)

breast_cancer_minmaxscaled10 = min_max_scaler.transform(breast_cancer.data)

#### Using MaxAbsScaler
Using MaxAbsScaler, the maximum absolute value of each feature is scaled to unit size, i.e., 1. It is intended for data that is previously centered at sparse or zero data.

By default, MaxAbsScaler transforms data to the range -1 and 1.

In [11]:
max_abs_scaler = preprocessing.MaxAbsScaler().fit(breast_cancer.data)

breast_cancer_maxabsscaled = max_abs_scaler.transform(breast_cancer.data)
breast_cancer_maxabsscaled

array([[0.63998577, 0.26425662, 0.65145889, ..., 0.91202749, 0.69313046,
        0.57301205],
       [0.73176805, 0.45239308, 0.70503979, ..., 0.63917526, 0.41428141,
        0.42901205],
       [0.70046247, 0.54098778, 0.68965517, ..., 0.83505155, 0.54429045,
        0.42207229],
       ...,
       [0.59053718, 0.71486762, 0.57453581, ..., 0.48728522, 0.33413679,
        0.37686747],
       [0.73283529, 0.74669043, 0.74323607, ..., 0.91065292, 0.6156975 ,
        0.59759036],
       [0.27605834, 0.62474542, 0.25421751, ..., 0.        , 0.43250979,
        0.33922892]])

## Normalization
Normalization scales each sample to have a unit norm.
Normalization can be achieved with 'l1', 'l2', and 'max' norms.
- 'l1' norm makes the sum of absolute values of each row as 1, and 'l2' norm makes the sum of squares of each row as 1.
- 'l1' norm is insensitive to outliers.
By default l2 norm is considered. Hence, removing outliers is recommended before applying l2 norm.



In [12]:
normalizer = preprocessing.Normalizer(norm='l1').fit(breast_cancer.data)

breast_cancer_normalized = normalizer.transform(breast_cancer.data)
breast_cancer_normalized

array([[5.04461573e-03, 2.91067878e-03, 3.44346198e-02, ...,
        7.44214015e-05, 1.29017660e-04, 3.33410122e-05],
       [5.49864230e-03, 4.75016401e-03, 3.55259874e-02, ...,
        4.97203436e-05, 7.35112606e-05, 2.37962634e-05],
       [5.81273050e-03, 6.27326171e-03, 3.83776011e-02, ...,
        7.17365928e-05, 1.06660210e-04, 2.58546946e-05],
       ...,
       [7.00344278e-03, 1.18467875e-02, 4.56911357e-02, ...,
        5.98245895e-05, 9.35761210e-05, 3.29921220e-05],
       [5.68390968e-03, 8.09267334e-03, 3.86561042e-02, ...,
        7.31182555e-05, 1.12767664e-04, 3.42138252e-05],
       [1.18802525e-02, 3.75697675e-02, 7.33636209e-02, ...,
        0.00000000e+00, 4.39538722e-04, 1.07764300e-04]])

## Binarization
Binarization is the process of transforming data points to 0 or 1 based on a given threshold.

- Any value above the threshold is transformed to 1, and any value below the threshold is transformed to 0.
- By default, a threshold of 0 is used.

In [13]:
binarizer = preprocessing.Binarizer(threshold=3.0).fit(breast_cancer.data)
breast_cancer_binarized = binarizer.transform(breast_cancer.data)
print(breast_cancer_binarized[:5,:5])

[[1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]]


## OneHotEncoder
OneHotEncoder converts categorical integer values into one-hot vectors. In an one-hot vector, every category is transformed into a binary attribute having only 0 and 1 values.

An example creating two binary attributes for the categorical integers 1 and 2, is shown in the next slide.

In [15]:
# example of a one hot encoding
import pandas as pd
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder

# define data
data = pd.DataFrame({"Color":["red", "green", "blue"]})

# define one hot encoding
encoder = OneHotEncoder(sparse=False)

# transform data
onehot = encoder.fit_transform(data)

print(onehot)
data.head()

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


Unnamed: 0,Color
0,red
1,green
2,blue


### Onehot Encoding with get_dummies()

In [16]:
import pandas as pd
import numpy as np
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# generate binary values using get_dummies
dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"], prefix=["Type_is"] )
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(dum_df)
bridge_df

Unnamed: 0,Bridge_Types,Type_is_Arch,Type_is_Beam,Type_is_Cable,Type_is_Cantilever,Type_is_Suspension,Type_is_Tied Arch,Type_is_Truss
0,Arch,1,0,0,0,0,0,0
1,Beam,0,1,0,0,0,0,0
2,Truss,0,0,0,0,0,0,1
3,Cantilever,0,0,0,1,0,0,0
4,Tied Arch,0,0,0,0,0,1,0
5,Suspension,0,0,0,0,1,0,0
6,Cable,0,0,1,0,0,0,0


## Label Encoding
Label Encoding is a step in which, in which categorical features are represented as categorical integers. An example of transforming categorical values ["benign","malignant"]into[0, 1]` is shown below.

In [17]:
labels = ['malignant', 'benign', 'malignant', 'benign']

labelencoder = preprocessing.LabelEncoder()

labelencoder = labelencoder.fit(labels)

bc_labelencoded = labelencoder.transform(breast_cancer.target_names)

## Imputation
Imputation replaces missing values with either median, mean, or the most common value of the column or row in which the missing values exist.

Below example replaces missing values, represented by np.nan, with the mean of respective column (axis 0).

In [31]:
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')

imputer = imputer.fit(breast_cancer.data)
breast_cancer_imputed = imputer.transform(breast_cancer.data)

AttributeError: module 'sklearn.preprocessing' has no attribute 'Imputer'

In [None]:
import sklearn.preprocessing as preprocessing

x = [[7.8], [1.3], [4.5], [0.9]]
print(preprocessing.Binarizer().fit(x).transform(x))

# ML - Models

## KNN
### 1. Nearest Neighbors Regression
scikit-learn implements the following two regressors:

1. **KNeighborsRegressor** predicts based on the k nearest neighbors of each query point.
2. **RadiusNeighborsRegressor** predicts based on the neighbors present in a fixed radius r of the query point.

### 2. Nearest Neighbors Classification
Nearest neighbors method is used to determine a predefined number of data points that are closer to a sample point and predict its label.

sklearn.neighbors provides utilities for unsupervised and supervised neighbors-based learning methods.

In cases where the data is not uniformly sampled, radius-based neighbors classification in RadiusNeighborsClassifier can be a better choice. The user specifies a fixed radius , such that points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-dimensional parameter spaces, this method becomes less effective due to the so-called “curse of dimensionality”.
#### scikit-learn implements two different nearest neighbors classifiers:

1. **KNeighborsClassifier** - classifies based on k nearest neighbors of every query point, where k is an integer value specified by the user.

2. **RadiusNeighborsClassifier** - classifies based on the number of neighbors present in a fixed radius r of every training point.


#### KDTree and BallTree Classes
Alternatively, one can use the KDTree or BallTree classes directly to find nearest neighbors. This is the functionality wrapped by the NearestNeighbors class used above.


#### Important parameters - 
First value in list is default value.
1. n_neighbors = To set the number of neighbors
2. weights = [uniform, distance] Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword
3. algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']

In [19]:
import sklearn.datasets as datasets

from sklearn.model_selection import train_test_split

# neighbors is the module used 
from sklearn.neighbors import KNeighborsClassifier

cancer = datasets.load_breast_cancer()  # Loading the data set

X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
           stratify=cancer.target, random_state=42)

knn_classifier = KNeighborsClassifier()   

knn_classifier = knn_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', knn_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', knn_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9460093896713615
Accuracy of Test Data : 0.9300699300699301


## Decision Tree
1. DecisionTreeRegressor
2. DecisionTreeClassifier

#### Advantages of Decision Trees
1. Decision Trees are easy to understand.
2. They often do not require any preprocessing.
3. Decision Trees can learn from both numerical and categorical data.

#### Disadvantages of Decision Trees
1. Decision trees sometimes become complex, which do not generalize well and leads to overfitting. Overfitting can be addressed by placing the least number of samples needed at a leaf node or placing the highest depth of the tree.

2. A small variation in data can result in a completely different tree. This problem can be addressed by using decision trees within an ensemble.

### Important Parameters
1. max_depth = int, Number of levels

In [20]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier()   

dt_classifier = dt_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))

Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.9230769230769231


## Ensemble Methods
Ensemble methods combine predictions of other learning algorithms, to improve the generalization.

**Ensemble methods are two types:**

1. Averaging Methods: They build several base estimators independently and finally average their predictions.
E.g.: Bagging Methods, Forests of randomised trees
2. Boosting Methods: They build base estimators sequentially and try to reduce the bias of the combined estimator.
E.g.: Adaboost, Gradient Tree Boosting

### 1. Bagging Methods
`Bagging Methods` draw random subsets of the original dataset, build an estimator and aggregate individual results to form a final one.

`BaggingClassifier` and `BaggingRegressor` are the utilities from sklearn.ensemble to deal with Bagging.

#### Randomized Trees
sklearn.ensemble offers two types of algorithms based on randomized trees: Random Forest and Extra randomness algorithms.

- RandomForestClassifier and RandomForestRegressor classes are used to deal with random forests.
- In random forests, each estimator is built from a sample drawn with replacement from the training set.

ExtraTreesClassifier and ExtraTreesRegressor classes are used to deal with extremely randomized forests.

In extremely randomized forests, more randomness is introduced, which further reduces the variance of the model.

### 2. Boosting Methods
Boosting Methods combine several weak models to create a improvised ensemble.

- sklearn.ensemble also provides the following boosting algorithms:
  - AdaBoostClassifier
  - GradientBoostingClassifier

### Important Parameters
1. n_estimators = int, number of estimators 

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

rf_classifier = rf_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', rf_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', rf_classifier.score(X_test,Y_test))

Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.958041958041958


## SVM
Support Vector Machines (SVMs) separates data points based on decision planes, which separates objects belonging to different classes in a higher dimensional space.

SVM algorithm uses the best suitable kernel, which is capable of separating data points into two or more classes.

Commonly used kernels are:

1. linear
2. polynomial
3. rbf
4. sigmoid

### 1. Support Vector Classification
scikit-learn provides the following three utilities for performing Support Vector Classification.

- SVC,
- NuSVC: Same as SVC but uses a parameter to control the number of support vectors.
- LinearSVC: Similar to SVC with parameter kernel taking linear value.

### 2. Support Vector Regression
scikit-learn provides the following three utilities for performing Support Vector Regression.

- SVR
- NuSVR
- LinearSVR

#### Advantages of SVMs
1. SVM can distinguish the classes in a higher dimensional space.
2. SVM algorithms are memory efficient.
3. SVMs are versatile, and a different kernel can be used by a decision function.

#### Disadvantages of SVMs
1. SVMs do not perform well on high dimensional data with many samples.
2. SVMs work better only with Preprocessed data.
3. They are harder to visualize.

- Which of the following parameter of SVC method is used for fine-tuning the model? => 'C'

- What happens when very small value is used for parameter C in support vector machines? => 'imporves the performance'

- Which approach is used by SVC and NuSVC for multi-class classification? => 'one vs one'

In [35]:
from sklearn.svm import SVC
svm_classifier = SVC()
svm_classifier = svm_classifier.fit(X_train, Y_train) 

# Getting the support vectors
print(svm_classifier.support_vectors_)

print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

[[1.669e+01 2.020e+01 1.071e+02 ... 8.737e-02 4.677e-01 7.623e-02]
 [1.348e+01 2.082e+01 8.840e+01 ... 2.258e-01 2.807e-01 1.071e-01]
 [1.495e+01 1.757e+01 9.685e+01 ... 1.667e-01 3.414e-01 7.147e-02]
 ...
 [1.420e+01 2.053e+01 9.241e+01 ... 1.339e-01 2.534e-01 7.858e-02]
 [1.390e+01 1.924e+01 8.873e+01 ... 8.150e-02 2.356e-01 7.603e-02]
 [1.571e+01 1.393e+01 1.020e+02 ... 1.374e-01 2.723e-01 7.071e-02]]
Accuracy of Train Data : 0.9178403755868545
Accuracy of Test Data : 0.9230769230769231


#### Improving Accuracy Using Scaled Data
In the following example, scaled input data is used to improve the accuracy of SVM classifier.

In [27]:
import sklearn.preprocessing as preprocessing
from sklearn import metrics

standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train) 
print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

# making the predictions
Y_pred = svm_classifier.predict(X_test)

# Viewing the classification report
print('Classification report : \n',metrics.classification_report(Y_test, Y_pred))

Accuracy of Train Data : 0.9178403755868545
Accuracy of Test Data : 0.9230769230769231
Classification report : 
               precision    recall  f1-score   support

           0       0.96      0.83      0.89        53
           1       0.91      0.98      0.94        90

    accuracy                           0.92       143
   macro avg       0.93      0.90      0.92       143
weighted avg       0.93      0.92      0.92       143



## Clustering
- Clustering is one of the unsupervised learning technique.

- The technique is typically used to group data points into clusters based on a specific algorithm.

#### Major clustering algorithms that can be implemented using scikit-learn are:
1. K-means Clustering
2. Agglomerative clustering
3. DBSCAN clustering
4. Mean-shift clustering
5. Affinity propagation
6. Spectral clustering

### 1. K-Means Clustering
In K-means Clustering entire data set is grouped into k clusters.

**Steps involved are:**
1. k centroids are chosen randomly.
2. The distance of each data point from k centroids is calculated. A data point is assigned to the nearest cluster.
3. Centroids of k clusters are recomputed.
4. The above steps are iterated till the number of data points a cluster reach convergence.
KMeans from sklearn.cluster can be used for K-means clustering.

### 2. Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering is a bottom-up approach.

**Steps involved are:**
1. Each data point is treated as a single cluster at the beginning.
2. The distance between each cluster is computed, and the two nearest clusters are merged together.
3. The above step is iterated till a single cluster is formed.
4. AgglomerativeClustering from sklearn.cluster can be used for achieving this.
5. Merging of two clusters can be any of the following linkage type: ward, complete or average.

### 3. Mean Shift Clustering
Mean Shift Clustering aims at discovering dense areas.

**Steps Involved:**

1. Identify blob areas with randomly guessed centroids.
2. Calculate the centroid of each blob area and shift to a new one, if there is a difference.
3. Repeat the above step till the centroids converge.
`make_blobs` from `sklearn.cluster` can be used to initialize the blob areas. MeanShift from `sklearn.cluster` can be used to perform Mean Shift clustering.

### 4. Affinity Propagation
Affinity Propagation generates clusters by passing messages between pairs of data points, until convergence.

- AffinityPropagation class from sklearn.cluster can be used.
- The above class can be controlled with two major parameters:
   - preference: It controls the number of exemplars to be chosen by the algorithm.
   - damping: It controls numerical oscillations while updating messages.
   
### 5. Spectral Clustering
Spectral Clustering is ideal to cluster data that is connected, and may not be in a compact space.

In general, the following steps are followed:

1. Build an affinity matrix of data points.
2. Embed data points in a lower dimensional space.
3. Use a clustering method like k-means to partition the points on lower dimensional space.
spectral_clustering from sklearn.cluster can be used for achieving this.

In [28]:
from sklearn.cluster import KMeans
kmeans_cluster = KMeans(n_clusters=2)
kmeans_cluster = kmeans_cluster.fit(X_train) 
kmeans_cluster.predict(X_test)

array([1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1])

### Evaluating a Clustering algorithm
A clustering algorithm is majorly evaluated using the following scores:

1. Homogeneity: Evaluates if each cluster contains only members of a single class.
2. Completeness: All members of a given class are assigned to the same cluster.
3. V-measure: Harmonic mean of Homogeneity and Completeness.
4. Adjusted Rand index: Measures similarity of two assignments.

In [29]:
from sklearn import metrics

print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))

0.5732364668342355
0.4838627966070309
0.5247715319687337
0.5498399411196141


In [30]:
import sklearn.preprocessing as preprocessing

regions = ['HYD', 'CHN', 'MUM', 'HYD', 'KOL', 'CHN']
print(preprocessing.LabelEncoder().fit(regions).transform(regions))

[1 0 3 1 2 0]


In [54]:
#Write your code here

import numpy 
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import load_iris
import sklearn.preprocessing as preprocessing
from sklearn.impute import SimpleImputer


iris = load_iris()
normalizer = preprocessing.Normalizer(norm='l2').fit(iris.data)
iris_normalized = normalizer.transform(iris.data)
print(iris_normalized.mean(axis=0))



encoder = preprocessing.OneHotEncoder(sparse=False)
iris_target_one_hot = encoder.fit_transform(iris.target.reshape(-1, 1))
print(numpy.array(iris_target_one_hot)[[0, 50, 100]])



iris.data[:50,:] = numpy.nan
imputer = SimpleImputer(missing_values=numpy.nan, strategy='mean')
iris_imputed = imputer.fit_transform(iris.data)
print(iris_imputed.mean(axis=0))

[0.75140029 0.40517418 0.45478362 0.14107142]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[6.262 2.872 4.906 1.676]


![image.png](attachment:image.png)

In [79]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()

X_train, X_test, Y_train, Y_test = train_test_split(
           iris.data, iris.target,
           stratify = iris.target,  random_state=30)
print(X_train.shape)
print(X_test.shape)

knn_classifier = KNeighborsClassifier()   
knn_clf = knn_classifier.fit(X_train, Y_train) 

print(knn_clf.score(X_train,Y_train))
print(knn_clf.score(X_test,Y_test))

li = []
neighbors, max_acc = 0, 0
for i in range(3, 11):
    knn_classifier = KNeighborsClassifier(n_neighbors = i)   
    knn_clf = knn_classifier.fit(X_train, Y_train) 
    acc = knn_clf.score(X_train,Y_train)
    li.append(acc)
    
    if acc>max_acc:
        neighbors = i
        max_acc = acc
    

print(neighbors)

(112, 4)
(38, 4)
0.9821428571428571
0.9473684210526315
9


![image.png](attachment:image.png)

In [76]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)

print(X_train.shape)
print(X_test.shape)


dt_reg = DecisionTreeRegressor()   
dt_reg = dt_reg.fit(X_train, Y_train) 
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))

a = dt_reg.predict([X_test[0]])
b = dt_reg.predict([X_test[1]])
print([float(a),float(b)])


li = []
max_acc = 0
max_depth_ = 0
for i in range(2, 6):
    dt_reg = DecisionTreeRegressor(max_depth = i)   
    dt_reg = dt_reg.fit(X_train, Y_train) 
    acc = dt_reg.score(X_train,Y_train)
    li.append(acc)
    
    if acc>max_acc:
        max_depth_ = i
        max_acc = acc
        
print(max_depth_)

(379, 13)
(127, 13)
1.0
0.6915222887792919
[18.2, 12.8]
5
