<a href="https://colab.research.google.com/github/ravi260372/Data_Science_Python/blob/main/Machine_Learning_Data_School.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Training a Machine Learning model with scikit-learn**

## Reviewing the iris dataset

In [2]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [3]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [4]:
# print the names of the four features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [5]:
# print integers representing the species of each observation
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [6]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


# Requirements for scikit-learn
**Requirements for working with data in scikit-learn**
1. Features and response are separate objects
2. Features should always be numeric, and response should be numeric for regression problems
3. Features and response should be NumPy arrays
4. Features and response should have specific shapes


In [8]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [9]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

(150, 4)


In [10]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

(150,)


1. 150 observations
2. 4 features (sepal length, sepal width, petal length, petal width)
3. Response variable is the iris species
4. Classification problem since response is categorical

# Create Features and Target Arrays


In [7]:
# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [11]:
# print the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


# 4-step modeling pattern  
Step 1: Import the class you plan to use  
Step 2: "Instantiate" the "estimator" / "model"  
Step 3: Fit the model with data (aka "model training")  
Step 4: Predict the response for a new observation

In [32]:
# Import the class from scikit learn.In our case KNN classifier
from sklearn.neighbors import KNeighborsClassifier

In [33]:
# Instantiate the model. In our case 'knn' is the classifier object.
# Instantiate the model with "n=1"
knn = KNeighborsClassifier(n_neighbors=1)
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')


In [34]:
# Fit the model on feature and target arrays
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

In [35]:
# Predict with unknown X values
knn.predict([[3, 5, 4, 2]])
# returns a numpy array of predicted y with model run on given X

array([2])

In [36]:
# Predict on "Multiple Values of Feature Vecors" 
# In pour case 2 Arrays of X's
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)
# returns an array of predicted y for given arrays of Xs

array([2, 1])

# Modeling with different K values

In [31]:
# instantiate the model (using the value K=5)
knn_5 = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn_5.fit(X, y)

# predict the response for new observations
knn_5.predict(X_new)

array([1, 1])

# Using a different classifier : Logistic Regression

In [37]:
# import the class
from sklearn.linear_model import LogisticRegression

In [40]:
# instantiate the model
logreg = LogisticRegression(solver='liblinear')

In [41]:
# fit the model with data
logreg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [42]:
# predict the response for new observations
logreg.predict(X_new)

array([2, 0])