# Supervised Learning foundation :

Before we dive into the supervised learning algorithms we must fist know the data that we work with 

##Meet the Data


There are several steps that can be taken to ensure that you meet the data correctly and get the most out of it:
-Understand the context of the data: Learn about the problem you're trying to solve, the industry and the data source. This will give you a sense of what type of data you're working with, and what kind of insights you can expect to gain from it.

- Explore the data: Get a sense of the overall structure and characteristics of the data, including the number of observations, the types of variables, and their distribution.

- Clean and preprocess the data: Address any missing or invalid data, handle outliers, and make sure the data is in a format that can be easily used for analysis.

- Visualize the data: Use visualizations such as histograms, scatter plots, and box plots to gain insights into the distribution of the data and identify patterns or trends.

- Feature engineering: Create new features or modify existing ones to better capture the information in the data that is relevant to the problem you're trying to solve.

- Validate the data: Use techniques such as cross-validation to ensure that the data is reliable and that the model will generalize well to new data.

- Document the data: Keep detailed notes on the data, including the source, cleaning and preprocessing steps, and any other relevant information.

Let's take an example using the iris datasets: 

In [6]:
from sklearn.datasets import load_iris
iris_dataset=load_iris()

In [7]:
print("keys of iris_dataset: \n{}".format(iris_dataset.keys()))

keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [8]:
print(iris_dataset['DESCR'][:193] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...


In [9]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


In [10]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [11]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


In [12]:
iris_dataset['data'].shape

(150, 4)

In [13]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=0)

In [14]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

X_train shape: (112, 4)
y_train shape: (112,)


In [15]:
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_test shape: (38, 4)
y_test shape: (38,)


In [16]:
import pandas as pd
df=pd.DataFrame(iris_dataset["data"])

###First Things First: Look at Your Data

In [17]:
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

In [18]:
iris_dataframe

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.9,3.0,4.2,1.5
1,5.8,2.6,4.0,1.2
2,6.8,3.0,5.5,2.1
3,4.7,3.2,1.3,0.2
4,6.9,3.1,5.1,2.3
...,...,...,...,...
107,4.9,3.1,1.5,0.1
108,6.3,2.9,5.6,1.8
109,5.8,2.7,4.1,1.0
110,7.7,3.8,6.7,2.2


In [19]:
pip install mglearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mglearn
  Downloading mglearn-0.1.9.tar.gz (540 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.1/540.1 KB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: mglearn
  Building wheel for mglearn (setup.py) ... [?25l[?25hdone
  Created wheel for mglearn: filename=mglearn-0.1.9-py2.py3-none-any.whl size=582637 sha256=100940db604f229a9aee48f00b206971ad7f4a0fef8d8481b309580724639544
  Stored in directory: /root/.cache/pip/wheels/87/75/37/404e66d0c4bad150f101c9a0914b11a8eccc2681559936e7f7
Successfully built mglearn
Installing collected packages: mglearn
Successfully installed mglearn-0.1.9


NOTE :  mglearn is a library that is build on top of scikit-learn, which is another machine learning library.

In [20]:
import mglearn

TypeError: ignored

- In order to use mglearn in Google Colab, you'll need to first install it. You can do this by running !pip install mglearn in a code cell.
- It's important to note that the library may not be compatible with the current version of the other libraries in your environment. So, you may need to check the version of your libraries and match them with the compatible version of mglearn library.

In [None]:
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

### Building Your First Model: k-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=1)

Now we can start building the actual machine learning model. There are many classi‐
fication algorithms in scikit-learn that we could use. Here we will use a k-nearest
neighbors classifier, which is easy to understand. Building this model only consists of
storing the training set. To make a prediction for a new data point, the algorithm
finds the point in the training set that is closest to the new point. Then it assigns the
label of this training point to the new data point.


In [None]:
knn.fit(X_train,y_train)

##Making a predictions

In [None]:
import numpy as np

In [None]:
X_new=np.array([[5,2.9,1,0.2]])
print("X_new.shape:{}".format(X_new.shape))

In [None]:
prediction=knn.predict(X_new)
iris_dataset['target_names'][prediction]

##Evaluating the Model

This is where the test set that we created earlier comes in. This data was not used to
build the model, but we do know what the correct species is for each iris in the test
set.
Therefore, we can make a prediction for each iris in the test data and compare it
against its label (the known species). We can measure how well the model works by
computing the accuracy, which is the fraction of flowers for which the right species
was predicted:

In [None]:
y_pred=knn.predict(X_test)
y_pred

In [None]:
np.mean(y_pred==y_test)

In [None]:
knn.score(X_test,y_test)

#Chapter 2

##Classification and Regression

### Overfitting and underfitting :

Overfitting
occurs when you fit a model too closely to the particularities of the training set and
obtain a model that works well on the training set but is not able to generalize to new
data. On the other hand, if your model is too simple say, “Everybody who owns a
house buys a boat” then you might not be able to capture all the aspects of and variability in the data, and your model will do badly even on the training set. that's called underfitting

The more complex we allow our model to be, the better we will be able to predict on
the training data. However, if our model becomes too complex, we start focusing too
much on each individual data point in our training set, and the model will not gener‐
alize well to new data.
There is a sweet spot in between that will yield the best generalization performance.
This is the model we want to find.


### Classification VS Regression
- In regression, the input data is often represented by a set of features or attributes, such as the size and age of a house, and the output is a continuous variable, such as the price of the house. The process of regression is to learn a mapping from the input features to the output variable, based on the labeled examples in the training data.
- Classification is a supervised learning problem in which the goal is to predict a categorical label or class from a set of input features. It is a type of machine learning problem in which an algorithm learns from a labeled dataset, and then uses this knowledge to predict the class or label

In [None]:
import matplotlib.pyplot as plt
# generate dataset
X, y = mglearn.datasets.make_forge()
# plot dataset
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.legend(["Class 0", "Class 1"], loc=4)
plt.xlabel("First feature")
plt.ylabel("Second feature")
print("X.shape: {}".format(X.shape))

In [None]:
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
cancer.keys()

In [None]:
df=pd.DataFrame(cancer['data'])
df

In [None]:
df.info()

In [None]:
print("Sample counts per class:\n{}".format(
 {n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}))

In [None]:
print("Sample counts per class:\n{}".format(
 {n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}))

In [None]:
cancer.feature_names

In [None]:
print(cancer.DESCR)

###analyse bosten housing

building a model follow this steps:
loading or importing the data -> spliting the data into test and valisation set ->choose the algorithms to train and fit the data -> predict -> score the model

In [None]:
from mglearn.datasets import load_extended_boston
boston=load_extended_boston

In [None]:
X,y=mglearn.datasets.load_extended_boston()

In [None]:
mglearn.plots.plot_knn_classification(n_neighbors=1)

In [None]:
mglearn.plots.plot_knn_classification(n_neighbors=3)

In [None]:
from sklearn.model_selection import train_test_split
X,y=mglearn.datasets.make_forge()
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

In [None]:

from sklearn.neighbors import KNeighborsClassifier
clf=KNeighborsClassifier(n_neighbors=3)

In [None]:
clf.fit(X_train,y_train)

In [None]:
clf.predict(X_test)

In [None]:
clf.score(X_test,y_test)

In [None]:
fig,axes=plt.subplots(1,3,figsize=(10,3))
for n_neighbors,ax in zip([1,3,9],axes):
  clf=KNeighborsClassifier(n_neighbors=n_neighbors).fit(X,y)
  mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
  mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
  ax.set_title("{} neighbor(s)".format(n_neighbors))
  ax.set_xlabel("feature 0")
  ax.set_ylabel("feature 1")
axes[0].legend(loc=3)


compairing the models using 1,3 and 9 neighbors

###Using the breast cancer dataset

In [None]:
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=66)
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
 # build the model
 clf = KNeighborsClassifier(n_neighbors=n_neighbors)
 clf.fit(X_train, y_train)
 # record training set accuracy
 training_accuracy.append(clf.score(X_train, y_train))
 # record generalization accuracy
 test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()


The best performance is somewhere in the middle,
using around six neighbors. Still, it is good to keep the scale of the plot in mind. The
worst performance is around 88% accuracy, which might still be acceptable.


In [None]:
mglearn.plots.plot_knn_regression(n_neighbors=1)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
X,y=mglearn.datasets.make_wave(n_samples=40)
# split the wave dataset into a training and a test set
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
# instantiate the model and set the number of neighbors to consider to 3
reg=KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train,y_train)

In [None]:
reg.predict(X_test)

In [None]:
reg.score(X_test,y_test)

##Analyzing KNeighborsRegressor

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# create 1,000 data points, evenly spaced between -3 and 3
line = np.linspace(-3, 3, 1000).reshape(-1, 1)
for n_neighbors, ax in zip([1, 3, 9], axes):
 # make predictions using 1, 3, or 9 neighbors
 reg = KNeighborsRegressor(n_neighbors=n_neighbors)
 reg.fit(X_train, y_train)
 ax.plot(line, reg.predict(line))
 ax.plot(X_train, y_train, '^', c=mglearn.cm2(0), markersize=8)
 ax.plot(X_test, y_test, 'v', c=mglearn.cm2(1), markersize=8)
 ax.set_title(
 "{} neighbor(s)\n train score: {:.2f} test score: {:.2f}".format(
 n_neighbors, reg.score(X_train, y_train),
 reg.score(X_test, y_test)))
 ax.set_xlabel("Feature")
 ax.set_ylabel("Target")
axes[0].legend(["Model predictions", "Training data/target",
 "Test data/target"], loc="best")


So, while the nearest k-neighbors algorithm is easy to understand, it is not often used
in practice, due to prediction being slow and its inability to handle many features.
The method we discuss next has neither of these drawbacks.


One of the strengths of k-NN is that the model is very easy to understand, and often
gives reasonable performance without a lot of adjustments. Using this algorithm is a
good baseline method to try before considering more advanced techniques. Building
the nearest neighbors model is usually very fast, but when your training set is very
large

#Linear Models


##Linear models for regression


In [None]:
mglearn.plots.plot_linear_regression_wave()

There are many different linear models for regression. The difference between these
models lies in how the model parameters w and b are learned from the training data

##Linear regression (ordinary squares)

In [None]:
from sklearn.linear_model import LinearRegression
X,y=mglearn.datasets.make_wave(n_samples=60)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42)
lr=LinearRegression()
lr.fit(X_train,y_train)

In [None]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

An R
2
 of around 0.66 is not very good, but we can see that the scores on the training
and test sets are very close together. This means we are likely underfitting, not over‐
fitting. For this one-dimensional dataset, there is little danger of overfitting, as the
model is very simple

In [None]:
X,y=mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
lr=LinearRegression().fit(X_train,y_train)

In [None]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))


This discrepancy between performance on the training set and the test set is a clear
sign of overfitting, and therefore we should try to find a model that allows us to control complexity. One of the most commonly used alternatives to standard linear
regression is ridge regression, which we will look into next.

###Ridge regression

In [None]:
from sklearn.linear_model import Ridge
ridge=Ridge().fit(X_train,y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

. A less complex model means worse performance on the training
set, but better generalization. As we are only interested in generalization perfor‐
mance, we should choose the Ridge model over the LinearRegression model.

The optimum setting of alpha depends on the particular dataset we are using.
Increasing alpha forces coefficients to move more toward zero, which decreases
training set performance but might help generalization.


The lesson here is that with enough training data, regularization becomes less important, and given enough data, ridge and linear regression will have the same performance.

###Lasso

called also L1 regularization
 The consequence of L1 regularization
is that when using the lasso, some coefficients are exactly zero. This means some features are entirely ignored by the model. This can be seen as a form of automatic feature selection. Having some coefficients be exactly zero often makes a model easier to
interpret, and can reveal the most important features of your model.

In [None]:
from sklearn.linear_model import Lasso
lasso=Lasso().fit(X_train,y_train)

In [None]:
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

In [None]:
lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso001.coef_ != 0)))

If we set alpha too low, however, we again remove the effect of regularization and end
up overfitting, with a result similar to LinearRegression:

In [None]:
lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso00001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso00001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso00001.coef_ != 0)))

- Ridge regression is usually the first choice between these two models.
- Lasso might be a better choice ridge regression is usually the first choice between these two models.if you would like to have a
model that is easy to interpret, Lasso will provide a model that is easier to understand, as it will select only a subset of the input features.

- scikit-learn also provides
the ElasticNet class, which combines the penalties of Lasso and Ridge

##Linear models for classification

Linear models are also extensively used for classification. Let’s look at binary classifi‐
cation first. In this case, a prediction is made using the following formula:

ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b > 0

The formula looks very similar to the one for linear regression, but instead of just
returning the weighted sum of the features, we threshold the predicted value at zero.
If the function is smaller than zero, we predict the class –1; if it is larger than zero, we
predict the class +1. This prediction rule is common to all linear models for classifica‐
tion. Again, there are many different ways to find the coefficients (w) and the intercept (b).

There are many algorithms for learning linear models. These algorithms all differ in the following two ways:

- The way in which they measure how well a particular combination of coefficients

and intercept fits the training data
- If and what kind of regularization they use

In [None]:
#there are two most common linear classification algorithms are:
#logistic regression
#linear support vector machine
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
X, y = mglearn.datasets.make_forge()
fig, axes = plt.subplots(1, 2, figsize=(10, 3))
for model, ax in zip([LinearSVC(), LogisticRegression()], axes):
 clf = model.fit(X, y)
 mglearn.plots.plot_2d_separator(clf, X, fill=False, eps=0.5,
 ax=ax, alpha=.7)
 mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
 ax.set_title("{}".format(clf.__class__.__name__))
 ax.set_xlabel("Feature 0")
 ax.set_ylabel("Feature 1")
axes[0].legend()


In [None]:
mglearn.plots.plot_linear_svc_regularization()


we have a very small C corresponding to a lot of regularization. The strongly regularized model chooses a relatively horizontal line, misclassifying two points

In the center plot, C is slightly higher, and the model focuses more
on the two misclassified samples, tilting the decision boundary



In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer=load_breast_cancer()
X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=42)
logreg=LogisticRegression().fit(X_train,y_train)
print(logreg.score(X_train,y_train))
print(logreg.score(X_test,y_test))

The default value of C=1 provides quite good performance, with 95% accuracy on
both the training and the test set. But as training and test set performance are very
close, it is likely that we are underfitting. Let’s try to increase C to fit a more flexible
model:


In [None]:
logreg100=LogisticRegression(C=100).fit(X_train,y_train)
print(logreg100.score(X_train,y_train))
print(logreg100.score(X_test,y_test))

In [None]:
logreg001=LogisticRegression(C=0.01).fit(X_train,y_train)
print(logreg001.score(X_train,y_train))
print(logreg001.score(X_test,y_test))

###Linear Model for multiclass classification

In [None]:
from sklearn import linear_model
from sklearn.datasets import make_blobs
X,y=make_blobs(random_state=42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["class 0","Class 1","class 2"])
linear_svm=LinearSVC().fit(X,y)
print(linear_svm.coef_.shape)
print(linear_svm.intercept_.shape)


###Decision Trees

the decision trees try to find the perfect split of region that is the most informative oneabout the target variable with the recursuve process.

The recursive partitioning of the data is repeated until each region in the partition
(each leaf in the decision tree) only contains a single target value (a single class or a
single regression value). A leaf of the tree that contains data points that all share the
same target value is called pure

A prediction on a new data point is made by checking which region of the partition
of the feature space the point lies in, and then predicting the majority target (or the
single target in the case of pure leaves) in that region. The region can be found by
traversing the tree from the root and going left or right, depending on whether the
test is fulfilled or not

In [None]:
mglearn.plots.plot_animal_tree()

###Controlling complexity of decision trees
There are two common strategies to prevent overfitting: stopping the creation of the
tree early (also called pre-pruning), or building the tree but then removing or collaps‐
ing nodes that contain little information (also called post-pruning or just pruning).
Possible criteria for pre-pruning include limiting the maximum depth of the tree,
limiting the maximum number of leaves, or requiring a minimum number of points
in a node to keep splitting it.

###Building decision trees

A leaf of the tree that contains data points that all share the
same target value is called pure.

In [None]:
from sklearn.tree import DecisionTreeClassifier
cancer=load_breast_cancer()
X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
tree=DecisionTreeClassifier()
tree.fit(X_train,y_train)
print(tree.score(X_train,y_train))
print(tree.score(X_test,y_test))

If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep
and complex. Unpruned trees are therefore prone to overfitting and not generalizing
well to new data. Now let’s apply pre-pruning to the tree, which will stop developing
the tree before we perfectly fit to the training data. One option is to stop building the
tree after a certain depth has been reached. Here we set max_depth=4, meaning only
four consecutive questions can be asked. Limiting the
depth of the tree decreases overfitting. This leads to a lower accuracy on the training
set, but an improvement on the test set:

In [None]:
tree=DecisionTreeClassifier(max_depth=4,random_state=0)
tree.fit(X_train,y_train)
print(tree.score(X_train,y_train))
print(tree.score(X_test,y_test))

###Analysing the decision tree

In [None]:
from sklearn.tree import export_graphviz
export_graphviz(tree,out_file='tree_dot',class_names=["malignant","benign"],feature_names=cancer.feature_names,impurity=False,filled=True)

In [None]:
from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
 feature_names=cancer.feature_names, impurity=False, filled=True)

In [None]:
import graphviz
with open("tree.dot") as f:
  dot_graph=f.read()
  graphviz.Source(dot_graph)

###features importance in trees

The most commonly
used summary is feature importance, which rates how important each feature is for
the decision a tree makes. It is a number between 0 and 1 for each feature, where 0
means “not used at all” and 1 means “perfectly predicts the target.” The feature
importances always sum to 1:


In [None]:
print(tree.feature_importances_)

In [None]:
import matplotlib.pyplot as plt
def plot_feature_importances_cancer(model):
  n_features=cancer.data.shape[1]
  plt.barh(range(n_features),model.feature_importances_,align='center')
  plt.yticks(np.arange(n_features),cancer.feature_names)
  plt.xlabel("feature importance")
  plt.ylabel("features")
plot_feature_importances_cancer(tree)

However, if a feature has a low feature_importance, it doesn’t mean that this feature
is uninformative. It only means that the feature was not picked by the tree, likely
because another feature encodes the same information.

parameter of decision tree: max_depth, max_leaf_nodes, or min_samples_leaf—is sufficient to prevent overfitting

In [None]:
tree=mglearn.plots.plot_tree_not_monotone()
display(tree)

In [None]:
import pandas as pd
ram_prices = pd.read_csv("https://www.kaggle.com/code/shubhankartiwari/decision-tree-regression?scriptVersionId=42755779&cellId=3")
plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte")

###Strengths weaknesses and parameters

-The main parameter of linear models is the regularization parameter, called alpha in
the regression models and C in LinearSVC and LogisticRegression. Large values for
alpha or small values for C mean simple models. In particular for the regression mod‐
els, tuning these parameters is quite important. Usually C and alpha are searched for
on a logarithmic scale. The other decision you have to make is whether you want to
use L1 regularization or L2 regularization. If you assume that only a few of your fea‐
tures are actually important, you should use L1. Otherwise, you should default to L2.
L1 can also be useful if interpretability of the model is important. As L1 will use only
a few features, it is easier to explain which features are important to the model, and
what the effects of these features are.

-Linear models are very fast to train, and also fast to predict. They scale to very large
datasets and work well with sparse data. If your data consists of hundreds of thou‐
sands or millions of samples, you might want to investigate using the solver='sag'
option in LogisticRegression and Ridge, which can be faster than the default on
large datasets. Other options are the SGDClassifier class and the SGDRegressor
class, which implement even more scalable versions of the linear models described
here

###Naive Bayes Classifiers
av: family of classifiers that are quite similar to the linear models.
it's so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from feature.
des: provide efficiency models but the generalization performance is slightly worse than the linear classifiers (logisticRegression and linearSVC)
there are three kinds of naive bayes classifiers implimented in sklearn : GaussianNB,BernoulliNB,MultinomiaLNB.


In [None]:
counts = {}
for label in np.unique(y):
 # iterate over each class
 # count (sum) entries of 1 per feature
 counts[label] = X[y == label].sum(axis=0)
print("Feature counts:\n{}".format(counts))

MultinomialNB and BernoulliNB have a single parameter, alpha, which controls
model complexity.This results in a
“smoothing” of the statistics. A large alpha means more smoothing, resulting in less
complex models.

GaussianNB is mostly used on very high-dimensional data, while the other two var‐
iants of naive Bayes are widely used for sparse count data such as text. MultinomialNB
usually performs better than BinaryNB, particularly on datasets with a relatively large
number of nonzero features

The naive Bayes models share many of the strengths and weaknesses of the linear
models. They are very fast to train and to predict, and the training procedure is easy
to understand. The models work very well with high-dimensional sparse data and are
relatively robust to the parameters. Naive Bayes models are great baseline models and
are often used on very large datasets, where training even a linear model might take
too long

##Ensembles of Decision Trees

Ensembles are methods that combine multiple machine learning models to create
more powerful models. There are many models in the machine learning literature
that belong to this category, but there are two ensemble models that have proven to
be effective on a wide range of datasets for classification and regression, both of
which use decision trees as their building blocks: random forests and gradient boos‐
ted decision trees.

the problem of decision trees is tend to overfit the training data the random forest are one only way to address this problem 

To implement this strategy, we need to build many decision trees. Each tree should do
an acceptable job of predicting the target, and should also be different from the other
trees. Random forests get their name from injecting randomness into the tree build‐
ing to ensure each tree is different. There are two ways in which the trees in a random
forest are randomized: by selecting the data points used to build a tree and by select‐
ing the features in each split test. Let’s go into this process in more detail.

###Building random forests

1. decide on the number of trees to builds and they will be completely independetly from each other and the algorithm will make differ‐
ent random choices for each tree to make sure the trees are distinct
2. Built a tree : call the bootstrap sample of our data ne5dhou partie mn data de5la b3adhha w tnajm tkoun feha repetition 
3. the decision tree is build based on this newly created dataset and select a subset of the features (the number controlled by ma_features)
4. a high max_fea
tures means that the trees in the random forest will be quite similar, and they will be
able to fit the data easily, using the most distinctive features. A low max_features means that the trees in the random forest will be quite different, and that each tree
might need to be very deep in order to fit the data well.
5. To make a prediction using the random forest, the algorithm first makes a prediction
for every tree in the forest. For regression, we can average these results to get our final
prediction. For classification, a “soft voting” strategy is used. This means each algorithm makes a “soft” prediction, providing a probability for each possible output

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
X,y=make_moons(n_samples=100,noise=0.25,random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
 random_state=42)
forest=RandomForestClassifier(n_estimators=5,random_state=2)
forest.fit(X_train,y_train)

In [None]:
fig,axes=plt.subplots(2,3,figsize=(10,5))
for i,(ax,tree)in enumerate(zip(axes.ravel(),forest.estimators_)):
  ax.set_title("Tree{}".format(i))
  mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)
mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],
  alpha=.4)
axes[-1, -1].set_title("Random Forest")
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
  

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
forest.fit(X_train,y_train)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))

In [None]:
plot_feature_importances_cancer(forest)

Similarly to the decision tree, the random forest provides feature importances, which
are computed by aggregating the feature importances over the trees in the forest. Typ‐
ically, the feature importances provided by the random forest are more reliable than
the ones provided by a single tree

- The more trees there are in the forest, the more robust
it will be against the choice of random state. (n_estimator, max_features,If you want to have reproducible results,. Aver‐
aging more trees will yield a more robust ensemble by reducing overfitting.
it is important to fix the random_state
- Random forests don’t tend to perform well on very high dimensional, sparse data,
such as text data
- random
forests require more memory and are slower to train and to predict than linear mod‐
els. If time and memory are important in an application, it might make sense to use a
linear model instead.
- As described earlier, max_features determines how random each tree is, and a
smaller max_features reduces overfitting. In general, it’s a good rule of thumb to use
the default values: max_features=sqrt(n_features) for classification and max_fea
tures=log2(n_features) for regression. Adding max_features or max_leaf_nodes
might sometimes improve performance. It can also drastically reduce space and time
requirements for training and prediction.

##Gradient boosted regression trees (gradient boosting machines)


- it's a mothod that combines multiples decision trees to create a more powerful model.it can be used on regression and classification 
- it works by building trees in serial manner where each tree try to correct the mistake of the previous one. 
- there is no randomization instead strong pre-pruning is used 
- Each tree can only provide good
predictions on part of the data, and so more and more trees are added to iteratively
improve performance.
- important parameter called the learning rate, it control how strongly each tree tries to correct the mistakes of the previous trees

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
gbrt=GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train,y_train)
print(gbrt.score(X_train,y_train))
print(gbrt.score(X_test,y_test))

We are likely to overfit. we can apply stronger pre-pruning by limiting the maximum depth or lower the learning rate

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

Both methods of decreasing the model complexity reduced the training set accuracy,
as expected. In this case, lowering the maximum depth of the trees provided a significant improvement of the model, while lowering the learning rate only increased the
generalization performance slightly.
As for the other decision tree–based models, we can again visualize the feature
importances to get more insight into our model. As we used 100 trees, it
is impractical to inspect them all, even if they are all of depth 1:

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
plot_feature_importances_cancer(gbrt)

- the gradient boosting ignore some of the features than random forest.
- a common approach is to first try random forests, which work quite robustly.
- if we want more accuracy from the ML model, moving to gradient boosting often helps

###Strengths, weaknesses, and parameters
- Gradient boosted decision trees are among the
most powerful and widely used models for supervised learning
- it require careful tuning of the parameters and may take a long time to train
- Similarly to other tree-based models, the algorithm works well without scaling
and on a mixture of binary and continuous features
- it often does not work well on high-dimensional sparse data
- PARAMETERS: n_estimators and learning rate are interconnected 
- a lower learning rate means that moe trees are needed to build a model
- in contrast to random forests increasing n_estimator in GB leading to more complex model

## Kernelized Support Vactor *Machine*

https://www.youtube.com/watch?v=Q7vT0--5VII

- prob: the classification that we learn previously use a linear feature and end up with a simple desicion boundery( line , plan , hyoerplan) can't be applied in real world data because most of the data are nonlinear 
so we can't solve a nonlinear feature with a linear model 
- solution is kernals: One way to make a linear
model more flexible is by adding more features—for example, by adding interactions
or polynomials of the input features.

In [None]:
# add the squared first feature
X_new = np.hstack([X, X[:, 1:] ** 2])
from mpl_toolkits.mplot3d import Axes3D, axes3d
figure = plt.figure()
# visualize in 3D
ax = Axes3D(figure, elev=-152, azim=-26)
# plot first all the points with y == 0, then all with y == 1
mask = y == 0
ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b',
 cmap=mglearn.cm2, s=60)
ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^',
 cmap=mglearn.cm2, s=60)
ax.set_xlabel("feature0")
ax.set_ylabel("feature1")
ax.set_zlabel("feature1 ** 2")

In [None]:
linear_svm_3d = LinearSVC().fit(X_new, y)
coef, intercept = linear_svm_3d.coef_.ravel(), linear_svm_3d.intercept_
# show linear decision boundary
figure = plt.figure()
ax = Axes3D(figure, elev=-152, azim=-26)
xx = np.linspace(X_new[:, 0].min() - 2, X_new[:, 0].max() + 2, 50)
yy = np.linspace(X_new[:, 1].min() - 2, X_new[:, 1].max() + 2, 50)
XX, YY = np.meshgrid(xx, yy)
ZZ = (coef[0] * XX + coef[1] * YY + intercept) / -coef[2]
ax.plot_surface(XX, YY, ZZ, rstride=8, cstride=8, alpha=0.3)
ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b',
 cmap=mglearn.cm2, s=60)
ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^',
 cmap=mglearn.cm2, s=60)
ax.set_xlabel("feature0")
ax.set_ylabel("feature1")
ax.set_zlabel("feature0 ** 2")

###the kernel trick
it get really expencive to compute all the possible new non linear features and it take times and space --> solution: the kernel trick it works by directly computing the distance (more pre‐
cisely, the scalar products) of the data points for the expanded feature representation,
without ever actually computing the expansion.
- there are two ways to map with SVM : 
1) polynomial kernel : which computes all
possible polynomials up to a certain degree of the original features
2)  The Gaussian kernel is a bit harder to explain, as it corresponds to
an infinite-dimensional feature space.

### Understanding the SVMs
Typically only a subset of
the training points matter for defining the decision boundary: the ones that lie on the
border between the classes. These are called support vectors and give the support vec‐
tor machine its name.

In [None]:
from sklearn.svm import SVC
X,y=mglearn.tools.make_handcrafted_dataset()
svm=SVC(kernel='rbf',C=10,gamma=0.1).fit(X,y)
mglearn.plots.plot_2d_separator(svm,X,eps=5)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
#plot support vectors
sv=svm.support_vectors_
# class labels of support vectors are given by the sign of the dual coefficients
sv_labels=svm.dual_coef_.ravel()>0
mglearn.discrete_scatter(sv[:, 0], sv[:, 1], sv_labels, s=15, markeredgewidth=3)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")


- GAMMA : control the width of the gaussian kernel, and it determines the scale of what it means for points to be close togather.
- C parameter: is a regularization parameter, similar to the used in the linear models.

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 10))
for ax, C in zip(axes, [-1, 0, 3]):
 for a, gamma in zip(ax, range(-1, 2)):
    mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)
axes[0, 0].legend(["class 0", "class 1", "sv class 0", "sv class 1"],
 ncol=4, loc=(.9, 1.2))

- When gamma get higher value the decision boundery focus on single points which is more complex model.
- When C get higher value we get misclassification point .

In [None]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
X_train, X_test, y_train, y_test = train_test_split(
 cancer.data, cancer.target, random_state=0)
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

In [None]:
plt.plot(X_train.min(axis=0), 'o', label="min")
plt.plot(X_train.max(axis=0), '^', label="max")
plt.legend(loc=4)
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
plt.yscale("log")


### Preprocessing data for SVMs


Prob: the features are not scaled and the SVM model are very sensative to te scaling of the data
MinMaxscaler prepocessing method is the solution

In [None]:
# compute the minimum value per feature on the training set
min_on_training = X_train.min(axis=0)
# compute the range of each feature (max - min) on the training set
range_on_training = (X_train - min_on_training).max(axis=0)
# subtract the min, and divide by range
# afterward, min=0 and max=1 for each feature
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))

In [None]:
X_test_scaled=(X_test-min_on_training)/range_on_training

In [None]:
svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
 svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))

In [None]:
svc=SVC(C=1000)
svc.fit(X_train_scaled,y_train)
print("Accuracy on training set: {:.3f}".format(
 svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))

### Strenths, weaknesses and parameters
- KSVM are a powerful models and perform well on a varity of datasets. 
- they work well on low and high dimensional data 
- it don't scale very well you need to make it by yourself 
- by working with datasets of size 100,000 or more become more challanging of runtime and memory usage
- Another downside of SVMs is that they require careful preprocessing of the data and
tuning of the parameters this is why most people use random forest and gradient boosting 
- SVM models are hard to inspect; it
can be difficult to understand why a particular prediction was made, and it might be
tricky to explain the model to a nonexpert.
- Still, it might be worth trying SVMs, particularly if all of your features represent
measurements in similar units and they are on similar
scales.

