<a href="https://colab.research.google.com/github/yandexdataschool/MLatImperial2022/blob/main/Seminars/lab03_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear classification

We'll try to solve clients' churn task using data of mobile network operator.

We have to predict whether customer will change the mobile network operator.

The target field here is 'Churn'.

Let's transform raw data, then make a Logistic Regression model and adjust it's parameteres.

Upload data and have a look at it

In [None]:
!wget -N https://raw.githubusercontent.com/yandexdataschool/MLatImperial2022/main/Data/telecom_churn2.csv

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

random_state=0
df = pd.read_csv('telecom_churn2.csv')
df.head()

Transform target and  some other fields

In [None]:
d = {'Yes' : 1, 'No' : 0}
df['International plan'] = df['International plan'].map(d)
df['Voice mail plan'] = df['Voice mail plan'].map(d)
df['Churn'] = df['Churn'].astype('int64')
df.head()


In [None]:
#<YOUR TURN>
# find out how many missing values (numerical and categorical) are there.


Divide data to design matrix X and target vector y.

Make a train-test split

In [None]:
df.head()
y=df['Churn']
X=df.drop('Churn',axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,
                                                    random_state=0)

In [None]:
X.describe()


In [None]:
#<YOUR TURN>
# analyse feature 'Area code' and transform it if nessesary



Further we need to:
- Impute missing numeric and categorical values.

- Separate numerical and categorical fields.

- Scale numerical features

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

numeric_data = X_train.select_dtypes([np.number])
numeric_data_mean = numeric_data.mean()
X_train = X_train.fillna(numeric_data_mean)
#or use inplace = True
X_test = X_test.fillna(numeric_data_mean)
X_train.head()

In [None]:
numeric_data.head()

Now we don't extract all the information from the data, simply because we do not use some of the features. These features in the dataset are encoded in strings, each of them represents a certain category. 

Let's first fill in missing categorical features with special category "NotGiven". Sometimes the fact that a feature has a missing value can be a good sign itself.

In [None]:
numeric_features = numeric_data.columns
categorical = list(X_train.dtypes[X_train.dtypes == "object"].index)
X_train[categorical] = X_train[categorical].fillna("NotGiven")
X_test[categorical] = X_test[categorical].fillna("NotGiven")

In [None]:
X_train.head()

### Categorical features encoding

Many ML algorithms do not work with categorial features and assume only numeric. If you want to transform categorial features into numeric, you may use encoding.  Two standard transformers from sklearn for working with categorical features are `OrdinalEncoder` (simply renumbers feature values with natural numbers) and `OneHotEncoder` (dummy features).

### One Hot Encoding

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

<img src="https://russianblogs.com/images/855/ddd65f4f342886bb411d41a33c5528e7.png" width=50%> 




A `OneHotEncoder` is a representation of categorical variables as binary vectors.

`OneHotEncoder` assigns to each feature a whole vector consisting of zeros and one unit (which stands in the place corresponding to the received value, thus encoding it).

Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.



- Is it worth to apply a scaling transformer to features encoded by `OneHotEncoder`?
- What's about applying  `OrdinalEncoder` in the case of a linear model? tree models?

### Pipeline

We can write more streamlined  code with Pipeline:

<img src="https://miro.medium.com/max/620/1*ONryJuHGGUZ6PUmYTMiFxQ.png" width=50%>

Model training is often presented as a sequence of some actions with training and test sets (for example, you first need to scale the sample (and for the training set you need to apply the fit method, and for the test set - transform), and then train/apply the model (for the train sample fit, and make predictions for test sample)  

The `sklearn.pipeline.Pipeline` class allows you to store this sequence of steps and correctly applies it to both training and test samples.

sklearn also has a class to make a pipeline without naming: `sklearn.pipeline.make_pipeline` 


### ColumnTransformer
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

<img src="https://miro.medium.com/max/537/1*BNwN3cmbLLoU9CQoJgFSKQ.png" width=40%> 


We often need to apply different sets of tranformers to different groups of columns. For instance, we would want to apply OneHotEncoder to only categorical columns but not to numerical columns. This is where ColumnTransformer comes in. This time, we will partition the dataset keeping all columns so that we have both numerical and categorical features.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV,LogisticRegression

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error,mean_absolute_error

column_transformer = ColumnTransformer([
    ('ohe', OneHotEncoder(handle_unknown="ignore",sparse=False), categorical),
    ('scaling', StandardScaler(), numeric_features)
])
X_train_encoded=column_transformer.fit_transform(X_train)

pd.DataFrame(X_train_encoded).head()

In [None]:
# Question:  does it nessecary to scale features for linear model?
# what if you haven't got one-hot features?

### LogisticRegression

sklearn suggests 2 realizations of LogisticRegression:


https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


- class sklearn.linear_model.LogisticRegression ()
- class sklearn.linear_model.LogisticRegressionCV(*,

                     Cs=10, fit_intercept=True, cv=None, dual=False, penalty='l2',
                     
                     scoring=None,  solver='lbfgs', tol=0.0001, max_iter=100,
                     
                     class_weight=None, n_jobs=None, verbose=0, refit=True, 
                     
                     intercept_scaling=1.0, multi_class='auto', random_state=None, l1_ratios=None)
                     
   - Cs - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
   - penalty -{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
   - solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’c’. Algorithm to use in the optimization problem. Default is ‘lbfgs’. 

   - cv : cv  or cross-validation generator, default=5 folds
            
   - l1_ratios list of float, default=None. The list of Elastic-Net mixing parameter
   
In addition to the standard `fit`,`predict` methods, the `predict_proba()` method is useful for classifiers  


Let's make a logistic regression with L2-regularization in Pipeline with feature transformation, find the best parameters on cross-validation on the grid of the regularization parameter С: [0.0001,0.001,0.01,0.1,1,10,100].
We'll use the LogisticRegressionCV and the number of cross-validation blocks cv=5

In [None]:
#import warnings
#warnings.simplefilter("ignore")

pipeline = Pipeline(steps=[
    ('ohe_and_scaling', column_transformer),
    ('regression', LogisticRegressionCV(penalty='l2',Cs=[0.0001,0.001,0.01,0.1,1,10,100],max_iter=400,
                                        cv=5,refit=True))
])

model = pipeline.fit(X_train, y_train)
y_pred = model.predict(X_test)

y_proba = model.predict_proba(X_test)
pd.DataFrame(y_proba[:, :]).head()


In [None]:
#<YOUR TURN>
#increase max_iter parameter if solver can't converge and you see warnings 'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
# (or change solver parameter).
#As the loss-function is convex, solver must converge.

In [None]:
# Question: could feature scaling help if we see these warnings?

In [None]:
#<YOUR TURN>
#<calculate accuracy of model with warnings and compare it to accuracy of converged model without warnings  >

In [None]:


pipeline = Pipeline(steps=[
    ('ohe_and_scaling', column_transformer),
    ('regression', LogisticRegressionCV(penalty='l1',Cs=[0.1],solver='saga', max_iter=400,
                                        cv=5,refit=True))
])

model = pipeline.fit(X_train, y_train)
y_pred = model.predict(X_test)

y_proba = model.predict_proba(X_test)
print(pd.DataFrame(y_proba[:, 1]).head())


In [None]:
from sklearn.metrics import accuracy_score
print('1. Test accuracy =' ,accuracy_score(y_pred,y_test))
print("2. C = " ,model['regression'].C_)


In [None]:
 #<YOUR TURN>
# Try ElasticNet regularization instead of L1 and L2

In [None]:
#<YOUR TURN>
#try to use here GridSearchCV and LogisticRegression instead of LogisticRegressionCV.
#did you get the same accuracy result?

### Feature binarization


For feature binarization, you can use the class `sklearn.preprocessing.KBinsDiscretizer`:
sklearn.preprocessing.KBinsDiscretizer(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)

       strategy(default=’quantile’):
            - uniform - 
            - quantile -  
            - kmeans - 1D k-means cluster.

Advantages of binarization: capturing non-monotonic and non-linear dependences feature from the target.
       


Instead of `StandardScaler`, we apply the class method `sklearn.preprocessing.KBinsDiscretizer` to numerical features with splitting into 15 groups and splitting strategy 'kmeans' to numerical features.
At the same time we apply `OneHotEncoder` to categorical features.
We use `ColumnTransformer` to combine uniformely these 2 transformation for the train and test datasets.



In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
%%time
column_transformer = ColumnTransformer([
    ('ohe', OneHotEncoder(handle_unknown="ignore"), categorical),
    ('kbins',  KBinsDiscretizer(n_bins=20, strategy='uniform'), numeric_features)
])

pipeline = Pipeline(steps=[
    ('ohe_and_scaling', column_transformer),
    ('regression',  LogisticRegressionCV(penalty='l2',Cs=[0.0001,0.001,0.01,0.1,1,10,100],cv=5,max_iter=1000,
                                       random_state=random_state))
])

model = pipeline.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Test accuracy =",accuracy_score(y_pred,y_test))
print("C= ", model[1].C_)


#### Visualization of quantile binaization of features

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt
df['q_minutes'] = pd.qcut(df['Total intl minutes'], 11)
df['service_calls'] = pd.cut(df['Customer service calls'], 5)
df['share'] = pd.qcut(df['Total intl charge']/df['Total day charge'],11,precision=2)
df.head(10)

In [None]:

plt.figure(figsize=(12,5))
sns.barplot(x= 'service_calls',y='Churn',data=df,color="blue",saturation=0.25)
plt.xlabel("'Total intl minutes'")



In [None]:
plt.figure(figsize=(14,5))
sns.barplot(x= 'share',y='Churn',data=df,color="blue",saturation=0.25)
plt.xlabel("Total intl minutes")


In [None]:
#<YOUR TURN>
#make new variable  that makes sense and plot similar plot of it

### Polynomial Features

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [None]:
#you turn 
#apply polinomial features instead of Kbindiskretizer and calculate accuracy
#compare time of running (use magic %%time)

In [None]:
%%time
import numpy as np
from sklearn.preprocessing import PolynomialFeatures



column_transformer = ColumnTransformer([
    ('ohe', OneHotEncoder(handle_unknown="ignore"), categorical),
    ('binner',   PolynomialFeatures(2), numeric_features)
])

pipeline = Pipeline(steps=[
    ('ohe_and_scaling', column_transformer),
    ('regression',  LogisticRegressionCV(penalty='l2',Cs=[0.0001,0.001,0.01,0.1,1,10,100],cv=5,max_iter=1000,
                                       random_state=random_state))
])

model = pipeline.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Test accuracy = ", accuracy_score(y_pred,y_test))
print("C= ",model[1].C_)

why polynomial features extracts more information from data than KBinsDiscretizer?

Is it worth to try polynomial featues of degree more than 2?

### GD  & SGD  for linear regresion

$$L(\beta | X, y) = \| X\beta - y \|_2 \to \inf_{\beta}$$

$$\beta = (X^TX)^{-1}X^Ty.$$

Matrix inversion is a very time consuming operation that sometimes requires an unacceptable amount of resources $(O(d^3))$ and can be unstable.

Therefore, parameters are often looked for using iterative methods. One of them is gradient descent.

Recall that in the step of the gradient transition, the values of the parameters at the next step are obtained from the values of the parameters at the current step by shifting towards the antigradient of the functional:

$$\beta^{(t+1)} = \beta^{(t)} - \eta_t \varepsilon \nabla L(\beta^{(t)}),$$
where $\eta_t \varepsilon$ — step decrease dynamics.

Formula for gradient in MSE case looks like:

$$\nabla L(\beta) = -2X^Ty + 2X^TX\beta = 2X^T(X\beta - y).$$
 
The complexity here is $O(dN)$. Stochastic gradient descent differs from basic gradient descent by replacing the gradient with an unbiased estimate for one or more objects. In this case, the complexity becomes $ O (kd) $, where $ k $ is the number of objects by which the gradient is estimated, $ k << N $. This partly explains the popularity of stochastic optimization techniques.

### Vizualization of GD & SGD

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%pylab inline
matplotlib.rcParams['figure.figsize'] = (12.0, 8.0)

Let's generate a matrix of objects - features $ X $ and a vector of weights $ \beta_ {true} $, calculate the vector of target numbers $ y $ as $ X\beta_ {true} $ and add Gaussian noise:

In [None]:
np.random.seed(16)
n_features = 2
n_objects = 300
batch_size = 10
num_steps = 43

beta_true = np.random.normal(size=(n_features, ))

X = np.random.uniform(-5, 5, (n_objects, n_features))
X *= (np.arange(n_features) * 2 + 1)[None, :]  # for different scales
Y = X.dot(beta_true) + np.random.normal(0, 1, (n_objects))
beta_0 = np.random.uniform(-2, 2, (n_features))

Let us train linear regression for MSE on the obtained data using full gradient descent - thereby we obtain a vector of parameters.

In [None]:
beta = beta_0.copy()
beta_list = [beta.copy()]
step_size = 1e-2

for i in range(num_steps):
    beta -= 2 * step_size * np.dot(X.T, np.dot(X, beta) - Y) / Y.shape[0]
    beta_list.append(beta.copy())
beta_list = np.array(beta_list)

In [None]:
#beta_list

let's show a sequence of parameter estimates $\beta^{(t)}$ obtained during iterations. The red dot is $\beta_{true}$.

In [None]:
# compute level set
A, B = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))

levels = np.empty_like(A)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        beta_tmp = np.array([A[i, j], B[i, j]])
        levels[i, j] = np.mean(np.power(np.dot(X, beta_tmp) - Y, 2))


plt.figure(figsize=(12, 8))
plt.title('GD trajectory')
plt.xlabel(r'$\beta_1$')
plt.ylabel(r'$\beta_2$')
plt.xlim((beta_list[:, 0].min() - 0.1, beta_list[:, 0].max() + 0.1))
plt.ylim((beta_list[:, 1].min() - 0.1, beta_list[:, 1].max() + 0.1))
plt.gca().set_aspect('equal')

# visualize the level set
CS = plt.contour(A, B, levels, levels=np.logspace(0, 1, num=20), cmap=plt.cm.rainbow_r)
CB = plt.colorbar(CS, shrink=0.8, extend='both')

# visualize trajectory
plt.scatter(beta_true[0], beta_true[1], c='r')
plt.scatter(beta_list[:, 0], beta_list[:, 1])
plt.plot(beta_list[:, 0], beta_list[:, 1])

plt.show()

We now visualize the trajectories of the stochastic gradient descent, repeating the same steps, while evaluating the gradient from the subsample.

In [None]:
beta = beta_0.copy()
beta_list = [beta.copy()]
step_size = 0.2

for i in range(num_steps):
    sample = np.random.randint(n_objects, size=batch_size)
    beta -= 2 * step_size * np.dot(X[sample].T, np.dot(X[sample], beta) - Y[sample]) / Y.shape[0]
    beta_list.append(beta.copy())
beta_list = np.array(beta_list)

In [None]:
# compute level set
A, B = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))

levels = np.empty_like(A)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        beta_tmp = np.array([A[i, j], B[i, j]])
        levels[i, j] = np.mean(np.power(np.dot(X, beta_tmp) - Y, 2))


plt.figure(figsize=(12, 8))
plt.title('SGD trajectory')
plt.xlabel(r'$\beta_1$')
plt.ylabel(r'$\beta_2$')
plt.xlim((beta_list[:, 0].min() - 0.1, beta_list[:, 0].max() + 0.1))
plt.ylim((beta_list[:, 1].min() - 0.1, beta_list[:, 1].max() + 0.1))
plt.gca().set_aspect('equal')

# visualize the level set
CS = plt.contour(A, B, levels, levels=np.logspace(0, 1, num=20), cmap=plt.cm.rainbow_r)
CB = plt.colorbar(CS, shrink=0.8, extend='both')

# visualize trajectory
plt.scatter(beta_true[0], beta_true[1], c='r')
plt.scatter(beta_list[:, 0], beta_list[:, 1])
plt.plot(beta_list[:, 0], beta_list[:, 1])

plt.show()

As you can see, the stochastic gradient method "wanders" around the optimum. This is due to the selection of the step of the gradient descent $ \eta_k $. The fact is that for the stochastic gradient descent to converge, the sequence of steps $ \eta_k $ must satisfy the Robbins-Monroe conditions:
$$
\sum_{k = 1}^\infty \eta_k = \infty, \qquad \sum_{k = 1}^\infty \eta_k^2 < \infty.
$$
Intuitively, this means the following:

1. the sequence must diverge so that the optimization method can reach any point in space,
2. but at the same time decrease quickly enough for the method to converge.

Let's try to look at the SGD trajectories, the sequence of steps satisfies the Robbins-Monroe conditions:

In [None]:
beta = beta_0.copy()
beta_list = [beta.copy()]
step_size_0 = 0.45
num_steps=100
for i in range(num_steps):
    step_size = step_size_0 / ((i+1)**0.6)
    sample = np.random.randint(n_objects, size=batch_size)
    beta -= 2 * step_size * np.dot(X[sample].T, np.dot(X[sample], beta) - Y[sample]) / Y.shape[0]
    beta_list.append(beta.copy())
beta_list = np.array(beta_list)

In [None]:
# compute level set
A, B = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))

levels = np.empty_like(A)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        beta_tmp = np.array([A[i, j], B[i, j]])
        levels[i, j] = np.mean(np.power(np.dot(X, beta_tmp) - Y, 2))


plt.figure(figsize=(12, 8))
plt.title('SGD trajectory')
plt.xlabel(r'$\beta_1$')
plt.ylabel(r'$\beta_2$')
plt.xlim((beta_list[:, 0].min() - 0.1, beta_list[:, 0].max() + 0.1))
plt.ylim((beta_list[:, 1].min() - 0.1, beta_list[:, 1].max() + 0.1))
#plt.gca().set_aspect('equal')

# visualize the level set
CS = plt.contour(A, B, levels, levels=np.logspace(0, 1, num=20), cmap=plt.cm.rainbow_r)
CB = plt.colorbar(CS, shrink=0.8, extend='both')

# visualize trajectory
plt.scatter(beta_true[0], beta_true[1], c='r')
plt.scatter(beta_list[:, 0], beta_list[:, 1])
plt.plot(beta_list[:, 0], beta_list[:, 1])

plt.show()

### Comparison of convergence rates

In [None]:
# data generation
n_features = 50
n_objects = 1000
num_steps = 200
batch_size = 2

beta_true = np.random.uniform(-2, 2, n_features)

X = np.random.uniform(-10, 10, (n_objects, n_features))
Y = X.dot(beta_true) + np.random.normal(0, 5, n_objects)

In [None]:
step_size_sgd = 1
step_size_gd = 1e-2
beta_sgd = np.random.uniform(-4, 4, n_features)
beta_gd = beta_sgd.copy()
residuals_sgd = [np.mean(np.power(np.dot(X, beta_sgd) - Y, 2))]
residuals_gd = [np.mean(np.power(np.dot(X, beta_gd) - Y, 2))]

for i in range(num_steps):
    step_size = step_size_sgd / ((i+1) ** 0.51)
    sample = np.random.randint(n_objects, size=batch_size)
    beta_sgd -= 2 * step_size * np.dot(X[sample].T, np.dot(X[sample], beta_sgd) - Y[sample]) / Y.shape[0]
    residuals_sgd.append(np.mean(np.power(np.dot(X, beta_sgd) - Y, 2)))
    
    beta_gd -= 2 * step_size_gd * np.dot(X.T, np.dot(X, beta_gd) - Y) / Y.shape[0]
    residuals_gd.append(np.mean(np.power(np.dot(X, beta_gd) - Y, 2)))

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(range(num_steps+1), residuals_gd, label='Basic Gradient Descent')
plt.plot(range(num_steps+1), residuals_sgd, label='Stochastic Gradient Descent')
plt.title('Empirial risk over iterations')
plt.xlim((-1, num_steps+1))
plt.legend()
plt.xlabel('Iter num')
plt.ylabel(r'Q($w$)')
plt.grid()
plt.show()

### SGD Classifier in sklearn


class sklearn.linear_model.SGDClassifier(loss='hinge', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate='optimal', eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False)

- loss , default=’hinge’. The loss function to be used. Defaults to ‘hinge’, which gives a linear SVM. The possible options are ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’, or a regression loss: ‘squared_error’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.
- penalty{‘l2’, ‘l1’, ‘elasticnet’}, default=’l2’
- alpha , default=0.0001 regularization term
- max_iter, default=1000 The maximum number of passes over the training data (aka epochs).
- learning_rate , default=’optimal’:
  -  ‘constant’: eta = eta0
  - ‘optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.
  - ‘invscaling’: eta = eta0 / pow(t, power_t)
  - ‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
 - eta0 , default=0.0 The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.
 
#### The advantages of Stochastic Gradient Descent are:

- Efficiency.

- Ease of implementation (lots of opportunities for code tuning).

#### the disadvantages of Stochastic Gradient Descent include:

- SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.

- SGD is sensitive to feature scaling.

In [None]:
results=[]
for eps in [0.00001,0.0001,0.01,0.05,0.1,0.2,0.5,1.0]:
    from sklearn.linear_model import SGDClassifier
    pipeline = Pipeline(steps=[
        ('ohe_and_scaling', column_transformer),
        ('regression', SGDClassifier(max_iter=100,loss='log',penalty='l2',alpha=0.1, 
                                     learning_rate='constant',eta0=eps,
                                     random_state=random_state,n_iter_no_change=20))
    ])

    model = pipeline.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(" Test accuracy = %.4f learning rate= %.6f n_iter_=%.f" % (accuracy_score(y_pred,y_test), eps,model[1].n_iter_))
    results.append((accuracy_score(y_pred,y_test), eps))


In [None]:
print("Max test accuracy = %.4f \nlearning rate= %.4f" % 
      (max(results, key = lambda i : i[0])[0],max(results, key = lambda i : i[0])[1]))

Completely similar to the previous task, we will train the model with the learning_rate='adaptive' parameter (divides eps by 5 if there is no improvement in the training loss at several iterations . If you set too large eps, then it is very likely that it will not converge, it depends, in particular , from the n_iter_no_change parameter.

In [None]:
results=[]
for eps in [1,5,10,100]:
    from sklearn.linear_model import SGDClassifier
    pipeline = Pipeline(steps=[
        ('ohe_and_scaling', column_transformer),
        ('regression', SGDClassifier(max_iter=200,loss='log',penalty='l2',alpha=0.1,
                                     learning_rate='adaptive',eta0=eps,
                                     random_state=random_state,n_iter_no_change=5 ))
    ])

    model = pipeline.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(eps,accuracy_score(y_pred,y_test),model[1].n_iter_)
    results.append((accuracy_score(y_pred,y_test), eps))

In [None]:
#<YOUR TURN>
#try to change parameteres to get better results