#Logistic Regression in Practice

In this session, we shall apply logistic regression and look at predictions using it.

In [None]:
#First, we import the required packages
import pandas as pd #the pandas library is useful for data processing 
import numpy as np #numpy package will be useful for numerical computations
import matplotlib.pyplot as plt #the matplotlib library is useful for plotting purposes

# The following python directive helps to plot the graph in the notebook directly
%matplotlib inline

Now let us consider some open source data sets available in the internet.

In [None]:
from sklearn.datasets import load_iris  #importing the load_iris class
iris_data = load_iris() #loading the iris dataset in iris_data

print(iris_data['DESCR']) #checking out the description of the dataset

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [None]:
iris_data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [None]:
X = iris_data['data']
print(X)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

From the description, it is clear that *iris* dataset consists of essentially 3 classes: *Iris-Setosa*, *Iris-Versicolour*, *Iris-Virginica*. As logistic regression has been introduced as a binary classifier, we can alter the data into a binary classification problem based on finding whether a flower belongs to *Iris-Virginica* or not. In order to do that, we change the labels $\{0,1\}$ to $0$ and 2 to 1.

In [None]:
y = np.where(iris_data['target'] == 2, 1, 0) #shorthand notation to change all labels other than 2 as 0 and 2 as 1
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]


$\Large{\text{Logistic Regression with L2 regularisation} \ \text{ }}:$

Assume that we have a random variable $X$ whose realization is $x$ a $d$-dimensional data:  
$
x=
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots\\
x_d
\end{bmatrix}$.
Now we want to model the response variable $Y$ using logistic regression. 

Recall that we use $E[Y|X=x] = p(x)$. 

Also recall that we write $p(x)$ equivalently as:
$
p(x)=p(x;\beta_0,\beta_1,\ldots,\beta_d) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d)}}.
$

If we denote $p(x)$ simply as $p$ and if we have the notations $\mathbf{x}=\begin{bmatrix}
x \\ 1
\end{bmatrix} = \begin{bmatrix}
x_1 \\ x_2 \\ \vdots \\ x_d \\ 1
\end{bmatrix}, \beta=\begin{bmatrix}
\beta_1 \\ \beta_2 \\ \vdots \\ \beta_d \\ \beta_0
\end{bmatrix}$
then we can write:

$
\begin{align}
p = \frac{1}{1+e^{-\beta^\top \mathbf{x}}}.
\end{align}
$

We also derived that

$
\begin{align}
\beta^\top \mathbf{x} &= \ln \frac{p}{1-p} \\
\implies \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d &= \ln \frac{p}{1-p}.
\end{align}
$

Thus, even if we did not have a straightforward dependence of $Y$ on an observation $x$ of $X$ as a linear relation, we see that the linear relation $\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d $ is related to the probability $p$ using:

$
\ln\frac{p}{1-p}=\beta^\top \mathbf{x}.
$

Note that the ratio $\frac{p}{1-p}$ is called $\textbf{odds}$ that the event $Y=1$ occurs, and hence $\ln \frac{p}{1-p}$ denotes the $\textbf{log odds}$. 

More popularly, the log odds $\ln \frac{p}{1-p}$ is called the $\textbf{logit}$ function. 


$\Large{\text{Likelihood function}}$ 

We defined a quantity useful in the estimation of the parameters $\beta$ used to model $p$.


Given an observation $X=x=\begin{bmatrix}
x_1 \\ x_2 \\ \vdots \\ x_d
\end{bmatrix}$ 
we define the $\textbf{likelihood function}$ as: 

$
L(y;p) = p^y(1-p)^{(1-y)}
$

where recall that $p=p(x)=p(x;\beta)=\frac{1}{1+e^{-\beta^\top {\mathbf{x}}}}$. Note that likelihood function is simply an equivalent way to represent $P[Y=y]$, when $Y$ is assumed to be Bernoulli random variable. 

Then observe that the natural goal is to maximize the likelihood function with respect to parameters $\beta$. 

Now given a data set $D$ containing $n$ observations of the form $\{({x}^1,y^1), ({x}^2,y^2), \ldots, ({x}^n,y^n)\}$, and assuming that the pairs $({x}^i,y^i)$ are independent observations, then it is possible to extend the likelihood function as: 

$
\begin{align}
L(y^1,\ldots,y^n;p^1,\ldots,p^n) = \Pi_{i=1}^{n} {(p^i)}^{y^i}(1-p^i)^{(1-y^i)}.
\end{align}
$

We can now write the $\textbf{log likelihood}$ function as:

$
\begin{align}
\ln L(y^1,\ldots,y^n;p^1,\ldots,p^n) = \sum_{i=1}^{n} y^i \ln {(p^i)} + (1-y^i) \ln (1-p^i).
\end{align}
$

Since natural log function is monotonic, maximizing the likelihood function is equivalent to maximizing the log likelihood function. 

Hence the concerned optimization problem is: 

$
\max_{\beta} \ln L(y^1,\ldots,y^n;p^1;\ldots,p^n)= \sum_{i=1}^{n} y^i \ln {(p^i)} + (1-y^i) \ln (1-p^i).
$

Note that $p^i = p(x^i) = p(x^i; \beta) = \frac{1}{1+e^{-{\beta^\top\mathbf{x}}}}, \forall i = 1,\ldots,n$.



$\Large{\text{Solving the regularized likelihood maximization problem}}:$

To solve 

$
\max_{\beta} f(\beta) = \ln L(y^1,\ldots,y^n;p^1;\ldots,p^n) - \frac{\lambda}{2} \|\beta\|_2^2 = \sum_{i=1}^{n} y^i \ln {(p^i)} + (1-y^i) \ln (1-p^i) - \frac{\lambda}{2} \|\beta\|_2^2 - \frac{\lambda}{2} \|\beta\|_2^2.
$

we can find the gradient of the objective function with respect to $\beta$ as:

$
\begin{align}
\nabla_\beta {\ln L} =  \begin{bmatrix}
\frac{\partial{ \ln L }} {\partial \beta_1} \\
\frac{\partial{ \ln L }} {\partial \beta_2} \\
\vdots \\
\frac{\partial{ \ln L }} {\partial \beta_d} \\
\frac{\partial{ \ln L }} {\partial \beta_0} 
\end{bmatrix}
&= \sum_{i=1}^{n} \frac{y^i}{p^i} \nabla_\beta  p^i + \frac{(1-y^i)}{(1-p^i)} \nabla_\beta (1-p^i) - \lambda \beta \\ 
&= \sum_{i=1}^{n} \frac{y^i}{p^i} \nabla_\beta  p^i - \frac{(1-y^i)}{(1-p^i)} \nabla_\beta p^i 
\end{align}
$

where $\nabla_\beta  p^i = \nabla_\beta  \left ( \frac{1}{1+e^{-\beta^\top {\mathbf{x}}^i}}\right )$.

Note that $\nabla_\beta p^i$ can be computed as:
$
\begin{align}
\nabla_\beta p^i = \nabla_\beta \frac{1}{1+e^{-\beta^\top {\mathbf{x}}^i}} = 
\frac{e^{-\beta^\top {\mathbf{x}}^i}} {(1+e^{-\beta^\top {\mathbf{x}}^i})^2}
{\mathbf{x}}^i=
\left (\frac{1}{1+e^{-\beta^\top {\mathbf{x}}^i}}\right ) \left (\frac{e^{-\beta^\top {\mathbf{x}}^i}}{1+e^{-\beta^\top {\mathbf{x}}^i}}\right ) {\mathbf{x}}^i =   
p^i(1-p^i) {\mathbf{x}}^i
\end{align}
$

Thus we have 

$
\begin{align}
\nabla_\beta {\ln L} &= \sum_{i=1}^{n} \frac{y^i}{p^i} \nabla_\beta  p^i - \frac{(1-y^i)}{(1-p^i)} \nabla_\beta p^i  \\ 
&=  \sum_{i=1}^{n} \frac{y^i}{p^i} p^i(1-p^i){\mathbf{x}}^i - \frac{(1-y^i)}{(1-p^i)} p^i(1-p^i){\mathbf{x}}^i \\
&= \sum_{i=1}^{n} \left(y^i (1-p^i) - (1-y^i) p^i\right ){\mathbf{x}}^i\\
&= \sum_{i=1}^{n} \left(y^i - y^ i p^i - p^i + y^i p^i\right ){\mathbf{x}}^i \\
&= \sum_{i=1}^{n} \left(y^i - p^i \right ){\mathbf{x}}^i
\end{align}
$

Hence the overall gradient of the regularized objective function is:

$
\nabla_{\beta} f(\beta) = \sum_{i=1}^{n} \left(y^i - p^i \right ) - \lambda \beta
$

We generally adopt an iterative procedure as follows to find the optimal $\beta$. 

$\large{\text{Gradient ascent for solving the likelihood maximization problem}}:$

$
\begin{align}
&\textbf{Step 0:}  \text{Input data set $D$, tolerances $\epsilon_1, \epsilon_2$.} \\
&\textbf{Step 1:}  \text{Start with arbitrary $\beta_0$.} \\
&\textbf{Step 2:}  \text{For $k=0,1,2,\ldots$} \\
&\quad \quad \textbf{Step 2.1:} \text{Compute gradient $\nabla_\beta f(\beta_k)$}. \\
&\quad \quad \textbf{Step 2.2:}  \text{Compute step length $\eta$ using line search procedure} \\
&\quad \quad \textbf{Step 2.3:}  \beta_{k+1} = \beta_{k} + \eta_k \nabla_\beta f(\beta_k) \\
&\quad \quad \textbf{Step 2.4:}  \text{if $\|\nabla_{\beta} f(\beta_{k+1})\|_2 \leq \epsilon_1$ break from loop} \\
&\quad \quad \textbf{Step 2.5:}  \text{if relative change in function value is $\leq \epsilon_2$ break from loop} \\
&\textbf{Step 3:}  \text{ Output $\beta_{k+1}$}
\end{align}
$


Note $\eta_k$ denotes the learning rate. 


$\large{\text{Module for objective value computation}}:$

In [None]:
import numpy as np
from numpy.linalg import norm
#computing log likelihood function
def log_likelihood(beta,lamda, X, y):
  log_likelihood = 0.0
  num_samples = X.shape[0]
  for i in range(num_samples):
    x_i = X[i] #access i-th feature row
    y_i = float(y[i]) #access i-th label
    p_i = 1.0/(1.0+np.exp(-np.dot(beta, x_i))) #probability with the current beta
    log_likelihood += y_i * np.log(p_i) + (1.0-y_i)*np.log(1-p_i)  #adding the penalty term
  return log_likelihood-(lamda/2)*np.dot(beta,beta)

$\large{\text{Module for gradient computation}}:$

In [None]:
#computing gradient of objective function with respect to beta

def compute_gradient(beta,lamda,X,y):
  gradient = np.zeros(X.shape[1])
  num_samples = X.shape[0]
  for i in range(num_samples):
    x_i = X[i] #access i-th feature row
    y_i = float(y[i]) #access i-th label
    p_i = 1.0/(1.0+np.exp(-np.dot(beta, x_i))) #probability with the current beta
    gradient = np.add(gradient,(y_i -p_i)*x_i) 
  return np.add(gradient,-lamda*beta) #adding penalty term in gradient

In [None]:
#cross-check compute_gradient and log_likelihood functions
beta = np.zeros(X.shape[1])
print(X[0])
print(beta)
np.dot(X[0],beta)
print('log likelihood:',log_likelihood(beta,1e-3, np.array([X[0]]), np.array([y[0]])))
print('gradient:',compute_gradient(beta,1e-3,np.array([X[0]]), np.array([y[0]])))

[5.1 3.5 1.4 0.2]
[0. 0. 0. 0.]
log likelihood: -0.6931471805599453
gradient: [-2.55 -1.75 -0.7  -0.1 ]


$\large{\text{Module for Training}}:$

In [None]:
def solve_l2_regularized_logistic_regression(X,y, lamda, max_iter = 500000, eps_1=0.01, eps_2= 1e-9, verbose=0, plot_obj=False):
  #max_iter denotes maximum number of iterations
  #eps_1 is tolerance for gradient norm
  #eps_2 is tolerance for relative function value difference

  #gradient ascent for likelihood maximization 
  beta = np.zeros(X.shape[1])

  f_val_list = []
  if verbose>3:
    f_val = log_likelihood(beta,lamda, X, y)
   #store the objective function values for plotting purposes
    f_val_list.append(f_val)

  if verbose>3:
    print('Initial values: beta:', beta, ' log likeihood:', f_val)
  #the loop 
  for k in range(max_iter):
    
    #compute gradient of objective function with respect to beta
    grad_beta = compute_gradient(beta,lamda, X, y)
      
    #lr = linesearch(beta_0, beta_1, grad_beta_0, grad_beta_1, float(f_val_list[-1]))
    lr = 0.01 #we use a constant step length (or) learning rate
    #update beta 
    beta = np.add(beta, lr*grad_beta)
    
    #print('k: ', k, ' grad beta: ', grad_beta,  'beta:',beta)
    grad_norm = np.linalg.norm(grad_beta)
    if verbose>3:
      f_val = log_likelihood(beta,lamda, X, y)
      f_val_list.append(f_val)
      rel_change_in_fval = np.abs((f_val - f_val_list[-2])/f_val_list[-2])

    if verbose>3:
      if k%10000+1 == 1:
        print('k: ', k,  ' beta:', beta, ' grad norm:', grad_norm, 'log likelihood:', f_val)
    if grad_norm <= eps_1:# or rel_change_in_fval <= eps_2:
      break
  if verbose>=0:
    print('Final: k: ', k,  ' beta:', beta, ' grad norm:', grad_norm, 'log likelihood:', log_likelihood(beta,lamda, X, y))

  if plot_obj == True:
    #plot the function values during optimization 
    plt.plot( f_val_list, '-r', label='Log Likelihood values')
    plt.title("Log likelihood function values vs iterations")
    plt.xlabel("Iteration")
    plt.ylabel("Log Likelihood")
    plt.legend(loc='lower right')
    #plt.grid()
    plt.show()
  return beta

$\large{\text{Module for accuracy computation}}:$

In [None]:
def compute_accuracy(beta,X,y):
  num_samples = X.shape[0]
  num_correct_predicted = 0
  for i in range(num_samples):
    x_i = X[i] #access i-th feature row
    y_i = float(y[i]) #access i-th label
    p_i = 1.0/(1.0+np.exp(-np.dot(beta, x_i))) #probability with the current beta
    if p_i >= 0.5:
      y_i_pred = 1
    else:
      y_i_pred = 0
    if y[i] == y_i_pred:
      num_correct_predicted+=1  
  accuracy = num_correct_predicted/num_samples
  print('num samples:', num_samples, 'num correct predictions: ', num_correct_predicted, 'accuracy:', accuracy)
  return accuracy


$\large{\text{k-Fold Cross-Validation}}:$

In [None]:
from sklearn.model_selection import train_test_split

seed = 2000

X_train_initial, X_test, y_train_initial, y_test = train_test_split(X, y, test_size = 0.25, random_state = seed)
X_train_initial = np.hstack((X_train_initial,np.ones((X_train_initial.shape[0],1))))
X_test = np.hstack((X_test,np.ones((X_test.shape[0],1))))

print('X_train_initial shape:', X_train_initial.shape)
print('y_train_initial shape:', y_train_initial.shape)
print('y_train_initial:', y_train_initial)

print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)
print('y_test:', y_test)

seeds = [100, 200, 300]

lambdas = [0.001, 0.01, 0.1]

train_acc_seeds = [] 
val_acc_seeds = [] 

for seed in seeds: 
  X_train, X_val, y_train, y_val = train_test_split(X_train_initial, y_train_initial, test_size = 0.2, random_state = seed)
  print('#################################################')
  print('K fold seed:', seed)
  print('X_train shape:', X_train.shape)
  print('y_train shape:', y_train.shape)
  #print('y_train:', y_train)
  
  print('X_test shape:', X_test.shape)
  print('y_test shape:', y_test.shape)
  #print('y_test:', y_test)
  
  
  
  train_acc_lambdas = []
  val_acc_lambdas = []
  
  for lambda_ in lambdas:
    print('*******************************************')
    print('lambda:',lambda_)
    beta = solve_l2_regularized_logistic_regression(X_train,y_train, lambda_)
    print('beta:',beta)
    print('***************')

    train_acc = compute_accuracy(beta,X_train,y_train)
    train_acc_lambdas.append(train_acc)

    val_acc = compute_accuracy(beta,X_val,y_val)
    val_acc_lambdas.append(val_acc)

  train_acc_seeds.append(np.array(train_acc_lambdas))

  val_acc_seeds.append(np.array(val_acc_lambdas))

print('***************** K fold cross validation complete ! **************')
val_acc_seeds = np.array(val_acc_seeds).squeeze()
print(val_acc_seeds)

mean_val_acc_lambdas = np.mean(val_acc_seeds,axis=0)

train_acc_seeds = np.array(train_acc_seeds).squeeze()
print(train_acc_seeds)

mean_train_acc_lambdas = np.mean(train_acc_seeds,axis=0)


print('k-fold val acc values:', mean_val_acc_lambdas.squeeze())
print('k-fold train acc values:', mean_train_acc_lambdas.squeeze())

best_lambda_idx = np.argmax(mean_val_acc_lambdas)
best_lambda = lambdas[best_lambda_idx]

print('*************************************')
print('best lambda:', best_lambda)
print('*************************************')

print('Solving with full train data and best lambda:')
beta = solve_l2_regularized_logistic_regression(X_train_initial,y_train_initial, best_lambda)
print('lambda:',best_lambda)
print('Final beta:',beta)
print('**********************************')

train_acc = compute_accuracy(beta,X_train_initial,y_train_initial)

test_acc = compute_accuracy(beta,X_test,y_test)

print('*************************************')
print('Printing train and test accuracies after training on full data and best lambda:')
print('Train acc :', train_acc)
print('Test acc:', test_acc)
print('************************************')


X_train_initial shape: (112, 5)
y_train_initial shape: (112,)
y_train_initial: [0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 1 1 0 0 1
 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1
 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0
 0]
X_test shape: (38, 5)
y_test shape: (38,)
y_test: [0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 0
 1]
#################################################
K fold seed: 100
X_train shape: (89, 5)
y_train shape: (89,)
X_test shape: (38, 5)
y_test shape: (38,)
*******************************************
lambda: 0.001
Final: k:  50416  beta: [-13.79412577  -5.5638296   18.70950498  14.91305099 -14.27554457]  grad norm: 0.009999883550355443 log likelihood: -1.4508850850234418
beta: [-13.79412577  -5.5638296   18.70950498  14.91305099 -14.27554457]
***************
num samples: 89 num correct predictions:  89 accuracy: 1.0
num samples: 23 num correct predictions:  23

In [2]:
# plt.figure(figsize =(12,7))

# plt.plot(lambdas,mean_train_acc_lambdas,label = "Training accuracy value")
# plt.plot(lambdas, mean_val_acc_lambdas,label = "Validation accuracy value")
# plt.xlabel('Lambda')
# plt.ylabel('Acc. score ')
# plt.title("Avg. acc. score vs Lambda")
# plt.xscale('log')
# plt.grid()
# plt.legend()
# plt.xticks( [1e-3, 1e-2])
# plt.show()
