## Logistic Regression


In this notebook we are going to implement logistic regression to solve a binary classification problem. In particular, you will have to:

* Complete `logRegParamEstimates(XTrain,yTrain)` function that fits a logistic regressor to data using the Gradient Descent algorithm.
* Complete a function `logRegNEWRegrPredict(XTrain, yTrain, xTest)` to implement logistic regression algorithm and run it on the data.

# Import libraries

The required libraries for this notebook are pandas, sklearn, scipy, numpy and matplotlib.

In [1]:
# import libraries
# from pandas.tools.plotting import scatter_matrix
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as patches
from scipy.special import expit
import itertools

# Load the data
We will use the dataset ***classification data.txt***. It characterizes 2 different classes of fruits (apple or non apple) based on 7 features. 

In [2]:
# Loading the TXT file
fruits = pd.read_table('./classification_data.txt')


# Split data into training and testing

In [3]:
# Split the data
feature_names = ['mass', 'width', 'height', 'color_score']
x = fruits[feature_names]
y = fruits['fruit_label']

# Split the data into training and testing(75% training and 25% testing data)
x_train, x_test, y_train, y_test = train_test_split(x, y,random_state=0)

# Pre-process data
scaler = MinMaxScaler() # This estimator scales and translates each feature individually such that it is in the given range on the training set, default between(0,1)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)


# Task 1: Use logistic regression from a library

We will first see how logistic regression can be implemented using already available functions from the scikit-learn library.

In [4]:
# sklearn functions implementation
def logRegrPredict(x_train, y_train,xtest ):
    # Build Logistic Regression Model
    logreg = LogisticRegression(solver='lbfgs')
    # Train the model using the training sets
    logreg.fit(x_train, y_train)
    y_pred= logreg.predict(xtest)
    #print('Accuracy on test set: {:.2f}'.format(logreg.score(x_test, y_test)))
    return y_pred

y_pred = logRegrPredict(x_train, y_train,x_test)
print('Accuracy on test set: '+str(accuracy_score(y_test,y_pred)))
print(classification_report(y_test,y_pred))#text report showing the main classification metrics


Accuracy on test set: 0.7
              precision    recall  f1-score   support

           1       0.75      0.60      0.67         5
           3       0.67      0.80      0.73         5

    accuracy                           0.70        10
   macro avg       0.71      0.70      0.70        10
weighted avg       0.71      0.70      0.70        10



# Task 2: Implement your own logistic regression function


You will be given the partially-implemented`paramEstimates(xTrain, yTrain)` function that returns the parameters estimated by gradient descent. You are asked to complete the cost function as follows:

\begin{align}
J\left(\theta \right) & =  -{\frac{1}{n}}[\sum_{i=1}^n \left(y_i \log_2(P_r(\hat{y}=1|x_i;\theta))+(1-y_i)\log_2(1-P_r(\hat{y}=1|x_i;\theta)) \right)]\\
\end{align}

You are also asked to complete the `logRegrNEWRegrPredict(xTrain, yTrain, xTest)` function, or write your own, that returns the output variable y given the input features x as follows: 
\begin{align}
\hat{y} & = \frac{1}{1+e^{-\theta^{*t}x}}
\end{align}

***Remember that we train on `xTrain` and `yTrain`!***

In [14]:
def sigmoid(z):
    return 1. / (1. + np.exp(-z))

def loss(h, y):
    return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
 
def logRegParamEstimates(xTrain, yTrain):
    intercept = np.ones((xTrain.shape[0], 1))
    xTrain = np.concatenate((intercept, xTrain), axis=1)
    yTrain[yTrain > 1] = 0
    theta = np.zeros(xTrain.shape[1])
    for i in range(100):
        z = np.dot(xTrain, theta)
        h = sigmoid(z)
        lr = 0.01
        gradient = loss(h, yTrain)
        theta = theta - lr*gradient
    return theta

def logRegrNEWRegrPredict(xTrain, yTrain,xTest ):
    theta = logRegParamEstimates(xTrain, yTrain)
    intercept = np.ones((xTest.shape[0], 1))
    xTest = np.concatenate((intercept, xTest), axis=1)
    sig = sigmoid(np.dot(xTest, theta))
    y_pred1 = []
    for x in sig:
        print(x)
        if x > 0.5:
            y_pred1.append(3)
        else:
            y_pred1.append(1)
    
    return y_pred1


In [15]:
y_pred1 = logRegrNEWRegrPredict(x_train, y_train,x_test)
print('Accuracy on test set: '+str(accuracy_score(y_test,y_pred1)))
print(classification_report(y_test,y_pred1))#text report showing the main classification metrics




0.07941491499064234
0.14401630327980314
0.13658670929196423
0.12360927828779202
0.1718929088110426
0.031557077824981385
0.09187892023408127
0.17961732165833164
0.15524415948979306
0.20591419027398608
Accuracy on test set: 0.5
              precision    recall  f1-score   support

           1       0.50      1.00      0.67         5
           3       0.00      0.00      0.00         5

    accuracy                           0.50        10
   macro avg       0.25      0.50      0.33        10
weighted avg       0.25      0.50      0.33        10



  _warn_prf(average, modifier, msg_start, len(result))


In [8]:
print(y_pred1)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
