# Logistic Regression Coding Challenge

## Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

## The Dataset

The dataset we will be using is the Iris dataset, which is commonly used in learning classification. The Iris dataset is a multivariate dataset where each class refers to a type of iris plant. This dataset is free and is publicly available at the UCI Machine Learning Repository.

<div>
    <img src=https://upload.wikimedia.org/wikipedia/commons/4/49/Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg width=300px>
</div>

This dataset contains a set of 150 records with five attributes - Sepal Length, Sepal Width, Petal Length, Petal Width, and Species. Species is the type of iris plant we will be classifying.

Lets import the data to see what we are dealing with.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Question 1 - Data pre-processing

Write a function to pre-processing the data so that we can run it through the classifier. The function should:
* Split the data into features and labels
* Standardise the features using sklearn's ```StandardScaler```
* Split the data into 75% training and 25% testing data.
* Use the `train_test_split` method from `sklearn` to do this.
* Set random_state to equal 42 for this internal method. 

_**Function Specifications:**_
* Should take a dataframe as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

In [87]:
def data_preprocess(df):
    
    dataset = df.copy()
    X = dataset.drop('species',axis =1)
    y = dataset['species'].values
    scaler = preprocessing.StandardScaler()
    scaler.fit(X)
    X_transform = scaler.transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.25, random_state=42)
    
    return ((X_train,y_train),(X_test,y_test))

In [88]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:2])
print(y_train[:2])
print(X_test[:2])
print(y_test[:2])

[[-1.02184904  1.26346019 -1.3412724  -1.31297673]
 [-0.7795133   2.42047502 -1.2844067  -1.4444497 ]]
['Iris-setosa' 'Iris-setosa']
[[ 0.31099753 -0.58776353  0.53529583  0.00175297]
 [-0.17367395  1.72626612 -1.17067529 -1.18150376]]
['Iris-versicolor' 'Iris-setosa']


_**Expected Outputs:**_

```python
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:2])
print(y_train[:2])
print(X_test[:2])
print(y_test[:2])
```

> ```
[[-1.02184904  1.26346019 -1.3412724  -1.31297673]
 [-0.7795133   2.42047502 -1.2844067  -1.4444497 ]]
['Iris-setosa' 'Iris-setosa']
[[ 0.31099753 -0.58776353  0.53529583  0.00175297]
 [-0.17367395  1.72626612 -1.17067529 -1.18150376]]
['Iris-versicolor' 'Iris-setosa']
```

## Question 2 - Training the Model

Now that we have formatted our data, we can fit a model using sklearn's `LogisticRegression` class with its default parameters. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* The returned model should be fitted to the data.

In [41]:
def train_model(X_train, y_train):
    logic = LogisticRegression()
    results = logic.fit(X_train, y_train)
    return results

In [42]:
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)

-1.5171756431068295
[[-0.71766856  1.32377179 -1.60277458 -1.41952784]
 [ 0.16218186 -1.28166738  0.51973042 -0.65811392]
 [ 0.08010274 -0.1281464   1.71205143  2.34828867]]


In [43]:
np.array_equal(np.round(lm.intercept_, 8), np.array([-1.51717564, -0.85832387, -2.36787217]))

True

_**Expected Outputs:**_

```python
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)
```
```
-1.51717564311
[[-0.71766856  1.32377179 -1.60277458 -1.41952784]
 [ 0.16218186 -1.28166738  0.51973042 -0.65811392]
 [ 0.08010274 -0.1281464   1.71205143  2.34828867]]
```

## Testing the model

### Question 3.1

Now that you have trained your model, lets see how well it does on the test set. Write a function which returns the accuracy of your trained model when tested with the test set.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a `float` of the accuracy of the model. This number should be between zero and one.

In [80]:
def calculate_accuracy(lm, X_test, y_test):
    
    y_pred = lm.predict(X_test)
    results = accuracy_score(y_test, y_pred)
    
    return results

In [81]:
print(calculate_accuracy(lm,X_test,y_test))

0.947368421053


_**Expected Outputs:**_
    
```python
print(calculate_accuracy(lm,X_test,y_test))
```
>```
0.947368421053
```

### Question 3.2

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. Confusion matrices gives us more information on where our model is going wrong - looking specifically at the performance caused by Type I & II errors. Write a function which returns the confusion matrix of your test set.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a confusion matrix

_**Hint**_ You don't need to do this manually, sklearn has a confusion matrix function.

In [48]:
def conf_matrix(lm, X_test, y_test):
    
    y_pred = lm.predict(X_test)
    results = confusion_matrix(y_test, y_pred)
    
    return results

In [49]:
print(conf_matrix(lm,X_test,y_test))

[[15  0  0]
 [ 0  9  2]
 [ 0  0 12]]


_**Expected Outputs:**_
    
```python
print(conf_matrix(lm,X_test,y_test))
```
>```
[[15  0  0]
 [ 0  9  2]
 [ 0  0 12]]
 ```

### Question 3.3

Write a function which calculates the _multi-class_ Accuracy, Precision, Recall and F1 scores. Recall from your trains that the precision, recall and f1 scores are calculated by

$$
{\rm Precision} = \frac{\rm TP}{TP+FP}
$$

$$
{\rm Recall} = \frac{\rm TP}{TP+FN}
$$

$$
{\rm F1} = 2 \times \frac{\rm Recall \times Precision}{\rm Recall + Precision}
$$

<div>
<img src=https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/confusion_matrix.png>
</div>

As per the image above, these scores can be calculated from the elements of either a single column (precision on "predicted true") or a single row row (recall on "actual true"). 

Let's generalize this notion to multiple columns. If $i$ represents a row index, and $j$ represents a column index of a confusion matrix $C$ then we can write the recall for the $i^{\rm th}$ row as 
$$
R_i = \frac{ C_{ii} }{ \sum_j C_{ij} },
$$
the precesion of the $j^{\rm th}$ column as
$$
P_j = \frac{ C_{jj} }{ \sum_i C_{ij} },
$$
and the F1 score as
$$
F_i = 2 \times \frac{P_i \times R_i}{P_i + R_i}.
$$

Using these, calculate the _average_ recall, precision, and F1 scores for our $3\times 3$ confusion matrix. As an example, the average recall is $R=\frac{1}{N}\sum_i^N R_i$, where $N$ is the number of rows.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a tuple in the form (`Accuracy`, `Precision`, `Recall`, `F1-Score`)

_**HINT:**_
The autograder tests the value of each of these metrics seperately. If you only know how to calculate one metric, then return a tuple with that metric, and zeros for the other metrics. For example, if you only know how to calculate accurcacy, then return a tuple with `(accuracy, 0, 0, 0)`. If you can only calculate the accuracy and recall, then return `(accuracy, 0, recall, 0)`.

In [78]:
def scores(lm, X_test, y_test):
    from sklearn.metrics import precision_score,recall_score,f1_score

    y_pred = lm.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')

    return (accuracy,precision,recall,f1)

In [82]:
(accuracy, precision, recall, f1) = scores(lm, X_test, y_test)    

print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)

Accuracy: 0.947368
Precision: 0.952381
Recall: 0.939394
F1 score: 0.941026


_**Expected Outputs:**_
```python
(accuracy, precision, recall, f1) = scores(lm,X_test,y_test)
    
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)
```
> ```
Accuracy: 0.947368
Precision: 0.952381
Recall: 0.939394
F1 score: 0.941026
```