## Exercise: Logistic Regression

This notebook shows another class comparison between cuML and Scikit-learn: The `LogisticRegression`. The basic form of logistic regression is used to model the probability of a certain class or event happening based on a set of variables.

We also use this as an example of how cuML can adapt to other GPU centric workflows, this time based on CuPy, a GPU centric NumPy like library for array manipulation: [CuPy](https://cupy.chainer.org)

Thanks to the [CUDA Array Interface](https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html) cuML is compatible with multiple GPU memory libraries that conform to the spec, and tehrefore can use objects from libraries such as CuPy or Pytorch without additional memory copies!

Lets begin by importing our needed libraries:

In [None]:
import pandas as pd
# Lets use cupy in a similar fashion to how we use numpy
import cupy as cp

from sklearn import metrics, datasets
from sklearn.linear_model import LogisticRegression as skLogistic
from sklearn.preprocessing import binarize
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

cm_bright = ListedColormap(['#FF0000', '#0000FF'])

Once again, lets use Scikit-learn to create a dataset to use:

In [None]:
a = datasets.make_classification(10000, n_features=2, n_informative=2, n_redundant=0, 
                                     n_clusters_per_class=1, class_sep=0.5, random_state=1485)

Now lets create our `X` and `y` arrays in CuPy:

In [None]:
X = cp.array(a[0], order='F') # the API of CuPy is almost identical to NumPy
y = cp.array(a[1], order='F')

Lets see how the dataset works:    

In [None]:
plt.scatter(cp.asnumpy(X[:,0]), cp.asnumpy(X[:,1]), c=[cm_bright.colors[i] for i in cp.asnumpy(y)], 
            alpha=0.1);

Now lets divide our dataset into training and testing datasets in a simple manner:

In [None]:
# Split the data into a training and test set using NumPy like syntax
X_train = X[:8000, :].copy(order='F')
X_test = X[-2000:, :].copy(order='F')
y_train = y[:8000]
y_test = y[8000:10000]

Note that the resulting objects are still CuPy arrays in GPU: 

In [None]:
X_train.__class__

## Exercise: Fit the cuML and Scikit-learn `LogisticRegression` objects and compare them when they use as similar parameters as possible

* Hint 1: the **default values** of parameters in cuML are **the same** as the default values for Scikit-learn most of the time, so we recommend to leave all parameters except for `solver` as the default 


* Hint 2: Remember the **solver can differ significantly between the libraries**, so look into the solvers offered by both libraries to make them match 


* Hint 3: Even though Scikit-learn expects Numpy objects, it **cannot** accept CuPy objects for many of its methods since it expects the memory to be on CPU (host), not on GPU (device)

For convenience, the notebook offers a few cells to organize your work.

### 1. Fit Scikit-learn LogisticRegression and show its accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# useful methods: cp.asnumpy(cupy_array) converts cupy to numpy, 

### 2. Fit cuML Regression and show its accuracy

* Hint 1: Look at the data types expected by cuML methods: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.LogisticRegression.fit 
   one of the input vectors might not be of the expected data type!



* Hint 2: as mentioned above, cuML has native support for CuPy objects

In [None]:
from cuml import LogisticRegression as cuLogistic
import numpy as np

In [None]:
# useful methods: cupy_array.astype(np_dtype) converts an array from one datatype to np_datatype, where np_datatype can be something like np.float32, np.float64, etc.
# useful methods: cudf_seris.to_array() converts a cuDF Series to a numpy array
# useful methods: cp.asnumpy(cupy_array) converts cupy to numpy,

**Expected accuracies for apples to apples comparison: 0.8025 vs 0.8695**

Additional Exercise: Play with the different parameters, particularly the different Scikit-learn solvers to see how they differ in behavior even in the same library!