# Quick Algorithm - ML Pipeline Challenge

https://gitlab.com/qa-public/qa-streaming-pipeline-challenge#user-interaction

## Dataframe Creation

### Importing necessary modules

In [2]:
import requests
import pandas as pd

### Finding the 'end-point' in database & Framing the Data

The database is packed with data, and thus it is necessary for the code of figure out where the list of employees end. Instead of doing the whole entire project in a function, it is better practice to compartamentalize the flow, for future database updates, etc.

I will not use any fancy coding here, since it is a mere environmental setup.

For a shorter runtime, I will intentionally increment the 'search grid' by 1000, then manually check if this is the border. In the real-application, one would be better off with a increment of 1, whilst having more time-optimized code.

In [3]:
def find(port):
    if type(port) != str:
        raise Exception('port must be an url expresseed with a string, page unspecified.')

    counter = 0
    
    while requests.get(port + '?page={}'.format(counter)).json() != []:
        counter += 1000

    return counter

In [4]:
find('http://localhost:5000/api/v1/data')

100000

We have found that at the data ends somewhere in between 99000 and 100000. Indeed, the limit here was at a 99999-100000.

Now we build a function to retrieve the data available from the database, to create a long list of JSON data, which then we will convert to pandas dataframe, in order to make manipulations easier. The function is designed to obtain the data until an arbitrary page, but in the case we want them all, so we will select 100000 as our limit. Again, this piece of code prioritizes its simpleness and intuitiveness over runtime, and since this has to go through 100000 loops with appending, it will take some time to complete.

In [5]:
def obtain_list(port, limit):
    if type(port) != str:
        raise Exception('port must be an url expresseed with a string, page unspecified.')

    if type(limit) != int or limit < 0:
        raise Exception('limit has to be a positive integer.')
    
    data = []
        
    for i in range(limit):
        data.append(requests.get(port + '?page={}'.format(i)).json())
    
    return data

In the creation of this listing, we have a multi-dimensional list, specifically having (# of pages, 10) . But we want (1, # ofpages * 10), thus we resort to a very simple method of flattening the list. convering to nparray then flattening should make no difference to this method.

In [6]:
dataset = sum(obtain_list('http://localhost:5000/api/v1/data',100), [])

dataset

[{'competence': 2.1561929505672226,
  'id': 0,
  'network_ability': -0.13850255563080713,
  'promoted': 1},
 {'competence': 0.1503209340782082,
  'id': 1,
  'network_ability': 0.6802618180031226,
  'promoted': 0},
 {'competence': 0.439139006444416,
  'id': 2,
  'network_ability': 1.0768018489787832,
  'promoted': 1},
 {'competence': -0.5154817414166092,
  'id': 3,
  'network_ability': -0.9299164880823638,
  'promoted': 0},
 {'competence': -0.7928276291200661,
  'id': 4,
  'network_ability': 0.23833774107081732,
  'promoted': 0},
 {'competence': -0.026650045224544557,
  'id': 5,
  'network_ability': -0.6416174585974274,
  'promoted': 0},
 {'competence': -0.480352986575169,
  'id': 6,
  'network_ability': 1.5730391776913912,
  'promoted': 1},
 {'competence': 0.37663500837867614,
  'id': 7,
  'network_ability': -1.118131621983561,
  'promoted': 0},
 {'competence': 0.34780430681736957,
  'id': 8,
  'network_ability': 0.5897098603358912,
  'promoted': 0},
 {'competence': 1.3518261083995122,

In [15]:
df = pd.DataFrame(dataset)

df

Unnamed: 0,competence,id,network_ability,promoted
0,2.156193,0,-0.138503,1
1,0.150321,1,0.680262,0
2,0.439139,2,1.076802,1
3,-0.515482,3,-0.929916,0
4,-0.792828,4,0.238338,0
...,...,...,...,...
995,0.366982,995,-1.574646,0
996,-1.100337,996,-2.032756,0
997,-1.320860,997,0.157454,0
998,2.411077,998,-0.390600,1


Now we have obtained a pandas dataframe, which is easily analysable with machine elarning modules.

## Machine Learning Modelling - Part 1 : Classification Algorithm

Here, the objective to predict the boolean variable `promoted` based on the other variables.

### Importing Modules and Pre-processing

#### Importing Modules

This time I will be using the `sklearn` module as a basis. I chose sklearn because it is capable from the beginning of the machine learning process all the way to the end. It supports both pre-processing and pipelines, helping me to avoid confusion and potential errors.

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

#### Pre-processing - Slicing 'answer 'column

we will have to drop the `promoted` column out of the dataframe as that are the 'solutions' we are trying to figure out. Simultaneously, we will drop the ID column as well, as this is tracked by the index of the dataframe.

In [17]:
X = df.drop(["promoted", "id"], axis=1)

X

Unnamed: 0,competence,network_ability
0,2.156193,-0.138503
1,0.150321,0.680262
2,0.439139,1.076802
3,-0.515482,-0.929916
4,-0.792828,0.238338
...,...,...
995,0.366982,-1.574646
996,-1.100337,-2.032756
997,-1.320860,0.157454
998,2.411077,-0.390600


In [19]:
X['competence']

0      2.156193
1      0.150321
2      0.439139
3     -0.515482
4     -0.792828
         ...   
995    0.366982
996   -1.100337
997   -1.320860
998    2.411077
999   -1.086212
Name: competence, Length: 1000, dtype: float64

In [8]:
y = df["promoted"]

y

0        1
1        0
2        1
3        0
4        0
        ..
99995    0
99996    0
99997    0
99998    0
99999    0
Name: promoted, Length: 100000, dtype: int64

We are left with 2 explanatory variables to determine the result of whether the employee is promoted or not.

#### Pre-processing - Train Set and Test Set Splitting

I am using the sklearn module to randomly split the dataframe into training set and the test set. I have used a very average ratio of test set of 20%, and I have selected a `random_state` so that the code is replicable.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1)

### Pipeline Structuring

I will construct the pipeline in the following way.

1. Scale the Data down so it is easier to deal with : Using `StandardScaler`.

2. Apply the classifiers (I will be comparing Ridge Classifier, Logistic Regression, SVC, KNN, and Decision Tree)

3. Choose the most suitable classifier for the dataset.


We will be omitting the PCA, as there are only 2 explanatory variabvles, and it is quite unlikely that one of the varibales is unnecessary.

In [12]:
pipeline_Ridge = Pipeline([('scale_1', StandardScaler()), ('class_Ridge', RidgeClassifier(random_state=1))])

pipeline_LR = Pipeline([('scale_2', StandardScaler()), ('class_LR', LogisticRegression(random_state=1))])

pipeline_SVC = Pipeline([('scale_3', StandardScaler()), ('class_SVC', SVC(random_state=1))])

pipeline_KNN = Pipeline([('scale_1', StandardScaler()), ('class_KNN', KNeighborsClassifier())])

pipeline_DT = Pipeline([('scale_1', StandardScaler()), ('class_DT', DecisionTreeClassifier(random_state=1))])


pipelines = [pipeline_Ridge, pipeline_LR, pipeline_SVC, pipeline_KNN, pipeline_DT]

Now the pipelines have been defined, and we will proceed to finding the one with the most accuracy. 

It is always good practice to have a dictionary for each of the classifiers for the later convenience of the codes.

In [13]:
pipes = {0: 'Ridge Classification', 1: 'Logistic Regression', 2: 'Support Vector Classifier', 3: 'K-Nearest Neighbors CLassification', 4: 'Decision Tree Classification'}

### Fitting The Model & Evaluation

Now we can simply fit the `pipelines`.

In [14]:
for item in pipelines:
    item.fit(X_train, y_train)

Now it is time to evaluate how well the models did, thus we will proceed to make the machine "guess" on the test set. The one with highest accuracy is chosen.

In [19]:
for i, j in enumerate(pipelines):
    print('{} has an accuracy of: {}%.'.format(pipes[i], j.score(X_test,y_test)*100))

Ridge Classification has an accuracy of: 81.635%.
Logistic Regression has an accuracy of: 81.47999999999999%.
Support Vector Classifier has an accuracy of: 94.17999999999999%.
K-Nearest Neighbors CLassification has an accuracy of: 94.69999999999999%.
Decision Tree Classification has an accuracy of: 90.225%.


Out of the diverse models chosen, they are quite close to each other, however, **K-Nearest Neighbor** is chosen as the best pipeline out of the 5 models compared. It has an accuracy of **94.7%**.

The hyper-parameter tuning can be done through the pipeline with the following methods:
(For the most part it is enough to use the Grid search.)

1. Import GridsearchCV
2. create a pipe with classifiers
3. create a parameter grids, consulting the different hyperparameters for each classifiers
4. Fit the pipe with each parameters on the grid, and assess the accuracy.

In this case we omit this step for the following reasons:

- To keep the work simple
- The score without hyperparameter tuning is excellent.
- Avoid overfitting

That said, in practice with high accuracy demand, time resources for making the code clean, and ensurance of overfitting avoidance, it is definitely possible, although it is unsure as to how much it will contribute to the accuracy and the overall effectiveness.

## Machine Learning Modelling - Part 2 : Endogenous Variable