# Classification

Classification is the process responsible for categorizing a given data to a class/category. It is often described as the task of finding a function that yields the best results to map an input value X to a class representing the output Y. In order to achieve the desired function, the classifier uses some training data to extract patterns to be used for classifying. For this project, classification is one of two steps (classification and ranking) to decide which student is approved or reject to participate in Hackaton USP. 

# Regression

Differently than Classification, Regression is a method that predicts a continuous output.  In summary, given an input data X, a Regression algorithm would process it to predict a continuous/real output. It is also a supervised technique and more suitable for the purpose of this project. Using regression models, it is possible to predict a real number representing the score of a person, which will be used to create the ranking.

## Logistic Regression

As a regression algorithm, the Logistic Regression uses the statistical model based on logistic function in order to predict the value of a categorical dependent variable. The idea is not to rank the registration of the candidates, just to classify the candidates that were selected to the HackathonUSP or not. In this case, the Binary Logistic Regression will be used once the student can be IN or OUT of the HackathonUSP in a specific edition.

In [98]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix

In [99]:
registrations = pd.read_csv('data/AllRegistrationsFile.csv', delimiter=',')
registrations

Unnamed: 0,Datetime,Edition,id,IsSelected,Gender,College,StudentProgram,GraduationYear,NHackathon,Role,HasTeam,TeamMembers,TeamId,TeamSize
0,09/08/2016 19:55:46,2016.1,46,S,F,IME,Bacharelado em Ciência da Computação,1,0,Desenvolvedor,Não,"46, 56, 309, 483",54,4
1,09/08/2016 20:24:09,2016.1,309,S,M,IME,Bacharelado em Ciência da Computação,1,0,Nenhuma,Não,"46, 56, 309, 483",54,4
2,09/08/2016 22:57:08,2016.1,56,S,M,IME,Bacharelado em Ciência da Computação,1,0,Desenvolvedor,Sim,"46, 56, 309, 483",54,4
3,18/08/2016 22:49:37,2016.1,47,S,F,Não USP,,,0,Desenvolvedor,Não,,0,1
4,19/08/2016 21:47:33,2016.1,35,S,M,Não USP,,,0,Desenvolvedor,Não,,0,1
5,23/08/2016 09:44:28,2016.1,483,S,M,IME,Matemática Aplicada,1,0,Desenvolvedor,Sim,"46, 56, 309, 483",54,4
6,24/07/2016 20:53:04,2016.1,227,S,M,IME,Bacharelado em Ciência da Computação,2,2,Desenvolvedor,Sim,"227, 303, 339, 402",31,4
7,24/07/2016 21:01:17,2016.1,342,S,M,IME,Bacharelado em Ciência da Computação,2,5,Desenvolvedor,Sim,"332, 342, 511, 513",47,4
8,24/07/2016 21:01:24,2016.1,438,S,M,Poli,Engenharia da Computação,5,0,Desenvolvedor,Não,,0,1
9,24/07/2016 21:01:25,2016.1,513,S,M,IME,Bacharelado em Ciência da Computação,3,1,Desenvolvedor,Sim,"332, 342, 511, 513",47,4


In [100]:
# Using LabelEncoder to convert categorical variables to numerical
categorical_variables = ('Edition','Gender','College','StudentProgram','Role','HasTeam','GraduationYear')
for variable in categorical_variables:
    # creating instance of labelencoder
    labelencoder = LabelEncoder()
    # Assigning numerical values and storing in another column
    registrations[variable] = labelencoder.fit_transform(registrations[variable])
registrations

Unnamed: 0,Datetime,Edition,id,IsSelected,Gender,College,StudentProgram,GraduationYear,NHackathon,Role,HasTeam,TeamMembers,TeamId,TeamSize
0,09/08/2016 19:55:46,0,46,S,1,22,8,1,0,5,0,"46, 56, 309, 483",54,4
1,09/08/2016 20:24:09,0,309,S,2,22,8,1,0,20,0,"46, 56, 309, 483",54,4
2,09/08/2016 22:57:08,0,56,S,2,22,8,1,0,5,1,"46, 56, 309, 483",54,4
3,18/08/2016 22:49:37,0,47,S,1,25,0,0,0,5,0,,0,1
4,19/08/2016 21:47:33,0,35,S,2,25,0,0,0,5,0,,0,1
5,23/08/2016 09:44:28,0,483,S,2,22,93,1,0,5,1,"46, 56, 309, 483",54,4
6,24/07/2016 20:53:04,0,227,S,2,22,8,2,2,5,1,"227, 303, 339, 402",31,4
7,24/07/2016 21:01:17,0,342,S,2,22,8,2,5,5,1,"332, 342, 511, 513",47,4
8,24/07/2016 21:01:24,0,438,S,2,26,59,5,0,5,0,,0,1
9,24/07/2016 21:01:25,0,513,S,2,22,8,3,1,5,1,"332, 342, 511, 513",47,4


In [101]:
registrations['Datetime'] = pd.to_datetime(registrations['Datetime'], utc='true')
registrations['Datetime']

0      2016-09-08 19:55:46+00:00
1      2016-09-08 20:24:09+00:00
2      2016-09-08 22:57:08+00:00
3      2016-08-18 22:49:37+00:00
4      2016-08-19 21:47:33+00:00
5      2016-08-23 09:44:28+00:00
6      2016-07-24 20:53:04+00:00
7      2016-07-24 21:01:17+00:00
8      2016-07-24 21:01:24+00:00
9      2016-07-24 21:01:25+00:00
10     2016-07-24 21:01:33+00:00
11     2016-07-24 21:01:55+00:00
12     2016-07-24 21:01:57+00:00
13     2016-07-24 21:02:00+00:00
14     2016-07-24 21:02:07+00:00
15     2016-07-24 21:02:17+00:00
16     2016-07-24 21:02:41+00:00
17     2016-07-24 21:02:44+00:00
18     2016-07-24 21:02:47+00:00
19     2016-07-24 21:02:48+00:00
20     2016-07-24 21:02:53+00:00
21     2016-07-24 21:03:00+00:00
22     2016-07-24 21:03:06+00:00
23     2016-07-24 21:03:11+00:00
24     2016-07-24 21:03:16+00:00
25     2016-07-24 21:03:18+00:00
26     2016-07-24 21:03:19+00:00
27     2016-07-24 21:03:25+00:00
28     2016-07-24 21:03:29+00:00
29     2016-07-24 21:03:35+00:00
          

In [102]:
registrations['Datetime'] = registrations['Datetime'].astype(int).astype(float)
registrations['Datetime']

0       1.473365e+18
1       1.473366e+18
2       1.473375e+18
3       1.471561e+18
4       1.471643e+18
5       1.471945e+18
6       1.469394e+18
7       1.469394e+18
8       1.469394e+18
9       1.469394e+18
10      1.469394e+18
11      1.469394e+18
12      1.469394e+18
13      1.469394e+18
14      1.469394e+18
15      1.469394e+18
16      1.469394e+18
17      1.469394e+18
18      1.469394e+18
19      1.469394e+18
20      1.469394e+18
21      1.469394e+18
22      1.469394e+18
23      1.469394e+18
24      1.469394e+18
25      1.469394e+18
26      1.469394e+18
27      1.469394e+18
28      1.469394e+18
29      1.469394e+18
            ...     
1274    1.526768e+18
1275    1.526775e+18
1276    1.526781e+18
1277    1.526802e+18
1278    1.526809e+18
1279    1.526819e+18
1280    1.526819e+18
1281    1.526823e+18
1282    1.526828e+18
1283    1.526842e+18
1284    1.526843e+18
1285    1.526844e+18
1286    1.526848e+18
1287    1.526849e+18
1288    1.526849e+18
1289    1.526849e+18
1290    1.526

In [103]:
y = registrations['IsSelected']
x = registrations.drop('IsSelected', 1).drop('TeamMembers', 1).drop('TeamId', 1).drop('id', 1)
x

Unnamed: 0,Datetime,Edition,Gender,College,StudentProgram,GraduationYear,NHackathon,Role,HasTeam,TeamSize
0,1.473365e+18,0,1,22,8,1,0,5,0,4
1,1.473366e+18,0,2,22,8,1,0,20,0,4
2,1.473375e+18,0,2,22,8,1,0,5,1,4
3,1.471561e+18,0,1,25,0,0,0,5,0,1
4,1.471643e+18,0,2,25,0,0,0,5,0,1
5,1.471945e+18,0,2,22,93,1,0,5,1,4
6,1.469394e+18,0,2,22,8,2,2,5,1,4
7,1.469394e+18,0,2,22,8,2,5,5,1,4
8,1.469394e+18,0,2,26,59,5,0,5,0,1
9,1.469394e+18,0,2,22,8,3,1,5,1,4


In [104]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=0)
columns = x_train.columns
columns

Index([u'Datetime', u'Edition', u'Gender', u'College', u'StudentProgram',
       u'GraduationYear', u'NHackathon', u'Role', u'HasTeam', u'TeamSize'],
      dtype='object')

In [105]:
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [106]:
y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.66


In [107]:
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[258   0]
 [134   0]]


# Learning To Rank (LTR)

Learning To Rank is a specific technique to generate a ranked list of items. The idea is that LTR ignores the precision of a final score of each item, the final index is what matters. The input of this method is slightly different because there are a list of items and a list defining the index of each item. This method is the best fit for the purpose of this project and it generates the perfect output.

# Reference List

- https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
- https://medium.com/@Mandysidana/machine-learning-types-of-classification-9497bd4f2e14
- https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
- https://towardsdatascience.com/learning-to-rank-with-python-scikit-learn-327a5cfd81f
- https://medium.com/@nikhilbd/intuitive-explanation-of-learning-to-rank-and-ranknet-lambdarank-and-lambdamart-fe1e17fac418
- https://medium.com/datadriveninvestor/regression-in-machine-learning-296caae933ec
- https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
- https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8