### The dataset was from DBLP: Computer science bibliography which provides open bibliographic information on major computer science journals and proceedings. The datasets summarize the co-authorship using publications during 1999 to 2013. Authors who published at least one paper together are considered co-authors. Each example is generated between two random authors A,B who are not co-authors during 1999 to 2013.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [2]:
data = pd.read_csv('Dataset_DBLP.csv')

In [3]:
data

Unnamed: 0,keyword_match_count,keyword_sum,paper_sum,neighbor_sum,journal_match_count,journal_sum,shortest_path_distance,common_neighbor_count,if_coauthor
0,0,50,12,19,0,1,6,0,0
1,0,32,4,10,0,2,5,0,0
2,4,191,36,86,0,10,4,0,1
3,2,51,8,9,0,5,5,0,1
4,0,36,6,12,0,2,3,0,1
5,1,26,3,5,1,2,8,0,1
6,3,98,20,16,0,4,5,0,0
7,0,59,9,16,0,4,7,0,1
8,0,44,6,47,0,2,6,0,1
9,2,98,25,32,0,2,2,1,1


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
keyword_match_count       5000 non-null int64
keyword_sum               5000 non-null int64
paper_sum                 5000 non-null int64
neighbor_sum              5000 non-null int64
journal_match_count       5000 non-null int64
journal_sum               5000 non-null int64
shortest_path_distance    5000 non-null int64
common_neighbor_count     5000 non-null int64
if_coauthor               5000 non-null int64
dtypes: int64(9)
memory usage: 351.6 KB


In [5]:
result = dict()

In [6]:
def printResult(result, method):
    print("Accuracy:", result[method]["Accuracy"])
    print("Precision:", result[method]["Precision"])
    print("Recall:", result[method]["Recall"])  

In [7]:
def calculateMetrics(y_test, y_pred):
    return {"Accuracy": metrics.accuracy_score(y_test, y_pred), \
            "Precision": metrics.precision_score(y_test, y_pred), \
            "Recall": metrics.recall_score(y_test, y_pred)}

In [8]:
X = data.drop("if_coauthor", axis=1)
Y = data["if_coauthor"]

## 1. Linear regression model

### This is not good for classification since the prediction cannot be easily interpreted as a probability or a binary label

In [9]:
from sklearn.linear_model import LinearRegression

In [10]:
# Split dataset into training set and test set
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [11]:
model_1 = LinearRegression()  ## create object for the class

#Train the model using the training sets
model_1.fit(X_train_1, y_train_1)

#Predict the response for test dataset
y_pred_1_raw = model_1.predict(X_test_1)
#Predict co-author if score>0.5, 0 otherwise. 
y_pred_1 = [1 if y > 0.5 else 0 for y in y_pred_1_raw]

In [12]:
result["Linear Regression"] = calculateMetrics(y_test_1, y_pred_1)

In [13]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "Linear Regression")

Accuracy: 0.774
Precision: 0.8122171945701357
Recall: 0.7151394422310757


In [14]:
## give a new dataset to predict whether these two people are co-author or not using linear regression.
checkif_coauthor_1 = np.array([(5,1,0,1,1,1,4,0),(2,0,0,1,1,1,5,0)])

model_1.predict(checkif_coauthor_1)

array([0.62371254, 0.48529629])

## 2. Multinomial Naive Bayes

In [15]:
from sklearn.naive_bayes import MultinomialNB

In [16]:
# Split dataset into training set and test set
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [17]:
model_2 = MultinomialNB()

In [18]:
#Train the model using the training sets
model_2.fit(X_train_2, y_train_2)

#Predict the response for test dataset
y_pred_2 = model_2.predict(X_test_2)

In [19]:
result["Multinomial Naive Bayes"] = calculateMetrics(y_test_2, y_pred_2)

In [20]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "Multinomial Naive Bayes")

Accuracy: 0.75
Precision: 0.85
Recall: 0.6095617529880478


## 3. Gaussian Naive Bayes

In [21]:
from sklearn.naive_bayes import GaussianNB

In [22]:
# Split dataset into training set and test set
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [23]:
model_3 = GaussianNB()

In [24]:
#Train the model using the training sets
model_3.fit(X_train_3, y_train_3)

#Predict the response for test dataset
y_pred_3 = model_3.predict(X_test_3)

In [25]:
result["Gaussian Naive Bayes"] = calculateMetrics(y_test_3, y_pred_3)

In [26]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "Gaussian Naive Bayes")

Accuracy: 0.702
Precision: 0.9322033898305084
Recall: 0.43824701195219123


In [27]:
## give a new dataset to predict whether these two people are co-author or not using GaussianNBayes.
predicted_3=model_3.predict(checkif_coauthor_1)
print ("Predicted Values:",predicted_3)

Predicted Values: [1 1]


## 4. Decision Tree

In [28]:
## using decision tree to train the model
from sklearn import tree

In [29]:
# Split dataset into training set and test set
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [30]:
model_4 = tree.DecisionTreeClassifier()

In [31]:
#Train the model using the training sets
model_4.fit(X_train_4, y_train_4)

#Predict the response for test dataset
y_pred_4 = model_4.predict(X_test_4)

In [32]:
result["Decision Tree"] = calculateMetrics(y_test_4, y_pred_4)

In [33]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "Decision Tree")

Accuracy: 0.683
Precision: 0.6789168278529981
Recall: 0.6992031872509961


## 5. Random Forest

In [34]:
## initiate RF classifier
from sklearn.ensemble import RandomForestClassifier

In [35]:
# Split dataset into training set and test set
X_train_5, X_test_5, y_train_5, y_test_5 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [36]:
## Change Hyperparameters and show the best estimator 
from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier()
parameters = {
    'max_depth' :[8,16,32,None], 
    'n_estimators' :[100,200,300]
}
gridCV = GridSearchCV(rf,parameters,cv=5)
model_5 = gridCV.fit(X_train_5,y_train_5)

In [37]:
# show the best random forest
model_5.best_estimator_
## we can conclude from previous that max_depth = 8, n_estimators=200 we got the best estimation in RF model.

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [38]:
#Predict the response for test dataset
y_pred_5 = model_5.predict(X_test_5)

In [39]:
result["Random Forest"] = calculateMetrics(y_test_5, y_pred_5)

In [40]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "Random Forest")

Accuracy: 0.802
Precision: 0.8551401869158879
Recall: 0.7290836653386454


## 6. k-Nearest Neighbors

In [41]:
from sklearn.neighbors import KNeighborsClassifier

In [42]:
# Split dataset into training set and test set
X_train_6, X_test_6, y_train_6, y_test_6 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [43]:
knn = KNeighborsClassifier()
parameters = {
    'leaf_size' :[10,20,30,40,50], 
    'n_neighbors' :[5,10,15]
}
gridCV = GridSearchCV(knn,parameters,cv=5)
model_6 = gridCV.fit(X_train_6,y_train_6)

In [44]:
# show the best kNN estimator
model_6.best_estimator_
## we can conclude from previous that leaf_size = 10, n_neighbors=15 we got the best estimation in kNN model.

KNeighborsClassifier(algorithm='auto', leaf_size=10, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=2,
           weights='uniform')

In [45]:
#Predict the response for test dataset
y_pred_6 = model_6.predict(X_test_6)

In [46]:
result["k-Nearest Neighbors"] = calculateMetrics(y_test_6, y_pred_6)

In [47]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "k-Nearest Neighbors")

Accuracy: 0.734
Precision: 0.7532188841201717
Recall: 0.6992031872509961


## 7. SVM Classifier

In [48]:
from sklearn.svm import SVC
from sklearn import svm

In [49]:
# Split dataset into training set and test set
X_train_7, X_test_7, y_train_7, y_test_7 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [50]:
model_7 = SVC(kernel = 'linear') # use linear kernel and didn't run grid search, since SVM is too time consuming.
model_7.fit(X_train_7, y_train_7)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [51]:
#Predict the response for test dataset
y_pred_7 = model_7.predict(X_test_7)

In [52]:
result["SVM Classifier"] = calculateMetrics(y_test_7, y_pred_7)

In [53]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "SVM Classifier")

Accuracy: 0.795
Precision: 0.8648648648648649
Recall: 0.701195219123506


## 8. Logistic Regression

In [54]:
#Create a LR model
from sklearn.linear_model import LogisticRegression

In [55]:
# Split dataset into training set and test set
X_train_8, X_test_8, y_train_8, y_test_8 = train_test_split(X, Y,test_size=0.2,random_state=4) 
# 80% training and 20% test 

In [56]:
model_8 = LogisticRegression()

In [57]:
#Train the model using the training sets
model_8.fit(X_train_8, y_train_8)

#Predict the response for test dataset
y_pred_8 = model_8.predict(X_test_8)

In [58]:
result["Logistic Regression"] = calculateMetrics(y_test_8, y_pred_8)

In [59]:
# Model Accuracy, how often is the classifier correct?
printResult(result, "Logistic Regression")

Accuracy: 0.799
Precision: 0.8508158508158508
Recall: 0.7270916334661355


# Results

In [63]:
# Print out the result
pd.DataFrame.from_dict(result)

Unnamed: 0,Linear Regression,Multinomial Naive Bayes,Gaussian Naive Bayes,Decision Tree,Random Forest,k-Nearest Neighbors,SVM Classifier,Logistic Regression
Accuracy,0.774,0.75,0.702,0.683,0.802,0.734,0.795,0.799
Precision,0.812217,0.85,0.932203,0.678917,0.85514,0.753219,0.864865,0.850816
Recall,0.715139,0.609562,0.438247,0.699203,0.729084,0.699203,0.701195,0.727092


### From the model's results, Random Forest has the highest accuracy and relatively high precision and recall. We can choose Random Forest Classifier for this task.