## Enigma codeFest Machine Learning

#### The competition has been organised by [Analytics Vidhya](https://datahack.analyticsvidhya.com/contest/enigma-codefest-machine-learning/)

##### It is regression based problem and we have to predict the no. of upvotes given by the users.

##### Evaluation metric used is RMSE score.

##### Import dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

In [2]:
%matplotlib inline

##### Read data

In [3]:
train= pd.read_csv('train.csv')
test= pd.read_csv('test.csv')
samp= pd.read_csv('sample.csv')

In [4]:
train.head()

Unnamed: 0,ID,Tag,Reputation,Answers,Username,Views,Upvotes
0,52664,a,3942.0,2.0,155623,7855.0,42.0
1,327662,a,26046.0,12.0,21781,55801.0,1175.0
2,468453,c,1358.0,4.0,56177,8067.0,60.0
3,96996,a,264.0,3.0,168793,27064.0,9.0
4,131465,c,4271.0,4.0,112223,13986.0,83.0


### First method used is ANN with three hidden layers.

In [5]:
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split, KFold

In [6]:
le= LabelEncoder()
encoder= le.fit(train['Tag'].astype(str))
train['Tag']= encoder.transform(train['Tag'])
test['Tag']= encoder.transform(test['Tag'])

In [11]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.metrics import mean_squared_error

In [8]:
X_train, X_test, y_train, y_test= train_test_split(train.drop(['ID', 'Username', 'Upvotes'], axis=1), train['Upvotes'],
                                                   random_state= 0)

In [12]:
model= Sequential()
model.add(Dense(units= 100, activation= 'relu', kernel_initializer= 'uniform', input_dim= X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(units= 10, activation= 'relu'))
model.add(Dropout(0.1))
model.add(Dense(units= 1, activation= 'relu'))

model.compile(loss='mean_squared_error', optimizer='sgd', metrics= [mean_squared_error])

model.fit(X_train, y_train, batch_size= 200, epochs= 5, validation_data= (X_test, y_test))

y_pred= model.predict(X_test, batch_size= 200)

#print accuracies
train_acc= model.evaluate(X_train, y_train, batch_size= 200)
test_acc= model.evaluate(X_test, y_test, batch_size= 200)
r_squared= r2_score(y_test, y_pred)
mse= mean_squared_error(y_test, y_pred)
rmse= np.sqrt(mse)

print('Train detail', train_acc)
print('Test detail', test_acc)
print('r_squared: {}' .format(r_squared))
print('rmse: {}' .format(rmse))

Train on 247533 samples, validate on 82512 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


MemoryError: 

#### Data preprocessing and Feature Engineering

In [17]:
train['Reputation_Views_Ratio']= train['Reputation']/train['Views']
test['Reputation_Views_Ratio']= test['Reputation']/test['Views']

train['Answers_Views_Ratio']= train['Answers']/train['Views']
test['Answers_Views_Ratio']= test['Answers']/test['Views']

train['Views_Scale']= train['Views']/train['Views'].mean()
test['Views_Scale']= test['Views']/test['Views'].mean()

train['Reputation_Scale']= train['Reputation']/train['Reputation'].mean()
test['Reputation_Scale']= test['Reputation']/test['Reputation'].mean()

train['Answers_Scale']= train['Answers']/train['Answers'].mean()
test['Answers_Scale']= test['Answers']/test['Answers'].mean()


In [18]:
train.head()

Unnamed: 0,ID,Tag,Reputation,Answers,Username,Views,Upvotes,Reputation_Views_Ratio,Answers_Views_Ratio,Views_Scale,Reputation_Scale,Answers_Scale
0,52664,0,3942.0,2.0,155623,7855.0,42.0,0.501846,0.000255,0.264968,0.507131,0.510507
1,327662,0,26046.0,12.0,21781,55801.0,1175.0,0.466766,0.000215,1.882303,3.350767,3.063044
2,468453,1,1358.0,4.0,56177,8067.0,60.0,0.16834,0.000496,0.272119,0.174704,1.021015
3,96996,0,264.0,3.0,168793,27064.0,9.0,0.009755,0.000111,0.912934,0.033963,0.765761
4,131465,1,4271.0,4.0,112223,13986.0,83.0,0.305377,0.000286,0.471782,0.549456,1.021015


In [23]:
train['Reputation']= train['Reputation'].astype(int)
test['Reputation']= test['Reputation'].astype(int)

train['Answers']= train['Answers'].astype(int)
test['Answers']= test['Answers'].astype(int)

train['Views']= train['Views'].astype(int)
test['Views']= test['Views'].astype(int)



#### Machine learning implementation

In [24]:
X_train, X_test, y_train, y_test= train_test_split(train.drop(['ID', 'Username', 'Upvotes'], axis= 1),
                                                   train['Upvotes'], random_state= 0)

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

In [26]:
import catboost as cb
import xgboost as xgb
import lightgbm as lgbm

#### CatBoost

In [27]:
cat_feat_index= np.where(train.drop(['ID', 'Username', 'Upvotes'], axis= 1).dtypes!= np.float)[0]

In [29]:
def run_CB(train, target, test, cat_feat_index= cat_feat_index):
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error
    X_train, X_test, y_train, y_test= train_test_split(train, target, random_state= 0)
    param_cb= {}
    param_cb['iterations']= 1000
    param_cb['learning_rate']= 0.1
    #param_cb['max_depth']= 3
    #param_cb['random_seed']= 2018
    model= cb.CatBoostRegressor(**param_cb)
    
    model.fit(X_train, y_train, cat_feat_index, eval_set= (X_test, y_test), verbose= 100, early_stopping_rounds= 50)
    
    y_pred= model.predict(X_test)
    
    r_squared= r2_score(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse= np.sqrt(mse)
    #msle= mean_squared_log_error(y_test, y_pred)
    #rmsle= np.sqrt(msle)
    print('The R-Squared value: {:.4f}' .format(r_squared))
    print('Mean Squared Error: {:.4f}' .format(mse))
    print('Root Mean Squared Error: {:.4f}' .format(rmse))
    #print('Root Mean Squared Log Error: {:.4f}' .format(rmsle))
    
    return model, model.predict(test)

In [30]:
model_1, pred_1= run_CB(train.drop(['ID', 'Username', 'Upvotes'], axis= 1), train['Upvotes'],
                        test.drop(['ID', 'Username'], axis= 1))

0:	learn: 3692.4595896	test: 2569.1815115	best: 2569.1815115 (0)	total: 790ms	remaining: 13m 8s
100:	learn: 1051.7016295	test: 1059.9045170	best: 1059.9045170 (100)	total: 50.3s	remaining: 7m 27s
200:	learn: 814.6666365	test: 1006.4085183	best: 1005.9317449 (184)	total: 1m 39s	remaining: 6m 35s
300:	learn: 719.4999745	test: 987.4857834	best: 987.3361094 (297)	total: 2m 29s	remaining: 5m 46s
400:	learn: 645.3981486	test: 982.0289381	best: 979.1885596 (375)	total: 3m 19s	remaining: 4m 57s
500:	learn: 588.3221075	test: 978.7037999	best: 976.1758268 (471)	total: 4m 9s	remaining: 4m 8s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 976.1758268
bestIteration = 471

Shrink model to first 472 iterations.
The R-Squared value: 0.8675
Mean Squared Error: 952919.2448
Root Mean Squared Error: 976.1758


##### Linear Regression Model

In [48]:
import math
def ml_modeling(model, train, target, test):
    from sklearn.metrics import r2_score, mean_squared_error
    X_train, X_test, y_train, y_test= train_test_split(train, target, random_state= 0)
    
    model.fit(X_train, y_train)
    
    y_pred= model.predict(X_test)
    
    r_squared= r2_score(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse= np.sqrt(mse)
    #msle= mean_squared_log_error(y_test, y_pred)
    #rmsle= np.sqrt(msle)
    print('The R-Squared value: {:.4f}' .format(r_squared))
    print('Mean Squared Error: {:.4f}' .format(mse))
    print('Root Mean Squared Error: {:.4f}' .format(rmse))
    #print('Root Mean Squared Log Error: {:.4f}' .format(rmsle))
    
    return model.predict(test)

In [49]:
reg_lin= LinearRegression()

In [50]:
pred_2= ml_modeling(reg_lin, train.drop(['ID', 'Username', 'Upvotes'], axis= 1), train['Upvotes'],
                        test.drop(['ID', 'Username'], axis= 1))

The R-Squared value: 0.3333
Mean Squared Error: 4795905.4603
Root Mean Squared Error: 2189.9556


#### XGBoost

In [60]:
def run_XGB(train, target, test):
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error
    X_train, X_test, y_train, y_test= train_test_split(train, target, random_state= 0)
    
    param_xgb= {}
    param_xgb['eta']= 0.1
    param_xgb['objective']= 'reg:linear'
    param_xgb['subsample']= 0.8
    param_xgb['colsample_bytree']= 0.7
    param_xgb['max_depth']= 3
    param_xgb['eval_metric']= 'rmse'
    param_xgb['seed']= 0
    
    dtrain= xgb.DMatrix(X_train, y_train, silent= True)
    dtest= xgb.DMatrix(X_test, y_test, silent= True)
    test_= xgb.DMatrix(test)

    watchlist= [(dtest, 'eval'), (dtrain, 'train')]
    
    model= xgb.train(param_xgb, dtrain, 1000, watchlist, early_stopping_rounds= 50, verbose_eval= 100)
    
    y_pred= model.predict(dtest)

    r_squared= r2_score(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse= np.sqrt(mse)
    #msle= mean_squared_log_error(y_test, y_pred)
    #rmsle= np.sqrt(msle)
    print('The R-Squared value: {:.4f}' .format(r_squared))
    print('Mean Squared Error: {:.4f}' .format(mse))
    print('Root Mean Squared Error: {:.4f}' .format(rmse))
    #print('Root Mean Squared Log Error: {:.4f}' .format(rmsle))
    
    return model.predict(test_)

In [61]:
pred_4= run_XGB(train.drop(['ID', 'Username', 'Upvotes'], axis= 1), train['Upvotes'],
                test.drop(['ID', 'Username'], axis= 1))

[0]	eval-rmse:2585.05	train-rmse:3655.52
Multiple eval metrics have been passed: 'train-rmse' will be used for early stopping.

Will train until train-rmse hasn't improved in 50 rounds.
[100]	eval-rmse:861.942	train-rmse:776.137
[200]	eval-rmse:902.681	train-rmse:632.025
[300]	eval-rmse:930.063	train-rmse:559.611
[400]	eval-rmse:931.952	train-rmse:515.458
[500]	eval-rmse:943.385	train-rmse:479.87
[600]	eval-rmse:952.558	train-rmse:457.164
[700]	eval-rmse:958.32	train-rmse:435.768
[800]	eval-rmse:966.928	train-rmse:419.903
[900]	eval-rmse:967.309	train-rmse:405.383
[999]	eval-rmse:970.289	train-rmse:391.61
The R-Squared value: 0.8691
Mean Squared Error: 941460.8451
Root Mean Squared Error: 970.2891


In [80]:
def run_XGB(train, target, test):
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error
    X_train, X_test, y_train, y_test= train_test_split(train, target, random_state= 0, test_size= 0.15)
    
    param_xgb= {}
    param_xgb['eta']= 0.1
    param_xgb['objective']= 'reg:linear'
    param_xgb['subsample']= 0.8
    param_xgb['colsample_bytree']= 0.7
    param_xgb['max_depth']= 3
    param_xgb['eval_metric']= 'rmse'
    param_xgb['seed']= 0
    
    dtrain= xgb.DMatrix(X_train, y_train, silent= True)
    dtest= xgb.DMatrix(X_test, y_test, silent= True)
    test_= xgb.DMatrix(test)

    watchlist= [(dtest, 'eval'), (dtrain, 'train')]
    
    model= xgb.train(param_xgb, dtrain, 1000, watchlist, early_stopping_rounds= 50, verbose_eval= 100)
    
    y_pred= model.predict(dtest)

    r_squared= r2_score(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse= np.sqrt(mse)
    #msle= mean_squared_log_error(y_test, y_pred)
    #rmsle= np.sqrt(msle)
    print('The R-Squared value: {:.4f}' .format(r_squared))
    print('Mean Squared Error: {:.4f}' .format(mse))
    print('Root Mean Squared Error: {:.4f}' .format(rmse))
    #print('Root Mean Squared Log Error: {:.4f}' .format(rmsle))
    
    return model.predict(test_)

In [73]:
train_= pd.read_csv('train.csv')
test_= pd.read_csv('test.csv')
samp= pd.read_csv('sample.csv')

In [76]:
train_['Tag']= le.fit_transform(train_['Tag'].astype(str))
test_['Tag']= le.fit_transform(test_['Tag'].astype(str))

In [81]:
pred_7= run_XGB(train.drop(['ID', 'Upvotes'], axis= 1), train['Upvotes'],
                test.drop(['ID'], axis= 1))

[0]	eval-rmse:2621.47	train-rmse:3500.65
Multiple eval metrics have been passed: 'train-rmse' will be used for early stopping.

Will train until train-rmse hasn't improved in 50 rounds.
[100]	eval-rmse:827.61	train-rmse:760.952
[200]	eval-rmse:840.4	train-rmse:625.349
[300]	eval-rmse:848.906	train-rmse:564.636
[400]	eval-rmse:857.448	train-rmse:514.769
[500]	eval-rmse:861.676	train-rmse:483.894
[600]	eval-rmse:876.373	train-rmse:463.058
[700]	eval-rmse:879.001	train-rmse:444.33
[800]	eval-rmse:883.476	train-rmse:426.683
[900]	eval-rmse:888.642	train-rmse:412.349
[999]	eval-rmse:898.327	train-rmse:401.482
The R-Squared value: 0.8942
Mean Squared Error: 806992.1483
Root Mean Squared Error: 898.3274


##### Here, I got the best result i.e. RMSE of 898.3274

In [82]:
samp_12= pd.DataFrame({
    'ID': test['ID'],
    'Upvotes': pred_7
})

In [83]:
samp_12.to_csv('sample_12.csv', index= False)

### LightGBM

In [93]:
def run_LGBM(train, target, test):
    
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error
    X_train, X_test, y_train, y_test= train_test_split(train, target, random_state= 2018)
    
    param_lgbm= {}
    param_lgbm['learning_rate']= 0.1
    param_lgbm['objective']= 'regression'
    param_lgbm['num_iterations']= 10000
    param_lgbm['num_threads']= 7
    param_lgbm['seed']= 0
    #param_lgbm['max_depth']= 3
    param_lgbm['early_stopping_round']= 50
    #param_lgbm['verbose']= 100
    param_lgbm['metric']= 'rmse'
    
    dtrain= lgbm.Dataset(X_train, y_train, silent= True)
    dtest= lgbm.Dataset(X_test, y_test, silent= True)
    
    model= lgbm.train(param_lgbm, dtrain, valid_sets= [dtrain, dtest], verbose_eval= 100)
    
    y_pred= model.predict(X_test)

    r_squared= r2_score(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse= np.sqrt(mse)
    #msle= mean_squared_log_error(y_test, y_pred)
    #rmsle= np.sqrt(msle)
    print('The R-Squared value: {:.4f}' .format(r_squared))
    print('Mean Squared Error: {:.4f}' .format(mse))
    print('Root Mean Squared Error: {:.4f}' .format(rmse))
    #print('Root Mean Squared Log Error: {:.4f}' .format(rmsle))
    
    return model.predict(test)

In [94]:
pred_8= run_LGBM(train.drop(['ID', 'Upvotes', 'Username'], axis= 1), train['Upvotes'],
                test.drop(['ID', 'Username'], axis= 1))



Training until validation scores don't improve for 50 rounds.
[100]	training's rmse: 1024.24	valid_1's rmse: 1692.53
[200]	training's rmse: 770.82	valid_1's rmse: 1628.78
[300]	training's rmse: 638.169	valid_1's rmse: 1591.78
[400]	training's rmse: 566.011	valid_1's rmse: 1584.38
[500]	training's rmse: 519.451	valid_1's rmse: 1579.29
Early stopping, best iteration is:
[450]	training's rmse: 540.934	valid_1's rmse: 1578.56
The R-Squared value: 0.8119
Mean Squared Error: 2491842.6421
Root Mean Squared Error: 1578.5571


#### In my case the XGBoost algorithm gave the best result among all these algorithms.