# <font color='red'> Project 2 : Buzz Prediction on Twitter

Project Description:
- There are two different datasets for Regression and Classification tasks. Right-most column in both the datasets is a dependent variable i.e. buzz.
- Data description files are also provided for both the datasets.
- Deciding which dataset is for which task is part of the project.
- Read data into Jupyter notebook, use pandas to import data into a data frame.
- Preprocess data: Explore data, check for missing data and apply data scaling. Justify the type of scaling used.

Regression Task:
- Apply all the regression models you've learned so far. If your model has a scaling parameter(s) use Grid Search to find the best scaling parameter. Use plots and graphs to help you get a better glimpse of the results. 
- Then use cross-validation to find average training and testing score. 
- Your submission should have at least the following regression models: KNN regressor, linear regression, Ridge, Lasso, polynomial regression, SVM both simple and with kernels. 
- Finally, find the best regressor for this dataset and train your model on the entire dataset using the best parameters and predict buzz for the test_set.

Classification Task:
- Decide about a good evaluation strategy and justify your choice.
- Find best parameters for the following classification models: KNN classification, Logistic Regression, Linear Support Vector Machine, Kernelized Support Vector Machine, Decision Tree. 
- Which model gives the best results?

Deliverables:
- Submit IPython notebook. Use markdown to provide inline comments for this project.
- Rename notebook with your group number and submit only one notebook. Before submitting, make sure everything runs as expected. To check that, restart the kernel (in the menubar, select Kernel > Restart) and then run all cells (in the menubar, select Cell > Run All).
- Visualization encouraged.

Questions regarding the project:
- We have created a discussion board under Projects folder on e-learning. Create threads over there and post your queries related to project there.
- There is a high possibility that your classmate has also faced the same problem and knows the solution. So this is an effort to encourage collaborative learning, reducing mails for frequently asked queries and also making all the information available to everyone.
- Please check existing threads for your query before creating a new one. It goes without saying that do not share your code or complete solutions there.
- We will also answer queries there. We will not be answering any project related queries through the mail.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)

data=pd.read_csv('Twitter.data',header=None) ## Regression
# dataC=pd.read_csv('Twitter-Absolute-Sigma-500.data',header=None) ## Classification 


### Data Inspect

In [2]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,68,69,70,71,72,73,74,75,76,77
0,0,2,0,0,1,1,1,0,1,0,...,1.0,1.0,0,2,0,0,1,1,1,0.0
1,2,1,0,0,0,0,4,2,1,0,...,0.0,1.0,2,1,0,0,0,0,4,0.5
2,1,0,0,0,0,4,1,1,0,0,...,1.0,1.0,1,0,0,0,0,4,1,0.0
3,1,0,0,1,0,0,1,1,0,0,...,0.0,1.0,1,0,0,1,0,0,1,2.5
4,0,1,0,0,1,2,3,0,1,0,...,1.0,1.0,0,1,0,0,1,2,3,0.5


In [3]:
#  Columns Names as per the description Provided

colnames = ["NCD","AI","AS(NA)","BL","NAC","AS(NAC)","CS","AT","NA","ADL","NAD","Buzz_Magnitude"]

# Function for Column name in the data set

def colrange(n):
    if n in range(0,7):
        return(colnames[0]+'_'+str(n-0))
    elif n in range(7,14) :
        return(colnames[1]+'_'+str(n-7))
    elif n in range(14,21) :
        return(colnames[2]+'_'+str(n-14))
    elif n in range(21,28) :
        return(colnames[3]+'_'+str(n-21))
    elif n in range(28,35) :
        return(colnames[4]+'_'+str(n-28))
    elif n in range(35,42) :
        return(colnames[5]+'_'+str(n-35))
    elif n in range(42,49) :
        return(colnames[6]+'_'+str(n-42))
    elif n in range(49,56) :
        return(colnames[7]+'_'+str(n-49))
    elif n in range(56,63) :
        return(colnames[8]+'_'+str(n-56))
    elif n in range(63,70) :
        return(colnames[9]+'_'+str(n-63))
    elif n in range(70,77) :
        return(colnames[10]+'_'+str(n-70))
    elif n==77 :
        return(colnames[11])    

# Renaming Columns name 
data.columns=[colrange(col) for col in data.columns]

In [4]:
data.shape

(583250, 78)

In [5]:
data.head()

Unnamed: 0,NCD_0,NCD_1,NCD_2,NCD_3,NCD_4,NCD_5,NCD_6,AI_0,AI_1,AI_2,...,ADL_5,ADL_6,NAD_0,NAD_1,NAD_2,NAD_3,NAD_4,NAD_5,NAD_6,Buzz_Magnitude
0,0,2,0,0,1,1,1,0,1,0,...,1.0,1.0,0,2,0,0,1,1,1,0.0
1,2,1,0,0,0,0,4,2,1,0,...,0.0,1.0,2,1,0,0,0,0,4,0.5
2,1,0,0,0,0,4,1,1,0,0,...,1.0,1.0,1,0,0,0,0,4,1,0.0
3,1,0,0,1,0,0,1,1,0,0,...,0.0,1.0,1,0,0,1,0,0,1,2.5
4,0,1,0,0,1,2,3,0,1,0,...,1.0,1.0,0,1,0,0,1,2,3,0.5


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583250 entries, 0 to 583249
Data columns (total 78 columns):
NCD_0             583250 non-null int64
NCD_1             583250 non-null int64
NCD_2             583250 non-null int64
NCD_3             583250 non-null int64
NCD_4             583250 non-null int64
NCD_5             583250 non-null int64
NCD_6             583250 non-null int64
AI_0              583250 non-null int64
AI_1              583250 non-null int64
AI_2              583250 non-null int64
AI_3              583250 non-null int64
AI_4              583250 non-null int64
AI_5              583250 non-null int64
AI_6              583250 non-null int64
AS(NA)_0          583250 non-null float64
AS(NA)_1          583250 non-null float64
AS(NA)_2          583250 non-null float64
AS(NA)_3          583250 non-null float64
AS(NA)_4          583250 non-null float64
AS(NA)_5          583250 non-null float64
AS(NA)_6          583250 non-null float64
BL_0              583250 non-null f

In [7]:
data.describe()


Unnamed: 0,NCD_0,NCD_1,NCD_2,NCD_3,NCD_4,NCD_5,NCD_6,AI_0,AI_1,AI_2,...,ADL_5,ADL_6,NAD_0,NAD_1,NAD_2,NAD_3,NAD_4,NAD_5,NAD_6,Buzz_Magnitude
count,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,...,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0,583250.0
mean,140.33964,136.770147,159.679271,181.592091,201.097445,220.175371,219.388214,71.038051,69.829631,82.198203,...,1.136688,1.140372,140.78986,137.18127,160.105922,182.05744,201.596482,220.7059,219.936864,191.279493
std,431.772639,432.305129,502.057428,574.883713,630.448432,669.20593,672.182204,196.876718,202.199758,239.523042,...,1.432327,1.552313,432.624954,433.026611,502.774408,575.658022,631.258318,670.050977,673.032541,612.352354
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,4.0,4.0,5.0,6.0,6.0,2.0,2.0,2.0,...,1.0,1.0,3.0,3.0,4.0,4.0,5.0,6.0,6.0,4.5
50%,18.0,17.0,21.0,24.0,27.0,31.0,30.0,11.0,11.0,13.0,...,1.0,1.0,18.0,17.0,21.0,24.0,27.0,31.0,31.0,25.5
75%,104.0,100.0,115.0,131.0,147.0,166.0,164.0,59.0,57.0,65.0,...,1.090909,1.091296,104.0,101.0,115.0,131.0,148.0,167.0,165.0,139.0
max,24210.0,29574.0,37505.0,72366.0,79079.0,79079.0,79079.0,18654.0,22035.0,29402.0,...,262.0,295.0,24301.0,29574.0,37505.0,72366.0,79083.0,79083.0,79083.0,75724.5


### Data Leakage

In [8]:
# Determining the Correlation of Buzz_Magnitude with other attributes.
co_matrix=data.corr()
co_matrix["Buzz_Magnitude"].sort_values(ascending=False)

  

Buzz_Magnitude    1.000000
NCD_6             0.955330
NAD_6             0.955299
NAC_6             0.951809
NCD_5             0.918565
NAD_5             0.918533
NAC_5             0.915110
NA_6              0.905061
NCD_1             0.889884
NAD_1             0.889744
NCD_4             0.886262
NAD_4             0.886251
NAC_1             0.883589
NAC_4             0.883327
NCD_0             0.883316
NAD_0             0.883073
NAC_0             0.875533
NCD_2             0.875168
NAD_2             0.875099
NCD_3             0.873060
NAD_3             0.873026
NAC_2             0.869843
NAC_3             0.869301
NA_5              0.868421
AS(NAC)_6         0.866314
AS(NAC)_0         0.853992
NA_1              0.850840
AS(NAC)_1         0.850498
AS(NAC)_5         0.845093
NA_0              0.843521
                    ...   
AI_3              0.756556
AI_4              0.756219
BL_2              0.091175
CS_2              0.089560
BL_0              0.085147
BL_3              0.085041
C

### Split Data

In [9]:
#  Defingin Target Variable & Independent Variables
XFull = data.iloc[:, :-1]
YFull = data.iloc[:, -1]


In [10]:
# Splitting Data set Randomly such that size is 10% of the original data 

from sklearn.model_selection import train_test_split
_, sample_data, _, sample_target = train_test_split(XFull, YFull, shuffle = True, test_size = 0.1)

In [11]:
sample_data.shape

(58325, 77)

In [12]:
sample_target.shape

(58325,)

In [13]:
X_train_org, X_test_org, Y_train, Y_test = train_test_split(sample_data[:1000],sample_target[:1000])

In [14]:
def Skewness(data):
    s = (data.mean(),data.median())
    s = pd.concat(s,axis=1)
    s.columns = ['mean','median']
    return(s)
Skewness(data)

Unnamed: 0,mean,median
NCD_0,140.339640,18.000000
NCD_1,136.770147,17.000000
NCD_2,159.679271,21.000000
NCD_3,181.592091,24.000000
NCD_4,201.097445,27.000000
NCD_5,220.175371,31.000000
NCD_6,219.388214,30.000000
AI_0,71.038051,11.000000
AI_1,69.829631,11.000000
AI_2,82.198203,13.000000


#### The use of scaling algorithm in the listed below methods will be Min Max Scalar Method
#### as the independent variables are not normally distributed and are Right Skewed as mean is towards the right of median
#### which we can see in the listed above table having 'Mean' and 'Median' values for all variables
#### so Standard Scaler will not work well 
#### and Min Max scalar Works well with such data set distribution as compared to Standard Scalar 


In [15]:
from sklearn.preprocessing import MinMaxScaler StandardScaler
# scaler = MinMaxScaler()
scaler=StandardScaler()
X_train = scaler.fit_transform(X_train_org)
X_test = scaler.transform(X_test_org)

### Model 1 :  Linear regression (Normal Equation)

In [16]:
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
from sklearn.model_selection import cross_val_score


In [17]:
CvScores = cross_val_score(lreg, X_train, Y_train, cv=5)
print("CV Scores: {}".format(CvScores))
CvScores

CV Scores: [0.78181402 0.84931164 0.93561898 0.98952516 0.93717388]


array([0.78181402, 0.84931164, 0.93561898, 0.98952516, 0.93717388])

In [18]:
Train_Score = CvScores.mean()
print("Train Set Score: {:.2f}".format(Train_Score))
lreg.fit(X_train, Y_train)
Test_Score = lreg.score(X_test,Y_test)
print("Test Set Score: {:.2f}".format(Test_Score))


Train Set Score: 0.90
Test Set Score: 0.93


In [19]:
report_table = [['Linear Regression', 'None', Train_Score, Test_Score]]

### Model 2 : KNN Regressor 

In [20]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
model_knn = KNeighborsRegressor()


In [21]:
param_knn = {'n_neighbors':[1, 5, 10, 15, 20]}
print("Defined Parameters:\n{}".format(param_knn))


Defined Parameters:
{'n_neighbors': [1, 5, 10, 15, 20]}


In [22]:
grid_knn = GridSearchCV(model_knn, param_grid = param_knn, cv=5, return_train_score=True)
grid_knn.fit(X_train, Y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 5, 10, 15, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [23]:
print("Best cross-validation accuracy: {:.2f}".format(grid_knn.best_score_))
print("Best parameters: {}".format(grid_knn.best_params_))
print("Train Set Score: {}".format(grid_knn.score(X_train, Y_train)))
print("Test Set Score: {}".format(grid_knn.score(X_test,Y_test)))

Best cross-validation accuracy: 0.66
Best parameters: {'n_neighbors': 1}
Train Set Score: 1.0
Test Set Score: 0.8980074818676087


In [24]:
report_table = report_table + [['KNN Regression', 'K = 5', grid_knn.score(X_train, Y_train), grid_knn.score(X_test,Y_test)]]

### Model 3 : Ridge

In [25]:
from sklearn.linear_model import Ridge
model_ridge = Ridge()
param_ridge = {'alpha': [0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100,300,1000]}
print("Defined Parameters:\n{}".format(param_ridge))


Defined Parameters:
{'alpha': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]}


In [26]:
grid_ridge = GridSearchCV(estimator = model_ridge,param_grid = param_ridge, cv=5, return_train_score=True)
grid_ridge.fit(X_train, Y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [27]:
print("Best cross-validation accuracy: {:.2f}".format(grid_ridge.best_score_))
print("Best parameters: {}".format(grid_ridge.best_params_))
print("Train Set Score: {}".format(grid_ridge.score(X_train, Y_train)))
print("Test Set Score: {}".format(grid_ridge.score(X_test, Y_test)))

Best cross-validation accuracy: 0.92
Best parameters: {'alpha': 0.01}
Train Set Score: 0.9901926951809106
Test Set Score: 0.9336262424113778


In [28]:
report_table = report_table + [['Ridge Regression', 'alpha = 0.3', grid_ridge.score(X_train, Y_train), grid_ridge.score(X_test,Y_test)]]

### Model 4 : Lasso

In [29]:
import warnings
from  sklearn.linear_model import Lasso
model_lasso = Lasso()
warnings.filterwarnings('ignore')

In [30]:
param_lasso = {'alpha': [0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100,300,1000]}
print("Defined Parameters:\n{}".format(param_lasso))


Defined Parameters:
{'alpha': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]}


In [31]:
grid_lasso = GridSearchCV(model_lasso, param_grid = param_lasso, cv=5, return_train_score=True)
grid_lasso.fit(X_train, Y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [32]:
print("Best cross-validation accuracy: {:.2f}".format(grid_lasso.best_score_))
print("Best parameters: {}".format(grid_lasso.best_params_))
print("Train Set Score: {}".format(grid_lasso.score(X_train, Y_train)))
print("Test Set Score: {}".format(grid_lasso.score(X_test, Y_test)))

Best cross-validation accuracy: 0.92
Best parameters: {'alpha': 0.01}
Train Set Score: 0.9905320861757804
Test Set Score: 0.9324472767715222


In [33]:
report_table = report_table + [['Lasso Regression', 'alpha = 1', grid_lasso.score(X_train, Y_train), grid_lasso.score(X_test,Y_test)]]

### Model 5 : Polynomial Regression



In [34]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)

In [35]:
Train_Score_List = []
Test_Score_List = []
lreg_poly = LinearRegression()

X_Train_Poly = poly.fit_transform(X_train)
X_Test_Poly = poly.transform(X_test)
PolCVScore = cross_val_score(lreg_poly, X_Train_Poly, Y_train, cv=5)
lreg_poly.fit(X_Train_Poly,Y_train)
Train_ScorePoly = lreg_poly.score(X_Test_Poly,Y_test)

Train_ScorePoly = PolCVScore.mean()
print("Train Set Score: {:.2f}".format(PolCVScore.mean()))
print("Test Set Score: {:.2f}".format(Train_ScorePoly))





Train Set Score: -33329.78
Test Set Score: -33329.78


In [36]:
report_table = report_table + [['Polynomial Regression', 'Degree = 2', PolCVScore.mean(), Train_ScorePoly]]

### Model 6 : Stochastic Gradient Descent Regressor

In [37]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDRegressor
sgd_reg=SGDRegressor()


In [38]:
param_sgd = {'max_iter': [10, 100, 1000],'learning_rate':['optimal'], 'penalty' :['l1','l2'],'random_state':[0]}
print("Defined Parameters:\n{}".format(param_sgd))


Defined Parameters:
{'max_iter': [10, 100, 1000], 'learning_rate': ['optimal'], 'penalty': ['l1', 'l2'], 'random_state': [0]}


In [39]:
grid_sgd = GridSearchCV(estimator = sgd_reg, param_grid = param_sgd, cv=5, return_train_score=True)
grid_sgd.fit(X_train, Y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', max_iter=None, n_iter=None, penalty='l2',
       power_t=0.25, random_state=None, shuffle=True, tol=None, verbose=0,
       warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_iter': [10, 100, 1000], 'learning_rate': ['optimal'], 'penalty': ['l1', 'l2'], 'random_state': [0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [40]:
print("Best parameters: {}".format(grid_sgd.best_params_))
grid_sgdata = pd.DataFrame(grid_sgd.cv_results_)
grid_sgdata = grid_sgdata[['params','mean_train_score','mean_test_score']]
grid_sgdata


Best parameters: {'learning_rate': 'optimal', 'max_iter': 1000, 'penalty': 'l2', 'random_state': 0}


Unnamed: 0,params,mean_train_score,mean_test_score
0,"{'learning_rate': 'optimal', 'max_iter': 10, '...",-6.337536e+20,-1.578949e+21
1,"{'learning_rate': 'optimal', 'max_iter': 10, '...",-4.218091e+20,-2.351593e+21
2,"{'learning_rate': 'optimal', 'max_iter': 100, ...",-1.329109e+18,-4.000668e+18
3,"{'learning_rate': 'optimal', 'max_iter': 100, ...",-1.021762e+18,-5.311891e+18
4,"{'learning_rate': 'optimal', 'max_iter': 1000,...",-596352400000000.0,-1.133848e+16
5,"{'learning_rate': 'optimal', 'max_iter': 1000,...",-96511960000.0,-1381004000000.0


In [41]:
print("Best cross-validation accuracy: {:.2f}".format(grid_sgd.best_score_))
print("Best parameters: {}".format(grid_sgd.best_params_))
print("Train Set Score: {}".format(grid_sgd.score(X_train, Y_train)))
print("Test Set Score: {}".format(grid_sgd.score(X_test, Y_test)))



Best cross-validation accuracy: -1381004065969.10
Best parameters: {'learning_rate': 'optimal', 'max_iter': 1000, 'penalty': 'l2', 'random_state': 0}
Train Set Score: -17240453019.346058
Test Set Score: -113425297454.25739


In [42]:
report_table = report_table + [['SGD Regression', grid_sgd.best_params_, grid_sgd.score(X_train, Y_train), grid_sgd.score(X_test, Y_test)]]

### Model 7 : Linear SVR

In [43]:
from sklearn.svm import LinearSVR
model_LSVM = LinearSVR()
param_LSVM = {'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
print("Defined Parameters:\n{}".format(param_LSVM))


Defined Parameters:
{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}


In [44]:
grid_LSVM = GridSearchCV(model_LSVM, param_grid = param_LSVM, cv=5, return_train_score=True)
grid_LSVM.fit(X_train, Y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [45]:
print("Best parameters: {}".format(grid_LSVM.best_params_))
grid_LSVMdata = pd.DataFrame(grid_LSVM.cv_results_)
grid_LSVMdata = grid_LSVMdata[['params','mean_train_score','mean_test_score']]
grid_LSVMdata


Best parameters: {'C': 1000}


Unnamed: 0,params,mean_train_score,mean_test_score
0,{'C': 0.001},-0.055393,-0.143262
1,{'C': 0.01},-0.047553,-0.119754
2,{'C': 0.1},-0.036016,-0.09306
3,{'C': 1},0.025304,0.002497
4,{'C': 10},0.411904,0.52103
5,{'C': 100},0.732406,0.79119
6,{'C': 1000},0.958092,0.853302


In [46]:
print("Best cross-validation accuracy: {:.2f}".format(grid_LSVM.best_score_))
print("Best parameters: {}".format(grid_LSVM.best_params_))
print("Train Set Score: {}".format(grid_LSVM.score(X_train, Y_train)))
print("Test Set Score: {}".format(grid_LSVM.score(X_test, Y_test)))

Best cross-validation accuracy: 0.85
Best parameters: {'C': 1000}
Train Set Score: 0.9650802118177149
Test Set Score: 0.9442482551780127


In [47]:
report_table = report_table + [['Linear SVM', grid_LSVM.best_params_, grid_LSVM.score(X_train, Y_train), grid_LSVM.score(X_test, Y_test)]]

### Model 8 : SVM Kernel ' rbf '

In [48]:
from sklearn.svm import SVR
model_RadSVM = SVR(kernel = 'rbf')
param_RadSVM = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
print("Defined Parameters:\n{}".format(param_RadSVM))


Defined Parameters:
{'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}


In [49]:
grid_RadSVM = GridSearchCV(model_RadSVM, param_grid = param_RadSVM, cv=5, return_train_score=True)
grid_RadSVM.fit(X_train, Y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [50]:
print("Best parameters: {}".format(grid_RadSVM.best_params_))
grid_RadSVMdata = pd.DataFrame(grid_RadSVM.cv_results_)
grid_RadSVMdata = grid_RadSVMdata[['params','mean_train_score','mean_test_score']]
grid_RadSVMdata


Best parameters: {'C': 100, 'gamma': 0.1}


Unnamed: 0,params,mean_train_score,mean_test_score
0,"{'C': 0.001, 'gamma': 0.001}",-0.044536,-0.108639
1,"{'C': 0.001, 'gamma': 0.01}",-0.044535,-0.108638
2,"{'C': 0.001, 'gamma': 0.1}",-0.044537,-0.108632
3,"{'C': 0.001, 'gamma': 1}",-0.044547,-0.108645
4,"{'C': 0.001, 'gamma': 10}",-0.044537,-0.10863
5,"{'C': 0.001, 'gamma': 100}",-0.044519,-0.108583
6,"{'C': 0.01, 'gamma': 0.001}",-0.044535,-0.108637
7,"{'C': 0.01, 'gamma': 0.01}",-0.044526,-0.10862
8,"{'C': 0.01, 'gamma': 0.1}",-0.044543,-0.108564
9,"{'C': 0.01, 'gamma': 1}",-0.044642,-0.108678


In [51]:
print("Best cross-validation accuracy: {:.2f}".format(grid_RadSVM.best_score_))
print("Best parameters: {}".format(grid_RadSVM.best_params_))
print("Train Set Score: {}".format(grid_RadSVM.score(X_train, Y_train)))
print("Test Set Score: {}".format(grid_RadSVM.score(X_test, Y_test)))

Best cross-validation accuracy: 0.47
Best parameters: {'C': 100, 'gamma': 0.1}
Train Set Score: 0.15403078753853927
Test Set Score: 0.5351811726903493


In [52]:
report_table = report_table + [['SVM kernel Rbf', grid_RadSVM.best_params_, grid_RadSVM.score(X_train, Y_train), grid_RadSVM.score(X_test, Y_test)]]

### Model 9 : SVM Kernel 'poly' 

In [53]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
model_PolySVM = SVR(kernel = 'poly', degree = 2)
param_PolySVM = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
print("Defined Parameters:\n{}".format(param_PolySVM))


Defined Parameters:
{'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}


In [54]:
# grid_PolySVM = GridSearchCV(model_PolySVM, param_grid = param_PolySVM, cv=5, return_train_score=True)
# grid_PolySVM.fit(X_train[:200], Y_train[:200])

In [55]:
# print("Best parameters: {}".format(grid_PolySVM.best_params_))
# grid_PolySVMdata = pd.DataFrame(grid_PolySVM.cv_results_)
# grid_PolySVMdata = grid_PolySVMdata[['params','mean_train_score','mean_test_score']]
# grid_PolySVMdata


In [56]:
# print("Best cross-validation accuracy: {:.2f}".format(grid_PolySVM.best_score_))
# print("Best parameters: {}".format(grid_PolySVM.best_params_))
# print("Train Set Score: {}".format(grid_PolySVM.score(X_train, Y_train)))
# print("Test Set Score: {}".format(grid_PolySVM.score(X_test, Y_test)))

In [57]:
# report_table = report_table + [['SVM kernel Poly', grid_PolySVM.best_params_, grid_PolySVM.score(X_train, Y_train), grid_PolySVM.score(X_test, Y_test)]]

In [58]:
report = pd.DataFrame(report_table,columns = ['Model name', 'Model parameter', 'Train accuracy', 'Test accuracy'])
report['Test accuracy']=report['Test accuracy'].apply(lambda x: '%.4f' % x)
report['Train accuracy']=report['Train accuracy'].apply(lambda x: '%.4f' % x)
# report['Train accuracy'].apply(lambda x: '%.4f' % x)
report

Unnamed: 0,Model name,Model parameter,Train accuracy,Test accuracy
0,Linear Regression,,0.8987,0.9285
1,KNN Regression,K = 5,1.0,0.898
2,Ridge Regression,alpha = 0.3,0.9902,0.9336
3,Lasso Regression,alpha = 1,0.9905,0.9324
4,Polynomial Regression,Degree = 2,-33329.7795,-33329.7795
5,SGD Regression,"{'learning_rate': 'optimal', 'max_iter': 1000,...",-17240453019.3461,-113425297454.2574
6,Linear SVM,{'C': 1000},0.9651,0.9442
7,SVM kernel Rbf,"{'C': 100, 'gamma': 0.1}",0.154,0.5352


In [59]:
report.iloc[6,:]

Model name          Linear SVM
Model parameter    {'C': 1000}
Train accuracy          0.9651
Test accuracy           0.9442
Name: 6, dtype: object

### On Comparing all the models we can se it is evident that the Linear SVM Model  is with more accuracy in Test set then in Train set . Hence we are going to predict for the entire dataset and check the scores.


In [64]:
YFull.head()

0    0.0
1    0.5
2    0.0
3    2.5
4    0.5
Name: Buzz_Magnitude, dtype: float64

In [73]:
X_train_org, X_test_org, y_train, y_test = train_test_split(XFull,YFull, random_state = 0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_org)
X_test = scaler.transform(X_test_org)

In [76]:
from sklearn.svm import LinearSVR

svc_lin = LinearSVR()
param_grid = {'C':[0.001, 0.01, 0.1, 1, 10, 100]}

grid_svc_lin = GridSearchCV(svc_lin, param_grid, cv = 5, return_train_score=True)
grid_svc_lin.fit(X_train_whole, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [77]:
Final_report = [['Linear SVM', grid_svc_lin.best_params_, grid_svc_lin.score(X_train_whole, y_train), grid_svc_lin.score(X_test_whole, y_test)]]

In [82]:
Final_report=pd.DataFrame(Final_report,columns=['Model name', 'Model parameter', 'Train accuracy', 'Test accuracy'])
Final_report

Unnamed: 0,Model name,Model parameter,Train accuracy,Test accuracy
0,Linear SVM,{'C': 100},0.902165,0.91423
