Algorithm for User bucketing for Joke Category.

Approach:

==================

Steps:
1.	Obtain the Joke categorized dataset
2.	Retrieve jokes for the specified category
3.	Classify/cluster user likes and dislikes for the specific jokes under specified category.
4.	Show users liking the jokes belonging to specified category.

Training:
Dataset is divided into 90% training and 10% testing data.
1.	User likes and dislikes are decided from ratings.
2.	Model is trained for user likes and dislikes.
3.	Classification in 0:dislike, 1 :like is done using algorithms mentioned.
4.	From the results and categorization output, joke-category to user bucketing is done.

Datasets used:
These datasets are generated from Recommendation system and JOKE_categorization systems 

jester_dataset_2/joke_user_bucket_dataset.csv
jester_dataset_2/joke_category_dataset.csv'


RESULTS:
    
==================
All buckets results are stored under buckets/'bucketname_user_bucket.txt'

Accuracy Score2:

Naïve Bayes	1
random Forest	1
Nearest Neighbor	0.51427645
K-Nearest Neighbor	0.911012989

Bucket Analysis:

Naïve Bayes				
Total Number of Users for bucket: politics 5161				
Total Number of Users for bucket: animal 3159				
Total Number of Users for bucket: doctor 3036				
Total Number of Users for bucket: others 25020 				
KNN				
Total Number of Users for bucket: animal 3311				
Total Number of Users for bucket: politics 5503				
Total Number of Users for bucket: doctor 3197				
Total Number of Users for bucket: others 26425				


In [3]:
import pandas as pd
import numpy as np
import scipy as sp

from sklearn import svm
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [4]:

def read_joke_user_likes_data():
    data = pd.read_csv('jester_dataset_2/joke_user_bucket_dataset.csv')
    data.head()
    len(data)
    return data

In [5]:
X = read_joke_user_likes_data()
Y = X['likes']
print("X total length:",len(X))
print("Y total length:",len(Y))

X total length: 1048575
Y total length: 1048575


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.10, random_state=99)

print("X_train total length:",len(X_train))
print("X_test total length:",len(X_test))
print("Y_train total length:",len(y_train))


X_train total length: 943717
X_test total length: 104858
Y_train total length: 943717


In [7]:
model_nb = GaussianNB()
model_svm = svm.SVC()
model_rfc = RandomForestClassifier()
model_knn_centroid = NearestCentroid()
model_knn = KNeighborsClassifier()


best_model = None
best_accuracy = 0
best_preds = None

In [8]:
#navie bayes analysis
model_nb.fit(X_train, y_train)
preds = model_nb.predict(X_test)
accuracy = accuracy_score(y_test, preds)
rmse_nb = mean_squared_error(y_test, preds)
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = model_nb
    best_preds = preds

print("Name:Navie Bayes")
print("Accuracy score: ", accuracy)
print("RMSE: ", rmse_nb)

Name:Navie Bayes
Accuracy score:  1.0
RMSE:  0.0


In [9]:
#svm analysis
# model_svm.fit(X_train, y_train)
# preds = model_svm.predict(X_test)
# accuracy = accuracy_score(y_test, preds)
# if accuracy > best_accuracy:
#     best_accuracy = accuracy
#     best_model = model_svm
#     best_preds = preds
    
# print("Name:SVM")
# print("Accuracy score: ", accuracy)

In [13]:
#Random Forest analysis
model_rfc.fit(X_train, y_train)
preds = model_rfc.predict(X_test)
accuracy = accuracy_score(y_test, preds)
rmse_rfc = mean_squared_error(y_test, preds)
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = model_rfc
    best_preds = preds
    
print("Name:Random Forest")
print("Accuracy score: ", accuracy)

print("RMSE: ", rmse_rfc)

Name:Random Forest
Accuracy score:  1.0
RMSE:  0.0


In [14]:
#nearest centrold analysis
model_knn_centroid.fit(X_train, y_train)
preds = model_knn_centroid.predict(X_test)
accuracy = accuracy_score(y_test, preds)
rmse_centroid = mean_squared_error(y_test, preds)
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = model_knn_centroid
    best_preds = preds
    
print("Name:Nearest Centroid")
print("Accuracy score: ", accuracy)
print("RMSE: ", rmse_centroid)

Name:Nearest Centroid
Accuracy score:  0.514276450056
RMSE:  0.485723549944


In [15]:
#knn analysis
model_knn.fit(X_train, y_train)
preds = model_knn.predict(X_test)
accuracy = accuracy_score(y_test, preds)
rmse_knn = mean_squared_error(y_test, preds)
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = model_knn
    best_preds = preds
    
print("Name:KNN ")
print("Accuracy score: ", accuracy)
print("RMSE: ", rmse_knn)

Name:KNN 
Accuracy score:  0.911012988995
RMSE:  0.0889870110054


In [16]:
print("Performance of models")
print("======================")

print("Best model:",best_model)
print("Best accuracy:",best_accuracy)
print("Best predictions:",best_preds)

Performance of models
Best model: GaussianNB(priors=None)
Best accuracy: 1.0
Best predictions: [0 1 0 ..., 1 0 0]


In [17]:
output = pd.DataFrame()
output = pd.concat([output, X_test['user_id']], axis=1)
output['joke_id'] = X_test['joke_id']
output['likes'] = best_preds

output.head()

Unnamed: 0,user_id,joke_id,likes
787146,29962,16,0
618543,41622,121,1
228797,16211,89,0
193161,19616,59,0
537285,18305,19,0


In [18]:
joke_category_data = pd.read_csv('jester_dataset_2/joke_category_dataset.csv')
joke_category_data.head()
len(joke_category_data)
joke_category_data = joke_category_data.drop('joke_category', 1)

In [19]:
result = pd.merge(joke_category_data, output, how='right', on='joke_id')
result.head()

Unnamed: 0,joke_id,joke_category_reduced,user_id,likes
0,1,doctor,19091,1
1,1,doctor,45460,1
2,1,doctor,44250,1
3,1,doctor,18022,1
4,1,doctor,14443,1


In [20]:
len(result)

104858

In [23]:
bucket_name = 'politics'
bucket = result.loc[(result['likes'] == 1) & (result['joke_category_reduced'] == bucket_name)]
print(" Total Number of Users for bucket:", bucket_name,len(bucket))
print(" Total Users")
print("============")
bucket['user_id']

 Total Number of Users for bucket: politics 5161
 Total Users


712        388
714        333
724        270
728        376
730        504
735         43
737         68
738        264
739        167
742        425
27693    30677
27694     4131
27696    37969
27697    30751
27698    13327
27699    50037
27700    28867
27701    41514
27703     5731
27704    47179
27705    36144
27706    29268
27707    43926
27708    28903
27709    29289
27710    42355
27711     4164
27712    41670
27715    25820
27716    47039
         ...  
96405    27387
96406    14814
96409    37383
96410    34872
96411     1020
96412    36751
96413    29076
96414    42338
96415    31916
96416    45306
96417    30515
96419    44461
96421    15337
96422    29593
96424    12194
96425    26702
96427    25530
96429    37526
96430     2289
96433    28923
96434    13684
96435    25264
96436    29817
96442    47928
96443     1109
96444    47283
96446    26345
96450    33442
96456     4238
96457    15040
Name: user_id, Length: 5161, dtype: int64

In [24]:
def writePredictions(bucket,bucketname):    
    filename = "buckets/" + bucketname + "_user_bucket.txt"
    print(filename)
    np.savetxt(filename, bucket.values, header="joke_id joke_category user_id likes", fmt='%s' , delimiter="\t" )

writePredictions(bucket,bucket_name)       

buckets/politics_user_bucket.txt
