# Youtube Spam Detection (5 Videos)
## Abstract
Try to detect spam comments on youtube videos. The dataset contains 5 videos with 5 different classes of comments.<br>
Data Source: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection<br>

Data Set Information:<br>
The table below lists the datasets, the YouTube video ID, the amount of samples in each class and the total number of samples per dataset.<br>

| Dataset | YouTube ID | # Spam | # Ham | Total |
| --- | --- | --- | --- | --- |
| Psy | 9bZkp7q19f0 | 175 | 175 | 350 |
| KatyPerry | CevxZvSJLk8 | 175 | 175 | 350 |
| LMFAO | KQ6zr6kCPj8 | 236 | 202 | 438 |
| Eminem | uelHwf8o7_U | 245 | 203 | 448 |
| Shakira | pRpeEdMmmQ0 | 174 | 196 | 370 |

First, we will only use the first video (Psy) to train the model. Then, we will use all the videos to train the model. Using Bag of Words (BoW) and Random Froset (RF) to train the model.


In [1]:
import pandas as pd

In [2]:
# Read in The Psy file, and display the tail
df_psy = pd.read_csv('../Intro_AI_Project_3/Youtube01-Psy.csv')
df_psy.tail()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
345,z13th1q4yzihf1bll23qxzpjeujterydj,Carmen Racasanu,2014-11-14T13:27:52,How can this have 2 billion views when there's...,0
346,z13fcn1wfpb5e51xe04chdxakpzgchyaxzo0k,diego mogrovejo,2014-11-14T13:28:08,I don't now why I'm watching this in 2014﻿,0
347,z130zd5b3titudkoe04ccbeohojxuzppvbg,BlueYetiPlayz -Call Of Duty and More,2015-05-23T13:04:32,subscribe to me for call of duty vids and give...,1
348,z12he50arvrkivl5u04cctawgxzkjfsjcc4,Photo Editor,2015-06-05T14:14:48,hi guys please my android photo editor downloa...,1
349,z13vhvu54u3ewpp5h04ccb4zuoardrmjlyk0k,Ray Benich,2015-06-05T18:05:16,The first billion viewed this because they tho...,0


In [3]:
# find number of rows and columns
print(df_psy.shape)

(350, 5)


There are 350 rows and 5 columns

In [4]:
# count number of 0, 1 in CLASS
df_psy.CLASS.value_counts()

1    175
0    175
Name: CLASS, dtype: int64

When you compare the commnet content and class, you can find out class 0 equal to not spam and class 1 equal to spam. In psy.csv file, there are 175 rows spam and 175 rows not spam

In [5]:
# Import the Bag of Words function and create an instance
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [6]:
# Fit and transform the comments in one step
dvec = vectorizer.fit_transform(df_psy['CONTENT'])

In [7]:
dvec

<350x1418 sparse matrix of type '<class 'numpy.int64'>'
	with 4354 stored elements in Compressed Sparse Row format>

The size of sparse matrix is 350x1418

In [8]:
# print out 349th comment
print(df_psy['CONTENT'][349])

The first billion viewed this because they thought it was really cool, the  other billion and a half came to see how stupid the first billion were...﻿


In [9]:
# Using vectorizer.build_analyzer() display
# the breakdown of the 349th comment in to a “bag of words”
analyze = vectorizer.build_analyzer()
analyze(df_psy['CONTENT'][349])

['the',
 'first',
 'billion',
 'viewed',
 'this',
 'because',
 'they',
 'thought',
 'it',
 'was',
 'really',
 'cool',
 'the',
 'other',
 'billion',
 'and',
 'half',
 'came',
 'to',
 'see',
 'how',
 'stupid',
 'the',
 'first',
 'billion',
 'were']

In [10]:
# Give the output of vectorizer.get_feature_names()
vectorizer.get_feature_names()



['00',
 '000',
 '02',
 '034',
 '05',
 '08',
 '10',
 '100',
 '100000415527985',
 '10200253113705769',
 '1030',
 '1073741828',
 '11',
 '1111',
 '112720997191206369631',
 '12',
 '123',
 '124',
 '124923004',
 '126',
 '127',
 '13017194',
 '131338190916',
 '1340488',
 '1340489',
 '1340490',
 '1340491',
 '1340492',
 '1340493',
 '1340494',
 '1340499',
 '1340500',
 '1340502',
 '1340503',
 '1340504',
 '1340517',
 '1340518',
 '1340519',
 '1340520',
 '1340521',
 '1340522',
 '1340523',
 '1340524',
 '134470083389909',
 '1415297812',
 '1495323920744243',
 '1496241863981208',
 '1496273723978022',
 '1498561870415874',
 '161620527267482',
 '171183229277',
 '19',
 '19924',
 '1firo',
 '1m',
 '20',
 '2009',
 '2012',
 '2012bitches',
 '2013',
 '2014',
 '201470069872822',
 '2015',
 '2017',
 '210',
 '23',
 '24',
 '24398',
 '243a',
 '279',
 '29',
 '2b',
 '2billion',
 '2x10',
 '300',
 '3000',
 '313327',
 '315',
 '322',
 '33',
 '33gxrf',
 '39',
 '390875584405933',
 '391725794320912',
 '40beuutvu2zkxk4utgpz8k',
 '

It return the names of features from the dataset, the output begin with numbers and then words

In [11]:
# shuffle dataset
df_psy_shuffle = df_psy.sample(frac=1)

In [12]:
df_psy_shuffle

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
106,z13qgx0yzwf1uj1xm04ccbkhjnrsgz0i41g,firo mota,2014-11-04T14:38:42,please subscribe i am a new youtuber and need ...,1
343,z13sh3cz1kbqgrai504cf53qsq25ypmi5zs0k,Leonel Hernandez,2014-11-14T12:35:38,"Something to dance to, even if your sad JUST ...",0
112,z12ftpab5svihfffz23kf3iiymiwjzesi,trespasser4000,2014-11-04T22:07:22,This song never gets old love it.﻿,0
18,z12qth5j0ob1fx3q404chvy4fz32tbkpllk0k,Tony K Frazier,2013-11-28T23:57:13,http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K﻿,1
303,z12cehoxozfgg3nok04cjj05xznbgrlpfjo,Elieo Cardiopulmonary,2014-11-08T15:29:52,im sorry for the spam but My name is Jenny. I ...,1
...,...,...,...,...,...
110,z123jlf4lzjbgpbcr23yhxyqbpe3gxpvm,TIGERIO_,2014-11-04T19:46:38,EHI GUYS CAN YOU SUBSCRIBE IN MY CHANNEL? I AM...,1
80,z122f1fy5muwdxdxd04cexyxes3hh5hrifc,Squir3,2014-11-02T18:34:03,http://woobox.com/33gxrf/brt0u5 FREE CS GO!!!!﻿,1
224,z12sjrqiurm3sd4rh04chz4oplrmhzmgzmg0k,Stronzo Chicheritr,2014-11-07T20:01:15,prehistoric song..has been﻿,0
144,z13osluqrpefv1hd323idhejzxanc3ai004,Tyrek Sings,2014-11-05T22:50:02,CHECK MY CHANNEL OUT PLEASE. I DO SINGING COVERS﻿,1


In [13]:
# create training and testing set
# first 300 for training and remain for testing
d_train = df_psy_shuffle.iloc[:300]
d_test = df_psy_shuffle.iloc[300:]

In [14]:
# d_train

In [15]:
# d_test

In [16]:
# Create your training and testing attributes BOW 
# using vectorizer.fit_transform
d_train_att = vectorizer.fit_transform(d_train['CONTENT'])
d_test_att =vectorizer.transform(d_test['CONTENT'])

In [17]:
d_train_att

<300x1274 sparse matrix of type '<class 'numpy.int64'>'
	with 3759 stored elements in Compressed Sparse Row format>

In [18]:
d_test_att.shape

(50, 1274)

In [19]:
# training and testing labels
d_train_label = d_train['CLASS']
d_test_label = d_test['CLASS']

In [20]:
d_train_att

<300x1274 sparse matrix of type '<class 'numpy.int64'>'
	with 3759 stored elements in Compressed Sparse Row format>

d_train_att is a 300x1238 matrix

In [21]:
d_test_att

<50x1274 sparse matrix of type '<class 'numpy.int64'>'
	with 450 stored elements in Compressed Sparse Row format>

d_test_att is a 50x400 matrix

In [22]:
# create random forest classifier with 80 trees
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=80)
clf.fit(d_train_att, d_train_label)

In [23]:
clf.score(d_test_att, d_test_label)

0.92

Score for random forest classifier test set is 94%

In [24]:
# Create a confusion matrix for the test labels and prediction labels, and output array
predicted_labels = clf.predict(d_test_att)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(d_test_label, predicted_labels)

In [25]:
cm

array([[24,  0],
       [ 4, 22]], dtype=int64)

In [26]:
# Cross validate and output your accuracy
from sklearn.model_selection import cross_val_score
score = cross_val_score(clf,d_train_att, d_train_label, cv=5)

print('Random forest accuracy: ', round(score.mean(), 4), '(+/- ', round(score.std()*2, 4),')')

Random forest accuracy:  0.96 (+/-  0.0686 )


Cross validation split our dataset into multiple portions. Some will use for training rest will use on testing. For example in above cross validation, we divide dataset in 5 portion. For 1 interation, first portion will use for test and rest will use for train. Then, for interation 2, second portion will use for test and rest will use for train. Repeat until all portion have been used for test.

In [27]:
# Load and concatenate all 5 comments file
concat = [pd.read_csv('Youtube01-Psy.csv'), 
          pd.read_csv('Youtube02-KatyPerry.csv'), 
          pd.read_csv('Youtube03-LMFAO.csv'), 
          pd.read_csv('Youtube04-Eminem.csv'), 
          pd.read_csv('Youtube05-Shakira.csv')]
combined_csv = pd.concat(concat, ignore_index=True)

In [28]:
combined_csv

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1
...,...,...,...,...,...
1951,_2viQ_Qnc6-bMSjqyL1NKj57ROicCSJV5SwTrw-RFFA,Katie Mettam,2013-07-13T13:27:39.441000,I love this song because we sing it at Camp al...,0
1952,_2viQ_Qnc6-pY-1yR6K2FhmC5i48-WuNx5CumlHLDAI,Sabina Pearson-Smith,2013-07-13T13:14:30.021000,I love this song for two reasons: 1.it is abou...,0
1953,_2viQ_Qnc6_k_n_Bse9zVhJP8tJReZpo8uM2uZfnzDs,jeffrey jules,2013-07-13T12:09:31.188000,wow,0
1954,_2viQ_Qnc6_yBt8UGMWyg3vh0PulTqcqyQtdE7d4Fl0,Aishlin Maciel,2013-07-13T11:17:52.308000,Shakira u are so wiredo,0


In [29]:
# Display the length of the file as well as the breakdown into spam and non-spam
print('Lenght file: ', len(combined_csv))
print('Lenght of spam: ', len(combined_csv[combined_csv.CLASS == 1]))
print('Lenght of non spam: ', len(combined_csv[combined_csv.CLASS == 0]))

Lenght file:  1956
Lenght of spam:  1005
Lenght of non spam:  951


In [30]:
# Shuffle the new data, and create content and label sets, d_content, and d_label
combined_df = combined_csv.sample(frac=1)
d_content = combined_df['CONTENT']
d_label = combined_df['CLASS']

The advantage to use pipeline can minmal code and less room for mistake, instead using fit and transform method, programmer only need to use fit in pipline and make the code easler to read.

Reference: <br>
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html<br>
https://medium.com/mlearning-ai/how-to-use-sklearns-pipelines-to-optimize-your-analysis-b6cd91999be

In [31]:
# mport both Pipeline and make_pipeline from sklearn
from sklearn.pipeline import Pipeline, make_pipeline

In [32]:
# create pipeline
pipeline = Pipeline([('bag-of-words', CountVectorizer()),
                     ('random forest', RandomForestClassifier())])

In [33]:
# output the pipeline
pipeline

In [34]:
# Fit your pipeline with the first 1500 entries of the content and the labels. Output.
pipeline.fit(d_content[:1500], d_label[:1500])

In [35]:
# Use .score to score your pipeline
pipeline.score(d_content[1500:], d_label[1500:])

0.9473684210526315

In [36]:
# Use pipeline to predict spam or not
pipeline.predict(["what a neat video!"])

array([0], dtype=int64)

In [37]:
# Use pipeline to predict spam or not
pipeline.predict(["plz subscribe to my channel"])

array([1], dtype=int64)

For "what a neat video!", the model predict is 0, it mean not a spam.<br> 
For "plz subscribe to my channel", model predict is 1, it mean the comment is spam

In [38]:
# Cross validate pipeline using d_content and d_labels. Set cv=5
scores = cross_val_score(pipeline, d_content, d_label, cv=5)

In [39]:
# Print out the accuracy(scores.mean) +/- 2sd
print('Accracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()*2))

Accracy: 0.96 (+/- 0.02)


In [40]:
# Create second pipeline named pipeline2 which incorporates the TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
pipeline2 = make_pipeline(CountVectorizer(), TfidfTransformer(norm=None), RandomForestClassifier())

In [41]:
# Cross validate pipeline2 using d_content and d_labels.
# Output the accuracy(scores.mean) +/- 2sd. Set cv=5
scores = cross_val_score(pipeline2, d_content, d_label)
print('Accracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()*2))

Accracy: 0.96 (+/- 0.02)


In [42]:
#Output the steps of pipeline2
pipeline2.steps

[('countvectorizer', CountVectorizer()),
 ('tfidftransformer', TfidfTransformer(norm=None)),
 ('randomforestclassifier', RandomForestClassifier())]

In [43]:
# parameter search
parameters = {
    'countvectorizer__max_features': (None, 1000, 2000),
    'countvectorizer__ngram_range': ((1,1), (1,2)), # unigrams or bigrams
    'countvectorizer__stop_words': ('english', None),
    'tfidftransformer__use_idf': (True, False), # effectively turn on/off tfidf
    'randomforestclassifier__n_estimators': (20, 59, 100)
}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1)

The code in part a use grid search to optimize the parameters, parameter can have 1000, to 2000 words. Using ngrams we can use single words, pairs of words.
Use the English stop words or not. TF-IDF, on or off and random forest classifier uses 20,50 or 100 trees

In [44]:
# Using .fit and d_content, and d_labels, perform the grid search
grid_search.fit(d_content, d_label)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


In [45]:
# grid_search.best_index_

In [46]:
# grid_search.best_estimator_

In [47]:
print('Best score: %0.3f ' % grid_search.best_score_)
print('Best parameters set: ')
for x in grid_search.best_params_:
    print ('\t', x ,':', grid_search.best_params_[x])
    

Best score: 0.959 
Best parameters set: 
	 countvectorizer__max_features : None
	 countvectorizer__ngram_range : (1, 2)
	 countvectorizer__stop_words : None
	 randomforestclassifier__n_estimators : 59
	 tfidftransformer__use_idf : True


According to this output our best score was 0.960; with 2000 words, only pair of words, using english stop words; 100 trees, without TF-IDF

In [48]:
import gensim, logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Gensim: Is a package for NLP task, it also support algorithms for word embedding <br>
Logging: Basic configuration for the logging system by creating a stream handler with a formatter

In [49]:
# load Google w2v model
gmodel = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

2022-11-06 18:57:46,373 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin
2022-11-06 18:59:02,610 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2022-11-06T18:59:02.610646', 'gensim': '4.1.2', 'python': '3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19044-SP0', 'event': 'load_word2vec_format'}


**Question 1cI**

In [50]:
gmodel_cat = gmodel['cat']
gmodel_cat

array([ 0.0123291 ,  0.20410156, -0.28515625,  0.21679688,  0.11816406,
        0.08300781,  0.04980469, -0.00952148,  0.22070312, -0.12597656,
        0.08056641, -0.5859375 , -0.00445557, -0.296875  , -0.01312256,
       -0.08349609,  0.05053711,  0.15136719, -0.44921875, -0.0135498 ,
        0.21484375, -0.14746094,  0.22460938, -0.125     , -0.09716797,
        0.24902344, -0.2890625 ,  0.36523438,  0.41210938, -0.0859375 ,
       -0.07861328, -0.19726562, -0.09082031, -0.14160156, -0.10253906,
        0.13085938, -0.00346375,  0.07226562,  0.04418945,  0.34570312,
        0.07470703, -0.11230469,  0.06738281,  0.11230469,  0.01977539,
       -0.12353516,  0.20996094, -0.07226562, -0.02783203,  0.05541992,
       -0.33398438,  0.08544922,  0.34375   ,  0.13964844,  0.04931641,
       -0.13476562,  0.16308594, -0.37304688,  0.39648438,  0.10693359,
        0.22167969,  0.21289062, -0.08984375,  0.20703125,  0.08935547,
       -0.08251953,  0.05957031,  0.10205078, -0.19238281, -0.09

In [51]:
gmodel_dog = gmodel['dog']
gmodel_dog

array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01,  1.61132812e-01,
       -8.44726562e-02,  5.73730469e-02,  5.85937500e-02, -8.25195312e-02,
       -1.53808594e-02, -6.34765625e-02,  1.79687500e-01, -4.23828125e-01,
       -2.25830078e-02, -1.66015625e-01, -2.51464844e-02,  1.07421875e-01,
       -1.99218750e-01,  1.59179688e-01, -1.87500000e-01, -1.20117188e-01,
        1.55273438e-01, -9.91210938e-02,  1.42578125e-01, -1.64062500e-01,
       -8.93554688e-02,  2.00195312e-01, -1.49414062e-01,  3.20312500e-01,
        3.28125000e-01,  2.44140625e-02, -9.71679688e-02, -8.20312500e-02,
       -3.63769531e-02, -8.59375000e-02, -9.86328125e-02,  7.78198242e-03,
       -1.34277344e-02,  5.27343750e-02,  1.48437500e-01,  3.33984375e-01,
        1.66015625e-02, -2.12890625e-01, -1.50756836e-02,  5.24902344e-02,
       -1.07421875e-01, -8.88671875e-02,  2.49023438e-01, -7.03125000e-02,
       -1.59912109e-02,  7.56835938e-02, -7.03125000e-02,  1.19140625e-01,
        2.29492188e-01,  

In [52]:
gmodel_spatula = gmodel['spatula']
gmodel_spatula

array([-0.19140625, -0.04296875,  0.27539062,  0.00488281, -0.3203125 ,
        0.08203125,  0.05566406, -0.03613281, -0.31445312,  0.10693359,
       -0.359375  ,  0.29882812,  0.02331543,  0.05517578, -0.140625  ,
        0.1953125 , -0.23632812, -0.22167969, -0.06542969, -0.3359375 ,
        0.25195312, -0.09326172,  0.54296875,  0.11328125, -0.28710938,
       -0.12011719, -0.11181641,  0.20996094, -0.33203125,  0.30273438,
       -0.3359375 , -0.12255859,  0.12890625, -0.28515625, -0.04223633,
        0.25585938,  0.3203125 ,  0.07177734,  0.19042969, -0.01379395,
        0.16992188, -0.22460938,  0.5078125 ,  0.08398438, -0.07519531,
       -0.06396484,  0.05371094,  0.34570312,  0.46289062, -0.16699219,
       -0.30664062,  0.15234375, -0.09765625, -0.26171875, -0.14160156,
        0.2265625 ,  0.49609375, -0.10791016, -0.08447266,  0.234375  ,
        0.04931641, -0.07128906,  0.05273438, -0.11914062,  0.09814453,
        0.11181641, -0.13574219, -0.46875   ,  0.26171875,  0.12

In [53]:
# output similarity scores for 'dog', 'cat'
print('Smilarity socres for \'dog\' and \'cat\': ',gmodel.similarity('dog', 'cat'))

Smilarity socres for 'dog' and 'cat':  0.7609457


In [54]:
# output similarity scores for 'dog', 'spatula'
print('Smilarity socres for \'dog\' and \'spatula\': ' ,gmodel.similarity('dog', 'spatula'))

Smilarity socres for 'dog' and 'spatula':  0.1705973


In [55]:
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
import re

Gensim library: For representing documents as vectors, retrieval by similarity, and other natural language processing.<br>
TaggedDocument: Use tag to represents a document, input document format for Doc2Vec.

In [81]:
def extract_words(sent):
    sent = sent.lower()
    sent = re.sub(r'<[^>]+>', ' ', sent)
    sent = re.sub(r'(\w)\'(\w)', '\1\2', sent)
    sent = re.sub(r'\W', ' ', sent)
    sent = re.sub(r'\s+', ' ', sent)
    sent = sent.strip()
    return sent.split()

Above utility functon use regular expression to extract all the word from input 'sent' and return it. Function also remove anything between a < and a >. Convert upper case to lower case.

In [57]:
# unsupervised training data
import os
unsup_sentences = []

# source: http://ai.stanford.edu/~amaas/data/sentiment/, data from IMDB
for dirname in ["train/pos/", "train/neg/", "train/unsup/", "test/pos", "test/neg/"]:
    for fname in sorted(os.listdir("aclImdb/" + dirname)):
        if fname[-4:] == '.txt':
            with open("aclImdb/" + dirname + "/" + fname,  encoding='UTF-8') as f:
                sent = f.read()
                words = extract_words(sent)
                unsup_sentences.append(TaggedDocument(words, [dirname + "/" + fname]))

# source: http://www.cs.cornell.edu/people/pabo/movie-review-data/
for dirname in ["review_polarity/txt_sentoken/pos", "review_polarity/txt_sentoken/neg/"]:
    for fname in sorted(os.listdir(dirname)):
        if fname[-4:] == '.txt':
            with open(dirname + "/" + fname, encoding='UTF-8') as f:
                for i, sent in enumerate(f):
                    words = extract_words(sent)
                    unsup_sentences.append(TaggedDocument(words, ["%s%s-%d" % (dirname, fname, i)]))

# source: https://nip.stanford.edu/sentiment/, data from Rotten Tomatoes
with open("stanfordSentimentTreebank/original_rt_snippets.txt", encoding='UTF-8') as f:
    for i, line in enumerate(f):
        words = extract_words(sent)
        unsup_sentences.append(TaggedDocument(words, ["rt-%d" % i]))

In [58]:
# unsup_sentences[0:1]

In [59]:
len(unsup_sentences)

175325

First loop open aclImdb folder, then loop though "train/pos/", "train/neg/", "train/unsup/", "test/pos" and "test/neg/" path. Read the .txt file as UTF-8 encoding, use extract_words function to extract words from extract and add file path as tag to unsup_sentences array.<br>

Second loop open "review_polarity/txt_sentoken/pos" and "review_polarity/txt_sentoken/neg/" folders, read the .txt file as UTF-8 encoding and use extract_words function to extract words from extract and add file path as tag to unsup_sentences array.<br>

Final loop open original_rt_snippets.txt, read the file as  as UTF-8 encoding, use extract_words function to extract words from file and add the words to array and add file path as tag to unsup_sentences array<br>

All 3 loop will added a file path tag the end of its array<br>


There are 175,325 training examples



In [84]:
# display first 10 entries
unsup_sentences[:10]

[TaggedDocument(words=['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', 'such', 'as', 'teachers', 'my', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'bromwell', 'hig', 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', 'teachers', 'the', 'scramble', 'to', 'survive', 'financially', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', 'pomp', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', 'all', 'remind', 'me', 'of', 'the', 'schools', 'i', 'knew', 'and', 'their', 'students', 'when', 'i', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', 'i', 'immediately', 'recalled', 'at', 'high', 'a', 'classic', 'line', 'inspector', 'here', 'to', 'sack', 'one', 'of', 'your', 'teachers', 'student', 'welcome', 

In [61]:
import random
class PermuteSentences(object):
    def __init__(self, sents):
        self.sents = sents
    
    def __iter__(self):
        shuffled = list(self.sents)
        random.shuffle(shuffled)
        for sent in shuffled:
            yield sent

In [62]:
permuter = PermuteSentences(unsup_sentences)
model = Doc2Vec(permuter, dm=0, hs=1, vector_size=50)

2022-11-06 19:03:19,146 : INFO : collecting all words and their counts
2022-11-06 19:03:19,485 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2022-11-06 19:03:20,355 : INFO : PROGRESS: at example #10000, processed 1409813 words (1624646/s), 45653 word types, 10000 tags
2022-11-06 19:03:21,287 : INFO : PROGRESS: at example #20000, processed 2829499 words (1524535/s), 62029 word types, 20000 tags
2022-11-06 19:03:22,123 : INFO : PROGRESS: at example #30000, processed 4218642 words (1664668/s), 73756 word types, 30000 tags
2022-11-06 19:03:23,070 : INFO : PROGRESS: at example #40000, processed 5659233 words (1525977/s), 83627 word types, 40000 tags
2022-11-06 19:03:23,905 : INFO : PROGRESS: at example #50000, processed 7063117 words (1683053/s), 91525 word types, 50000 tags
2022-11-06 19:03:24,817 : INFO : PROGRESS: at example #60000, processed 8461968 words (1536336/s), 98818 word types, 60000 tags
2022-11-06 19:03:25,652 : INFO : PROGRESS: at example #70

2022-11-06 19:04:21,513 : INFO : EPOCH 1 - PROGRESS: at 48.41% examples, 269858 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:22,548 : INFO : EPOCH 1 - PROGRESS: at 49.81% examples, 269344 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:23,566 : INFO : EPOCH 1 - PROGRESS: at 51.24% examples, 269431 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:24,575 : INFO : EPOCH 1 - PROGRESS: at 52.69% examples, 269375 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:25,593 : INFO : EPOCH 1 - PROGRESS: at 54.27% examples, 270002 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:26,605 : INFO : EPOCH 1 - PROGRESS: at 55.84% examples, 270437 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:27,615 : INFO : EPOCH 1 - PROGRESS: at 57.24% examples, 270133 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:28,622 : INFO : EPOCH 1 - PROGRESS: at 58.60% examples, 269880 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:04:29,633 : INFO : EPOCH 1 - PROGRESS: at 60.07% examples, 269987 words/s, in_qsiz

2022-11-06 19:05:32,864 : INFO : EPOCH 2 - PROGRESS: at 51.15% examples, 269051 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:33,905 : INFO : EPOCH 2 - PROGRESS: at 52.61% examples, 268953 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:34,907 : INFO : EPOCH 2 - PROGRESS: at 54.01% examples, 268945 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:35,920 : INFO : EPOCH 2 - PROGRESS: at 55.42% examples, 268873 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:36,926 : INFO : EPOCH 2 - PROGRESS: at 56.90% examples, 269028 words/s, in_qsize 6, out_qsize 0
2022-11-06 19:05:37,977 : INFO : EPOCH 2 - PROGRESS: at 58.42% examples, 268833 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:39,012 : INFO : EPOCH 2 - PROGRESS: at 59.94% examples, 269163 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:40,061 : INFO : EPOCH 2 - PROGRESS: at 61.43% examples, 268852 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:05:41,078 : INFO : EPOCH 2 - PROGRESS: at 62.82% examples, 268744 words/s, in_qsiz

2022-11-06 19:06:43,995 : INFO : EPOCH 3 - PROGRESS: at 51.32% examples, 263247 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:45,030 : INFO : EPOCH 3 - PROGRESS: at 52.87% examples, 263732 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:46,031 : INFO : EPOCH 3 - PROGRESS: at 54.42% examples, 264234 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:47,087 : INFO : EPOCH 3 - PROGRESS: at 55.95% examples, 264183 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:48,097 : INFO : EPOCH 3 - PROGRESS: at 57.33% examples, 263844 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:49,115 : INFO : EPOCH 3 - PROGRESS: at 58.83% examples, 264028 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:50,126 : INFO : EPOCH 3 - PROGRESS: at 60.25% examples, 264113 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:51,134 : INFO : EPOCH 3 - PROGRESS: at 61.63% examples, 264172 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:06:52,136 : INFO : EPOCH 3 - PROGRESS: at 63.05% examples, 264081 words/s, in_qsiz

2022-11-06 19:07:54,773 : INFO : EPOCH 4 - PROGRESS: at 52.42% examples, 267388 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:07:55,795 : INFO : EPOCH 4 - PROGRESS: at 53.93% examples, 267688 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:07:56,806 : INFO : EPOCH 4 - PROGRESS: at 55.44% examples, 268017 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:07:57,860 : INFO : EPOCH 4 - PROGRESS: at 56.96% examples, 267870 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:07:58,892 : INFO : EPOCH 4 - PROGRESS: at 58.52% examples, 268252 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:07:59,925 : INFO : EPOCH 4 - PROGRESS: at 60.14% examples, 268789 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:08:00,938 : INFO : EPOCH 4 - PROGRESS: at 61.67% examples, 269209 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:08:01,956 : INFO : EPOCH 4 - PROGRESS: at 63.27% examples, 269922 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:08:02,998 : INFO : EPOCH 4 - PROGRESS: at 64.85% examples, 270168 words/s, in_qsiz

2022-11-06 19:09:05,862 : INFO : EPOCH 5 - PROGRESS: at 56.35% examples, 271818 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:06,889 : INFO : EPOCH 5 - PROGRESS: at 57.95% examples, 272134 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:07,904 : INFO : EPOCH 5 - PROGRESS: at 59.40% examples, 272118 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:08,915 : INFO : EPOCH 5 - PROGRESS: at 60.84% examples, 272138 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:09,926 : INFO : EPOCH 5 - PROGRESS: at 62.23% examples, 271957 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:10,955 : INFO : EPOCH 5 - PROGRESS: at 63.60% examples, 271207 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:11,958 : INFO : EPOCH 5 - PROGRESS: at 65.00% examples, 270961 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:12,971 : INFO : EPOCH 5 - PROGRESS: at 66.37% examples, 270676 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:09:14,012 : INFO : EPOCH 5 - PROGRESS: at 67.98% examples, 270892 words/s, in_qsiz

2022-11-06 19:10:17,220 : INFO : EPOCH 6 - PROGRESS: at 58.41% examples, 267912 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:18,226 : INFO : EPOCH 6 - PROGRESS: at 59.71% examples, 267523 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:19,238 : INFO : EPOCH 6 - PROGRESS: at 61.19% examples, 267643 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:20,245 : INFO : EPOCH 6 - PROGRESS: at 62.59% examples, 267432 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:21,257 : INFO : EPOCH 6 - PROGRESS: at 64.10% examples, 267714 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:22,265 : INFO : EPOCH 6 - PROGRESS: at 65.66% examples, 268181 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:23,280 : INFO : EPOCH 6 - PROGRESS: at 67.24% examples, 268593 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:24,299 : INFO : EPOCH 6 - PROGRESS: at 68.79% examples, 268942 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:10:25,313 : INFO : EPOCH 6 - PROGRESS: at 70.34% examples, 269280 words/s, in_qsiz

2022-11-06 19:11:28,254 : INFO : EPOCH 7 - PROGRESS: at 60.42% examples, 270779 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:29,283 : INFO : EPOCH 7 - PROGRESS: at 61.98% examples, 271217 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:30,287 : INFO : EPOCH 7 - PROGRESS: at 63.43% examples, 270990 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:31,288 : INFO : EPOCH 7 - PROGRESS: at 64.93% examples, 271428 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:32,337 : INFO : EPOCH 7 - PROGRESS: at 66.42% examples, 271236 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:33,358 : INFO : EPOCH 7 - PROGRESS: at 67.79% examples, 270770 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:34,361 : INFO : EPOCH 7 - PROGRESS: at 69.33% examples, 270998 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:35,367 : INFO : EPOCH 7 - PROGRESS: at 70.76% examples, 271081 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:11:36,411 : INFO : EPOCH 7 - PROGRESS: at 72.23% examples, 270928 words/s, in_qsiz

2022-11-06 19:12:39,223 : INFO : EPOCH 8 - PROGRESS: at 62.43% examples, 267761 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:40,242 : INFO : EPOCH 8 - PROGRESS: at 63.89% examples, 267800 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:41,258 : INFO : EPOCH 8 - PROGRESS: at 65.34% examples, 267914 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:42,307 : INFO : EPOCH 8 - PROGRESS: at 66.78% examples, 267658 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:43,320 : INFO : EPOCH 8 - PROGRESS: at 68.31% examples, 267954 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:44,339 : INFO : EPOCH 8 - PROGRESS: at 69.75% examples, 268030 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:45,339 : INFO : EPOCH 8 - PROGRESS: at 71.27% examples, 268197 words/s, in_qsize 6, out_qsize 0
2022-11-06 19:12:46,360 : INFO : EPOCH 8 - PROGRESS: at 72.75% examples, 268253 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:12:47,412 : INFO : EPOCH 8 - PROGRESS: at 74.36% examples, 268406 words/s, in_qsiz

2022-11-06 19:13:50,019 : INFO : EPOCH 9 - PROGRESS: at 64.51% examples, 269156 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:51,067 : INFO : EPOCH 9 - PROGRESS: at 66.02% examples, 269223 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:52,073 : INFO : EPOCH 9 - PROGRESS: at 67.52% examples, 269503 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:53,079 : INFO : EPOCH 9 - PROGRESS: at 69.05% examples, 269473 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:54,145 : INFO : EPOCH 9 - PROGRESS: at 70.55% examples, 269556 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:55,168 : INFO : EPOCH 9 - PROGRESS: at 72.07% examples, 269427 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:56,175 : INFO : EPOCH 9 - PROGRESS: at 73.57% examples, 269531 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:57,190 : INFO : EPOCH 9 - PROGRESS: at 74.92% examples, 269166 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:13:58,213 : INFO : EPOCH 9 - PROGRESS: at 76.28% examples, 269035 words/s, in_qsiz

2022-11-06 19:15:01,037 : INFO : EPOCH 10 - PROGRESS: at 66.37% examples, 269766 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:02,065 : INFO : EPOCH 10 - PROGRESS: at 67.86% examples, 269741 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:03,072 : INFO : EPOCH 10 - PROGRESS: at 69.28% examples, 269499 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:04,073 : INFO : EPOCH 10 - PROGRESS: at 70.75% examples, 269779 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:05,079 : INFO : EPOCH 10 - PROGRESS: at 72.31% examples, 270337 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:06,118 : INFO : EPOCH 10 - PROGRESS: at 73.89% examples, 270681 words/s, in_qsize 6, out_qsize 0
2022-11-06 19:15:07,136 : INFO : EPOCH 10 - PROGRESS: at 75.35% examples, 270705 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:08,140 : INFO : EPOCH 10 - PROGRESS: at 76.81% examples, 270787 words/s, in_qsize 5, out_qsize 0
2022-11-06 19:15:09,159 : INFO : EPOCH 10 - PROGRESS: at 78.31% examples, 271043 words/s

It put unsup_sentences array into PermuteSentences class and set it as object and assigne variable shuffled for the list 'self.sents' <br>
Assgin permuter as variable. Finally, apply Doc2Veec on permuter using distributed bag of words, hierarchical softmax and vector_size of 50 for model training

Using gensim version 4.1.2 and delete_temporary_training_data not longer avaliable in this version

In [63]:
# # done with training, free up memory
# model.delete_temporary_training_data(keep_inference=True)

In [64]:
# save the model
model.save('reviews.d2v')

2022-11-06 19:15:24,177 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'reviews.d2v', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-11-06T19:15:24.177330', 'gensim': '4.1.2', 'python': '3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19044-SP0', 'event': 'saving'}
2022-11-06 19:15:24,179 : INFO : not storing attribute cum_table
2022-11-06 19:15:25,238 : INFO : saved reviews.d2v


In [65]:
# Use model.infer, and the utility function to infer the vector
phase = "This place is not worth your time, let alone Begas."
phase_extract = extract_words(phase)
vector = model.infer_vector(phase_extract)

In [66]:
# vector

In [67]:
# import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

In [68]:
cosine_similarity(
    [model.infer_vector(extract_words('this place is not worth your time, let alone Vegas.'))],
    [model.infer_vector(extract_words('service sucks.'))]
)

array([[0.36717778]], dtype=float32)

In [69]:
cosine_similarity(
    [model.infer_vector(extract_words('highly recommended.'))],
    [model.infer_vector(extract_words('“service sucks.'))]
)

array([[0.29771]], dtype=float32)

In [70]:
sentences = []
sentvecs = []
sentiments = []

for fname in ["yelp", "amazon_cells", "imdb"]:
    with open("sentiment labelled sentences/%s_labelled.txt" % fname, encoding='UTF-8') as f:
        for i, line in enumerate(f):
            line_split = line.strip().split('\t')
            sentences.append(line_split[0])
            words = extract_words(line_split[0])
            sentvecs.append(model.infer_vector(words, epochs=10)) # create a vector for this document
            sentiments.append(int(line_split[1]))

# suffle sentence, sentvecs, sentiments together
combined = list(zip(sentences, sentvecs, sentiments))
random.shuffle(combined)
sentences, sentvecs, sentiments = zip(*combined)


In [71]:
len(combined)

3000


First it pre define fname as "yelp", "amazon_cells", "imdb" and use loop to open sentiment labelled sentences open fname_labelled.txt as UTF-8 encoding. <br>
Strip it in line each time TAB where using in file and add into sentences array. Then use extract_words function to separate line of sentences to words and remove unnecessary symbol.<br>
Use infer_vector to get the words vector and add to senvecs array, convert line_split[1] into interger and add to sentiments array.<br>

Then use zip to put sentences, sentvecs, sentiments into one list call combined, shuffle combined list. Finally, unpack the combined list back into snetences, sentvecs, sentiments.

In [72]:
from sklearn.neighbors import KNeighborsClassifier

In [73]:
clf = KNeighborsClassifier(n_neighbors=9)

In [88]:
import numpy as np

scores = cross_val_score(clf, sentvecs, sentiments, cv=5)
print('Mean: %0.3f' % np.mean(scores), '\nStandard deviation: %0.3f'% np.std(scores))

Mean: 0.782 
Standard deviation: 0.022
