### Youtube comments analysis

In this notebook, we have a dataset of user comments for youtube videos related to animals or pets. We will attempt to identify cat or dog owners based on these comments.

In [0]:
#pip install googledrivedownloader==0.4

Python interpreter will be restarted.
Collecting googledrivedownloader==0.4
  Using cached googledrivedownloader-0.4-py2.py3-none-any.whl (3.9 kB)
Installing collected packages: googledrivedownloader
Successfully installed googledrivedownloader-0.4
Python interpreter will be restarted.


In [0]:
#pip install wordcloud

Python interpreter will be restarted.
Collecting wordcloud
  Downloading wordcloud-1.9.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (513 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.3
Python interpreter will be restarted.


In [0]:
# link: https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing


#### 0. Data Exploration and Cleaning


In [0]:
df_clean=spark.read.csv("/FileStore/tables/animals_comments_csv.gz",inferSchema=True,header=True)
df_clean.show(10)

+--------------------+------+-------------------------------------+
|        creator_name|userid|                              comment|
+--------------------+------+-------------------------------------+
|        Doug The Pug|  87.0|                 I shared this to ...|
|        Doug The Pug|  87.0|                   Super cute  😀🐕🐶|
|         bulletproof| 530.0|                 stop saying get e...|
|       Meu Zoológico| 670.0|                 Tenho uma jiboia ...|
|              ojatro|1031.0|                 I wanna see what ...|
|     Tingle Triggers|1212.0|                 Well shit now Im ...|
|Hope For Paws - O...|1806.0|                 when I saw the en...|
|Hope For Paws - O...|2036.0|                 Holy crap. That i...|
|          Life Story|2637.0|武器はクエストで貰えるんじゃないん...|
|       Brian Barczyk|2698.0|                 Call the teddy Larry|
+--------------------+------+-------------------------------------+
only showing top 10 rows



In [0]:
from pyspark.sql.functions import rand 

df_clean.orderBy(rand(seed=0)).createOrReplaceTempView("table1")
df_clean = spark.sql("select * from table1 limit 1000000")

df_clean.count() 

Out[3]: 1000000

In [0]:
df_clean = df_clean.na.drop(subset=["comment"])
df_clean.count()

Out[4]: 999821

In [0]:
df_clean.show()

+-----------------------+---------+---------------------------------+
|           creator_name|   userid|                          comment|
+-----------------------+---------+---------------------------------+
|         LightningLpsTV|2383838.0|             I dare Dakota to ...|
|        Viktor Larkhill| 348139.0|               damn Im crying now|
|        Einstein Parrot| 585165.0|             Einstein youre so...|
|   REALITY TALK REVIEWS|1579903.0|             Ben is just a pup...|
|       Brave Wilderness| 413490.0|             coyote i wanted t...|
|            Info Marvel|1982636.0|             Quiero un funko p...|
|         Obscure Domain|2508747.0|             This has never be...|
|             The Fatman| 597566.0|             something to thin...|
|꼬부기아빠 My Pet Diary|2107492.0|오늘은 진짜 집사님이 부럽게 느...|
|              GoHerping|1467709.0|             Red eared sliders...|
|          eMusic Talent|1853119.0|             OMG! El segundo n...|
|     Taylor Nicole Dean|  63136.0|   

In [0]:
# find user with preference of dog and cat
from pyspark.sql.functions import when
from pyspark.sql.functions import col

df_clean = df_clean.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                           .when(col("comment").like("%I have a dog%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%I have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .otherwise(0)))

In [0]:
df_clean.show()

+-----------------------+---------+---------------------------------+-----+
|           creator_name|   userid|                          comment|label|
+-----------------------+---------+---------------------------------+-----+
|         LightningLpsTV|2383838.0|             I dare Dakota to ...|    0|
|        Viktor Larkhill| 348139.0|               damn Im crying now|    0|
|        Einstein Parrot| 585165.0|             Einstein youre so...|    0|
|   REALITY TALK REVIEWS|1579903.0|             Ben is just a pup...|    0|
|       Brave Wilderness| 413490.0|             coyote i wanted t...|    0|
|            Info Marvel|1982636.0|             Quiero un funko p...|    0|
|         Obscure Domain|2508747.0|             This has never be...|    0|
|             The Fatman| 597566.0|             something to thin...|    0|
|꼬부기아빠 My Pet Diary|2107492.0|오늘은 진짜 집사님이 부럽게 느...|    0|
|              GoHerping|1467709.0|             Red eared sliders...|    0|
|          eMusic Talent|18531

#### 1. Data preprocessing and Build the classifier 

In [0]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec
from pyspark.ml.classification import LogisticRegression

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")
word2Vec = Word2Vec(vectorSize=50, minCount=1, inputCol="words", outputCol="wordVector")

In [0]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, word2Vec])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df_clean)
dataset = pipelineFit.transform(df_clean)

In [0]:
dataset.show()

+-----------------------+---------+---------------------------------+-----+--------------------+--------------------+
|           creator_name|   userid|                          comment|label|               words|          wordVector|
+-----------------------+---------+---------------------------------+-----+--------------------+--------------------+
|         LightningLpsTV|2383838.0|             I dare Dakota to ...|    0|[i, dare, dakota,...|[-0.0084864338859...|
|        Viktor Larkhill| 348139.0|               damn Im crying now|    0|[damn, im, crying...|[-0.0179006536491...|
|        Einstein Parrot| 585165.0|             Einstein youre so...|    0|[einstein, youre,...|[-0.0763257555404...|
|   REALITY TALK REVIEWS|1579903.0|             Ben is just a pup...|    0|[ben, is, just, a...|[-0.1381724602745...|
|       Brave Wilderness| 413490.0|             coyote i wanted t...|    0|[coyote, i, wante...|[-0.0763060057150...|
|            Info Marvel|1982636.0|             Quiero u

In [0]:
(lable0_train,lable0_test)=dataset.filter(col('label')==1).randomSplit([0.7, 0.3],seed = 100)
(lable1_train, lable1_ex)=dataset.filter(col('label')==0).randomSplit([0.005, 0.995],seed = 100)
(lable1_test, lable1_ex2)=lable1_ex.randomSplit([0.002, 0.998],seed = 100)

In [0]:
trainingData = lable0_train.union(lable1_train)
testData=lable0_test.union(lable1_test)

In [0]:
print("Dataset Count: " + str(dataset.count()))
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Dataset Count: 999821
Training Dataset Count: 9788
Test Dataset Count: 4049


#### 2. Models
Logistic Regression

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

lr = LogisticRegression(featuresCol="wordVector",labelCol="label" , maxIter=10, regParam=0.1, elasticNetParam=0.8)
lrModel = lr.fit(trainingData)

# Make predictions on test data.
predictions = lrModel.transform(testData)
predictions.show(10)

+--------------------+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|        creator_name|   userid|             comment|label|               words|          wordVector|       rawPrediction|         probability|prediction|
+--------------------+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|                null|1265524.0|if i saw a snake ...|    1|[if, i, saw, a, s...|[-0.0768854211394...|[-0.6479574422907...|[0.34344997054044...|       1.0|
|           278pikelk| 152729.0|I just told my do...|    1|[i, just, told, m...|[-0.1650808578695...|[-0.1937053818619...|[0.45172450873112...|       1.0|
|2CAN.TV - Ripley ...| 231728.0|He acts like my d...|    1|[he, acts, like, ...|[-0.2551775995641...|[-1.1044695141026...|[0.24890337915506...|       1.0|
|      Aarons Animals| 117349.0|I love cats I hav...|    1|[i, love, c

In [0]:
trainingSummary = lrModel.summary


# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
trainingSummary.roc.show()

+--------------------+--------------------+
|                 FPR|                 TPR|
+--------------------+--------------------+
|                 0.0|                 0.0|
|6.127450980392157E-4|0.001635322976287817|
|8.169934640522876E-4|0.003475061324611611|
|0.001429738562091...|0.004701553556827474|
|0.001633986928104...|0.006745707277187245|
|0.001838235294117647|0.008381030253475062|
|0.002246732026143...|0.009811937857726901|
|0.002450980392156...|0.011447260834014717|
|0.002450980392156...|0.013286999182338511|
|0.002450980392156...|0.015126737530662305|
|0.002655228758169...|  0.0169664758789861|
|0.003063725490196...| 0.01839738348323794|
|0.003267973856209...|0.020032706459525755|
|0.003267973856209...| 0.02187244480784955|
|0.003676470588235294| 0.02391659852820932|
|0.004084967320261438|0.025347506132461162|
| 0.00428921568627451| 0.02698282910874898|
| 0.00428921568627451|0.028822567457072772|
| 0.00428921568627451|0.030662305805396566|
|0.004493464052287581|0.03229762

In [0]:
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

areaUnderROC: 0.8925807240312325


In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator


def get_evaluation_result(predictions):
  evaluator = BinaryClassificationEvaluator(
      labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
  AUC = evaluator.evaluate(predictions)

  TP = predictions[(predictions["label"] == 1) & (predictions["prediction"] == 1.0)].count()
  FP = predictions[(predictions["label"] == 0) & (predictions["prediction"] == 1.0)].count()
  TN = predictions[(predictions["label"] == 0) & (predictions["prediction"] == 0.0)].count()
  FN = predictions[(predictions["label"] == 1) & (predictions["prediction"] == 0.0)].count()

  accuracy = (TP + TN)*1.0 / (TP + FP + TN + FN)
  precision = TP*1.0 / (TP + FP)
  recall = TP*1.0 / (TP + FN)


  print ("True Positives:", TP)
  print ("False Positives:", FP)
  print ("True Negatives:", TN)
  print ("False Negatives:", FN)
  print ("Test Accuracy:", accuracy)
  print ("Test Precision:", precision)
  print ("Test Recall:", recall)
  print ("Test AUC of ROC:", AUC)

print("Prediction result summary for Logistic Regression Model:  ")
get_evaluation_result(predictions)

Prediction result summary for Logistic Regression Model:  
True Positives: 1878
False Positives: 494
True Negatives: 1459
False Negatives: 218
Test Accuracy: 0.824154112126451
Test Precision: 0.7917369308600337
Test Recall: 0.8959923664122137
Test AUC of ROC: 0.8911314751625018


##### Parameter Tuning and K-fold cross-validation

##### Random Forest Model

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="wordVector", numTrees=15)

# Train model.  This also runs the indexers.
model = rf.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.show(10)

print("Prediction result summary for Random Forest Model:  ")
get_evaluation_result(predictions)

+--------------------+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|        creator_name|   userid|             comment|label|               words|          wordVector|       rawPrediction|         probability|prediction|
+--------------------+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|                null|1265524.0|if i saw a snake ...|    1|[if, i, saw, a, s...|[-0.0768854211394...|[5.05177563308591...|[0.33678504220572...|       1.0|
|           278pikelk| 152729.0|I just told my do...|    1|[i, just, told, m...|[-0.1650808578695...|[2.59045922781202...|[0.17269728185413...|       1.0|
|2CAN.TV - Ripley ...| 231728.0|He acts like my d...|    1|[he, acts, like, ...|[-0.2551775995641...|[3.32790614663259...|[0.22186040977550...|       1.0|
|      Aarons Animals| 117349.0|I love cats I hav...|    1|[i, love, c

#### 3. Classify All The Users

In [0]:
import pyspark.sql.functions as F
# get dataset for prediction (note to exclude people we already know the label)
df_unknown = dataset.filter(F.col('label') == False)
df_unknown = df_unknown.withColumn('label',df_unknown.label.cast('integer'))
print("There are {} users whose attribute is unclear.".format(df_unknown.count()))
pred_all = model.transform(df_unknown)
pred_all.show(10)

There are 992833 users whose attribute is unclear.
+-----------------------+---------+---------------------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|           creator_name|   userid|                          comment|label|               words|          wordVector|       rawPrediction|         probability|prediction|
+-----------------------+---------+---------------------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|         LightningLpsTV|2383838.0|             I dare Dakota to ...|    0|[i, dare, dakota,...|[-0.0084864338859...|[9.73575870205387...|[0.64905058013692...|       0.0|
|        Viktor Larkhill| 348139.0|               damn Im crying now|    0|[damn, im, crying...|[-0.0179006536491...|[10.4466641211398...|[0.69644427474265...|       0.0|
|        Einstein Parrot| 585165.0|             Einstein youre so...|    0|[einstein, youre,..

In [0]:
#number of total user
total_user = dataset.select('userid').distinct().count()
#number of labeled owner
owner_labeled = dataset.select('userid').distinct().count() 
#number of owner predicted
owner_pred = pred_all.filter(F.col('prediction') == 1.0).count()

fraction = (owner_labeled+owner_pred)/total_user
print('Fraction of the users who are cat/dog owners (ML estimate): ', round(fraction,3))

Fraction of the users who are cat/dog owners (ML estimate):  1.251


#### 4. Get insigts of Users

In [0]:
from pyspark.ml.feature import StopWordsRemover

df_all_owner = dataset.select('words').union(pred_all.filter(F.col('prediction') == 1.0).select('words'))

stopwords_custom = ['im', 'get', 'got', 'one', 'hes', 'shes', 'dog', 'dogs', 'cats', 'cat', 'kitty', 'much', 'really', 'love','like','dont','know','want','thin',\
                    'see','also','never','go','ive']

remover1 = StopWordsRemover(inputCol="raw", outputCol="filtered")
core = remover1.getStopWords()
core = core + stopwords_custom
remover = StopWordsRemover(inputCol="words", outputCol="filtered",stopWords=core)
df_all_owner = remover.transform(df_all_owner)

wc = df_all_owner.select('filtered').rdd.flatMap(lambda a: a.filtered).countByValue()

df_all_owner.show(1)

+--------------------+--------------------+
|               words|            filtered|
+--------------------+--------------------+
|[i, dare, dakota,...|[dare, dakota, pr...|
+--------------------+--------------------+
only showing top 1 row



In [0]:
wcSorted = sorted(wc.items(), key=lambda kv: kv[1],reverse = True)
wcSorted

Out[22]: [('video', 64795),
 ('good', 51323),
 ('people', 46221),
 ('cute', 40752),
 ('great', 39885),
 ('videos', 38589),
 ('u', 38494),
 ('think', 38333),
 ('animals', 35791),
 ('time', 34370),
 ('lol', 33660),
 ('coyote', 30644),
 ('make', 30294),
 ('thank', 29146),
 ('3', 28040),
 ('little', 27669),
 ('keep', 26591),
 ('day', 25574),
 ('even', 25515),
 ('happy', 25021),
 ('hope', 24746),
 ('always', 24520),
 ('please', 24407),
 ('n', 23912),
 ('channel', 23770),
 ('going', 23663),
 ('thats', 23656),
 ('cant', 23515),
 ('2', 23429),
 ('back', 23101),
 ('well', 22829),
 ('need', 22731),
 ('new', 22674),
 ('nice', 22297),
 ('first', 22041),
 ('beautiful', 21854),
 ('looks', 21850),
 ('way', 21832),
 ('look', 21684),
 ('thanks', 21493),
 ('name', 21248),
 ('m', 21236),
 ('take', 21096),
 ('best', 20947),
 ('amazing', 20822),
 ('ever', 20819),
 ('awesome', 20797),
 ('life', 20778),
 ('guys', 20011),
 ('man', 19846),
 ('1', 19754),
 ('still', 19425),
 ('thing', 18912),
 ('work', 18617),


In [0]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = " ".join([(k + " ")*v for k, v in wc.items()])

wcloud = WordCloud(background_color="white", max_words=20000, collocations=False,
                   contour_width=3, contour_color='steelblue', max_font_size=40)

# Generate a word cloud image
wcloud.generate(text)

# Display the generated image:
# the matplotlib way:
fig,ax0=plt.subplots(nrows=1,figsize=(12,8))
ax0.imshow(wcloud,interpolation='bilinear')

ax0.axis("off")
display(fig)

## not a lot of obvious features

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
File [0;32m<command-4295172853616343>:10[0m
[1;32m      6[0m wcloud [38;5;241m=[39m WordCloud(background_color[38;5;241m=[39m[38;5;124m"[39m[38;5;124mwhite[39m[38;5;124m"[39m, max_words[38;5;241m=[39m[38;5;241m20000[39m, collocations[38;5;241m=[39m[38;5;28;01mFalse[39;00m,
[1;32m      7[0m                    contour_width[38;5;241m=[39m[38;5;241m3[39m, contour_color[38;5;241m=[39m[38;5;124m'[39m[38;5;124msteelblue[39m[38;5;124m'[39m, max_font_size[38;5;241m=[39m[38;5;241m40[39m)
[1;32m      9[0m [38;5;66;03m# Generate a word cloud image[39;00m
[0;32m---> 10[0m wcloud[38;5;241m.[39mgenerate(text)
[1;32m     12[0m [38;5;66;03m# Display the generated image:[39;00m
[1;32m     13[0m [38;5;66;03m# the matplotlib way:[39;00m
[1;32m     14[0m fig,ax0[38;5;241m=[39

#### 5. Identify Creators With Cat And Dog Owners In The Audience


In [0]:
#Get all creators whenever the users label is True(cat/dog owner)
df_create = dataset.select('creator_name').union(pred_all.filter(F.col('prediction') == 1.0).select('creator_name'))

df_create.createOrReplaceTempView("create_table")

#get count
create_count = spark.sql("select distinct creator_name, count(*) as name\
                          from create_table \
                          group by creator_name \
                          order by name DESC")

create_count.show()

+-----------------------+------+
|           creator_name|  name|
+-----------------------+------+
|       Brave Wilderness|198449|
|          Brian Barczyk| 75912|
|               The Dodo| 70884|
|     Taylor Nicole Dean| 50480|
|   Hope For Paws - O...| 27141|
|           Robin Seplut| 25595|
|              Vet Ranch| 22350|
|            Info Marvel| 20960|
|        Gohan The Husky| 20838|
|               ViralHog| 18274|
|꼬부기아빠 My Pet Diary| 17658|
|        Viktor Larkhill| 17459|
|      Talking Kitty Cat| 14667|
|              MonkeyBoo| 13827|
|    Keedes channel LIVE| 12570|
|     Think Like A Horse| 11953|
|   Gone to the Snow ...| 11523|
|               Mạnh CFM| 11272|
|       Cole & Marmalade| 10640|
|           Mr. Max T.V.| 10485|
+-----------------------+------+
only showing top 20 rows



#### 6. Analysis and Future work


Only part of the dataset was used due to the lack of computation power for the entire dataset, which could be the reason that output features and topics do not seem to be much related. Also, model fine tuning can be done to improve the model performances, which could also help with getting better results.
