### Youtube comments analysis 

In this notebook, we have a dataset of user comments for youtube videos related to animals or pets. We will attempt to identify cat or dog owners based on these comments, find out the topics important to them, and then identify video creators with the most viewers that are cat or dog owners.

The dataset provided for this coding test are comments for videos related to animals and/or pets. The dataset is 240MB compressed and can be downloaded here:
https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing

 The dataset file is comma separated, with a header line defining the field names, listed here:
 
● creator_name. Name of the YouTube channel creator.
● userid. Integer identifier for the users commenting on the YouTube channels.
● comment. Text of the comments made by the users.

Step 1: Identify Cat And Dog Owners

● Find the users who are cat and/or dog owners.

Step 2: Build And Evaluate Classifiers

● Build classifiers for the cat and dog owners and measure the performance of the classifiers.

Step 3: Classify All The Users

● Apply the cat/dog classifiers to all the users in the dataset. 
● Estimate the fraction of all users who are cat/dog owners.

Step 4: Extract Insights About Cat And Dog Owners

● Find topics popluar among cat and dog owners.

Step 5: Identify Creators With Cat And Dog Owners In The Audience

● Find creators with the most cat and/or dog owners in the audience. Find creators with the highest statistically
significant percentages of cat and/or dog owners.

### 1. Import functions

In [4]:
%sh 
pip install nltk
pip install --upgrade pip
python -m nltk.downloader all

In [5]:
from pyspark.sql.functions import when, col,explode
from pyspark.ml.feature import RegexTokenizer, Word2Vec,StopWordsRemover,CountVectorizer
from pyspark.ml.clustering import LDA
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator 
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from sklearn.metrics import confusion_matrix,roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

### 2. Data Exploration and Cleaning

* Look at the data

In [8]:
df_clean=spark.read.csv("/FileStore/tables/animals_comments.csv",inferSchema=True,header=True)
df_clean.show(10)

In [9]:
df_clean = df_clean.na.drop(subset=["comment"])
df_clean_cnt = df_clean.count()
print('Total number of comments is :',df_clean_cnt)

* Label the data

In [11]:
# find user with preference of dog and cat and label then with 1, otherwise 0.
df_clean = df_clean.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                           .when(col("comment").like("%I have a dog%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%I have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .otherwise(0)))

In [12]:
df_clean.show(5)

In [13]:
label1_cnt = df_clean.filter(col('label')==1).count()
label0_cnt = df_clean.filter(col('label')==0).count()
print('label 1 count:', label1_cnt)
print('label 0 count:', label0_cnt)
print('label 0 count/label 1 count:', label0_cnt // label1_cnt)

### 2.1 Process data

In [15]:
# Establish pipeline and fit data
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")
word2Vec = Word2Vec(inputCol="words", outputCol="features")
pipeline = Pipeline(stages=[regexTokenizer, word2Vec])

pipelineFit = pipeline.fit(df_clean) # ~ 30 min
alldata = pipelineFit.transform(df_clean)

In [16]:
alldata.show(5)

Due to running time and memory limit, we will use small traning and test size.

In [18]:
(lable1_train,lable1_test)=alldata.filter(col('label')==1).randomSplit([0.1, 0.9],seed = 42)[0].randomSplit([0.7, 0.3],seed = 100)
(lable0_train, lable0_test)=alldata.filter(col('label')==0).randomSplit([0.001, 0.999],seed = 42)[0].randomSplit([0.7, 0.3],seed = 100)

In [19]:
print('Number of label1_train, label0_train, label1_test, lable0_test:')
print(lable1_train.count(),lable0_train.count(), lable1_test.count(), lable0_test.count())

In [20]:
training = lable0_train.union(lable1_train)
test=lable0_test.union(lable1_test)

### 3 Build the Classifier

### 3.1 Logistic Regression

In [23]:
# Build estimator, LR model
lr = LogisticRegression(maxIter=20)

# Build parameter grid
lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()
print ("Num models to be tested: ", len(lr_paramGrid))

# Build the evaluator
evaluator = BinaryClassificationEvaluator()

In [24]:
# Function1: Train model
def crossval_train(estimator,paramGrid,evaluator,training):
  crossval = CrossValidator(estimator=estimator,
                            estimatorParamMaps=paramGrid,
                            evaluator=evaluator,
                            numFolds=2)
  cv_model = crossval.fit(training)
  return cv_model


# Function2: Predict
def predict_from_model(model,test):
  best_model = model.bestModel
  prediction = best_model.transform(test)
  return best_model,prediction


# Function3 Evaluate model performance
def evaluate_model(prediction):
  #BinaryClassificationEvaluator has two metrics: areaUnderROC and areaUnderPR
  evaluator =  BinaryClassificationEvaluator()
  areaunderAOC = evaluator.evaluate(prediction)
  areaunderPR = evaluator.setMetricName("areaUnderPR").evaluate(prediction)
  
  #MulticlassClassificationEvaluator 
  evaluator_multi = MulticlassClassificationEvaluator()
  f1 = evaluator_multi.setMetricName("f1").evaluate(prediction)
  weightedPrecision = evaluator_multi.setMetricName("weightedPrecision").evaluate(prediction)
  weightedRecall = evaluator_multi.setMetricName("weightedRecall").evaluate(prediction)
  accuracy = evaluator_multi.setMetricName("accuracy").evaluate(prediction)
  
  return [f1,weightedPrecision,weightedRecall,accuracy,areaunderAOC,areaunderPR]


# Function4 Print model performance
def print_model_performance(result):
  print ('**Best LR Model Evalution**')
  print(' F1:',result[0])
  print(' Precision:',result[1])
  print(' Recall:',result[2])
  print(' Accuracy:',result[3])
  print(' AreaUnderAOC:',result[4])
  print(' AreaUnderPR:',result[5])
  
# Function5 Select partial data
def select(prediction):
  selected_pd = prediction.select('userid', 'comment', 'label','probability','prediction').toPandas()
  selected_pd['probability_of_pos'] = selected_pd['probability'].apply(lambda x: x[1])
  selected_pd['prediction'] = selected_pd['prediction'].apply(lambda x: int(x))
  return selected_pd 

# Function6 Generate confusion matrix and ROC
def metrics(selected_pd):
  cm = confusion_matrix(selected_pd['label'], selected_pd['prediction'])/selected_pd.count()[0] # normalization
  fpr,tpr,thres = roc_curve(selected_pd['label'],selected_pd['probability_of_pos'],pos_label=1)
  return cm, fpr,tpr

# Function7 Plot confusion matrix
def plot_cm(cm,model_name):
  cm_pd_df = pd.DataFrame(cm, range(2), range(2))
  fig,ax = plt.subplots(figsize = (5,4))
  sns.heatmap(cm_pd_df,cmap = 'Spectral_r',annot = True , square = True)
  plt.xlabel('Predicted')
  plt.ylabel('Actual')
  plt.title('Confusion Matrix of {}'.format(model_name))
  display()
  
# Function8 Plot ROC curve
def plot_roc(fpr,tpr,model_name):  
  fig,ax = plt.subplots(figsize = (5,4))
  sns.lineplot([0, 1], [0, 1],color = 'black',ax = ax)
  sns.lineplot(fpr, tpr, color = 'purple',ax = ax)
  plt.xlabel('False positive rate')
  plt.ylabel('True positive rate')
  plt.title('ROC curve of {}'.format(model_name))
  plt.legend(loc='best')
  display()

# Function9 Plot cm and ROC curve
def plot_cm_roc(cm,fpr,tpr,model_name):
  cm_pd_df = pd.DataFrame(cm, range(2), range(2))
  fig,ax = plt.subplots(1,2,figsize = (10,4))
  sns.heatmap(cm_pd_df,cmap = 'Spectral_r',annot = True , square = True,ax = ax[0])
  sns.lineplot([0, 1], [0, 1],color = 'black',ax = ax[1])
  sns.lineplot(fpr, tpr, color = 'purple',ax = ax[1])
  ax[0].set_xlabel('Predicted')
  ax[0].set_ylabel('Actual')
  ax[1].set_xlabel('False positive rate')
  ax[1].set_ylabel('True positive rate')
  ax[0].set_title('Confusion Matrix and ROC of {}'.format(model_name))
  ax[1].set_title('ROC curve of {}'.format(model_name))
  display()

In [25]:
lr_model = crossval_train(lr,lr_paramGrid,evaluator,training) # runs ~27 min
best_lr_model,lr_prediction = predict_from_model(lr_model,test)
lr_result= evaluate_model(lr_prediction)

In [26]:
#Print best model parameters and model performance
print ('**Best LR Model**')
print (' RegParam:',best_lr_model._java_obj.parent().getRegParam()) #parent()method will return an estimator, get the best params 
print (' ElasticNetParam:',best_lr_model._java_obj.parent().getElasticNetParam())

print_model_performance(lr_result)

In [27]:
# plot confusion matrix and ROC curve
lr_selected_pd = select(lr_prediction)
lr_cm, lr_fpr,lr_tpr = metrics(lr_selected_pd)
plot_cm_roc(lr_cm,lr_fpr,lr_tpr,'LR')

### 3.2 RandomForest

In [29]:
rf = RandomForestClassifier()
rf_paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [5,10])
             .addGrid(rf.maxBins, [20])
             .addGrid(rf.numTrees, [5,10,20])
             .build())
print ("Num models to be tested: ", len(rf_paramGrid))

In [30]:
rf_model = crossval_train(rf,rf_paramGrid,evaluator,training) # runs ~ 32 min
best_rf_model,rf_prediction = predict_from_model(rf_model,test)
rf_result= evaluate_model(rf_prediction)

In [31]:
#Print model parameters
print ('**Best RF Model**')
print (' MaxDepth:',best_rf_model._java_obj.parent().getMaxDepth())
print (' MaxBins:',best_rf_model._java_obj.parent().getMaxBins()) #parent()method will return an estimator, get the best params 
print (' NumTrees:',best_rf_model._java_obj.parent().getNumTrees())

print_model_performance(rf_result)

In [32]:
# plot confusion matrix and ROC curve
rf_selected_pd = select(rf_prediction)
rf_cm, rf_fpr,rf_tpr = metrics(rf_selected_pd)
plot_cm_roc(rf_cm,rf_fpr,rf_tpr,'RF')

In [33]:
best_rf_model.featureImportances()

### 3.3 Gradient boosting trees

In [35]:
## Unable to implement paramGrid for GBT due to OutofMemory Error
'''
gbt = GBTClassifier(maxIter=3)
gbt_paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 5])
             .addGrid(gbt.maxBins, [10, 20])
             .build())

gbt_model = crossval_train(gbt,gbt_paramGrid,evaluator,training)  # runs 51.44 min
best_gbt_model,gbt_prediction = predict_from_model(gbt_model,test)
gbt_result= evaluate_model(gbt_prediction)
'''

## Print model parameters
'''
print ('**Best GBT Model**')
print (' MaxDepth:',best_gbt_model._java_obj.parent().getMaxDepth())
print (' MaxBins:',best_gbt_model._java_obj.parent().getMaxBins()) 
'''

## Print model performance
'''
print_model_performance(gbt_result)
'''

In [36]:
## Unable to implement paramGrid for GBT due to OutofMemory Error
gbt = GBTClassifier(maxIter=20,maxDepth=10,maxBins=20) #36mins
gbt_model = gbt.fit(training)
gbt_prediction = gbt_model.transform(test)
gbt_result= evaluate_model(gbt_prediction)

In [37]:
print_model_performance(gbt_result)

In [38]:
gbt_selected_pd = select(gbt_prediction)
gbt_cm, gbt_fpr,gbt_tpr = metrics(gbt_selected_pd)
plot_cm_roc(gbt_cm,gbt_fpr,gbt_tpr, 'GBT')

### 3.4 Compare between models

In [40]:
def plot_cm_all(cm_list,model_name_list):
  fig,ax = plt.subplots(1,len(cm_list),figsize = (12,4),sharey = True)
  count = 0
  cbar_ax = fig.add_axes([.03, .2, .01, .5])
  while count < len(cm_list):
    cm = cm_list[count]
    cm_pd_df = pd.DataFrame(cm, range(2), range(2))
    ax[count].set_title(model_name_list[count])
    sns.heatmap(cm_pd_df,cmap = 'Spectral_r',annot = True , square = True, ax = ax[count],cbar = count==0,cbar_ax = cbar_ax if count == 0 else None)
    if count == 0:
      ax[count].set_xlabel('Predicted')
      ax[count].set_ylabel('Actual')
    count+=1
  display()
  
  
def plot_roc_all(fpr_list,tpr_list,model_name_list): 
  color_list = ['blue','red','purple']
  fig,ax = plt.subplots(figsize = (5,4))
  sns.lineplot([0, 1], [0, 1],color = 'black',ax = ax)
  count = 0
  while count < len(fpr_list):
    fpr = fpr_list[count]
    tpr = tpr_list[count]
    model_name = model_name_list[count]
    color = color_list[count]
    sns.lineplot(fpr, tpr, color = color,label = model_name,ax = ax)
    count += 1
  plt.xlabel('False positive rate')
  plt.ylabel('True positive rate')
  plt.title('ROC curve')
  plt.legend(loc='best')
  display()
  
def print_model_performance_all(LR_result,RF_result,GBT_result):
  index = ['F1','Precision','Recall','Accuracy','AreaUnderROC','AreaUnderPR']
  result_df = pd.DataFrame(list(zip(LR_result,RF_result,GBT_result)))
  result_df.columns = ['LR','RF','GBT']
  result_df.index = index
  return result_df

In [41]:
print_model_performance_all(lr_result,rf_result,gbt_result).head(6)

Unnamed: 0,LR,RF,GBT
F1,0.888545,0.906489,0.841065
Precision,0.888998,0.907755,0.849236
Recall,0.888328,0.906155,0.839892
Accuracy,0.888328,0.906155,0.839892
AreaUnderROC,0.952307,0.959084,0.935405
AreaUnderPR,0.913497,0.931339,0.907093


In [42]:
cm_list = [lr_cm,rf_cm,gbt_cm]
fpr_list = [lr_fpr,rf_fpr, gbt_fpr]
tpr_list = [lr_tpr,rf_tpr, gbt_tpr]
model_name_list = ['LR','RF','GBT']
plot_cm_all(cm_list,model_name_list)

In [43]:
plot_roc_all(fpr_list,tpr_list,model_name_list)

We will choose random forest model to classify all the users as we see it has the best performance.

#### 4. Classify All The Users

In [46]:
rf_prediction_all = best_rf_model.transform(alldata)

In [47]:
cnt1 = rf_prediction_all.filter(col('prediction')==1).count()
cnt0 = rf_prediction_all.filter(col('prediction')==0).count()
print('Overall, there are {} cat/dog owners and {} non cat/dog owners.'.format(cnt1,cnt0))

In [48]:
fig,ax = plt.subplots(figsize = (5,4))
ax.pie([cnt1,cnt0],labels = ['owners', 'non-owners'],startangle=90,counterclock=False,autopct='%1.0f%%')
ax.set_title('Cat/dogs owner percentage')
display()

#### 5. Find most popular topics among cat/dog owners -LDA

In [50]:
stopword_list = stopwords.words('english')
stopword_list.extend(['dogs','cats'])

In [51]:
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")
remover = StopWordsRemover(inputCol='words', outputCol='words_clean',stopWords = stopword_list)
vectorizer = CountVectorizer(inputCol='words_clean', outputCol="features")
pipeline_lda = Pipeline(stages=[regexTokenizer,remover,vectorizer])
pipeline_lda_model = pipeline_lda.fit(df_clean)
df_lda= pipeline_lda_model.transform(df_clean)

In [52]:
# choose only cat/dog owners
df_lda_1 = df_lda.filter(col('label')==1)

In [53]:
# choose Non cat/dog owners
df_lda_0 = df_lda.filter(col('label')==0)

In [54]:
lda = LDA(k=5, maxIter=20) # 4 mins
lda1_model = lda.fit(df_lda_1)

In [55]:
lda0 = LDA(k=5, maxIter=5)
lda0_model = lda0.fit(df_lda_0)

In [56]:
topics_1 = lda1_model.describeTopics(maxTermsPerTopic = 6)
topics_0 = lda0_model.describeTopics(maxTermsPerTopic = 6)
topics_1.show()

In [57]:
topics_0.show()

In [58]:
vectorizer_model = pipeline_lda_model.stages[2]
vocab_list = vectorizer_model.vocabulary

In [59]:
id_to_str = udf(lambda x : [vocab_list[i] for i in x])
round_term_weights = udf(lambda x : [round(i,5) for i in x])

topics_1 = topics_1 \
           .withColumn("terms", id_to_str("termIndices")) \
           .withColumn("weights", round_term_weights("termWeights")) \
           .select("topic", "terms", "weights")

topics_0 = topics_0 \
           .withColumn("terms", id_to_str("termIndices")) \
           .withColumn("weights", round_term_weights("termWeights")) \
           .select("topic", "terms", "weights")

Popular topics among cat/dog owners

In [61]:
topics_1.toPandas().head(10)

Unnamed: 0,topic,terms,weights
0,0,"[dog, like, cat, would, looks, one]","[0.0023, 0.00198, 9.3E-4, 7.6E-4, 7.2E-4, 5.9E-4]"
1,1,"[dog, cat, dont, like, im, one]","[8.5E-4, 5.0E-4, 4.1E-4, 4.0E-4, 3.6E-4, 3.1E-4]"
2,2,"[cat, like, looks, love, dog, im]","[0.00176, 6.3E-4, 3.5E-4, 2.6E-4, 2.5E-4, 1.9E-4]"
3,3,"[cat, like, im, love, much, think]","[4.2E-4, 1.1E-4, 8.0E-5, 7.0E-5, 6.0E-5, 6.0E-5]"
4,4,"[dog, cat, like, love, one, get]","[0.02612, 0.01486, 0.00984, 0.0058, 0.00546, 0..."


Popular topics among Non cat/dog owners

In [63]:
topics_0.toPandas().head(10)

Unnamed: 0,topic,terms,weights
0,0,"[like, one, video, would, great, im]","[0.01573, 0.00853, 0.00763, 0.00734, 0.00631, ..."
1,1,"[cute, love, que, de, dont, el]","[0.0122, 0.00953, 0.00752, 0.00727, 0.00578, 0..."
2,2,"[love, good, im, like, videos, get]","[0.01955, 0.01063, 0.00996, 0.00991, 0.00908, ..."
3,3,"[coyote, 3, im, love, like, wow]","[0.01724, 0.01347, 0.01017, 0.00982, 0.00659, ..."
4,4,"[love, n, u, c, lol, ng]","[0.01731, 0.01573, 0.01039, 0.00985, 0.00679, ..."


#### 6. Identify Creators With the Most Cat And Dog Owners In The Audience

In [65]:
creators = rf_prediction_all \
           .filter(rf_prediction_all.prediction==1) \
           .groupBy('creator_name','userid') \
           .count() \
           .withColumnRenamed('count','number of comments') \
           .orderBy('number of comments',ascending = False) 

Active users - Top 10 owners who comment most

In [67]:
creators.show(10)

In [68]:
owners_under_creators = creators \
                        .groupBy('creator_name') \
                        .count() \
                        .withColumnRenamed('count','number of owners') \
                        .orderBy('count',ascending = False)  # 174148 owners in total

Top 10 creators with the most owners

In [70]:
owners_under_creators.show(10)

### Summary

In this project, we analyzed a dataset of user comments on youtube videos related to pets(cats and dogs). We hope to identify users with pets and topics interesting to them.

To achieve the goal, we first take a look at the data, remove the missing values and label the data. Specifically, we identify users commenting like 'my dog', 'I have a dog' , 'I have a cat' etc as pets owners. Of course, we might miss some onwers. By trainign a model, we hope to identify as many users. 

Before training, we need to process the data to be ready. So we converted the comments texts to feature vectors using RegTokenizer and Word2Vec in Spark ML.Then, we trained the data with Logistic regression, random foerest and graident boosting tree models. Among all, we find that LR and RF give us good performance, evaluated based on accuracy, F1 score, AOC and etc. In addition, we extracted the confusion matrix and ROC cureve for visulization of model performance.  Considering the overall model performance, we select RF model to apply in the final dataset.

It is found that only about 11% of the total users are pets owners. So I concluded that most users don't have a cat or dog. Therefore, I believe it would be quite helpful to find topics that are interesting to these users in order to gain more views on the videos. This was inplemented using Latent Dirichlet Algolocation(LDA) learning model. And interesting topics are displayed in the results