Apache Spark Pet Owners Prediction on YouTube Comments
Before we start, these packages are required for our analysis:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
from pyspark.sql.functions import col, size, udf
from pyspark.ml.feature import Word2Vec
from pyspark.sql import functions as f
from statsmodels.stats.proportion import proportions_ztest
from pyspark.ml.classification import LogisticRegression, GBTClassifier, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
The given data is from YouTube comments that contains creatorID, userID, and Comments.
The goal is trying to identify whether the user is the owner of dogs or cat through the comments. The biggest challenge here is that I had to create the label through the comments by myself so that it would fit into the classification problem.
- Use Spark DataFrame to load the data
- Check for missing values and I found that missing values only occupy a small portion. Second, it is hard to impute the missing value of ID numbers or comments in this dataset so I would drop any NULL value in the future
- Count the number of creators and users for a brief understanding of the data
- Create a label(Y). Find users who own dogs or cats. In my perspective, people who have pets are more likely to say something involving these three elements: title + the number of pets + the type of their pet. For example, (I/My/Our) + (a/2/3/one) + (dog/cat). Hence, I assigned comments with this combination as dog or cats owners. I dealt with missing values later because the labeling step does not affect by the missing values and the code will be clearer if I put into further steps.
- Data preprocessing: Now, I removed the missing values tokenized the comments, and remove stop words so that I could fit the word lists into Word2Vec
- Word2Vec transformation: I transformed the word lists to vectors with set VectorSize of 256. The number 256 is because just by other professional's experience and number with the power of 2 is easy for the algorithm to compute.
Still, we need to get some insights into the data so that we can make strategies from these insights with the assistant with the predictive model.
- See if people who mention a dog or cat are more likely to have them. I did this analysis that if this statement holds, then people who talk about pets and yet to have them are more likely to own pets in the future. We can target them as potential audience.
- How many comments do pets owners leave in general. I wanted to know if there is a chance for me to segment users with different activeness
- Top creators that have most viewers that are pet owners I wanted to find out famous channels so that Youtube can set Ads according to the audience
- Down-sampling: Because I was facing imbalance with rate 1 to 100 so I downsampled it to 50 percent: 50 percent
- I implement LogisticRegression, GBTClassifier, and RandomForestClassifier without tuning the hyperparameters to see which model has the best performance
- Use CrossValidator with Grid Search Concept for Hyper-Parameter Tuning
I had 88% of accuracy the 89% recall on the Logistic Regression. Here I emphasized more on the recall since the goal is to identify the pets owners as possible. With the ability to predict pets onwers, the statements I made could be put those business strategy into practice.