Apache-Spark-Pet-Owners-Prediction-on-YouTube-Comments

Apache Spark Pet Owners Prediction on YouTube Comments

notebook link: https://drive.google.com/file/d/142RsBn70xTvK8J_vJdRrfZp_r-NtPcvm/view?usp=sharing

Prerequisites

Before we start, these packages are required for our analysis:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
from pyspark.sql.functions import col, size, udf
from pyspark.ml.feature import Word2Vec
from pyspark.sql import functions as f
from statsmodels.stats.proportion import proportions_ztest
from pyspark.ml.classification import LogisticRegression, GBTClassifier, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

Situation:

The given data is from YouTube comments that contains creatorID, userID, and Comments.

Task:

The goal is trying to identify whether the user is the owner of dogs or cat through the comments. The biggest challenge here is that I had to create the label through the comments by myself so that it would fit into the classification problem.

Action:

1. Data Exploration and Cleaning

Use Spark DataFrame to load the data
Check for missing values and I found that missing values only occupy a small portion. Second, it is hard to impute the missing value of ID numbers or comments in this dataset so I would drop any NULL value in the future
Count the number of creators and users for a brief understanding of the data

2. Data preprocessing building classifier

Create a label(Y). Find users who own dogs or cats. In my perspective, people who have pets are more likely to say something involving these three elements: title + the number of pets + the type of their pet. For example, (I/My/Our) + (a/2/3/one) + (dog/cat). Hence, I assigned comments with this combination as dog or cats owners. I dealt with missing values later because the labeling step does not affect by the missing values and the code will be clearer if I put into further steps.
Data preprocessing: Now, I removed the missing values tokenized the comments, and remove stop words so that I could fit the word lists into Word2Vec
Word2Vec transformation: I transformed the word lists to vectors with set VectorSize of 256. The number 256 is because just by other professional's experience and number with the power of 2 is easy for the algorithm to compute.

3. Get insights of Data

Still, we need to get some insights into the data so that we can make strategies from these insights with the assistant with the predictive model.

See if people who mention a dog or cat are more likely to have them. I did this analysis that if this statement holds, then people who talk about pets and yet to have them are more likely to own pets in the future. We can target them as potential audience.
How many comments do pets owners leave in general. I wanted to know if there is a chance for me to segment users with different activeness
Top creators that have most viewers that are pet owners I wanted to find out famous channels so that Youtube can set Ads according to the audience

4. Identify Cat And Dog Owners through the Comments - Build the classifier

Down-sampling: Because I was facing imbalance with rate 1 to 100 so I downsampled it to 50 percent: 50 percent
I implement LogisticRegression, GBTClassifier, and RandomForestClassifier without tuning the hyperparameters to see which model has the best performance
Use CrossValidator with Grid Search Concept for Hyper-Parameter Tuning

Result

I had 88% of accuracy the 89% recall on the Logistic Regression. Here I emphasized more on the recall since the goal is to identify the pets owners as possible. With the ability to predict pets onwers, the statements I made could be put those business strategy into practice.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Apache Spark Pet Owners Prediction on YouTube Comments.ipynb		Apache Spark Pet Owners Prediction on YouTube Comments.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly