In this notebook, we have a dataset of user comments for youtube videos related to animals or pets. We will attempt to identify cat or dog owners based on these comments, find out the topics important to them, and then identify video creators with the most viewers that are cat or dog owners.

The dataset provided for this coding test are comments for videos related to animals and/or pets. The dataset is 240MB compressed; please download the file using this google drive link:
https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing

 The dataset file is comma separated, with a header line defining the field names, listed here:
● creator_name. Name of the YouTube channel creator.
● userid. Integer identifier for the users commenting on the YouTube channels.
● comment. Text of the comments made by the users.

Please use a recent version of PySpark (version 2.2 or higher) to analyze the data. Do not use
any external libraries; just use the native methods from pyspark.sql and pyspark.ml. (Do not
use pyspark.mllib as this has been deprecated.) Keep your code clean and efficient, with
enough documentation so that the grader can easily follow your train of thought. Summarize
the key results from each step. Explain how to execute your code from a command line
interface.

Step 1: Identify Cat And Dog Owners
Find the users who are cat and/or dog owners.

Step 2: Build And Evaluate Classifiers
Build classifiers for the cat and dog owners and measure the performance of the classifiers.

Step 3: Classify All The Users
Apply the cat/dog classifiers to all the users in the dataset. Estimate the fraction of all users
who are cat/dog owners.

Step 4: Extract Insights About Cat And Dog Owners
Find topics important to cat and dog owners.

Step 5: Identify Creators With Cat And Dog Owners In The Audience
Find creators with the most cat and/or dog owners. Find creators with the highest statistically
significant percentages of cat and/or dog owners.

link: https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing

#### 0. Data Exploration and Cleaning

In [5]:
df_clean=spark.read.csv("/FileStore/tables/animals_comments.csv",inferSchema=True,header=True)
df_clean.show(10)

In [6]:
df_clean.count() 

In [7]:
df_clean = df_clean.na.drop(subset=["comment"])
df_clean.count()

In [8]:
# find user with preference of dog and cat
from pyspark.sql.functions import when
from pyspark.sql.functions import col

df_clean = df_clean.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                           .when(col("comment").like("%I have a dog%"), 1) \
                           .when(col("comment").like("%I have dogs%"), 1) \
                           .when(col("comment").like("%I have cats%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%I have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .otherwise(0)))

In [9]:
df_clean.show()

In [10]:
tmp = df_clean.filter(col('label')==1).count()
tmp2 = 5820035-tmp

print('number of labeled 1: '+str(tmp)+' number of labeled 0: '+str(tmp2)+' ratio : '+str(tmp/tmp2))

#### 1. Data preprocessing and Build the classifier

In [12]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec
from pyspark.ml.classification import LogisticRegression

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")

word2Vec = Word2Vec(inputCol="words", outputCol="features")

In [13]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, word2Vec])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df_clean)
dataset = pipelineFit.transform(df_clean)

In [14]:
dataset.count()

In [15]:
dataset.show()

In [16]:
(lable0_train,lable0_test)=dataset.filter(col('label')==1).randomSplit([0.7, 0.3],seed = 100)
(lable1_train, lable1_ex)=dataset.filter(col('label')==0).randomSplit([0.005, 0.995],seed = 100)
(lable1_test, lable1_ex2)=lable1_ex.randomSplit([0.002, 0.998],seed = 100)

In [17]:
trainingData = lable0_train.union(lable1_train)
testData=lable0_test.union(lable1_test)

In [18]:
print("Dataset Count: " + str(dataset.count()))
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

##### LogisticRegression

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [21]:
lr = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.01)
lrModel = lr.fit(trainingData)
preds = lrModel.transform(testData)
Evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
AUC = Evaluator.evaluate(preds)
print(AUC)

##### Parameter Tuning and K-fold cross-validation

In [23]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [24]:
lr = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction',regParam = 0.01)
grid =( ParamGridBuilder()
        .addGrid(lr.maxIter, [1,10,100])
        .addGrid(lr.elasticNetParam, [0,0.3])
        .build())
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,numFolds=3)
cvModel = cv.fit(trainingData)
avg_AUC = cvModel.avgMetrics[0]
best_AUC = evaluator.evaluate(cvModel.transform(trainingData))
best_lr_model =  cvModel.bestModel



In [25]:
print('average AUC on cross validation:', avg_AUC)
print(' best model AUC on training set: ',best_AUC)

In [26]:
best_lr_model.explainParams()


##### RandomForest

In [28]:
from pyspark.ml.classification import RandomForestClassifier

In [29]:
RF = RandomForestClassifier(featuresCol='features', labelCol='label', predictionCol='prediction')
grid =( ParamGridBuilder()
        .addGrid(RF.numTrees, [2,5,10])
        .addGrid(RF.maxDepth, [2,5])
        .build())
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=RF, estimatorParamMaps=grid, evaluator=evaluator,numFolds=3)
cvModel = cv.fit(trainingData)
avg_AUC = cvModel.avgMetrics[0]
best_AUC = evaluator.evaluate(cvModel.transform(trainingData))
best_RF_model =  cvModel.bestModel
print('average AUC on cross validation:', avg_AUC)
print(' best model AUC on training set: ',best_AUC)


##### Gradient boosting

In [31]:
from pyspark.ml.classification import GBTClassifier

In [32]:
GBDT = GBTClassifier(featuresCol='features', labelCol='label', predictionCol='prediction')
grid =( ParamGridBuilder()
        .addGrid(GBDT.maxDepth, [2, 5])
        .addGrid(GBDT.maxIter, [5,10])
        .build())
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=GBDT, estimatorParamMaps=grid, evaluator=evaluator,numFolds=3)
cvModel = cv.fit(trainingData)
avg_AUC = cvModel.avgMetrics[0]
best_AUC = evaluator.evaluate(cvModel.transform(trainingData))
best_GBDT_model =  cvModel.bestModel
print('average AUC on cross validation:', avg_AUC)
print(' best model AUC on training set:  ',best_AUC)


In [33]:
best_model =best_lr_model

#### 2. Classify All The Users

In [35]:
pred_df = best_model.transform(dataset)
pred_df.show()
pred_df.createOrReplaceTempView('users')

In [36]:
#get ratio of cat or dog owners from all users
spark.sql("select sum(catagory)/count(*) as ratio from  (select userid, case when sum(prediction) =0 then 0 else 1 end   as catagory from users group by userid)").show()

#### 3. Get insigts of Users

In [38]:
sql_query = """
select creator_name ,comment, label, prediction
from users
where label = 0 and prediction = 1
"""
df_3 = spark.sql(sql_query)
display(df_3)

creator_name,comment,label,prediction
Doug The Pug,I shared this to my friends and mom the were lol,0,1.0
Hope For Paws - Official Rescue Channel,That mother cat looks like my own Im guessing she is a russian blue due to her looks and unusual coping skills.,0,1.0
Talking Kitty Cat,steve: No wet food for a month!:cats immediately stop fighting:,0,1.0
Cole & Marmalade,cat drugs,0,1.0
Taylor Nicole Dean,I dont understand how you think she will make a good service dogs. SD are handled by a company here in Quebec (and given for free fully trained to people in need). For the first year they are fostered by families who expose them to as many things as possible and even then after a year the majority of them are deemed not fit for service work. They have to be ultra confident and never startled never afraid of anything etc... She seem like a good pet but the very opposite of what a service dog should be.,0,1.0
Rachel Fusaro,Im not allowed to have a dog because of money and my apartment doesnt allow dogs!! WHAT DO I DO!!!!???!?!?!?!?,0,1.0
Zak Georges Dog Training rEvolution,Chestnut is so cute. Your videos areuoer helpful for me. Ii dont have a dog yet but learning to train a dog ahead of time is really good for me.,0,1.0
2CAN.TV - Ripley the Toucan!,cooking the bird would be easier if the stove was turned on just saying,0,1.0
The Dodo,Oh my how crule,0,1.0
MonkeyBoo,where does Boo go potty?,0,1.0


#### 4. Identify Creators With Cat And Dog Owners In The Audience

In [40]:
sql_query = """
select creator_name ,sum(prediction) /count(*) as owner_ratio
from users
group by creator_name
order by owner_ratio desc
"""
df_4 = spark.sql(sql_query)
display(df_4)

creator_name,owner_ratio
James Stein,1.0
FernDog Training,1.0
marifroggy,1.0
KL Daily,1.0
EdmondCats,1.0
SPCA of Texas,1.0
Dutch Hollow Acres,1.0
Working Woman Report,1.0
Kids CN,1.0
SHEBA Brand,1.0


In [41]:
sql_query = """
select creator_name ,sum(prediction) as owner_num
from users
group by creator_name
order by owner_num desc
"""
df_4 = spark.sql(sql_query)
display(df_4)

creator_name,owner_num
Brave Wilderness,78373.0
The Dodo,61045.0
Taylor Nicole Dean,41380.0
Brian Barczyk,39915.0
Robin Seplut,39724.0
Hope For Paws - Official Rescue Channel,29445.0
Gohan The Husky,23157.0
Vet Ranch,19228.0
Gone to the Snow Dogs,17671.0
Cole & Marmalade,15381.0


#### 5. Analysis and Future work

From project to your CV format 
1. overview of project 
2. data clean and modeling 
3. data analysis 
4. build ml model
5. recommendation based on the model results