## Advertising Agency Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients.

Bookmarks
- Load Environment Libraries
- Load Spark
- Load Data & Constants
- Feature Exploration
- Machine Learning
    - Load Libraries
    - Feature Engineering
    - Data Split: training & CV
    - Train Model
    - Model Optimization
    - Model Selection

Load Environment Libraries

In [5]:
import yaml
import os
import matplotlib.pyplot as plt

Load Spark

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Churn Classification").getOrCreate()

from pyspark.sql import functions as F

Load Data & Constants

In [7]:
# load config file

with open('..\\config\\config.yaml', 'r', encoding='utf-8') as yml:
    config = yaml.load(yml, Loader=yaml.SafeLoader)
    
df = spark.read.csv(config['DATA_DIR']+config['TRAINING_DATA'], inferSchema = True, header = True)

# constants

Feature Exploration

In [None]:
df.printSchema()

In [None]:
df.show(3)

In [6]:
for row in df.head(3):
    print (row,'\n')

Row(Names='Cameron Williams', Age=42.0, Total_Purchase=11066.8, Account_Manager=0, Years=7.22, Num_Sites=8.0, Onboard_date=datetime.datetime(2013, 8, 30, 7, 0, 40), Location='10265 Elizabeth Mission Barkerburgh, AK 89518', Company='Harvey LLC', Churn=1) 

Row(Names='Kevin Mueller', Age=41.0, Total_Purchase=11916.22, Account_Manager=0, Years=6.5, Num_Sites=11.0, Onboard_date=datetime.datetime(2013, 8, 13, 0, 38, 46), Location='6157 Frank Gardens Suite 019 Carloshaven, RI 17756', Company='Wilson PLC', Churn=1) 

Row(Names='Eric Lozano', Age=38.0, Total_Purchase=12884.75, Account_Manager=0, Years=6.67, Num_Sites=12.0, Onboard_date=datetime.datetime(2016, 6, 29, 6, 20, 7), Location='1331 Keith Court Alyssahaven, DE 90114', Company='Miller, Johnson and Wallace', Churn=1) 



In [None]:
print (df.columns)

In [None]:
# Features of interest:
# 'Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites', 'Onboard_date', 'Location', 'Company'

# Feature to classify: 'Churn'

In [None]:
# Churn
df.groupBy('Churn').count().show()

In [None]:
# Age

df.select('Age').printSchema()
age_Vect = df.select('Age').collect()

temp_age_Vect = age_Vect[0:1000]
temp_age_Vect = [t[0] for t in temp_age_Vect]
plt.title('Age Distribution')
plt.hist(temp_age_Vect)
plt.xlabel('Age')
plt.ylabel('# of users')
plt.show()
# Feature Engineering: Age - normalization

In [None]:
# Total_Purchase
purchase_Vect = df.select('Total_Purchase')
purchase_Vect.show(3)
purchase_Vect.printSchema()
temp_pruchase_Vect = purchase_Vect.collect()
temp_pruchase_Vect = [t[0] for t in temp_pruchase_Vect]
plt.title('Purchase Distribution')
plt.hist(temp_pruchase_Vect)
plt.xlabel('Total Purchase')
plt.ylabel('# of users')
plt.show()
# Feature Engineering: Total_Pruchase - normalization

In [None]:
# Account_Manager
acc_mngr_Vect = df.select('Account_Manager')

acc_mngr_Vect.printSchema()

acc_mngr_Vect.distinct().show()

# number of classes : 2
acc_mngr_Vect.groupBy('Account_Manager').count().show()

# Feature Engineering: Not required

In [None]:
# Years
year_Vect = df.select('Years')
year_Vect.printSchema()
year_Vect.show(3)

temp_year_Vect = year_Vect.collect()
temp_year_Vect = temp_year_Vect[0:1000]
temp_year_Vect = [t[0] for t in temp_year_Vect]
plt.title('Years Distribution')
plt.hist(temp_year_Vect)
plt.xlabel('Years')
plt.ylabel('# of users')
plt.show()
# Feature Engineering: Years - normalization

In [None]:
# Num_Sites

sites_Vect = df.select('Num_Sites')
sites_Vect.printSchema()
sites_Vect.show(3)

temp_sites_Vect = sites_Vect.collect()
temp_sites_Vect = temp_sites_Vect[0:1000]
temp_sites_Vect = [t[0] for t in temp_sites_Vect]
plt.title('Number of Sites Distribution')
plt.hist(temp_sites_Vect)
plt.xlabel('# of Sites')
plt.ylabel('# of Users')
plt.show()
# Feature Engineering: Num_Sites - normalization

In [None]:
# 'Onboard_date', 
df = df.withColumn('Tenure', F.)
df.select('Onboard_date', 'Tenure').show(10)
# why 'Location' & 'Company' are not good database

Machine Learning
- Load Libraries
- Feature Engineering
- Data Split: training & CV
- Train Model
- Model Optimization
- Model Selection

Load Libraries

In [8]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.evaluation import BinaryClassificationEvaluator

Feature Engineering

In [9]:
# Feature Engineering
# Features: 'Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites'


# Assembler: to generate Scale Vector
assembler = VectorAssembler(
    inputCols=['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites'],
    outputCol="featuresToScale")

# Feature Standardizer: Age, Total_Pruchase, Years, Num_Sites
scaler = StandardScaler(inputCol="featuresToScale", outputCol="features",withStd=True)

# Feature Engineering Pipeline
pipeline = Pipeline(stages = [assembler, scaler])

In [10]:
feature_engineering_pipepline = pipeline.fit(df)
feature_engineered_data = feature_engineering_pipepline.transform(df)
data = feature_engineered_data.select('features','Churn')

Data Split: training & CV

In [11]:
train, test = data.randomSplit([0.7,0.3])

Train Model

We train our train set to two models:
- Logistic Regression
- Random Forest

In [12]:
lr_churn = LogisticRegression(labelCol='Churn', featuresCol = 'features')
#rf_churn = 

TypeError: __init__() got an unexpected keyword argument 'features'

In [14]:
!pip install findspark

Collecting findspark
  Downloading https://files.pythonhosted.org/packages/b1/c8/e6e1f6a303ae5122dc28d131b5a67c5eb87cbf8f7ac5b9f87764ea1b1e1e/findspark-1.3.0-py2.py3-none-any.whl
Installing collected packages: findspark
Successfully installed findspark-1.3.0


In [15]:
import findspark

In [17]:
findspark.init()

ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).

Model Optimization

In [48]:
#prediction = model.transform(test)
#training_sum = model.stages[-1].summary
#training_sum.predictions.select('prediction','Churn').describe().show()
# prediction.select('features','Churn','rawPrediction','probability', 'prediction').show()
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='Churn')
# auc = churn_eval.evaluate(prediction)
# print (round(auc,2))

Model Selection

In [18]:
sc

''