### Telco Customer Churn
Focused customer retention programs

#### Context
"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

#### Content
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

-Customers who left within the last month – the column is called Churn
-Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
-Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
-Demographic info about customers – gender, age range, and if they have partners and dependents

#### Inspiration
To explore this type of models and learn more about the subject.

#### New version from IBM:
https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

In [1]:
from pyspark.sql import SparkSession

spark =  SparkSession.builder.appName("churn").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")

In [2]:
churn_df  = spark.read.csv("Telco-Customer-Churn.csv",inferSchema=True,header=True)
churn_df.show()

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|7590-VHVEG|Female|            0|    Yes|        No|     1|  

In [3]:
churn_df.printSchema() # 대부분이 문자열로 된 칼럼을 갖고 있음
##1) 수업에서는 범주형 변수를 숫자로 변경할것임
##2) 숫자로 변경된 범주값들을 dummy 변수로 변경해야함->onehotendocer

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



In [4]:
churn_df = churn_df.drop("customerID")

In [5]:
from pyspark.sql.types import StringType

문자변수 = [변수.name for 변수 in churn_df.schema.fields if isinstance(변수.dataType, StringType)] #데이터프레임의 범주의 타입이 문자인지 확인하고 맞다면 출력
문자변수 # 앞으로 이 문자변수를 갖고 숫자 변수로 변화해주는 작업을 할것임

['gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'TotalCharges',
 'Churn']

##### 문자변수를 숫자값을 가진 변수로 변환

In [6]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder # 더미변수로 변환

In [7]:
## StringIndexer
indexer  = StringIndexer(inputCols=문자변수, #내가 바꿀 변수(1개 이상의 값을 가진 경우)
                         outputCols=["{}_SI".format(c) for c in 문자변수]) # 컬럼명을 변경
encode_df  = indexer.fit(churn_df).transform(churn_df) #기존 데이터프레임에 새로운 변수값들이 추가 -> 타입이 숫자로 변환
encode_df.printSchema()

root
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)
 |-- gender_SI: double (nullable = false)
 |-- Partner_SI: double (nullable = false)
 |-- Dependents_SI: double (nullable = 

In [8]:
설명변수 = ["SeniorCitizen","MonthlyCharges"]+["{}_SI".format(c) for c in 문자변수] #기존의 숫자칼럼 + 내가 변환한 숫자칼럼 => 하나의 데이터프레임으로 형성
설명변수 = 설명변수[0:-1] # 설명변수들 선택
설명변수

['SeniorCitizen',
 'MonthlyCharges',
 'gender_SI',
 'Partner_SI',
 'Dependents_SI',
 'PhoneService_SI',
 'MultipleLines_SI',
 'InternetService_SI',
 'OnlineSecurity_SI',
 'OnlineBackup_SI',
 'DeviceProtection_SI',
 'TechSupport_SI',
 'StreamingTV_SI',
 'StreamingMovies_SI',
 'Contract_SI',
 'PaperlessBilling_SI',
 'PaymentMethod_SI',
 'TotalCharges_SI']

In [9]:
from pyspark.ml.feature import VectorAssembler

변수묶음 = VectorAssembler(inputCols=설명변수,outputCol="features") #output은 하나의 묶여진 형태로 나옴
변환자료  = 변수묶음.transform(encode_df)
변환자료.select("features","Churn_SI").show()

+--------------------+--------+
|            features|Churn_SI|
+--------------------+--------+
|(18,[1,2,3,5,6,7,...|     0.0|
|(18,[1,7,8,10,14,...|     0.0|
|(18,[1,7,8,9,16,1...|     1.0|
|[0.0,42.3,0.0,0.0...|     0.0|
|(18,[1,2,17],[70....|     1.0|
|(18,[1,2,6,10,12,...|     1.0|
|(18,[1,4,6,9,12,1...|     0.0|
|(18,[1,2,5,6,7,8,...|     0.0|
|(18,[1,2,3,6,10,1...|     1.0|
|(18,[1,4,7,8,9,14...|     0.0|
|(18,[1,3,4,7,8,16...|     0.0|
|[0.0,18.95,0.0,0....|     0.0|
|(18,[1,3,6,10,12,...|     0.0|
|(18,[1,6,9,10,12,...|     1.0|
|(18,[1,8,10,11,12...|     0.0|
|[0.0,113.25,1.0,1...|     0.0|
|[0.0,20.65,1.0,0....|     0.0|
|[0.0,106.7,0.0,0....|     0.0|
|(18,[1,2,3,4,7,10...|     1.0|
|(18,[1,2,9,10,13,...|     0.0|
+--------------------+--------+
only showing top 20 rows



In [10]:
분류자료 = 변환자료.select(["features","Churn_SI"]) #(설명변수, 반응 변수) , features는 디폴트값

In [14]:
분류자료.show()

+--------------------+--------+
|            features|Churn_SI|
+--------------------+--------+
|(18,[1,2,3,5,6,7,...|     0.0|
|(18,[1,7,8,10,14,...|     0.0|
|(18,[1,7,8,9,16,1...|     1.0|
|[0.0,42.3,0.0,0.0...|     0.0|
|(18,[1,2,17],[70....|     1.0|
|(18,[1,2,6,10,12,...|     1.0|
|(18,[1,4,6,9,12,1...|     0.0|
|(18,[1,2,5,6,7,8,...|     0.0|
|(18,[1,2,3,6,10,1...|     1.0|
|(18,[1,4,7,8,9,14...|     0.0|
|(18,[1,3,4,7,8,16...|     0.0|
|[0.0,18.95,0.0,0....|     0.0|
|(18,[1,3,6,10,12,...|     0.0|
|(18,[1,6,9,10,12,...|     1.0|
|(18,[1,8,10,11,12...|     0.0|
|[0.0,113.25,1.0,1...|     0.0|
|[0.0,20.65,1.0,0....|     0.0|
|[0.0,106.7,0.0,0....|     0.0|
|(18,[1,2,3,4,7,10...|     1.0|
|(18,[1,2,9,10,13,...|     0.0|
+--------------------+--------+
only showing top 20 rows



In [11]:
from pyspark.ml.classification import LogisticRegression
##데이터 분류
train_data, test_data =분류자료.randomSplit([0.7, 0.3], 316)# train 70% test 30%

In [12]:
분석모형 =  LogisticRegression(labelCol="Churn_SI").fit(train_data) # 반응변수:labelCol
분석모형.summary

<pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x1f1d6589bb0>

In [16]:
분석모형.summary.predictions.show(truncate=False) # 반응변수의 예측값들을 포함한 표를 보여줌 

+------------------------------------------------------------------------------+--------+--------------------------------------------+----------------------------------------+----------+
|features                                                                      |Churn_SI|rawPrediction                               |probability                             |prediction|
+------------------------------------------------------------------------------+--------+--------------------------------------------+----------------------------------------+----------+
|(18,[0,1,2,3,4,6,9,12,13,17],[1.0,100.4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3003.0])  |0.0     |[-0.3622345163698677,0.3622345163698677]    |[0.4104187603640363,0.5895812396359636] |1.0       |
|(18,[0,1,2,3,4,6,9,16,17],[1.0,79.6,1.0,1.0,1.0,1.0,1.0,1.0,2572.0])          |1.0     |[0.06812463409570224,-0.06812463409570224]  |[0.5170245748274009,0.48297542517259906]|0.0       |
|(18,[0,1,2,3,4,6,10,12,16,17],[1.0,91.35,1.0,1.0,1.0,1.0,1.0,1.0



In [17]:
##기술 통계값 계산
분석모형.summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|           Churn_SI|         prediction|
+-------+-------------------+-------------------+
|  count|               4861|               4861|
|   mean|0.26290886648837686|0.20983336761983132|
| stddev| 0.4402586381230306| 0.4072314346458762|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



##### 모델 평가 

In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [19]:
예측 = 분석모형.evaluate(test_data) # 테스트 데이터로 모델을 평가해야함
예측.predictions.show()

+--------------------+--------+--------------------+--------------------+----------+
|            features|Churn_SI|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(18,[0,1,2,3,4,5,...|     0.0|[0.03991562291143...|[0.50997758102550...|       0.0|
|(18,[0,1,2,3,4,6,...|     0.0|[0.52300261993518...|[0.62784961080325...|       0.0|
|(18,[0,1,2,3,6,8,...|     0.0|[0.18591998642556...|[0.54634657122375...|       0.0|
|(18,[0,1,2,3,6,8,...|     1.0|[0.31517619789560...|[0.57814820688422...|       0.0|
|(18,[0,1,2,3,6,8,...|     1.0|[-0.6255870934179...|[0.34851182313858...|       1.0|
|(18,[0,1,2,3,6,8,...|     0.0|[-0.0149024193696...|[0.49627446410524...|       1.0|
|(18,[0,1,2,3,6,8,...|     0.0|[1.61549463515296...|[0.83417284735446...|       0.0|
|(18,[0,1,2,3,6,8,...|     0.0|[1.46697880093720...|[0.81259774427616...|       0.0|
|(18,[0,1,2,3,6,8,...|     1.0|[0.36085810062148...|[0.5892481401

In [20]:
평가 = BinaryClassificationEvaluator(rawPredictionCol="prediction",labelCol="Churn_SI") ## labelCol = 타겟값
auc = 평가.evaluate(예측.predictions) # 예측값 출력
auc

0.6953054459252075