### Telco Customer Churn
Focused customer retention programs

#### Context
"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

#### Content
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

-Customers who left within the last month – the column is called Churn
-Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
-Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
-Demographic info about customers – gender, age range, and if they have partners and dependents

#### Inspiration
To explore this type of models and learn more about the subject.

#### New version from IBM:
https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113


- 이 고객이 왜 이탈을 했는지 분석 (고객 이탈 모형)
- 캐글 데이터

파이스파크


- 문자(범주형) -> 숫자 


In [1]:
from pyspark.sql import SparkSession

spark =  SparkSession.builder.appName("churn").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")

In [2]:
churn_df  = spark.read.csv("Telco-Customer-Churn.csv",inferSchema=True,header=True)
churn_df.show() 

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|7590-VHVEG|Female|            0|    Yes|        No|     1|  

In [3]:
churn_df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



범주별로 숫자로 만드는 방법

문자 -> 숫자 1234... -onehotencoding-> 더미변수 

In [4]:
churn_df = churn_df.drop("customerID") # 분석에 필요 없는 변수

In [5]:
from pyspark.sql.types import StringType

문자변수 = [변수.name for 변수 in churn_df.schema.fields if isinstance(변수.dataType, StringType)] # 스트링 타입인 변수 네임 출력
문자변수

['gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'TotalCharges',
 'Churn']

In [6]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder 
#  StringIndexer: 해당하는값을 인덱스로 표시해주는 것

In [7]:
## StringIndexer
indexer  = StringIndexer(inputCols=문자변수,
                         outputCols=["{}_SI".format(c) for c in 문자변수])
encode_df  = indexer.fit(churn_df).transform(churn_df)
encode_df.printSchema()

root
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)
 |-- gender_SI: double (nullable = false)
 |-- Partner_SI: double (nullable = false)
 |-- Dependents_SI: double (nullable = 

In [28]:
설명변수 = ["SeniorCitizen", "tenure", "MonthlyCharges"]+["{}_SI".format(c) for c in 문자변수] # 리스트 더하기
설명변수 = 설명변수[0:-1]
설명변수 

['SeniorCitizen',
 'tenure',
 'MonthlyCharges',
 'gender_SI',
 'Partner_SI',
 'Dependents_SI',
 'PhoneService_SI',
 'MultipleLines_SI',
 'InternetService_SI',
 'OnlineSecurity_SI',
 'OnlineBackup_SI',
 'DeviceProtection_SI',
 'TechSupport_SI',
 'StreamingTV_SI',
 'StreamingMovies_SI',
 'Contract_SI',
 'PaperlessBilling_SI',
 'PaymentMethod_SI',
 'TotalCharges_SI']

In [29]:
# 설명 변수에서 뭐 빼라고 했지...? 뭐가 반응변수였지 

In [30]:
from pyspark.ml.feature import VectorAssembler

변수묶음 = VectorAssembler(inputCols=설명변수,outputCol="features")
변환자료  = 변수묶음.transform(encode_df)
변환자료.select("features","Churn_SI").show()

+--------------------+--------+
|            features|Churn_SI|
+--------------------+--------+
|(19,[1,2,3,4,6,7,...|     0.0|
|(19,[1,2,8,9,11,1...|     0.0|
|(19,[1,2,8,9,10,1...|     1.0|
|[0.0,45.0,42.3,0....|     0.0|
|(19,[1,2,3,18],[2...|     1.0|
|(19,[1,2,3,7,11,1...|     1.0|
|(19,[1,2,5,7,10,1...|     0.0|
|(19,[1,2,3,6,7,8,...|     0.0|
|(19,[1,2,3,4,7,11...|     1.0|
|(19,[1,2,5,8,9,10...|     0.0|
|(19,[1,2,4,5,8,9,...|     0.0|
|[0.0,16.0,18.95,0...|     0.0|
|(19,[1,2,4,7,11,1...|     0.0|
|(19,[1,2,7,10,11,...|     1.0|
|(19,[1,2,9,11,12,...|     0.0|
|[0.0,69.0,113.25,...|     0.0|
|[0.0,52.0,20.65,1...|     0.0|
|[0.0,71.0,106.7,0...|     0.0|
|(19,[1,2,3,4,5,8,...|     1.0|
|(19,[1,2,3,10,11,...|     0.0|
+--------------------+--------+
only showing top 20 rows



In [31]:
분류자료 = 변환자료.select(["features","Churn_SI"])
분류자료.show()

+--------------------+--------+
|            features|Churn_SI|
+--------------------+--------+
|(19,[1,2,3,4,6,7,...|     0.0|
|(19,[1,2,8,9,11,1...|     0.0|
|(19,[1,2,8,9,10,1...|     1.0|
|[0.0,45.0,42.3,0....|     0.0|
|(19,[1,2,3,18],[2...|     1.0|
|(19,[1,2,3,7,11,1...|     1.0|
|(19,[1,2,5,7,10,1...|     0.0|
|(19,[1,2,3,6,7,8,...|     0.0|
|(19,[1,2,3,4,7,11...|     1.0|
|(19,[1,2,5,8,9,10...|     0.0|
|(19,[1,2,4,5,8,9,...|     0.0|
|[0.0,16.0,18.95,0...|     0.0|
|(19,[1,2,4,7,11,1...|     0.0|
|(19,[1,2,7,10,11,...|     1.0|
|(19,[1,2,9,11,12,...|     0.0|
|[0.0,69.0,113.25,...|     0.0|
|[0.0,52.0,20.65,1...|     0.0|
|[0.0,71.0,106.7,0...|     0.0|
|(19,[1,2,3,4,5,8,...|     1.0|
|(19,[1,2,3,10,11,...|     0.0|
+--------------------+--------+
only showing top 20 rows



In [32]:
from pyspark.ml.classification import LogisticRegression

train_data, test_data =분류자료.randomSplit([0.7, 0.3], 316)

In [33]:
# 모형 설정, 핏팅
분석모형 =  LogisticRegression(labelCol="Churn_SI").fit(train_data) # labelColumn=반응변수 features=설명변수(default) 특별히 지정x
# 어떤 패키지를 사용하냐에 따라 입력 방식이 다르다 
분석모형.summary

<pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x19802cb9b50>

In [34]:
분석모형.summary.predictions.show(truncate=False)  # truncate=True(default)
# probability: 0일 확률



+-------------------------------------------------------------------------------------+--------+------------------------------------------+----------------------------------------+----------+
|features                                                                             |Churn_SI|rawPrediction                             |probability                             |prediction|
+-------------------------------------------------------------------------------------+--------+------------------------------------------+----------------------------------------+----------+
|(19,[0,1,2,3,4,5,7,10,13,14,18],[1.0,32.0,100.4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3003.0]) |0.0     |[-0.36571838187741035,0.36571838187741035]|[0.4095760151697173,0.5904239848302827] |1.0       |
|(19,[0,1,2,3,4,5,7,10,17,18],[1.0,34.0,79.6,1.0,1.0,1.0,1.0,1.0,1.0,2572.0])         |1.0     |[0.2818408242783337,-0.2818408242783337]  |[0.5699974686390334,0.43000253136096656]|0.0       |
|(19,[0,1,2,3,4,5,7,11,13,17,18],[1.0,32

In [35]:
분석모형.summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|           Churn_SI|         prediction|
+-------+-------------------+-------------------+
|  count|               4861|               4861|
|   mean|0.26722896523349104|0.22053075498868546|
| stddev|  0.442558399615584|  0.414647212499701|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



In [36]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator # 이진 분류 평가 

In [37]:
예측 = 분석모형.evaluate(test_data) # 분석 모형에 자체적으로 있음. 테스트 데이터로 모형을 평가. 
예측.predictions.show()

+--------------------+--------+--------------------+--------------------+----------+
|            features|Churn_SI|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(19,[0,1,2,3,4,5,...|     0.0|[0.39289610460907...|[0.59697968268466...|       0.0|
|(19,[0,1,2,3,4,5,...|     0.0|[-0.5106428469780...|[0.37504283928831...|       1.0|
|(19,[0,1,2,3,4,7,...|     0.0|[0.37758050400803...|[0.59328941763549...|       0.0|
|(19,[0,1,2,3,4,7,...|     1.0|[-0.2833131118982...|[0.42964171039365...|       1.0|
|(19,[0,1,2,3,4,7,...|     1.0|[-1.3971841060924...|[0.19826333152951...|       1.0|
|(19,[0,1,2,3,4,7,...|     0.0|[1.47083134139576...|[0.81318371294332...|       0.0|
|(19,[0,1,2,3,4,7,...|     0.0|[2.83696959075266...|[0.94464120275434...|       0.0|
|(19,[0,1,2,3,4,7,...|     0.0|[2.44372859870398...|[0.92010162407723...|       0.0|
|(19,[0,1,2,3,4,7,...|     1.0|[1.23476625008917...|[0.7746516906

In [38]:
평가 = BinaryClassificationEvaluator(rawPredictionCol="prediction",labelCol="Churn_SI")
auc = 평가.evaluate(예측.predictions)
auc # 보통 얼마나 잘 설명하는가(예측하는가) 기준 값. 0에 가까우면 설명력 안조음 0.5는 무작위.
# tenure를 넣고 안넣고 차이 없을 땐 0.69 ... 이었음  

0.7145574855252274