# Telecom Customer Churn

This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.

Not sure where to begin? Scroll to the bottom to find challenges!

In [1]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(caret))

churn <- read_csv('data/customer_churn.csv', show_col_types = FALSE)

## Data Dictionary
| Column                  | Explanation                                             |
|-------------------------|---------------------------------------------------------|
| Call Failure            | number of call failures                                 |
| Complains               | binary (0: No complaint, 1: complaint)                  |
| Subscription Length     | total months of subscription                            |
| Charge Amount           | ordinal attribute (0: lowest amount, 9: highest amount) |
| Seconds of Use          | total seconds of calls                                  |
| Frequency of use        | total number of calls                                   |
| Frequency of SMS        | total number of text messages                           |
| Distinct Called Numbers | total number of distinct phone calls                    |
| Age Group               | ordinal attribute (1: younger age, 5: older age)        |
| Tariff Plan             | binary (1: Pay as you go, 2: contractual)               |
| Status                  | binary (1: active, 2: non-active)                       |
| Age                     | age of customer                                         |
| Customer Value          | the calculated value of customer                        |
| Churn                   | class label (1: churn, 0: non-churn)                   |

[Source](https://www.kaggle.com/royjafari/customer-churn) of dataset and [source](https://archive.ics.uci.edu/ml/datasets/Iranian+Churn+Dataset) of dataset description. 

**Citation**: Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications.

## Don't know where to start?

**Challenges are brief tasks designed to help you practice specific skills:**

- 🗺️ **Explore**: Which age groups send more SMS messages than make phone calls?
- 📊 **Visualize**: Create a plot visualizing the number of distinct phone calls by age group. Within the chart, differentiate between short, medium, and long calls (by the number of seconds).
- 🔎 **Analyze**: Are there significant differences between the length of phone calls between different tariff plans?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

You have just been hired by a telecom company. A competitor has recently entered the market and is offering an attractive plan to new customers. The telecom company is worried that this competitor may start attracting its customers.

You have access to a dataset of the company's customers, including whether customers churned. The telecom company wants to know whether you can use this data to predict whether a customer will churn. They also want to know what factors increase the probability that a customer churns.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

---

✍️ _If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system._

In [2]:
str(churn)

spec_tbl_df [3,150 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Call  Failure          : num [1:3150] 8 0 10 10 3 11 4 13 7 7 ...
 $ Complains              : num [1:3150] 0 0 0 0 0 0 0 0 0 0 ...
 $ Subscription  Length   : num [1:3150] 38 39 37 38 38 38 38 37 38 38 ...
 $ Charge  Amount         : num [1:3150] 0 0 0 0 0 1 0 2 0 1 ...
 $ Seconds of Use         : num [1:3150] 4370 318 2453 4198 2393 ...
 $ Frequency of use       : num [1:3150] 71 5 60 66 58 82 39 121 169 83 ...
 $ Frequency of SMS       : num [1:3150] 5 7 359 1 2 32 285 144 0 2 ...
 $ Distinct Called Numbers: num [1:3150] 17 4 24 35 33 28 18 43 44 25 ...
 $ Age Group              : num [1:3150] 3 2 3 1 1 3 3 3 3 3 ...
 $ Tariff Plan            : num [1:3150] 1 1 1 1 1 1 1 1 1 1 ...
 $ Status                 : num [1:3150] 1 2 1 1 1 1 1 1 1 1 ...
 $ Age                    : num [1:3150] 30 25 30 15 15 30 30 30 30 30 ...
 $ Customer Value         : num [1:3150] 198 46 1537 240 146 ...
 $ Churn                  : num [1:

In [3]:
churn <- churn %>%
    rename(
       "Call_Failure" = "Call  Failure",
       "Subscription_Length" = "Subscription  Length",
       "Charge_Amount" = "Charge  Amount",
       "Seconds_of_Use" = "Seconds of Use",
       "Frequency_of_use" = "Frequency of use",
       "Frequency_of_SMS" = "Frequency of SMS",
       "Distinct_Called_Numbers" = "Distinct Called Numbers",
       "Age_Group" = "Age Group",
       "Tariff_Plan" = "Tariff Plan",
       "Customer_Value" = "Customer Value"
    )

In [4]:
(age_group_means <- churn %>%
    group_by(Age) %>%
    summarize(
       mean_sublength = mean(Subscription_Length),
       mean_charge_amount = mean(Charge_Amount),
       mean_usetime = mean(Seconds_of_Use),
       mean_smsfreq = mean(Frequency_of_SMS),
       mean_ditinctnumberscalled = mean(Distinct_Called_Numbers),
       mean_customervalue = mean(Customer_Value)
    ))

Age,mean_sublength,mean_charge_amount,mean_usetime,mean_smsfreq,mean_ditinctnumberscalled,mean_customervalue
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
15,32.05691,0.3333333,3986.854,20.19512,34.3252,334.5654
25,31.84764,0.6229508,4536.344,75.49952,22.98554,547.1267
30,33.34877,0.9649123,4463.154,90.04281,21.50246,541.4351
45,31.48354,1.035443,4042.089,42.05316,26.08608,207.6992
55,32.82353,2.9352941,5512.094,28.24706,29.72353,126.2149


In [14]:
glm1 <- glm(Churn ~., churn, family = "binomial")
summary(glm1)
(glm1_glance <- glance(glm1))


Call:
glm(formula = Churn ~ ., family = "binomial", data = churn)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.59049  -0.33912  -0.13595  -0.03192   2.92523  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.6579070  0.8039318  -3.306 0.000946 ***
Call_Failure             0.1314874  0.0178009   7.387 1.51e-13 ***
Complains                4.0421354  0.2820449  14.332  < 2e-16 ***
Subscription_Length     -0.0298973  0.0095140  -3.142 0.001675 ** 
Charge_Amount           -0.4146409  0.1206564  -3.437 0.000589 ***
Seconds_of_Use           0.0001086  0.0001442   0.753 0.451339    
Frequency_of_use        -0.0552806  0.0084907  -6.511 7.48e-11 ***
Frequency_of_SMS        -0.0471298  0.0124199  -3.795 0.000148 ***
Distinct_Called_Numbers -0.0110415  0.0096158  -1.148 0.250856    
Age_Group                0.0855448  0.2854640   0.300 0.764430    
Tariff_Plan              0.2302320  0.6335469   0.363 0.7

In [15]:
fmla1 <- Churn ~ Call_Failure + Complains + Charge_Amount + Frequency_of_use + Frequency_of_SMS + Status 

In [17]:
glm2 <- glm(
    fmla1,
    churn,
    family = "binomial"
)

glm2_glance <- glance(glm2)

In [None]:
set.seed(69)
myFolds <- createFolds(churn$Churn, k = 5)

myControl <- trainControl(
    summaryFunction = twoClassSummary,
    classProbs = TRUE,
    verboseIter = TRUE,
    savePredictions = TRUE,
    index = myFolds
)

In [None]:

model_glmnet <- train(
    Churn ~., 
    data = churn, 
    metric = "ROC",
    method = "glmnet",
    trControl = myControl
)

model_rf <- train(
    Churn ~.,
    data = churn,
    metric = "ROC",
    method = "ranger",
    trControl = myControl
)