<b><font size=5>Predicting client churn solution report</font></b>

<b><font size=4>Problem statement</font></b>

The telecom operator <b>Interconnect</b> wants to be able to forecast churn of its clients. If it's discovered that a user is planning to leave, Interconnect can offer promotional deals and special plan options to perhaps prevent the user from leaving their services. Based on current and past personal information of each client, whicn included information about their plans and contracts, can machine learning assist in accurately predicting which users are close to discontinuing service?

An acceptable model must yield an AUC-ROC of at least <b>0.75</b> to be promoted to production.

<b><font size=4>Data preparation</font></b>

<font size=3>Data sources</font>

The data consisted of plan and contract information of each client from different csv files.

- <b>contract.csv</b> - contract information
- <b>personal.csv</b> - client's personal data
- <b>internet.csv</b> - information about Internet services
- <b>phone.csv</b> - information about telephone services

<font size=3>Data preprocessing</font>

1. Load data in csv files into DataFrames
2. Merge all four datasets into one main DataFrame
    - Fill in missing values resulted from the merge
3. Update column names to all lowercase lettering
4. Replace for one space string values in <mark>totalcharges</mark> field
5. Convert data types of certain fields

<font size=3>Feature engineering</font>

1. Create new Yes / No column representing client churn called <mark>discontinued</mark>
2. Create new columns and extract month and year data from <mark>begindate</mark>
3. Apply one-hot encoding to categorical features
4. Apply feature scaling / standardization to numerical features

<b><font size=4>Exploratory data analysis</font></b>

<font size=3>Client churn over time</font>

1. Client churn progression
2. Client tenure distribution

Conclusions

- All clients that have discontinued service have left within the past 4 months
- The number of new clients has been steady over the years, but there has been an abrupt spike within the past 5 months
- There is a net loss of around 200 customers the past 4 months
- There are more recently-joined clients that are discontinuing service compared to longer-tenured clients

<font size=3>Client and plan characteristics churn comparison</font>

1. Gender of client
2. Seniority client
3. Internet services client possessed
4. Contract length
5. Charges paid per client

Conclusions
- There are no major differences between genders in terms of client churn
- There are more senior citizens (in proportion) that are discontinuing service compared to non-senior citizens
- More customers on fiber optic have discontinued service compared to customers on DSL. This may be due to the monthly charges of fiber optic
- Month-to-month contracts account for the most discontinued clients, but this is expected
- The total charges of all services might not play a big role in the decision of a client leaving

<b><font size=4>Model preparation</font></b>

1. Define features and target
2. Split data into training, validation, and test datasets
3. Perform upsampling to target variable to mitigate the class imbalance in the dataset
4. Build functions to streamline model building process
    - <b>build_model</b> - Gathers model predictions, retrieves AUC-ROC and accuarcy scores, and plots a ROC curve
    - <b>build_confusion_matrix</b> - Creates a confusion matrix describing how the model made predictions
    - <b>build_table</b> - Creates a summary table of the AUC-ROC scores
5. Create a Logistic Regression base model for comparison and basic evaluation

<b><font size=4>Model building and analysis</font></b>

There were three models built and fine-tuned to help predict client churn: random forest, K-nearest neighbors, and gradient boosting (Light GBM). First, an unoptimized model was built to establish a performance baseline and then an optimized model was built which included modifications and optimizations. Lastly, scores and confusion matrices were compared to evaluate the model's overall performance.

<font size=3>Random forest</font>

Hyperparameters tuned
- n_estimators
- max_depth
- min_samples_leaf
- min_samples_split

AUC-ROC score comparison
- Logistic regression: <b>0.8648</b>
- Unoptimized model: <b>0.8534</b>
- Optimized model: <b>0.8498</b>

Accuracy score comparison
- Logistic regression: <b>0.7835</b>
- Unoptimized model: <b>0.8027</b>
- Optimized model: <b>0.7921</b>

<font size=3>K-nearest neighbors</font>

Hyperparameters tuned
- n_neighbors
- leaf_size
- p
- weight

AUC-ROC score comparison
- Logistic regression: <b>0.8648</b>
- Unoptimized model: <b>0.7946</b>
- Optimized model: <b>0.8169</b>

Accuracy score comparison
- Logistic regression: <b>0.7835</b>
- Unoptimized model: <b>0.7431</b>
- Optimized model: <b>0.7459</b>

<font size=3>Gradient boosting (Light GBM)</font>

Hyperparameters tuned
- num_leaves
- max_depth
- n_estimators
- learning_rate

AUC-ROC score comparison
- Logistic regression: <b>0.8648</b>
- Unoptimized model: <b>0.8471</b>
- Optimized model: <b>0.8641</b>

Accuracy score comparison
- Logistic regression: <b>0.7835</b>
- Unoptimized model: <b>0.7885</b>
- Optimized model: <b>0.7970</b>

<b><font size=4>Final model evaluation</font></b>

Interconnect requested for a better way to predict and forecast what clients may be discontiuing service. Knowing this information upfront allows them to target certain groups to then offer promotional deals and special plan options to perhaps prevent these groups of clients from leaving. Having a means to identify these groups allows Interconnect to be more proactive in keeping a client's business, rather than being reactive after it happens. 

After building and fine-tuning models for all three model types, the gradient boosting model performed the best and was tested on the test dataset. After running the model, the model predictions yielded an AUC-ROC score that beats the threshold of <b>0.75</b>. Below are the final metrics. Moreover, the performance of the gradient boosting model proves that it can adequately predict what clients could potentially discontinue service, allowing Interconnect to perform actions quickly to try and keep their business.

- AUC-ROC threshold: <b>0.75</b>
- AUC-ROC score: <b>0.8407</b>
- Accuracy score: <b>0.7509</b>

<b><font size=4>Difficulties and issues</font></b>

1. Hyperparameter tuning took an ample amount of time to conduct. It was very important to be strategic and thoughtful about what parameters to tune and how to build the paramter grid and gridsearch object to save time on the optimization process
2. Building models and creating the same ROC plots and confusion matrices became tedious. So, functions were made to streamline this process
3. It was decided not to perform bootstrapping as it was seen as not necessary for this problem

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> Overall looks like a nice report! Well done and best of luck in your job search!
<a class="tocSkip"></a>