# D208 PA Task 1 - Predictive Modeling - Linear Regression Modeling

Shanay Murdock

WGU MSDA-DS Masters Candidate

# Part I: Research Question

## A1. Research Question

I aim to explore the following research question: "What factors contribute to customer tenure?" 

Customer churn is a significant issue in the highly competitive telecommunications market, with some companies experiencing churn rates as high as 25% annually (per D208 Data Dictionary). To avoid customers leaving and signing up with competitors, there is much to be gained across the company by understanding what factors keep existing customers and improving tenure over time. This analysis can aid informed decision-making by understanding what factors to capitalize on and which need dedicated improvement.

## A2. Goals

This analysis aims to gain greater insight into what factors directly contribute to customer tenure. Multiple linear regression can be used to assess what factors in the dataset (independent variables) directly impact customer tenure (dependent variable). Identification of these factors can assist executives in being more responsive to what drives customers to stay with the company versus leaving the company for a competitor. This frees up resources needed to recruit as many new or previous customers as possible and works to gain more long-term customer loyalty. Linear regression allows us to make predictions based on certain conditions; early alerts for potential customer churn give the company time to pivot and address issues that could retain customers, whether from a customer service perspective or a quality of service perspective.

# Part II: Method Justification

## B1. Summary of Assumptions

We can make the following four assumptions of a multiple linear regression model [(Bobbit, 2021)](https://www.statology.org/multiple-linear-regression-assumptions/):

**Assumption 1:** Linear relationships exist. A linear relation exists between the target variable and each predictor variable.

**Assumption 2:** There is no multicollinearity. None of the predictor variables have a high correlation with each other. High correlations among predictor variables create data redundancies, and having more correlated data reduces the value it can provide. This assumption is shared with multiple logistic regression.

**Assumption 3:** Observations are independent of each other. One observation cannot be the reason another observation exists in the dataset, there cannot be any repeat measurements of the same identity, and observations cannot be related to each other in any way. This assumption is shared with multiple logistic regression.

**Assumption 4:** There is multilinear normality. Any residuals of the model are normally distributed.

## B2. Tool Benefits

I use Python throughout this project because of its vast and ready-to-use collection of packages and libraries designed for data cleaning, analysis, exploration, visualization, and regression modeling.

Pandas allows Python to interact with data like a spreadsheet, implementing a tabular structure. NumPy adds statistical and mathematical capabilities to detect summary statistics and correlation metrics.

Matplotlib and Seaborn are data visualization packages that provide graphing functionality, aiding in data analysis and exploration in ways that reading raw data can't offer.

Multiple linear regression is a statistical method that predicts a response (target) variable by combining several explanatory (predictor) variables. Python and libraries like statsmodels and Scikit-learn provide an efficient and convenient way to implement multiple linear regression. With these packages, one can easily preprocess data, split it into training and testing sets, fit the model, and evaluate its performance.

Leveraging Python's ecosystem ensures flexibility, readability, and access to a wide range of data exploration, visualization, and model assessment tools. It's a powerful choice for data professionals due to its simplicity, extensive community support, and robustness in handling complex tasks.

## B3. Appropriate Technique

Multiple linear regression (MLR) is a powerful technique for predicting a continuous target variable using one or more continuous or categorical predictor variables.

MLR offers flexibility by accommodating continuous and categorical predictors, allowing for the modeling of complex relationships specific to a continuous target variable. Its flexibility extends further in that it's only limited to constraints on data types, not context or subject matter, allowing it to be used in any discipline.

MLR offers a straightforward way to interpret findings using coefficients to represent the impact of each predictor variable. It is also highly structured, meaning that because of the assumptions stated in B1, there are constraints on what data can be used to retrieve accurate output data. This places guidelines on what types of data (and their subsequent relationships) can be used in the model.

As I want to understand more about what contributes to `tenure`, a continuous variable, I believe multiple linear regression is appropriate for using the other numeric and categorical data available. While `tenure` has a finite maximum value in the dataset, it has no theoretical maximum beyond how long the company has been providing telecommunication services to customers. `tenure` is also measured precisely, using floats to measure the number of full and partial months.

# Part III: Data Preparation

## Libraries and Data

In [5]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [4]:
# Load the dataset
df = pd.read_csv('churn_clean.csv', index_col=0)

# Set option to not truncate results with large number of columns
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0_level_0,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,Population,Area,TimeZone,Job,Children,Age,Income,Marital,Gender,Churn,Outage_sec_perweek,Email,Contacts,Yearly_equip_failure,Techie,Contract,Port_modem,Tablet,InternetService,Phone,Multiple,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,PaymentMethod,Tenure,MonthlyCharge,Bandwidth_GB_Year,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
1,K409198,aa90260b-4141-4a24-8e36-b04ce1f4f77b,e885b299883d4f9fb18e39c75155d990,Point Baker,AK,Prince of Wales-Hyder,99927,56.251,-133.37571,38,Urban,America/Sitka,Environmental health practitioner,0,68,28561.99,Widowed,Male,No,7.978323,10,0,1,No,One year,Yes,Yes,Fiber Optic,Yes,No,Yes,Yes,No,No,No,Yes,Yes,Credit Card (automatic),6.795513,172.455519,904.53611,5,5,5,3,4,4,3,4
2,S120509,fb76459f-c047-4a9d-8af9-e0f7d4ac2524,f2de8bef964785f41a2959829830fb8a,West Branch,MI,Ogemaw,48661,44.32893,-84.2408,10446,Urban,America/Detroit,"Programmer, multimedia",1,27,21704.77,Married,Female,Yes,11.69908,12,0,1,Yes,Month-to-month,No,Yes,Fiber Optic,Yes,Yes,Yes,No,No,No,Yes,Yes,Yes,Bank Transfer(automatic),1.156681,242.632554,800.982766,3,4,3,3,4,3,4,4
3,K191035,344d114c-3736-4be5-98f7-c72c281e2d35,f1784cfa9f6d92ae816197eb175d3c71,Yamhill,OR,Yamhill,97148,45.35589,-123.24657,3735,Urban,America/Los_Angeles,Chief Financial Officer,4,50,9609.57,Widowed,Female,No,10.7528,9,0,1,Yes,Two Year,Yes,No,DSL,Yes,Yes,No,No,No,No,No,Yes,Yes,Credit Card (automatic),15.754144,159.947583,2054.706961,4,4,2,4,4,3,3,3
4,D90850,abfa2b40-2d43-4994-b15a-989b8c79e311,dc8a365077241bb5cd5ccd305136b05e,Del Mar,CA,San Diego,92014,32.96687,-117.24798,13863,Suburban,America/Los_Angeles,Solicitor,1,48,18925.23,Married,Male,No,14.91354,15,2,0,Yes,Two Year,No,No,DSL,Yes,No,Yes,No,No,No,Yes,No,Yes,Mailed Check,17.087227,119.95684,2164.579412,4,4,4,2,5,4,3,3
5,K662701,68a861fd-0d20-4e51-a587-8a90407ee574,aabb64a116e83fdc4befc1fbab1663f9,Needville,TX,Fort Bend,77461,29.38012,-95.80673,11352,Suburban,America/Chicago,Medical illustrator,0,83,40074.19,Separated,Male,Yes,8.147417,16,2,1,No,Month-to-month,Yes,No,Fiber Optic,No,No,No,No,No,Yes,Yes,No,No,Mailed Check,1.670972,149.948316,271.493436,4,4,4,3,4,4,4,5


# Part IV: Model Comparison and Analysis

# Part V: Data Summary and Implications

# Part VI: Demonstration

## G. Panopto Video

## H. Sources of Third-Party Code

- [Boorman, G. and Weber, I. _DataCamp - Exploratory Data Analysis in Python_, 2024](https://app.datacamp.com/learn/courses/exploratory-data-analysis-in-python)
- [Jones, K. _DataCamp - Working with Categorical Data in Python_, 2024](https://app.datacamp.com/learn/courses/working-with-categorical-data-in-python)
- Walker, M. (2024). Python Data Cleaning Cookbook (2nd ed.). Packt Publishing.

## I. Other Sources

- [Bobbit, Z., 2021. _The Five Assumptions of Multiple Linear Regression_](https://www.statology.org/multiple-linear-regression-assumptions/)