# <span style="color:turquoise">Performance Assessment NBM2 Task 1| D208 Predictive Modeling
&emsp;Ryan L. Buchanan
<br>&emsp;Student ID:  001826691
<br>&emsp;Masters Data Analytics (12/01/2020)
<br>&emsp;Program Mentor:  Dan Estes
<br>&emsp;(385) 432-9281 (MST)
<br>&emsp;rbuch49@wgu.edu
</span>

### <span style="color:Gold"><b>Part I: Research Question</b></span>

A.  Describe the purpose of this data analysis by doing the following:

1.  Summarize one research question that is relevant to a real-world organizational situation captured in the data set you have selected and that you will answer using multiple regression.

2.  Define the objectives or goals of the data analysis. Ensure that your objectives or goals are reasonable within the scope of the data dictionary and are represented in the available data.

### <span style="color:green"><b>A1. Question for Analysis</b>:</span>
Which customers are at high risk of churn?  And, which customer features/variables are most significant to churn?

### <span style="color:green"><b>A2. Benefit from Analysis</b>:</span>
Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.

### <span style="color:green"><b>A3. Data Identification</b>:</span>
Most relevant to our decision making process is the dependent variable of "Churn" which is binary categorical with only two values, "Yes" or "No".  In cleaning the data, we discovered relevance of the continuous numerical data columns "Tenure" (the number of months the customer has stayed with the provider), "MonthlyCharge" (the average monthly charge to the customer) & "Bandwidth_GB_Year" (the average yearly amount of data used, in GB, per customer).  Finally, the discrete numerical data from the survey responses from customers regarding various customer service features is relevant in the decision-making process.  In the surveys, customers provided ordinal numerical data by rating 8 customer service factors ("timely response", "timely fixes", "timely replacements", "reliability", "options", "respectful response", "courteous exchange" & "evidence of active listening") on a scale of 1 to 8 (1 = most important, 8 = least important).

### <span style="color:green"><b>B1. Code</b>:</span>
Chi-square testing will be used.

### Standard imports

In [None]:
# Standard data science imports
import numpy as np
import pandas as pd
from pandas import DataFrame

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Statistics packages
import pylab
import statsmodels.api as sm
import statistics
from scipy import stats

# Import chisquare from SciPy.stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

In [None]:
# Load data set into Pandas dataframe
df = pd.read_csv('Data/churn_clean.csv')

In [None]:
# Rename last 8 survey columns for better description of variables
df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'Fixes', 
                     'Item3':'Replacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'Respectfulness', 
                     'Item7':'Courteous', 
                     'Item8':'Listening'}, 
          inplace=True)

In [None]:
contingency = pd.crosstab(df['Churn'], df['TimelyResponse'])
contingency

In [None]:
contingency_pct = pd.crosstab(df['Churn'], df['TimelyResponse'], normalize='index')
contingency_pct

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(contingency, annot=True, cmap="YlGnBu")

### <span style="color:green"><b>B2. Output</b>:</span>

In [None]:
# Chi-square test of independence
c, p, dof, expected = chi2_contingency(contingency)
print('p-value = ' + str(p))

### <span style="color:green"><b>B3. Justification</b>:</span>
In this analysis, we are looking at churn from a telecom company ("Did customers stay with or leave the company?").  "Churn" is a binomial, categorical dependent variable.  Therefore, we will use chi-square testing as it is a non-parametric test for this "yes/no" target variable.  Our other categorical variable, "TimelyResponse", is at the ordinal level.

### <span style="color:green"><b>C. Univariate Statistics</b>:</span>


Two continuous variables:  
    1. MonthlyCharge
    2. Bandwidth_GB_Year
Two categorical (ordinal) variables:
    1. Item1 (Timely response) - relabeled "TimelyResponse"
    2. Item7 (Courteous exchange) - relabeled "Courteous" 

In [None]:
df.describe()

### <span style="color:green"><b>C1. Visual of Findings</b>:</span>


In [None]:
# Create histograms of contiuous & categorical variables
df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']].hist()
plt.savefig('churn_pyplot.jpg')
plt.tight_layout()

In [None]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = df)
plt.show()

In [None]:
sns.boxplot('Bandwidth_GB_Year', data = df)
plt.show()

In [None]:
sns.boxplot('TimelyResponse', data = df)
plt.show()

In [None]:
sns.boxplot('Courteous', data = df)
plt.show()

### <span style="color:green"><b>D. Bivariate Statistics</b></span>


Two continuous variables:  
    1. MonthlyCharge
    2. Bandwidth_GB_Year
Two categorical (binomial & ordinal, respectively) variables:
    1. Churn
    2. Item7 (Courteous exchange) - relabeled "Courteous" 

### <span style="color:green"><b>D1. Visual of Findings</b>:</span>


In [None]:
# Create dataframe for heatmap bivariate analysis of correlation
churn_bivariate = df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']]

In [None]:
sns.heatmap(churn_bivariate.corr(), annot=True)
plt.show()

In [None]:
# Create a scatter plot of continuous variables MonthlyCharge & Bandwidth_GB_Year
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].sample(100).plot.scatter(x='MonthlyCharge', 
                                                                                 y='Bandwidth_GB_Year')

# Create a scatter plot of categorical variables TimelyResponse & Courteous
churn_bivariate[churn_bivariate['TimelyResponse'] < 7].sample(100).plot.scatter(x='TimelyResponse', 
                                                                                 y='Courteous')

In [None]:
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].plot.hexbin(x='MonthlyCharge', y='Bandwidth_GB_Year', gridsize=15)

### <span style="color:green"><b>E1. Results of Analysis</b></span>
With a p-value as large as our output from our chi-square significance testing, p-value = 0.6318335816054494, we cannot reject the null hypothesis at a standard significance level of alpha = 0.05. It is unclear given the cleaned data available whether there is a statistically significant relationship between the survey responses (essentially, "How well did we the telecom company take care of you as a customer?") & whether or not this caused customers to leave the company.

### <span style="color:green"><b>E2. Limitations of Analysis</b>:</span>
Clearly, with a p-value that is so high, p-value = 0.6318335816054494, we need to investigate further & perhaps gather more & better data.  It is troubling that this dataset has been so limited in our ability to gather meaningful & actionable information.

### <span style="color:green"><b>E3. Recommended Course of Action</b>:</span>
While tests show very little correlation & perhaps no linear relations between the variables involved in timely action with regard to customer satisfaction (TimelyResponses, Fixes, Replacements & Respectfulness), we believe that these elements should be given greater emphasis and hopefully help reduce the churn rate from the large number of 27% & "increase the retention period of customers" by targeting more resources in the direction prompt customer service (Ahmad, 2019, p. 1). Again, this seems an intuitive result but now decision-makers in the company of reasonable verification of what might have been a "hunch".

### <span style="color:green"><b>F. Video</b></span>
<span style="color:red">https://wgu.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=%2237a1d719-eece-4cea-949f-ac7201896b42%22</span>

### <span style="color:green">G. Sources for Third-Party Code</span>

Kaggle. (2018, May 01). Bivariate plotting with pandas. Kaggle. https://www.kaggle.com/residentmario/bivariate-plotting-with-pandas#

<br> Sree. &ensp; (2020, October 26). &ensp; <i>Predict Customer Churn in Python.</i> &ensp; Towards Data Science. https://towardsdatascience.com/predict-customer-churn-in-python-e8cd6d3aaa7

<br> Wikipedia. (2021, May 31). Bivariate Analysis. https://en.wikipedia.org/wiki/Bivariate_analysis#:~:text=Bivariate%20analysis%20is%20one%20of,the%20empirical%20relationship%20between%20them.&text=Like%20univariate%20analysis%2C%20bivariate%20analysis%20can%20be%20descriptive%20or%20inferential.

### <span style="color:green">H. Sources</span>

Ahmad, A. K., Jafar, A & Aljoumaa, K. &ensp; (2019, March 20). &ensp; <i>Customer churn prediction in telecom using machine learning in big data platform</i>. &ensp; Journal of Big Data. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6

<br> Altexsoft. &ensp; (2019, March 27). &ensp; <i>Customer Churn Prediction Using Machine Learning: Main Approaches and Models</i>. Altexsoft. https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/

<br> Bruce, P., Bruce A. & Gedeck P. (2020). Practical Statistics for Data Scientists. O'Reilly.

<br> Freedman, D. Pisani, R. & Purves, R. (2018). Statistics. W. W. Norton & Company, Inc. 

<br> Frohbose, F. &ensp; (2020, November 24). &ensp; <i>Machine Learning Case Study: Telco Customer Churn Prediction</i>.  Towards Data Science. https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d

<br> Griffiths, D. (2009). A Brain-Friendly Guide: Head First Statistics. O'Reilly.

<br> NIH. (2020). National Library of Medicine. https://www.nlm.nih.gov/nichsr/stats_tutorial/section2/mod11_significance.html#:~:text=In%20statistical%20tests%2C%20statistical%20significance,set%20to%200.05%20(5%25).

<br> P-Values. (2020). StatsDirect Limited. https://www.statsdirect.com/help/basics/p_values.htm

In [None]:
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('D207_Performance_Assessment.ipynb')