# Exploring churn data with univariate/bivariate analysis and ANOVA testing 
## Data set: Cleaned customer churn data of a telecommunications company
## Sean Pharris  
## 16 Dec 2021  

### A1.  Reseasrch question: Do the survey questions point to a specific factor of importance that could help slow down customer churn?

### A2.  Benefit of the results: The stakeholders could benefit from knowing the answer to the above question by determining the effectiveness of the questions and results. After the conclusion of the study, the stakeholders could use the results to alter the survey questions to better determine the customer base's factor of importance. 

### A3.  Relevent data: The data that will be analyzed to answer this questions will be the survey questions and the "Churn" variable. The "Churn" variable will determine if the customer has discontinued the service within the last month. The variable is a binary type of "Yes" or "No". The survey questions are listed by Item 1-8 and the customer was asked to rate the importance of each variable.


### B1.  Type of analysis: ANOVA testing will be conducted in Python.

### B2.  The output of the analysis is below.

In [None]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('../input/clean-churn-data/churn_clean.csv')
df

In [None]:
# Renaming the survey columns
df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'TimelyFixes', 
                     'Item3':'TimelyReplacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'RespectfulResponse', 
                     'Item7':'CourteousExchange', 
                     'Item8':'EvidenceOfActiveListening'}, 
          inplace=True)

In [None]:
survey_columns = df[['TimelyResponse','TimelyFixes','TimelyReplacements','Reliability', 'Options', 'RespectfulResponse', 'CourteousExchange', 'EvidenceOfActiveListening']]
survey_columns

In [None]:
from matplotlib import pyplot as plt 

survey_columns.boxplot()
plt.show()

In [None]:
survey_columns.mean()

In [None]:
(survey_columns.mean()).mean()

In [None]:
df_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['TimelyResponse','TimelyFixes','TimelyReplacements','Reliability', 'Options', 'RespectfulResponse', 'CourteousExchange', 'EvidenceOfActiveListening'])
# replace column names
df_melt.columns = ['index', 'survey_questions', 'value']

In [None]:
import scipy.stats as stats
fvalue, pvalue = stats.f_oneway(df['TimelyResponse'],df['TimelyFixes'],df['TimelyReplacements'],df['Reliability'], df['Options'], df['RespectfulResponse'], df['CourteousExchange'], df['EvidenceOfActiveListening'])
print('f-value = ' + str(fvalue))
print('p-value = ' + str(pvalue))

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Ordinary Least Squares (OLS) model
model = ols('value ~ survey_questions', data=df_melt).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

### 3. Justification for technique used: ANOVA testing was chosen because we had more than two groups to observe the degrees of freedom and find the mean/sums of squares. It seemed as ideal choice since there will be many variables to find the variations between them. Also with the ANOVA test we can test our hypothesis (research question) to determine if it is null or not.


## C. Variables for univariate analysis  
## continuous variables:  
### -Age  
### -Income  

## categorical variables:  
### -TimelyResponse (Item1)  
### -TimelyFixes (Item2)  

1.

In [None]:
df[['Age']].boxplot()
plt.show()

In [None]:
df[['Income']].boxplot()

In [None]:
df[['TimelyResponse']].boxplot()

In [None]:
df[['TimelyFixes']].boxplot()

In [None]:
df[['Age']].hist()

In [None]:
df[['Income']].hist()

In [None]:
df[['TimelyResponse']].hist()

In [None]:
df[['TimelyFixes']].hist()

In [None]:
df.describe()

## D. Variables for Bivariate analysis 
## continuous variables:  
### -Age  
### -Income  

## categorical variables:  
### -TimelyResponse (Item1)  
### -TimelyFixes (Item2)  

In [None]:
plt.scatter(x=df['Age'], y=df['Income'])
plt.xlabel("Age")
plt.ylabel("Income")
plt.show()

In [None]:
plt.scatter(x=df['TimelyResponse'], y=df['TimelyFixes'])
plt.xlabel("TimelyResponse")
plt.ylabel("TimelyFixes")
plt.show()

In [None]:
x = df['Age']
y = df['Income']
fig, ax = plt.subplots()
ax.bar(x, y, width=1, edgecolor="white", linewidth=0.7)
plt.show()

In [None]:
x = df['TimelyResponse']
y = df['TimelyFixes']
fig, ax = plt.subplots()
ax.bar(x, y, width=1, edgecolor="white", linewidth=0.7)
plt.show()

In [None]:
plt.style.use('_mpl-gallery-nogrid')
x = df['Age']
y = df['Income']
fig, ax = plt.subplots()
ax.hexbin(x, y, gridsize=10)
plt.show()

In [None]:
plt.style.use('_mpl-gallery-nogrid')
x = df['TimelyResponse']
y = df['TimelyFixes']
fig, ax = plt.subplots()
ax.hexbin(x, y, gridsize=10)
plt.show()

## E. 

### 1.  Hypothesis: There is a large difference between the customer survey results. After completing the ANOVA test, we found that the sum of squares is equal to 3.5, the degrees of freedom is equal to 7, and the critical f-value is .856288. Now we look at the compare the critcal f-value and f-value. If the f-value is greater than the critical f-value (.856288) then we reject the null hypothesis and if it is less than the critical f-value, then we do not reject the null hypothesis. The f-value was .470908, which means it was less than the critical f-value and we do not reject the null hypothesis. Since we do not reject the null hypothesis, this means there is not a significant difference between the customer survey results to determine the order of importance in the survey factors. (Davila 2016)

### 2.  The data we currently have does not lead to any information that is helpful in determining the most important factors from the customers through the survey questions. We cannot correlate any trends because the differences between the survey question are not large enough.

### 3.  Based on what we have seen from our analysis we could recommend better survey questions and rating system to help us identify an order of importance from customers when it comes to attributes of the service. Moving forward after creation of a better rating system, we could then identify what needs to change from the service in order to slow down customer churn and provide a better service. 

## F. https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=38d709bb-ba54-40b0-a16c-ae06012c7af4

## G. Sources for third-party code

### Jobs Admin (2020) Introduction to ANOVA for Statistics and Data Science (with COVID-19 Case Study using Python). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/06/introduction-anova-statistics-data-science-covid-python/

### Matplotlib documentation (2021) Matplotlib documentation. Matplotlib. https://matplotlib.org/stable/index.html 

### Scipy (2021) Perform one-way ANOVA. Scipy. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html



## H. References 

### Davila, E. (2016) Hypothesis test and f-statisic [Video] LinkedIn Learning. https://www.linkedin.com/learning/statistics-foundations-3/hypothesis-test-and-f-statistic?autoAdvance=true&autoSkip=false&autoplay=true&resume=false&u=2045532


