# <span style="color:turquoise">D208 Performance Assessment NBM2 Task 1</span>
## <span style="color:turquoise">Multiple Regression for Predictive Modeling</span>
&emsp;Ryan L. Buchanan
<br>&emsp;Student ID:  001826691
<br>&emsp;Masters Data Analytics (12/01/2020)
<br>&emsp;Program Mentor:  Dan Estes
<br>&emsp;(385) 432-9281 (MST)
<br>&emsp;rbuch49@wgu.edu
</span>

### <span style="color:green"><b>A1. Research Question</b>:</span>
How much many GBs of data will a customer use yearly?  Can this be predicted accurately from a list of explanatory variables?

### <span style="color:green"><b>A2. Objectives & Goals</b>:</span>
Stakeholders in the company will benefit by knowing, with some measure of confidence, how much data a customer might predictably use.
This will provide weight for decisions in whether or not to expand customer data limits, provide unlimited (or metered) media streaming & expand company cloud computing resources for increased bandwidth demands.

### <span style="color:green"><b>B1. Summary of Assumptions</b>:</span>
Assumptions of a multiple regression model include:
* There is a linear relationship between the dependent variables & the independent variables.
* The independent variables are not too highly correlated with each other.
* y<sub>i</sub> observations are selected independently & randomly from the population.
* Residuals should normally distributed with a mean of zero.

### <span style="color:green"><b>B2. Tool Benefits</b>:</span>
Python & IPython Jupyter notebooks will be used to support this analysis.  Python offers very intuitive, simple & versatile programming style & syntax, as well as a large system of mature packages for data science & machine learning.  Since, Python is cross-platform, it will work well whether consumers of the analysis are using Windows PCs or a MacBook laptop.  It is fast when compared with other possible programming languages like R or MATLAB (Massaron, p. 8).
<br> &emsp; Also, there is strong support for Python as the most popular data science programming language in popular literature & media (<a target="_blank" href="https://www.cbtnuggets.com/blog/technology/data/why-data-scientists-love-python">CBTNuggets</a>)

### <span style="color:green"><b>B3. Appropriate Technique</b>:</span>
Multiple regression is an appropriate technique to analyze the research question because our target variable, predicting a real number of GBs per year, is a continuous variable (how much data is used).  Also, perhaps there are several (versus simply one) explanatory variables (area type, job, children, age, income, etc.) that will add to our understanding when trying to predict how much data a customer will use in a given year.  When adding or removing independent variables from our regression equation, we will find out whether or not they have a positive or negative relationship to our target variable & how that might affect company decisions on marketing segmentation.

### <span style="color:green"><b>C1. Data Goals</b>:</span>

My approach will include:
<br>&ensp; 1. Back up my data and the process I am following as a copy to my machine and, since this is a manageable dataset, to GitHub using command line and gitbash.
<br>&ensp; 2. Read the data set into Python using Pandas' read_csv command.
<br>&ensp; 3. Evaluate the data struture to better understand input data.
<br>&ensp; 4. Naming the dataset as a the variable "churn_df" and subsequent useful slices of the dataframe as "df".
<br>&ensp; 5. Examine potential misspellings, awkward variable naming & missing data.
<br>&ensp; 6. Find outliers that may create or hide statistical significance using histograms.
<br>&ensp; 7. Imputing records missing data with meaningful measures of central tendency (mean, median or mode) or simply remove outliers that are several standard deviations above the mean.

Most relevant to our decision making process is the <b>dependent variable</b> of "Bandwidth_GB_Year" (the average yearly amount of data used, in GB, per customer) which will be our <b>continuous target variable</b>. We need to train & then test our machine on our given dataset to develop a model that will give us an idea of how much data a customer may use given the amounts used by known customers given their respective data points for selected predictor variables.  

<br>In cleaning the data, we may discover relevance of the <b>continuous predictor variables<b>: 
* Children
* Income
* Outage_sec_perweek
* Yearly_equip_failure
* Tenure (the number of months the customer has stayed with the provider)
* MonthlyCharge
* Bandwidth_GB_Year    
    
<br>Likewise, we may discover relevance of the <b>categorical predictor variables</b> (all binary categorical with only two values, "Yes" or "No", except where noted): 
* Techie: Whether the customer considers themselves technically inclined (based on
customer questionnaire when they signed up for services) (yes, no)
* Contract: The contract term of the customer (month-to-month, one year, two year)
* Port_modem: Whether the customer has a portable modem (yes, no)
* Tablet: Whether the customer owns a tablet such as iPad, Surface, etc. (yes, no)
* InternetService: Customer’s internet service provider (DSL, fiber optic, None)
* Phone: Whether the customer has a phone service (yes, no)
* Multiple: Whether the customer has multiple lines (yes, no)
* OnlineSecurity: Whether the customer has an online security add-on (yes, no)
* OnlineBackup: Whether the customer has an online backup add-on (yes, no)
* DeviceProtection: Whether the customer has device protection add-on (yes, no)
* TechSupport: Whether the customer has a technical support add-on (yes, no)
* StreamingTV: Whether the customer has streaming TV (yes, no)
* StreamingMovies: Whether the customer has streaming movies (yes, no)
    
<br>Finally, <b>discrete ordinal predictor variables</b> from the survey responses from customers regarding various customer service features may be relevant in the decision-making process. In the surveys, customers provided ordinal numerical data by rating 8 customer service factors on a scale of 1 to 8 (1 = most important, 8 = least important): 
    
* Item1: Timely response
* Item2: Timely fixes
* Item3: Timely replacements
* Item4: Reliability
* Item5: Options
* Item6: Respectful response
* Item7: Courteous exchange
* Item8: Evidence of active listening


### <span style="color:green"><b>C2. Summary Statistics</b>:</span>
Discuss the summary statistics, including the target variable and all predictor variables that you will need to gather from the data set to answer the research question.

### <span style="color:green"><b>C3. Steps to Prepare Data</b>:</span>
Explain the steps used to prepare the data for the analysis, including the annotated code.

* Imputing records missing data with meaningful measures of central tendency (mean, median or mode) or simply remove outliers that are several standard deviations above the mean.

<span style="color:red">*</span>

<span style="color:red">*</span>

<span style="color:red">*</span>


* Finally, the prepared dataset will be extracted & provided as "churn_prepared.csv"

In [1]:
# Increase Jupyter display cell-width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [None]:
# Standard data science imports
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Statistics packages
import pylab
from pylab import rcParams
import statsmodels.api as sm
import statistics
from scipy import stats

# Scikit-learn
import sklearn
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report


# Import chisquare from SciPy.stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

# Ignore Warning Code
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Change color of Matplotlib font
import matplotlib as mpl

COLOR = 'white'
mpl.rcParams['text.color'] = COLOR
mpl.rcParams['axes.labelcolor'] = COLOR
mpl.rcParams['xtick.color'] = COLOR
mpl.rcParams['ytick.color'] = COLOR

In [None]:
# Load data set into Pandas dataframe
churn_df = pd.read_csv('Data/churn_clean.csv')

# Rename last 8 survey columns for better description of variables
churn_df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'Fixes', 
                     'Item3':'Replacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'Respectfulness', 
                     'Item7':'Courteous', 
                     'Item8':'Listening'}, 
          inplace=True)

In [None]:
# Display Churn dataframe
churn_df

In [None]:
# List of Dataframe Columns
df = churn_df.columns
print(df)

In [None]:
# Find number of records and columns of dataset
churn_df.shape

In [None]:
# Describe Churn dataset statistics
churn_df.describe()

In [None]:
# Remove less meaningful demographic variables from statistics description
churn_df = churn_df.drop(columns=['CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 
                            'State', 'County', 'Zip', 'Lat', 'Lng', 'Population', 
                            'Area', 'TimeZone', 'Job', 'Marital'])
churn_df.describe()

In [None]:
# Discover missing data points within dataset
data_nulls = churn_df.isnull().sum()
print(data_nulls)

### <span style="color:green"><b>C4. Visualizations</b>:</span>
Generate univariate and bivariate visualizations of the distributions of variables in the cleaned data set. Include the target variable in your bivariate visualizations.

In [None]:
# Visualize missing values in dataset

# Install appropriate library
!pip install missingno

# Importing the libraries
import missingno as msno

# Visualize missing values as a matrix
msno.matrix(churn_df);

In [None]:
'''No need to impute an missing values as the dataset appears complete/cleaned'''
# Impute missing fields for variables Children, Age, Income, Tenure and Bandwidth_GB_Year with median or mean
# churn_df['Children'] = churn_df['Children'].fillna(churn_df['Children'].median())
# churn_df['Age'] = churn_df['Age'].fillna(churn_df['Age'].median())
# churn_df['Income'] = churn_df['Income'].fillna(churn_df['Income'].median())
# churn_df['Tenure'] = churn_df['Tenure'].fillna(churn_df['Tenure'].median())
# churn_df['Bandwidth_GB_Year'] = churn_df['Bandwidth_GB_Year'].fillna(churn_df['Bandwidth_GB_Year'].median())

## Univariate Statistics

In [None]:
# Create histograms of contiuous & categorical variables
churn_df[['Children', 'Age', 'Income', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year']].hist()
plt.savefig('churn_pyplot.jpg')
plt.tight_layout()

In [None]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = churn_df)
plt.show()

In [None]:
sns.boxplot('Bandwidth_GB_Year', data = churn_df)
plt.show()

### It appears that anomolies have been removed from the dataset present "churn_clean.csv" as there are no remaining outliers.

## Bivariate Statistics

### Let's run some scatterplots to get an idea of our linear relationships with bandwidth usage & some of the respective predictor variables.

In [None]:
# Run scatterplots to show direct or inverse relationships between target & independent variables
sns.scatterplot(x=churn_df['Children'], y=churn_df['Bandwidth_GB_Year'], color='red')
plt.show();

In [None]:
sns.scatterplot(x=churn_df['Age'], y=churn_df['Bandwidth_GB_Year'], color='red')
plt.show();

In [None]:
sns.scatterplot(x=churn_df['Income'], y=churn_df['Bandwidth_GB_Year'], color='red')
plt.show();

In [None]:
sns.scatterplot(x=churn_df['Tenure'], y=churn_df['Bandwidth_GB_Year'], color='red')
plt.show();

In [None]:
sns.scatterplot(x=churn_df['MonthlyCharge'], y=churn_df['Bandwidth_GB_Year'], color='red')
plt.show();

In [None]:
# Create dataframe for heatmap bivariate analysis of correlation
churn_bivariate = churn_df[['Bandwidth_GB_Year', 'Children', 'Age', 'Income', 
                            'Outage_sec_perweek', 'Yearly_equip_failure', 
                            'Tenure', 'MonthlyCharge']]

In [None]:
# Run Seaborn heatmap
sns.heatmap(churn_bivariate.corr(), annot=True)
plt.show()

### Again, it appears that Tenure is the predictor for most of the variance.

### Scree plots & Principal Component Analysis (PCA) -> <span style="color:red"><i>which suggests we should only use <b>certain</b> variables</i></span>

In [None]:
# Set up for scree plots & PCA

# For a scree plot import matplotlib & seaborn libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import Scikit Learn PCA application
from sklearn.decomposition import PCA

# Normalize the data
churn_normalized = (data - data.mean()) / data.std()

# Select number of components to extract
pca = PCA(n_components = data.shape[1])

In [None]:
# Create a list of PCA names
churn_numeric = data[['Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'Responses', 
                       'Fixes', 'Replacements', 'Reliability', 'Options', 
                       'Respectfulness', 'Courteous', 'Listening']]
pcs_names = []
for i, col in enumerate(churn_numeric.columns):
    pcs_names.append('PC' + str(i + 1))
print(pcs_names)

In [None]:
# Call PCA application & convert the dataset of 11 variables into a dataset of 11 components
pca.fit(churn_normalized)
churn_pca = pd.DataFrame(pca.transform(churn_normalized),
                        columns = pcs_names)

In [None]:
# Run the scree plot
plt.plot(pca.explained_variance_ratio_)
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show();

In [None]:
# Extract the eigenvalues
cov_matrix = np.dot(churn_normalized.T, churn_normalized) / data.shape[0]
eigenvalues = [np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)) 
               for eigenvector in pca.components_]

In [None]:
# Plot the eigenvalues
plt.plot(eigenvalues)
plt.xlabel('Number of Components')
plt.ylabel('Eigenvalue')
plt.show();

In [None]:
# Select the fewest components 
for pc, var in zip(pcs_names, np.cumsum(pca.explained_variance_ratio_)):
    print(pc, var)

In [None]:
# Above, we see that 86% of variance is explained by 7 components
# Create a rotation 
rotation = pd.DataFrame(pca.components_.T, columns = pcs_names, index = churn_numeric.columns)
print(rotation)

In [None]:
# Output loadings for components
loadings = pd.DataFrame(pca.components_.T,
                       columns = pcs_names,
                       index = data.columns)
loadings

In [None]:
# Finally, extract reduced dataset & print 3 components
churn_reduced = churn_pca.iloc[ : , 0:3]
print(churn_reduced)

### There is clearly a direct linear relationship between customer tenure with the telecom company & the amount of data (in GBs) that is being used.  Let's run a simple linear regression model on those two variables.

In [None]:
churn_df['intercept'] = 1
lm_bandwidth = sm.OLS(churn_df['Bandwidth_GB_Year'], churn_df[['Children', 'Tenure', 'intercept']]).fit()
print(lm_bandwidth.summary())

### Initial Multiple Linear Regression Model
With <b><i>seven</i></b> indpendent variables: 
<br><br>&emsp;<span style="color:gold">y = 104.85 + 30.86 * Children - 3.31 * Age + 0.00 * Income - 0.26 * Outage_sec_perweek + 0.67 * Yearly_equip_failure + 82.01 * Tenure + 3.28 * MonthlyCharge</span>

### Reduced Multiple Linear Regression Model
With two indpendent variables: y = 497.78 + 31.18 * Children + 81.94 * Tenure

### <span style="color:green"><b>C5. Prepared Dataset</b>:</span>
Provide a copy of the prepared data set.

In [None]:
# Extract Clean dataset
churn_df.to_csv('churn_prepared.csv')

### <span style='color:Gold'><b>Part IV: Model Comparison and Analysis</b></span>

D.  Compare an initial and a reduced multiple regression model by doing the following:

1.  Construct an initial multiple regression model from all predictors that were identified in Part C2.

2.  Justify a statistically based variable selection procedure and a model evaluation metric to reduce the initial model in a way that aligns with the research question.

3.  Provide a reduced multiple regression model that includes both categorical and continuous variables.



<span style='color:red'>Note: The output should include a screenshot of each model.</span>

### <span style="color:green"><b>D1. Initial Model</b></span>
Construct an initial multiple regression model from <span style="color:red"><i>all predictors that were identified in Part C2</i></span>.

In [None]:
# Develop the initial estimated regression equation that could be used to predict the Bandwidth_GB_Year, 
# given the continuous variables
churn_df['intercept'] = 1
lm_bandwidth = sm.OLS(churn_df['Bandwidth_GB_Year'], churn_df[['Children', 'Age', 
                                                               'Income',
                                                               'Outage_sec_perweek', 
                                                               'Yearly_equip_failure', 
                                                               'Tenure', 'MonthlyCharge', 
                                                               'intercept']]).fit()
print(lm_bandwidth.summary())

### Based on an R<sup>2</sup> value = 0.989.  So, 99% of the variation is explained by this model.

### <span style="color:green"><b>D2. Justification of Model Reduction</b></span>
Justify a statistically based variable selection procedure and a model evaluation metric to reduce the initial model in a way that aligns with the research question.

<span style='color:red'>Note: Heatmap of missing values vs observed</span>

### <span style="color:green"><b>D3. Reduced Multiple Regression Model</b></span>
Provide a reduced multiple regression model that includes both categorical and continuous variables.

### Well, there it is.  Removing all the other predictor variables except "Tenure" & our model still explains 98% of the variance.

### <span style='color:Gold'><b>Part IV: E</b></span>
E.  Analyze the data set using your reduced multiple regression model by doing the following:

1.  Explain your data analysis process by comparing the initial and reduced multiple regression models, including the following elements:
<ul>
    <li>
    the logic of the variable selection technique
    </li>
    <li>
    the model evaluation metric
    </li>
    <li>
    a residual plot
    </li>
</ul>
2.  Provide the output and any calculations of the analysis you performed, including the model’s residual error.



<span style='color:red'>Note: The output should include the predictions from the refined model you used to perform the analysis. </span>



3.  Provide the code used to support the implementation of the multiple regression models.



### <span style="color:green"><b>E1. Model Comparison</b></span>
Explain your data analysis process by comparing the initial and reduced multiple regression models, including the following elements:
<ul>
    <li>
    the logic of the variable selection technique
    </li>
    <li>
    the model evaluation metric
    </li>
    <li>
    a residual plot
    </li>
</ul>

<span style='color:red'>Note: Verbatim from fasttrack description of analysis of Titanic dataset, 
<br>"Since male is the dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces log odds by 0.037." </span>

### <span style="color:green"><b>E2. Output & Calculations</b></span>
Provide the output and any calculations of the analysis you performed, including the model’s residual error.



<span style='color:red'>Note: The output should include the predictions from the refined model you used to perform the analysis. </span>

### <span style="color:green"><b>E3. Code</b></span>
Provide the code used to support the implementation of the multiple regression models.

### <span style='color:Gold'><b>Part V: Data Summary and Implications</b></span>

F.  Summarize your findings and assumptions by doing the following:

1.  Discuss the results of your data analysis, including the following elements:
<ul>
    <li>
    a regression equation for the reduced model
    </li>
    <li>
    an interpretation of coefficients of the statistically significant variables of the model
    </li>
    <li>
    the statistical and practical significance of the model
    </li>
    <li>
    the limitations of the data analysis
    </li>
</ul>
2.  Recommend a course of action based on your results.

### <span style="color:green"><b>F1. Results</b></span>
 Discuss the results of your data analysis, including the following elements:
<ul>
    <li>
    a regression equation for the reduced model
    </li>
    <li>
    an interpretation of coefficients of the statistically significant variables of the model
    </li>
    <li>
    the statistical and practical significance of the model
    </li>
    <li>
    the limitations of the data analysis
    </li>
</ul>

### <span style="color:green"><b>F2. Recommendations</b></span>
Recommend a course of action based on your results.

### <span style='color:Gold'><b>Part VI: Demonstration</b></span>

G.  Provide a Panopto video recording that includes all of the following elements:

•  a demonstration of the functionality of the code used for the analysis

•  an identification of the version of the programming environment

•  a comparison of the two multiple regression models you used in your analysis

•  an interpretation of the coefficients.



### <span style="color:green"><b>G. Video</b></span>
<span style="color:red">link</span>

### <span style="color:green">H. Sources for Third-Party Code</span>

Kaggle. (2018, May 01). Bivariate plotting with pandas. Kaggle. https://www.kaggle.com/residentmario/bivariate-plotting-with-pandas#

<br> Sree. &ensp; (2020, October 26). &ensp; <i>Predict Customer Churn in Python.</i> &ensp; Towards Data Science. https://towardsdatascience.com/predict-customer-churn-in-python-e8cd6d3aaa7

<br> Wikipedia. (2021, May 31). Bivariate Analysis. https://en.wikipedia.org/wiki/Bivariate_analysis#:~:text=Bivariate%20analysis%20is%20one%20of,the%20empirical%20relationship%20between%20them.&text=Like%20univariate%20analysis%2C%20bivariate%20analysis%20can%20be%20descriptive%20or%20inferential.

### <span style="color:green">I. Sources</span>

Ahmad, A. K., Jafar, A & Aljoumaa, K. &ensp; (2019, March 20). &ensp; <i>Customer churn prediction in telecom using machine learning in big data platform</i>. &ensp; Journal of Big Data. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6

<br> Altexsoft. &ensp; (2019, March 27). &ensp; <i>Customer Churn Prediction Using Machine Learning: Main Approaches and Models</i>. &ensp; Altexsoft. &ensp; &ensp; https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/

<br> Bruce, P., Bruce A. & Gedeck P. &ensp; (2020). &ensp; <i>Practical Statistics for Data Scientists</i>. &ensp; O'Reilly.

<br> CBTNuggets. &ensp; (2018, September 20). &ensp; <i>Why Data Scientists Love Python</i>. &ensp; https://www.cbtnuggets.com/blog/technology/data/why-data-scientists-love-python

<br> Freedman, D. Pisani, R. & Purves, R. &ensp; (2018). &ensp; <i>Statistics</i>. &ensp; W. W. Norton & Company, Inc. 

<br> Frohbose, F. &ensp; (2020, November 24). &ensp; <i>Machine Learning Case Study: Telco Customer Churn Prediction</i>.  &ensp; Towards Data Science. &ensp; https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d

<br> Griffiths, D. &ensp; (2009). &ensp; <i>A Brain-Friendly Guide: Head First Statistics</i>. &ensp; O'Reilly.

<br> Grus, J. &ensp; (2015). &ensp; <i>Data Science from Scratch</i>. &ensp; O'Reilly.

<br> Massaron, L. & Boschetti, A. &ensp; (2016). &ensp; <i>Regression Analysis with Python</i>. &ensp; Packt Publishing.

<br> McKinney, W. &ensp; (2018). &ensp; <i>Python for Data Analysis</i>. O'Reilly.

<br> Rossant, C. (2018). &ensp; <i>IPython Interactive Computing & Visualization Cookbook, 2nd Edition</i>. &ensp; Packt Publishing.

<br> Rossant, C. (2015). &ensp; <i>Learning IPython Interactive Computing & Visualization, 2nd Edition</i>. &ensp; Packt Publishing.

<br> VanderPlas, J. &ensp; (2017). &ensp; <i>Python Data Science Handbook</i>. &ensp; O'Reilly.

In [None]:
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('D208_Performance_Assessment_NBM2_Task_1.ipynb')