# Sean Pharris
# Model: K Nearest Neighbor
# Data set: Customer data of a telecommunications company
# Date: Jan 21, 2022

## Part I: Research Question

A.

1.  What customers are likely to discontinue their services in the next few months?

2.  The goal of identifying the results to our research question can help the stakeholders to understand the turn over of customers in detail.
 

## Part II: Method Justification

B.  

1.  The K Nearest Neighbor algorithim will classify our data case by case to determine whether each data point will be a point indicating customer churn or not. That method includes:
    * Classifying all points
    * Determining what K will be (K is the number of surrounding points to classify the point at focus)
    * Generalizing the overall outcome with the number of total points
    
    Outcome: What customers will be at risk of customer churn

2.  Assumptions:
* KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used (Thirumuruganathan, S. (2010)).

* Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case , it will be either + or – (for positive or negative classes). But KNN , can work equally well with arbitrary number of classes (Thirumuruganathan, S. (2010)).

* We are also given a single number "k" . This number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. This is usually a odd number if the number of classes is 2. If k=1 , then the algorithm is simply called the nearest neighbor algorithm (Thirumuruganathan, S. (2010)).

3.  The benefits of Python are vast but the main reason are the versatility, ease of use, and strong support from the community. There are many packages that make it easy to undertake the task of doing data analysis/data prediction.

* Some of those packages are:
    * Pandas and Numpy - make it easy to handle large sets of data
    * Seaborn and Matplotlib - make data visualization a breeze
    * Statsmodels and ScikitLearn - allow for easy data exploration and prediction
 

## Part III: Data Preparation

C.  

1.  The customers that have already discontinued their services in the last month have the binary variable as "yes" in the "Churn" column and "no" for those customers that have no discontinued their services. We will preprocess the data as 1s for "yes" and 0s for "no" to process the data. 

2.  The initial data set will include the variables below and our dependent variable will be "Churn", which is categorical.

* Continuous:

    'Population', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'TimelyFixes', 'TimelyReplacements', 'Reliability', 'Options', 'RespectfulResponse', 'CourteousExchange', 'EvidenceOfActiveListening'
       
* Categorical: 

    'Area', 'TimeZone', 'Job', 'Marital', 'Gender', 'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity','OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod'


3.  Steps to prepare data:
    1. Read the data into the data frame ("df") using Pandas "read_csv()"
    2. Drop unneeded columns
    3. Changing the names of columns to make the data more understandable
    4. Make sure there are no null values
    5. Create dummy variables for categorical columns
    6. Remove the outliers of numerical data types

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import warnings

warnings.filterwarnings('ignore')

### Import the data

In [None]:
# Read in data set into the data frame 
df = pd.read_csv('../input/clean-churn-data/churn_clean.csv')

### Removing unneeded columns

In [None]:
# Drop unnecessary columns
df.drop(columns=['CaseOrder','UID', 'Customer_id','Interaction', 'Job','State','City','County','Zip','Lat','Lng', 'TimeZone', 'Marital'], inplace=True)

In [None]:
df.head()

### Changing the name of columns to make the data more understandable

In [None]:
# Renaming the survey columns
df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'TimelyFixes', 
                     'Item3':'TimelyReplacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'RespectfulResponse', 
                     'Item7':'CourteousExchange', 
                     'Item8':'EvidenceOfActiveListening'}, 
          inplace=True)

In [None]:
df.columns

### Check for null values in the data set

In [None]:
df.isnull().sum()

### Below, we will find our categorical data types

In [None]:
# find categorical variables

categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :', categorical)

In [None]:
# view the categorical variables

print(categorical)

In [None]:
# check for cardinality in categorical variables

for var in df:
    print(var, ' contains ', len(df[var].unique()), ' labels')

## Below we will find out numerical data types and remove outliers

In [None]:
# find numerical variables

numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)

#### Now we will look for outliers

In [None]:
# view summary statistics in numerical variables

print(round(df.describe()),2)

Variables that may contain outliers:

* Population
* Income
* Bandwidth_GB_Year

In [None]:
# draw boxplots to visualize outliers
import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.boxplot(column='Population')
fig.set_title('')
fig.set_ylabel('Population')


plt.subplot(2, 2, 2)
fig = df.boxplot(column='Income')
fig.set_title('')
fig.set_ylabel('Income')


plt.subplot(2, 2, 3)
fig = df.boxplot(column='Bandwidth_GB_Year')
fig.set_title('')
fig.set_ylabel('Bandwidth_GB_Year')

In [None]:
# plot histogram to check distribution

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.Population.hist(bins=10)
fig.set_xlabel('Population')
fig.set_ylabel('Churn')


plt.subplot(2, 2, 2)
fig = df.Income.hist(bins=10)
fig.set_xlabel('Income')
fig.set_ylabel('Churn')


plt.subplot(2, 2, 3)
fig = df.Bandwidth_GB_Year.hist(bins=10)
fig.set_xlabel('Bandwidth_GB_Year')
fig.set_ylabel('Churn')

Removing the outliers in our numerical data types
* Bandwidth_GB_Year does not appear to be skewed.
* Population and income apprear to be skewed so we will conduct an interquantile range now.

In [None]:
# find outliers for Population

IQR = df.Population.quantile(0.75) - df.Population.quantile(0.25)
lower = df.Population.quantile(0.25) - (IQR * 3)
upper = df.Population.quantile(0.75) + (IQR * 3)
print('Population outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=lower, upperboundary=upper))

In [None]:
# find outliers for Income

IQR = df.Income.quantile(0.75) - df.Income.quantile(0.25)
lower = df.Income.quantile(0.25) - (IQR * 3)
upper = df.Income.quantile(0.75) + (IQR * 3)
print('Income outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=lower, upperboundary=upper))

Fixing the outliers in our numerical data types

* We have seen that the Population and Income columns contain outliers. 
* We will use top-coding approach to cap maximum values and remove outliers from the above variables.

In [None]:
def max_value(df3, variable, top):
    return np.where(df3[variable]>top, top, df3[variable])

for df3 in [df]:
    df3['Population'] = max_value(df3, 'Population', 50458.0)
    df3['Income'] = max_value(df3, 'Income', 155310.5275)

In [None]:
print(df.Population.max(), df.Income.max())

In [None]:
# plot histogram to check distribution of removed outliers 

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.Population.hist(bins=10)
fig.set_xlabel('Population')
fig.set_ylabel('Churn')


plt.subplot(2, 2, 2)
fig = df.Income.hist(bins=10)
fig.set_xlabel('Income')
fig.set_ylabel('Churn')

### C4.  Provide a copy of the cleaned data set.

In [None]:
# Desired data set
df.to_csv('KNN_churn.csv', index=False)

### D1.  Split the data into training and test data sets and provide the file(s).

In [None]:
from sklearn.model_selection import train_test_split

# Create arrays for the features and the response variable

train, test = train_test_split(df, test_size = 0.2, random_state = 0)

# check the shape of X_train and X_test

train.shape, test.shape

In [None]:
# Put training and test data into their own CSVs.

train.to_csv('training_churn.csv', index=False)

test.to_csv('test_churn.csv', index=False)

### Now that we have split the test/training data, we will split the dependent variable "Churn" from the independent variables.

In [None]:
# create target(predictor) variable 

X = df.drop(['Churn'], axis=1)

y = df['Churn']

In [None]:
# removing churn from categorical list because we will loop through the categorical in the encoding below

categorical.remove('Churn')

categorical

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
for feature in categorical:
    X.loc[:, feature] = le.fit_transform(X.loc[:, feature])

In [None]:
# normalizaing/feature scaling the data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)

In [None]:
# split X and y into training and testing sets

from sklearn.model_selection import cross_val_score, train_test_split

# Set seed for reproducibility

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# check the shape of X_train and X_test

X_train.shape, X_test.shape

## Part IV: Analysis

### D2.  Analysis and intermediate calculations

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

n_neighbors = 3
random_state = 0

# Create a k-NN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the method's model
knn.fit(X_train, y_train)

# predict the results and get accuracy of the model
y_pred = knn.predict(X_test)

### D3.  Code is above.
 

## Part V: Data Summary and Implications

### E1.  Accuracy and AUC

In [None]:
measure = le.fit_transform(y_pred)
measure_test = le.fit_transform(y_test)
measure_test

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from math import sqrt
mse = mean_squared_error(measure_test, measure)


model_acc = 'Model accuracy: {0:0.4f}'. format(accuracy_score(y_test, y_pred))
print(model_acc)
mse_acc = 'MSE accuracy: {0:0.4f}'. format(mse)
print(mse_acc)
r_squared = 'R-squared value:', knn.score(X_test,y_test)
print(r_squared)

In [None]:
from sklearn.metrics import classification_report

initial_model_report = classification_report(y_test, y_pred)
# classification metrics
print(initial_model_report)

In [None]:
# Import sklearn confusion_matrix & generate results

from sklearn.metrics import confusion_matrix

cfm = confusion_matrix(y_test, y_pred)
print(cfm)

In [None]:
import seaborn as sns

# chart confusion matrix
categories = ['True Negative', 'False Positive', 'False Negative', 'True Positive']

cat_amount = ["{0:0.0f}".format(value) for value in cfm.flatten()]

cat_percent = ["{0:.2%}".format(value) for value in cfm.flatten()/np.sum(cfm)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(categories,cat_amount,cat_percent)]

labels = np.asarray(labels).reshape(2,2)

sns.heatmap(cfm, annot=labels, fmt='', cmap='Greens')

In [None]:
from sklearn.metrics import roc_auc_score

auc_est = (roc_auc_score(measure_test, measure))

print("The AUC on validation dataset is", auc_est)

In [None]:
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(measure_test, measure)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                  estimator_name='AUC Estimate')
display.plot()

plt.show()

### Accuracy and AUC Conclusion:
    * The accuracy of the model is a .8
    * The precision is .8 as well
    * The area under the curve is approximately .6.
        * When 0.5<AUC<1, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values (Bhandari, A. (2020)). 

### Now going to attempt improve accuracy and reduce the dimensions of the model for efficiency

In [None]:
to_be_reduced_X = X_test
to_be_reduced_y = y_test

In [None]:
from sklearn.inspection import permutation_importance
from sklearn.datasets import make_classification

to_be_reduced_X, to_be_reduced_y = make_classification(n_samples=len(to_be_reduced_X), n_features=len(to_be_reduced_X.columns), random_state=1)

# define the model
model = KNeighborsClassifier()

# fit the model
model.fit(to_be_reduced_X, to_be_reduced_y)

In [None]:
# find the feature importance

results = permutation_importance(model, to_be_reduced_X, to_be_reduced_y, scoring='accuracy')

importance = results.importances_mean

In [None]:
# summarize feature importance

for i,v in enumerate(importance):
    print(X_test.columns[i], ': %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.figure(figsize=(10, 10))
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [None]:
feats = []
for i,v in enumerate(importance):
    if v > 0.003:
        feats.append(X_test.columns[i])
        print(X_test.columns[i], ': %0d, Score: %.5f' % (i,v))
print(len(feats))

In [None]:
feats

In [None]:
reduced_X_df = X_test[feats]
reduced_y_df = y_test

In [None]:
reduced_X_df

In [None]:
reduced_X_train = X_train[feats]
reduced_y_train = y_train

In [None]:
knn.fit(reduced_X_train, reduced_y_train)

print('R-squared value:', knn.score(reduced_X_df,reduced_y_df))

In [None]:
print('Initial model:\n\n', 
      model_acc, 
      '\n\n', 'Best Possible score for MSE is 0\n', 
      mse_acc, 
      '\n\n', 
      'Best Possible score for R-Squared is 1\n', 
      r_squared, "\n\n\n")

y_pred_reduced = knn.predict(reduced_X_df)

measure = le.fit_transform(y_pred_reduced)
measure_test = le.fit_transform(reduced_y_df)

mse = mean_squared_error(measure_test, measure)

print('Reduced model:\n\n',
      'Model accuracy: {0:0.4f}\n\n'. format(accuracy_score(reduced_y_df, y_pred)),
      'Best Possible score for MSE is 0\n',
      'MSE accuracy: {0:0.4f}\n\n'. format(mse),
      'Best Possible score for R-Squared is 1\n',
      'R-squared value:', knn.score(reduced_X_df,reduced_y_df)
     )

In [None]:
# classification metrics

initial_model_report

reduced_model_report = classification_report(reduced_y_df, y_pred_reduced)

print("Initial model report:\n", initial_model_report, "\n\n\n", "Reduced model report:\n", reduced_model_report)

In [None]:
y_pred_reduced = knn.predict(reduced_X_df)
y_pred_reduced = pd.DataFrame({'Churn': y_pred_reduced})
y_pred_reduced.value_counts()

#### E2.  Results and implications
    * In the results from the analysis we found in a sample size of 2000 that 497 customers are predicted to churn. 
    * We oddly found that the contacts variable had the highest importance of all the features. 
    * The implication that we ran into is that the model actually ran better with more features than less. The made an initial model and found the most important features, then removed the unimportant features and the model was less accurate and had a worst MSE.
    * The most important features were:
     'Area',
     'Age',
     'Gender',
     'Outage_sec_perweek',
     'Email',
     'Contacts',
     'Yearly_equip_failure',
     'Contract',
     'Port_modem',
     'Tablet',
     'PaperlessBilling',
     'PaymentMethod',
     'MonthlyCharge',
     'TimelyResponse',
     'RespectfulResponse'
    
### E3.  Discuss one limitation of your data analysis.
    * This technique could be really time consuming because of all the different variables (not features) of the model that could be configured. K can be altered, which can requires for the model to be ran every time and the feature finding algo required quite a bit of computational power. 

### E4.  Recommendation
    * Based off of the information gained from the analysis, the top 3 features to the customers are the amount of contacts the customer has, if they customer has a port modem, and the kind of payment method the customer has; meaning that the company needs to focus on these features to keep the customers from churning. The accuracy of the model was not incredibly accurate, so I would recommend additional analysis with other techniques.
    
 

## Part VI: Demonstration

F.  Panopto video:
 https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=163ba291-0efe-4e87-9d4b-ae240168826b

G.  Third party code:

SciKit-Learn (2022). sklearn.neighbors.KNeighborsClassifier. Scikit Learn. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

H.  References:
Bhandari, A. (2020). AUC-ROC Curve in Machine Learning Clearly Explained. Analytics Vidhya https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/

Thirumuruganathan, S. (2010). A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm. Wordpress. https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/