# Telco Customer Churn
**The data set includes information about:**

* Customers who left within the last month – the column is called Churn
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

###### Customer churn is the percentage of customers that stopped using your company's product or service during a certain time frame. You can calculate churn rate by dividing the number of customers you lost during that time period - say a quarter - by the number of customers you had at the beginning of that time period.

###### This notebook focuses on exploring and analysing which and why customers churn and if its predictable enough to make assumptions.

<br>For example, if you start your quarter with 400 customers and end with 380, your churn rate is 5% because you lost 5% of your customers.</br>

# Importing libaries and reading in the database

In [None]:
# Standard libaries for DataFrame, Dtypes & I/O
import numpy as np
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option("display.precision", 5)

import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import plotly.offline as py
from plotly.subplots import make_subplots

# Profile report
from pandas_profiling import ProfileReport

# Prediction libaries - Logistic regression
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Warnings
import warnings
warnings.filterwarnings("ignore")

# Create an array with the colors
colors = ["#636DFA", "#EF563B"]

# Setting custom color palette
sns.set_palette(sns.color_palette(colors))

In [None]:
# Reading in the csv file
data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
# Checking the data's columns
data.info()

# Dataset Overview

* **customerID** - Customer ID
* **Gender** - Whether the customer is a male or a female
* **SeniorCitizen** - Whether the customer is a senior citizen or not (1, 0)
* **Partner** - Whether the customer has a partner or not (Yes, No)
* **Dependents** - Whether the customer has dependents or not (Yes, No)
* **Tenure** - Number of months the customer has stayed with the company
* **PhoneService** - Whether the customer has a phone service or not (Yes, No)
* **MultipleLines** - Whether the customer has multiple lines or not (Yes, No, No phone service)
* **InternetService** - Customer’s internet service provider (DSL, Fiber optic, No)
* **OnlineSecurity** - Whether the customer has online security or not (Yes, No, No internet service)
* **OnlineBackup** - Whether the customer has online backup or not (Yes, No, No internet service)
* **DeviceProtection** - Whether the customer has device protection or not (Yes, No, No internet service)
* **TechSupport** - Whether the customer has tech support or not (Yes, No, No internet service)
* **StreamingTV** - Whether the customer has streaming TV or not (Yes, No, No internet service)
* **StreamingMovies** - Whether the customer has streaming movies or not (Yes, No, No internet service)
* **Contract** - The contract term of the customer (Month-to-month, One year, Two year)
* **PaperlessBilling** - Whether the customer has paperless billing or not (Yes, No)
* **PaymentMethod** - The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
* **MonthlyCharges** - The amount charged to the customer monthly
* **TotalCharges** - The total amount charged to the customer
* **Churn** - Whether the customer churned or not (Yes or No)

In [None]:
# Checking the shape of the data
print("Data shape:", data.shape)

# The raw data contains 7043 rows (customers) and 21 columns (features).

In [None]:
# Printing out the first 5 rows
display(data.head(6))

# Data cleaning and reshaping

In [None]:
# Creating missing data table
# Total - total number of missing data
# Percent - percentage of the dataset
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)

There is no missing data in this dataset.

In [None]:
# Checking unique values in each feature
for col in data.columns:
    unique_vals = data[col].unique()
    if len(unique_vals) < 10:
        print("Unique values for column {}: {}".format(col, unique_vals))
    else:
        if is_string_dtype(data[col]):
            print("Column {} has values string type".format(col))
        elif is_numeric_dtype(data[col]):
            print("Column {} is numerical".format(col))

In [None]:
# Renaming columns - Capital letters for better reading
data = data.rename(columns={'gender': 'Gender', 'tenure': 'Tenure'})

# Drop customerID - irrelevant for the analysis
del data['customerID']

# Convert integer type (0, 1) to categorical Yes/No
data['SeniorCitizen'] = data['SeniorCitizen'].map({True: 'Yes', False: 'No'})

# Count of online services used and creating a new feature
# Integer - 0, 1, 2, 3, 4, 5, 6
data['Count_PlusServices'] = (data[['OnlineSecurity', 'DeviceProtection', 'StreamingMovies', 'TechSupport', 'StreamingTV', 'OnlineBackup']] == 'Yes').sum(axis=1)

# Change columns to type category
catCols = ['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
data[catCols] = data[catCols].astype('category')

# Fill N/A values in TotalCharges
data.TotalCharges = pd.to_numeric(data.TotalCharges, errors = 'coerce')
data.TotalCharges.fillna(value = data.Tenure *  data.MonthlyCharges, inplace = True)

data.head(6)

### Reshaping:
* Renaming columns which started with lowercase
* Dropping 'customerID' - we wont use it in the analysis
* Converting 'SeniorCitizen' column in to categorical data type
* Creating a new feature - number of paid online services
* Chaging Yes/No columns to categorical data type
* Converting 'TotalCharges' feature in to float type and filling N/A values

###### Also replacing 'No internet service' with 'No' in online services feautres. <br>This will make the data more readable and easier to convert it into categorical data type. If someone hasn't have Internet service its obvious he/she doesn't pay for theses services.</br>

In [None]:
# Changing 'No internet service' to 'No'
# Easier to understand, simplify the feature
colsForReplacement = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for colName in colsForReplacement:
    data[colName] = data[colName].replace({'No internet service' : 'No'})
    data[colName] = data[colName].astype('category')

data.info()

In [None]:
# Using the interquartile range for outliers.
# IQR is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles.
cont_features = ["Tenure", "MonthlyCharges", "TotalCharges", "Count_PlusServices"]
dataframe_num = data[cont_features]
dataframe_num.describe()

Q1 = dataframe_num.quantile(0.25)
Q3 = dataframe_num.quantile(0.75)
IQR = Q3 - Q1
IQR
((dataframe_num < (Q1 - 1.5 * IQR)) | (dataframe_num > (Q3 + 1.5 * IQR))).any()

From the results, it seems that there are no outliers.

In [None]:
# Generate descriptive statistics with Pandas Profiling
profile = ProfileReport(data, title='Telco Customer Churn - Pandas Profiling Report', progress_bar=False)
profile

# EDA - Exploratory Data Analysis

### Analyzing online services

In [None]:
# Correlation heatmap
plt.figure(figsize=(25, 10))

# Compute the correlation matrix
corr = data.apply(lambda x: pd.factorize(x)[0]).corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Draw the heatmap with the mask and correct aspect ratio
ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap='coolwarm', vmin=-1, vmax=1)

# Finding(s):
# Dependents and Partner features have inverse correlation relationship
# Churn and Contract have inverse correlation relationship
# Internet Service and Online Security have inverse correlation relationship
# Tech Support and Internet Service have inverse correlation relationship
# Count Plus Services and Online Backup have inverse correlation relationship

# Streaming TV and Streaming Movies have positive correlation relationship
# Multiple Lines and Phone Service have positive correlation relationship
# Count Plus Services and Contract have positive correlation relationship
# Device Protection and Streaming Movires have positive correlation relationship

In [None]:
# Creating Pie chart about the churn distribution
trace = go.Pie(labels = ['Churn : no', 'Churn : yes'], values = data['Churn'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8, marker=dict(colors=['blue','red'], line=dict(color='#000000', width=1.5)))

layout = dict(title = 'Distribution of churn', autosize=False, height=400, width=600, title_font=dict(size=20))        
fig = dict(data = [trace], layout=layout)
py.offline.init_notebook_mode()
py.iplot(fig)

# Finding(s):
# Consumers are churning in alarming proportions

In [None]:
# Creating Bar plot for customer count and number of online services distribution
plot_data = pd.DataFrame() 
plot_data['Count_PlusServices'] = data.Count_PlusServices.value_counts()
plot_data = plot_data.sort_index(ascending=True)
fig = px.bar(plot_data, x=plot_data.index, y=plot_data['Count_PlusServices'], labels={'index': 'Number of services', 'Count_PlusServices': 'Number of customers'})

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Number of online services', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Online services used by customers',
)

fig.show()

# Finding(s):
# ~31% of the customers do not use any of the online services
# Most of the customers use 1-2-3 services
# Only 4% uses all the 6 services, 12% uses 5 services

In [None]:
# Creating Histogram about number of serivces used by customers
fig = px.histogram(data, x='Count_PlusServices', color='Churn', labels={'Churn': 'Churn', 'Count_PlusServices': 'Number of services'})

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Number of services', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Number of services used by customers',
    bargap=0.2,
    bargroupgap=0.1
)

fig.show()

# Finding(s):
# Customers who are availing just one Online Service are churning highest
# As the number of online services increases beyond one service, the less is the proportion of churn

In [None]:
# Creating Bar chart about tenure
plot_data = pd.DataFrame() 
plot_data['Tenure'] = data.Tenure.value_counts()
plot_data = plot_data.sort_index(ascending=True)
fig = px.bar(plot_data, x=plot_data.index, y=plot_data['Tenure'], labels={'index': 'Months', 'Tenure': 'Number of customers'})

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Tenure', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Distribution of customers by tenure',
)

fig.show()

In [None]:
# Preaparing data to handle Plotly distribution plot
hist_data = [data['Tenure']]
group_labels = ['data']

# Creating Distribution Plot about tenure
fig = ff.create_distplot(hist_data, group_labels, curve_type='normal') # Normal Distribution

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers')
# Update xaxis properties
fig.update_xaxes(title_text='Tenure in months')

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Distribution of customers by tenure',
)

fig.show()

# Finding(s):
# There is a lot of new customers
# Also there are a lot of 50-60-70 month customers

In [None]:
# Creating a DataFrame about services used by customers
plot_data = pd.DataFrame() 
plot_data['OnlineSecurity'] = data.OnlineSecurity.value_counts()
plot_data['DeviceProtection'] = data.DeviceProtection.value_counts()
plot_data['StreamingMovies'] = data.StreamingMovies.value_counts()
plot_data['TechSupport'] = data.TechSupport.value_counts()
plot_data['StreamingTV'] = data.StreamingTV.value_counts()
plot_data['OnlineBackup'] = data.OnlineBackup.value_counts()
   
plot_data

# Finding(s):
# Streaming Movies and Streaming TV is the 2 most popular services (~39%)
# Online Security and TechSupport is the least popular (~29%)

In [None]:
# Plotting this list of features with Churn
feature = ['OnlineSecurity', 'DeviceProtection', 'StreamingMovies', 'TechSupport', 'StreamingTV', 'OnlineBackup']

fig = plt.figure(figsize=(25,25))
plt.subplots_adjust(hspace=0.45)

for i, item in enumerate(feature, 1):
    fig.add_subplot(3,3,i)
    ax = sns.countplot(data=data, x=item, order=["No", "Yes"], hue='Churn')
    plt.title(f'{item}', fontsize=17)
    ax.set_ylabel('Number of customers', fontsize=12)
    ax.set_xlabel('Used by customers', fontsize=12)
    ax.set_ylim(0, 3750)
    
# Finding(s):
# Although Streaming Movies and Streaming TV is the 2 most popular services the churning rate is high => the quality can be a factor
# Online Security and TechSupport is the least popular, however the churning rate is low => the services probably not popular among customers and not because of the quality

In [None]:
# Creating a DataFrame about other metrics
plot_data = pd.DataFrame() 
plot_data['SeniorCitizen'] = data.SeniorCitizen.value_counts()
plot_data['Partner'] = data.Partner.value_counts()
plot_data['Dependents'] = data.Dependents.value_counts()
plot_data['PhoneService'] = data.PhoneService.value_counts()
plot_data['MultipleLines'] = data.MultipleLines.value_counts()
plot_data['PaperlessBilling'] = data.PaperlessBilling.value_counts()
   
plot_data

# Finding(s):
# ~16% of the customers are Senior Citizens
# Paperless Billing is more popular than Regular Billing
# ~90% of the customers have Phone Service
# ~47% of them have Multiple Lines

In [None]:
# Plotting this list of features with Churn
feature = ['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'PaperlessBilling']

fig = plt.figure(figsize=(25,25))
plt.subplots_adjust(hspace=0.45)

for i, item in enumerate(feature, 1):
    fig.add_subplot(2, 3, i)
    ax = sns.countplot(data=data, x=item, order=["No", "Yes"], hue='Churn')
    plt.title(f'{item}', fontsize=17)
    ax.set_ylabel('Number of customers', fontsize=12)
    ax.set_ylim(0, 5000)
    
# Finding(s):
# Senior Citizens churning in a higher rate
# Other churning rates are close to the avarage churning rate (~26%)

In [None]:
# Creating Bar chart about the number of services used by customers
agg = data.groupby('Count_PlusServices', as_index=False)[['MonthlyCharges']].mean()
agg[['MonthlyCharges']] = np.round(agg[['MonthlyCharges']], 0)
fig = px.bar(agg, x='Count_PlusServices', y='MonthlyCharges', labels={'Count_PlusServices': 'Number of services', 'MonthlyCharges': 'Average charge ($)'})

# Update yaxis properties
fig.update_yaxes(title_text='Average monthly charges ($)', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Number of online services used', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Average monthly charges by number of services',
)

fig.show()

# Finding(s):
# Customers who does not avail any internet service are paying $33
# While those with one service are paying double $66
# As the number of services availed increases, the Average Monthly Charges are increasing linearly

In [None]:
# Creating Box plot about monthly charges and churn
agg = agg.div(agg.sum())
fig = px.box(data, x='Churn', y = 'MonthlyCharges')

# Update yaxis properties
fig.update_yaxes(title_text='Monthly charges ($)', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Churn', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Monthly charges - Churn',
)

fig.show()

# Finding(s):
# The higher the monthly charges, the higher possibility of Churn
# Non churners are paying an average of $64.45, while churners are paying $79.65 average
# There is a possibility that the price is too high

In [None]:
# Creating Box plot about tenure and churn
fig = px.box(data, x='Churn', y = 'Tenure')

# Update yaxis properties
fig.update_yaxes(title_text='Tenure (Months)', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Churn', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Tenure - Churn',
)

fig.show()

# Finding(s):
# Shorter the tenure, higher is the possibility of Churn (Non Churn customers: ~38, Churned customers: ~10)

In [None]:
# Creating a DataFrame about other metrics
plot_data = pd.DataFrame() 
plot_data['Contract'] = data.Contract.value_counts()
   
plot_data

# Finding(s):
# Month-to-month contract is the most popular ~55%
# Two year and One year is close to each other

In [None]:
# Creating Histogram about contract type and churn
fig = px.histogram(data, x='Churn', color='Contract')

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Churn', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Churn by contract type',
)

fig.show()

# Finding(s):
# Customers with Month-to-Month contract are churning in alarming proportions, while two year contract customers are churning least

In [None]:
# Creating a DataFrame about other metrics
plot_data = pd.DataFrame() 
plot_data['InternetService'] = data.InternetService.value_counts()
   
plot_data

# Finding(s):
# Only 22% of the customers dont have internet
# 44% have Fiber optic and 34% have DSL

In [None]:
# Creating Histogram about internet service type and churn
fig = px.histogram(data, x='Churn', color='InternetService')

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Churn', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Churn by internet service type',
)

fig.show()

# Finding(s):
# Customers with Fiber Optic internet service are churning in alarming proportions

In [None]:
# Creating a DataFrame about other metrics
plot_data = pd.DataFrame() 
plot_data['PaymentMethod'] = data.PaymentMethod.value_counts()
   
plot_data

# Finding(s):
# Most of the customers payment method is Electronic check
# ~44% uses automatic payment method
# Only ~23% needs to be mailed

In [None]:
# Creating Histogram about payment methods and churn
fig = px.histogram(data, x='Churn', color='PaymentMethod')

# Update yaxis properties
fig.update_yaxes(title_text='Number of customers', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Churn', row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='Churn by payment method',
)

fig.show()

# Finding(s):
# Customers with Electronic Check as mode of payment are churning in higher proportion (almost 50%), while other payment method has ~15%

**Most important informations learned:**
* Streaming TV and Streaming Movies usually used together, they are the 2 most popular services, but the churning rate is high => the quality can be a factor
* Online Security and TechSupport is the least popular, but the churning rate is low => probably the services are just not popular
* Customers who are availing just one Online Service are churning highest
* Senior Citizens churning in a higher rate
* ~31% of the customers do not use any of the online services
* Only 4% uses all the 6 services, 12% uses 5 services
* The higher the monthly charges, the higher possibility of Churn
* Non churners are paying just over 64.45, while churners are paying nearly 79.65 => there is a possibility that the price is too high
* Shorter the tenure, higher is the possibility of Churn (Non Churn customers: ~38, Churned customers: ~10)

# Predicting with Logistic Regression

###### Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables. For example, a logistic regression could be used to predict whether a political candidate will win or lose an election or whether a high school student will be admitted to a particular college.

In [None]:
non_numeric_features = ['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
                       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 
                       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'Churn']

for feature in non_numeric_features:     
    # Encode target labels with value between 0 and n_classes-1
    data[feature] = LabelEncoder().fit_transform(data[feature])
    
data.info()

In [None]:
cat_features = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 
                'PaymentMethod']
encoded_features = []

for feature in cat_features:
    # Encode categorical features as a one-hot numeric array
    encoded_feat = OneHotEncoder().fit_transform(data[feature].values.reshape(-1, 1)).toarray()
    n = data[feature].nunique()
    cols = ['{}_{}'.format(feature, n) for n in range(1, n + 1)]
    encoded_df = pd.DataFrame(encoded_feat, columns=cols)
    encoded_df.index = data.index
    encoded_features.append(encoded_df)
data = pd.concat([data, *encoded_features], axis=1)
    
print('Number of encoded feautes:', len(encoded_features))

# Drop columns that are unrelated and columns where we generate one-hot encoded variables earlier
data2 = data.copy()
drop_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']
data.drop(columns=drop_cols, inplace=True)

In [None]:
x = data.drop(columns=['Churn']).values
y = data["Churn"].values

# Splitting the data
# 75% train
# 25% test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,  stratify=y, random_state=22)
print('X_train shape: {}'.format(x_train.shape))
print('X_test shape: {}'.format(x_test.shape))

In [None]:
%%time

# Provides train/test indices to split data in train/test sets.
skf = StratifiedKFold(n_splits=4)
val_auc_scores = []

for train_index, valid_index in skf.split(x_train, y_train):
    x_pseudo_train, x_pseudo_valid = x_train[train_index], x_train[valid_index]
    y_pseudo_train, y_pseudo_valid = y_train[train_index], y_train[valid_index]
    # Standardize features by removing the mean and scaling to unit variance
    ss = StandardScaler()
    # Fit to data, then transform it.
    x_pseudo_train_scaled = ss.fit_transform(x_pseudo_train)
    # Perform standardization by centering and scaling
    x_pseudo_valid_scaled = ss.transform(x_pseudo_valid)
    # Logistic Regression
    lr = LogisticRegression() # Using default parameters
    # Fit the model according to the given training data
    lr.fit(x_pseudo_train_scaled, y_pseudo_train)
    # Predict logarithm of probability estimates.
    y_pred_valid_probs = lr.predict_proba(x_pseudo_valid_scaled)[:, 1]
    # Compute Receiver operating characteristic (ROC)
    val_fpr, val_tpr, val_thresholds = roc_curve(y_pseudo_valid, y_pred_valid_probs)
    # Compute Area Under the Curve (AUC) using the trapezoidal rule
    val_auc_score = auc(val_fpr, val_tpr)
    val_auc_scores.append(val_auc_score)

In [None]:
%%time

# Standardize features by removing the mean and scaling to unit variance
ss = StandardScaler()
# Fit to data, then transform it.
x_train_scaled = ss.fit_transform(x_train)
# Perform standardization by centering and scaling
x_test_scaled = ss.transform(x_test)

# Applying logistic regression classifier
lr = LogisticRegression()        # Using default parameters
lr.fit(x_train_scaled, y_train)  # Training the model with X_train, y_train

# Generate Confusion Matrix
y_pred = lr.predict(x_test_scaled)
y_pred = pd.Series(y_pred)
y_test = pd.Series(y_test)
pd.crosstab(y_pred, y_test, rownames=['Predicted'], colnames=['True'], margins=True)

In [None]:
# Checking overall accuracy
print("Overall Accuracy: {:%}".format(sum(y_pred == y_test)/len(y_test)))