# A Hotel Problem

You work at a hospitality company that has many corporate customers who book hotel rooms for their employees. Your hospitality company noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying the rooms) so that they can correctly assign the customers most at risk to churn an account manager.

Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. Then your company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of employees that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address (Masked)
    Company: Name of Client Company (Masked)
    
Once you've created the model and evaluated it, test out the model on some new data that your hospitality company has provided, saved under new_customers.csv. Which new customers are most likely to churn given this data.

In [None]:
#Import your libraries
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline

df = pd.read_csv("../../data/customer_churn.csv")

In [None]:
#Check out your data using at least 4 visualizations
sns.heatmap(df.corr())

In [None]:
sns.jointplot(x = "Num_Sites", y = "Years", data = df)

In [None]:
df.plot()

In [None]:
sns.regplot(x = "Num_Sites", y = "Churn", data = df, y_jitter=.03)

## Train Test Split

In [None]:
df.columns

In [None]:
from sklearn.model_selection import train_test_split
X = df[["Age", "Total_Purchase", "Years", "Num_Sites"]]
y = df.Churn

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 42)

## Create the Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
LogReg = LogisticRegression()

In [None]:
LogReg.fit(X_train, y_train)

In [None]:
y_pred = LogReg.predict(X_test)

## Evaluate the results

In [None]:
# use classification report to explain how well your results are
## What is F1, and when would it be the best measure? 
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))

In [None]:
#Explain what AUC is, and create a score and visualization
#import the metrics for the AUC
from sklearn.metrics import roc_auc_score #Area under the curve
from sklearn.metrics import roc_curve #Receiver Operator Curve

log_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, LogReg.predict_proba(X_test)[:,1])


#let's plot it!
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % log_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

#make it pretty!
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

In [None]:
print(log_roc_auc)

[Common question - what is a good AUC value?](https://stats.stackexchange.com/questions/113326/what-is-a-good-auc-for-a-precision-recall-curve)

## Predict on unlabeled data
Who in the new customers might be at risk of churning? use new_customers.csv

In [None]:
#Import the new dataframe as new_customer
new_customer = pd.read_csv("../../data/new_customers.csv", index_col = None)
new_customer.reset_index(inplace = True)
new_customer.columns = ["Names", "Age", "Total_Purchase", "Account_Manager", "Years", "Num_sites", 
                     "Onboard_date", "Location", "Company", "Ditch"]
new_customer.head()

In [None]:
#Be sure you apply any transformations from above to this new_customers


In [None]:
#apply the "predict" to create a new feature in new_customers
new_customer["Risk"] = LogReg.predict(new_customer[["Age", "Total_Purchase", "Years", "Num_sites"]])
new_customer.head()

In [None]:
#what companies should we give account managers too? 
new_customer[["Names", "Company", "Location"]].loc[new_customer["Risk"] == 1]