# Exploring the Data: Part II 

Today, we will compare two quantities and determine some basic models for making inferences based on these relationships.  We have the following goals for todays session:

- [Check in on EDA Routines and Strategies](#Customer Churn)
- Discuss Linear Regression and Examine its use in Python
- Search for relationships between variables in our data
- Investigate metrics for the quality of fit of Linear Regression Models
- Use Multi-Variate Regression to explore data
- Investigate transformations to Linearize certain distributions
- NumPy, LOESS and smoothing

### Customer Churn

Let's revisit the customer churn dataset and how we might use an exploratory routine to investigate customer churn from our earlier notebook.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
#load the data
telecom = pd.read_csv('data/telecom_churn.csv')

In [None]:
telecom.info()

In [None]:
telecom['Churn'].value_counts()

In [None]:
telecom['International plan'].value_counts()

Note that Churn is presently a Boolean type.  We can change this to integer type, and subsequently understand something about what percentage of our customers have been churned.

In [None]:
telecom['Churn'] = telecom['Churn'].astype('int64')

In [None]:
telecom['Churn'].mean()

If we wanted to understand something about the differences between these groups -- churned or not -- we might look at the average use of all services in each.  We do this with a boolean index, then find the mean of the results.

In [None]:
telecom[telecom['Churn'] == 1].mean()

In [None]:
telecom[telecom['Churn'] == 0].mean()

In Pandas, the `.value_counts()` method works on data that is type `object` and `bool`.  Thus, we can examine the counts for these categories easily as follows.

In [None]:
telecom.describe(include = ['object'])

In [None]:
telecom[telecom['Churn'] == 1].describe(include = ['object'])

We can also use the `.groupyby()` method to investigate the different distributions within churn categories.

In [None]:
telecom.groupby(['Churn'])['Total day minutes'].describe()

In [None]:
telecom.groupby(['Churn'])['Total day minutes'].describe(percentiles = [])

We saw the `.map()` method last class, and we can use it again here to change the values of the International plan and Voice mail plan columns.  

In [None]:
plan = {'No': False, 'Yes': True}

In [None]:
telecom['Voice mail plan'] = telecom['Voice mail plan'].map(plan)

In [None]:
telecom['Voice mail plan'].head()

Note that we've also changed the data type.  Finally, we can use the `.crosstab()` display to see relationships between the international plan carriers and churn.

In [None]:
pd.crosstab(telecom['Churn'], telecom['International plan'])

In [None]:
sns.countplot('International plan', hue = 'Churn', data = telecom)

In [None]:
pd.crosstab(telecom['Churn'], telecom['Customer service calls'])

In [None]:
sns.countplot('Customer service calls', hue = 'Churn', data = telecom)

In [None]:
plt.figure(figsize = (12, 4))
plt.subplot(1, 2, 1)
plt.hist(telecom['Total day minutes'])
plt.title('Total Day Minutes')

plt.subplot(1, 2, 2)
plt.hist(telecom['Total intl calls'], density = True)
plt.title('Total International Calls')

In [None]:
telecom[['Total day minutes', 'Total intl calls']].plot(kind='density', subplots=True, 
                  layout=(1, 2), sharex=False, figsize=(12, 4))