**Visual exploration of data using popular libraries such as pandas, matplotlib and seaborn and find it amazing how much insight can be gained from seemingly simple charts created with available visualization tools.**

First of all we import the libraries for initializing the enviroment.

In [None]:
import numpy as np
import pandas as pd

# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')

# Matplotlib forms basis for visualization in Python
import matplotlib.pyplot as plt

# We will use the Seaborn library
import seaborn as sns
sns.set()

# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'

Creating the datatframe.

In [None]:
data=pd.read_csv('../input/edadata/telecom_churn.csv')

Let us have a look at the data.

In [None]:
data.head()

Churn which is a binary feature is the target one which depicts whether the company has lost its client or nature.

In [None]:
data.shape

Dataset consists of 3333 rows and 20 columns in all.

In [None]:
data.info()

There are 16 columns that are numeric, 3 categorical and 1 boolean.No presence of null values can be seen.

# Univariate Visualization-visualizing one feature at a time.

**For numeric variables.**

In [None]:
features = ['Total day minutes', 'Total intl calls']

In [None]:
data[features].describe()

To view the statistics visually we can use box plot.


In [None]:
plt.rcParams['figure.figsize']=(10,7)
sns.boxplot('Total day minutes',data=data)

So the users on average lie between 175-180.Maximum usage is upto 350 as can be seen as outliers starting from roughly 325-330.Almost minimal day minutes are less than 40 as seen in the plot.

In [None]:
plt.rcParams['figure.figsize']=(10,7)
sns.boxplot('Total intl calls',data=data)

Most international calls start beyond 10 till the maximum 20.On average 4-4.5 is done normally.

We can also go for violin plots which smoothes out the box at the maximum density , however information obtained from boxplot leaves the violin plot data redundant.

In [None]:
plt.rcParams['figure.figsize']=(10,7)
data[features].hist();

Total day minutes is distributed almost normally. The tail of Total International calls is skewed towards right though.

To check the density wise distribution  of the histograms we can use density plots.

In [None]:
data[features].plot(kind='density', subplots=True, layout=(1, 2), 
                  sharex=False, figsize=(10, 7));

The bis of the histogram are no longer there with these Kernel Density Plots with bins smoothed out.

In [None]:
sns.distplot(data['Total intl calls']);

The histograms are normalized with height of the bins shown as examples with distplot()

**For categorical variables.**

In [None]:
data['Churn'].value_counts()

We can see the distribution of churned and loyal clients.

In [None]:
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

sns.countplot(x='Churn', data=data, ax=axes[0]);
sns.countplot(x='Customer service calls', data=data, ax=axes[1]);

The above distribution can be visualized accurately with countplot().For customer sewrvice calls we can see that maximum count of 1 call are done with a minimal count of 7.

# Multivariate Visualization-comparing 2 or more variables at a time.

**Numeric-numeric**

We can see the correlation between each and every numerical variables using correlation matrix.For that first we need to remove the non-numeric variables.

In [None]:
num = list(set(data.columns) - 
                 set(['State', 'International plan', 'Voice mail plan', 
                      'Area code', 'Churn', 'Customer service calls']))

# Calculate and plot
corr_matrix = data[num].corr()
sns.heatmap(corr_matrix);

There are some dependent variables like Total day charge,Total night charge,Total eve charge,Total international charge which can be computed from their minutes spent respectively.So we will get rid of them.

In [None]:
num=list(set(num)-set(['Total day charge','Total night charge','Total eve charge','Total intl charge']))

Now we would see the relationship between 2 numeric variables using scatter plot.

In [None]:
plt.scatter(data['Total day minutes'],data['Customer service calls']);

From the above plot we can gather that as the number of calls increases , clients tend to be irritated by that and accordingly minutes decrease leading to higher churn rates.

In [None]:
sns.jointplot(x='Total day minutes', y='Customer service calls', 
              data=data, kind='scatter');

This is a scatter plot using seaborn known as jointplot().

In [None]:
sns.jointplot('Total day minutes', 'Customer service calls', data=data,
              kind="kde", color="g");

This is a more smoothed version of the jointplot.We can also observe from the densities that people are patient enough to talk with the customer service numbers for upto 4 calls.

**Categorical-categorical**

In [None]:
sns.countplot(x='Voice mail plan', hue='Churn', data=data);
plt.title('Loyal & Churned with the Voice Mail Plan')

To compare 2 non-numeric features , we can start with a count plot.
Here, we can see that the voice mail plan is not a bad plan as such considering the churn rate.

In [None]:
pd.crosstab(data['Voice mail plan'], data['Churn'],normalize=True)

One of the best ways to compare 2 categorical variables is the crosstab whose visualization we have done above.
Percentage wise contigency table shows out of 25% clients opting for the voice mail plan ,only 2% have churned out.

**Numeric-categorical**

In [None]:
sns.lmplot('Total day minutes', 'Total night minutes', data=data, hue='Churn', fit_reg=False);

We can use lmplot to compare categorical and numeric variables which is quite similar to a scatter plot with another dimension.
Here, we can see that more the minutes being spent behind calls , people churned out more as compared to night minutes.

In [None]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))

sns.boxplot(x='Churn', y='Total day minutes', data=data, ax=axes[0]);
sns.violinplot(x='Churn', y='Total day minutes', data=data, ax=axes[1]);

We can observe from both the plots that customers who speak more over the phone in daytime tend to be disloyal.

In [None]:
sns.catplot(x='Churn', y='Total day minutes', col='Customer service calls',
               data=data[data['Customer service calls'] < 8], kind="box",
               col_wrap=6, height=5, aspect=.8);

catplot provides us with an intersting feature.We can now see that total day minutes is not solely responsible for churning of customers.
Rather we see that anything beyond 4 customer calls referring customers with problems are not being answered properly after repeated calls causing them to leave.

# Whole Dataset Visualization

**The starting example can the correlation matrix itself.**

Another example can be the pairplot.

In [None]:
%config InlineBackend.figure_format = 'png'
sns.pairplot(data[num]);

In many cases , good amount of infromation could be drawn. Here, there are no surprises as such.

** t-distributed Stohastic Neighbor Embedding(t-SNE)**-find a projection for a high-dimensional feature space onto a plane (or a 3D hyperplane, but it is almost always 2D) such that those points that were far apart in the initial n-dimensional space will end up far apart on the plane. Those that were originally close would remain close to each other.

In [None]:
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

We have to drop the object data types and map the plans with pandas.series mapping object.

In [None]:
ok=data.drop(['Churn','State'],axis=1)
ok['International plan']=ok['International plan'].map({'Yes':1,'No':0})
ok['Voice mail plan']=ok['Voice mail plan'].map({'Yes':1,'No':0})

We also need to normalize the data. For this, we will subtract the mean from each variable and divide it by its standard deviation. All of this can be done with StandardScaler.

In [None]:
scaler = StandardScaler()
scaled = scaler.fit_transform(ok)

Create a tsne object followed by a representation.

In [None]:
tsne=TSNE(random_state=18)
tsne_repr = tsne.fit_transform(scaled)

In [None]:
plt.scatter(tsne_repr[:, 0], tsne_repr[:, 1], alpha=.5);

In [None]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(14, 8))

for i, name in enumerate(['International plan', 'Voice mail plan']):
    axes[i].scatter(tsne_repr[:, 0], tsne_repr[:, 1], 
                    c=data[name].map({'Yes': 'orange', 'No': 'blue'}), alpha=.5);
    axes[i].set_title(name);

People mostly dissatisfied with International Plan can be observed in large blue clusters have churned out reasonoably.
Although that is not the case with voice mail plan.