# Churn Descriptive Analysis
Descriptive analysis using the [Churn Modelling](https://www.kaggle.com/shubh0799/churn-modelling/) data set.

Includes:
- Exploring and plotting continuous variables
- Exploring and plotting categorical variables
- Converting categorical variables to numeric indicators in preparation for machine learning
- Output a cleaned CSV file

## Imports
Includes standard code provided by Kaggle when setting up a notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Read and Review the Data
We'll import the data to a data frame, then review basic info on the data.

In [None]:
# Read the data into a pandas dataframe
data = pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv', index_col = 'RowNumber')

# View the first 5 records
data.head()

In [None]:
data.info()

**NOTE: _No missing data!_** That's an issue we don't have to worry about with this data set.

## Explore Target Variable: Exited
- Exited = 0 --> did not leave our company
- Exited = 1 --> left the company

In [None]:
# Of these 10,000, how many exited?
data['Exited'].value_counts()

### Interpretation
- Approx. 20% of our customers exited

## Explore Continuous Features
We'll explore these features, which are continuous:
- CreditScore
- Age
- Tenure
- Balance
- NumOfProducts
- Estimated Salary

Keep Exited as the target variable for analysis

In [None]:
# Use data_continuous as a slice of the data frame containing only the continuous variables
data_continuous = data[['Exited','CreditScore','Age','Tenure','Balance','NumOfProducts','EstimatedSalary']]

data_continuous.describe()

In [None]:
# Group by Exited and view the mean for each continuous variable
data_continuous.groupby('Exited').mean()

**Observation on above:** There are no huge and obvious differences between the mean scores across these variables for customers who did or did not exit. The widest gaps in means are for Age and Balance, then CreditScore and NumOfProducts. Those who exited are on average older with a larger balance, slightly fewer avg products, and a slightly lower avg CreditScore.

## Histograms for Continuous Variables

On seaborn distplot see: https://seaborn.pydata.org/generated/seaborn.distplot.html

To select colors, see:
https://python-graph-gallery.com/100-calling-a-color-with-seaborn/

In [None]:
# Generate overlaid histograms for continuous variables
# Remained = cornflowerblue
# Exited = orangered
for feature in ['CreditScore','Age', 'Tenure','Balance','NumOfProducts','EstimatedSalary']:
    remained = list(data[data['Exited'] == 0][feature].dropna())
    exited = list(data[data['Exited'] == 1][feature].dropna())
    xmin = min(min(remained), min(exited))
    xmax = max(max(remained), max(exited))
    width = (xmax - xmin) / 40
    sns.distplot(remained, color="cornflowerblue", kde=False, bins=np.arange(xmin, xmax, width))
    sns.distplot(exited, color="orangered", kde=False, bins=np.arange(xmin, xmax, width))
    plt.legend(['Remained', 'Exited'])
    plt.title(f'Overlaid histogram for {feature}')
    plt.show()

**Interpretation:** As observed from the mean scores above, there are no huge discepancies between customers who remained vs. exited.

# Explore Categorical Features
- Geography
- Gender
- HasCrCard
- isActiveMember

- and Target: Exited

In [None]:
data.head()

In [None]:
data_categorical = data[['Geography','Gender','HasCrCard','IsActiveMember','Exited']]
data_categorical.head()

In [None]:
data_categorical.info()

## Plot Categorical Features

### Point plots to indicate proportions exited per category

In [None]:
for i, feature in enumerate(['Geography','Gender','HasCrCard','IsActiveMember']):
    plt.figure(i)
    sns.catplot(x=feature, y='Exited', data=data, kind='point', aspect=2, )

### Same data with a bar plot this time ...

In [None]:
for i, feature in enumerate(['Geography','Gender','HasCrCard','IsActiveMember']):
    plt.figure(i)
    sns.catplot(x=feature, y='Exited', data=data, kind='bar', palette='pastel', aspect=2, )

### Interpretation
- **Geography:** German customers are much more likely to exit.
- **Gender:** Women are more likely to exit than men.
- **IsActiveMember:** Inactive members are more likely to exit.

## Pivot Tables

Use pivot tables to explore relationships between variables

In [None]:
data.pivot_table('Exited', index='Gender', columns='Geography', aggfunc='count')

In [None]:
data.pivot_table('Exited', index='Gender', columns='IsActiveMember', aggfunc='count')

In [None]:
data.pivot_table('Exited', index='Geography', columns='IsActiveMember', aggfunc='count')

### Interpretation
- There are no super strong relationships between these categorical variables.
- Thus, each looks like it has its own degree of influence.

## Data Cleaning

In [None]:
data.head()

### Drop unneeded variables
- CustomerId
- Surname

In [None]:
data.drop(['CustomerId','Surname'], axis=1, inplace=True)
data.head()

## Convert Gender to 0 or 1
- to prep for machine learning

In [None]:
gender_ind = {'Male': 0, 'Female': 1}

data['Gender'] = data['Gender'].map(gender_ind)
data.head()

## Convert Geography to numeric
- France = 0
- Spain = 1
- Germany = 2

In [None]:
geo_ind = {'France': 0, 'Spain': 1, 'Germany': 2}

data['Geography'] = data['Geography'].map(geo_ind)
data.head()

In [None]:
data['Geography'].value_counts()

In [None]:
data.columns

## Write to CSV

Will be saved to: **/working/** directory

File named: **churn_cleaned.csv**

In [None]:
data.to_csv('churn_cleaned.csv')

### View all file(s)

In [None]:
import os 
for dirname, _, filenames in os.walk('/kaggle/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))