## Exploratory Data Analysis Notebook

### Importing Libraries

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport
import lux
import numpy as np

### Loading Dataset

In [3]:
## use code as per the type of data source

## use below line to read data from csv file
## df = pd.read_csv(dataset_path)

dataset_path = '../dataset/Bank_Personal_Loan_Modelling.xlsx'
df = pd.read_excel(dataset_path, sheet_name = 1, index_col=0)

### Basic Data Exploration

In [4]:
# good for getting a feel of the data
df.head()

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

In [5]:
# can help in spotting the presence of null values
# also can be used to see the column types
df.info()

<class 'lux.core.frame.LuxDataFrame'>
Int64Index: 5000 entries, 1 to 5000
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIP Code            5000 non-null   int64  
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal Loan       5000 non-null   int64  
 9   Securities Account  5000 non-null   int64  
 10  CD Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 546.9 KB


In [6]:
# describe method is helpful in seeing the distribution of numerical columns 
df.describe()

Unnamed: 0,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,45.3384,20.1046,73.7742,93152.503,2.3964,1.937913,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747666,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


### Discovering insights using automated data exploration
We will use pandas profiling and LUX library for fast data exploration process.
1. Pandas profiling is a pandas module and using just one line of code we can quickly to the basic data exploration such as seeing the distribution of the data and the correlation between attributes. It also gives the nice overview of the dataset.
2. LUX is an intelligent automated way of visual discovery, it uses intent based language and we can specify intent to steer our way through the data exploration process.

#### 1. Pandas Profiling

In [7]:
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

Summarize dataset:   0%|          | 0/28 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Pandas profiling provides a good overview of the dataset. 
We now know that for this dataset:
1. There are no missing values
2. 7 numerical attributes, 5 boolean and 2 categorical.
3. Age and Experience are highly correlated.
4. Personal Loan shows some amount of correlation with Income, CCAvg and CD Account.
5. CCAvg and Income are high correlated.

#### 2. Data exploration using LUX
* LUX can take your data exploration process forward by specifying your intent. i.e. now you can mention what attributes you want to explore and with what other attributes. We can also specify the attributes values for comparison.

First let's see the basic LUX recommendations which can act like our initial analysis step

In [8]:
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations:**
1. Correlation
    * Experience and Age are highly correlated
    * CCAvg and Mortgage are coorelated with Income i.e. person with more income tends to have more mortgage amount and more possibility of having a CD Account
2. Distribution
    * Mortgage, CCAvg and Income are righlty skewed whereas Age and Experience are moreover uniformally distributed.
3. Occurence
    * Only ~10% of customers have taken personal loan
    * Less than 10% of customers have opted for Securities and CD account.
    * 30% of people are using Credit Cards
    * More percentage of cusotmers are online
    * More customers are having basic education level
    * A slighlty high number of customers have family size of 1 i.e. less dependents.

### Let's explore below important attributes which can be more significant in identifying the relationship with target attribute.
1. CD Account
2. Securities Account
3. CC Avg
4. Income
5. Family and Education
6. Age or Experience
7. Mortgage
8. Online or Credit Card

CD Account

In [9]:
df.intent = ['CD Account']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations**
* Customer having CD Account has more mean mortgage amount, more CCAvg and more income as well.
* Significantly more number of customers having CD account are having Securities Account as well.
* Almost all the customer having CD account are online. These shows that customer needs to be online for using CD account.
* Significantly more number of people are having CD account who opted for Personal Loan. So this can be an important factor in predicting the target.
* Customers using CD Account are mostly using having CreditCard as well.

In [10]:
df.intent = ['Securities Account']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations** - Nothing significant.

In [11]:
df.intent = ['CCAvg']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations**
* CCAvg increase as income increases.
* Customers who opted for Personal Loan has higher mean CCAvg i.e. their CCAvg is more.
* Customers with basic Education level tends to have high mean CCAvg.

In [14]:
df.intent = ['Income']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations**
* Income highly correlated with CCAvg, already checked previously.
* Customers opting for personal loan have more mean income i.e. customers having more income has higher tendency of taking a loan.
* Customers having more income has more mortgage amount as well.
* Customers having CD account has more mean income i.e. customers with more income tends to have high chances of having a CD acount.

In [15]:
df.intent = ['Personal Loan', ['Family', 'Education']]
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations**
* There's slight high chance of taking a personal loan when Family members are more or Education level is high.

In [16]:
df.intent = ['Personal Loan', 'Age']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations** - Nothing significant

In [17]:
df.intent = ['Personal Loan', 'Experience']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations** - Nothing significant

In [18]:
df.intent = ['Personal Loan', 'Mortgage']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations**:
* Customers having high mortgage amount tends to take personal loan

In [19]:
df.intent = ['Personal Loan', 'Online']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations** - nothing significant

In [20]:
df.intent = ['Personal Loan', 'CreditCard']
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

**Observations** - nothing significant

#### Let's create a new variable 'active' which will be true when a customer is either using netbanking(online) or Creditcard.
We will then see the correlation with Personal Loan

In [21]:
df['active'] = df.apply(lambda row : 1 if row['Online'] == 1 or row['CreditCard'] == 1 else 0, axis=1)

In [22]:
from lux.vis.VisList import VisList
vc = VisList(["Personal Loan",['active', 'Online', 'CreditCard']],df)
vc

LuxWidget(recommendations=[{'action': 'Vis List', 'description': 'Shows a vis list defined by the intent', 'vs…

In [23]:
df['Account'] = df.apply(lambda row : 1 if row['Securities Account'] == 1 or row['CD Account'] == 1 else 0, axis=1)
vc = VisList(["Personal Loan",['Account', 'Securities Account', 'CD Account']],df)
vc

LuxWidget(recommendations=[{'action': 'Vis List', 'description': 'Shows a vis list defined by the intent', 'vs…