# Introduction

This notebook presents the outcomes of the DATA@ANZ virtual experience program. Here is the link to the official website of the program: https://www.insidesherpa.com/show-firm-programs/AKkAyEwWc8wjPxx9n/ANZ

As stated in the program description "*Data@ANZ is about mining and linking datasets to develop stories that matter and challenge the status quo, to deliver on ANZ’s purpose 'to shape a world where people and communities thrive'* ". This program includes two tasks:
* Exploratory Data Analysis: Segment the dataset and draw unique insights, including visualisation of the transaction volume and assessing the effect of any outliers.
* Predictive Analytics: Explore correlations between customer attributes, build a regression and a decision-tree prediction model based on your findings.

Note that this is a “synthesised transaction dataset containing 3 months’ worth of transactions for 100 hypothetical customers.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn import linear_model

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Exploratory Data Analysis

First let's take a look at what the data is like. There are 12043 data records with 23 features for each records.

In [None]:
df = pd.read_csv ('../input/anz-synthesised-transaction-dataset/anz.csv')
df.head(5)

In [None]:
# check numerical data statistics
df.describe()

In [None]:
# check data types
df.dtypes

According to the missing value percent table, there are some features like card_present_flag have a pretty high missing value percentage. Luckily, none of those features are needed in these project.

In [None]:
# check missing value ratio
df.isnull().sum() / len(df)

Here we check the total number of customers to verify the dataset. It turns out we do have 100 customers.

In [None]:
# check number of customers. Assume each customer has one unique customer_id.
print("number of customer: ", len(df.customer_id.unique()))

Another interesting thing to notice is that one of the days is not recorded during the three month duration. It turns out that there are not transactions on 2018-08-16, which might be caused by the system maintenance of the ANZ bank.

In [None]:
print("first day of data: ", df.date.iloc[0])
print("last day of data: ", df.date.iloc[-1])
print("duration: ", 92)
print("recorded days: ", len(df.date.unique()))

# Model Analysis

## Data Preparation

In this section, we extract some key features from the raw data as well as some calculated features, which will be used in decision tree model and regression model. Then we build the coorelation matrix to explore the relationship between different features

In [None]:
# list all unique customer ids
customer_list = df.customer_id.unique()
# filter out useless information
df_cus_info = pd.DataFrame(columns = ["customer_id", "annual salary", "age", "avg_transaction_amount", "transaction_number", "max_transaction_amount", "avg_balance", "gender", "state"])

for index, id in enumerate(customer_list):
    # extract payment information of this customer 
    df_cus = df[(df.customer_id == id) & (df.txn_description == 'PAY/SALARY')]
    # calculate annual salary
    pay_period = pd.to_datetime(df_cus.date).diff().mean().total_seconds() / 60 / 60 / 24
    pay_amount = df_cus.amount.mean()
    daily_pay = pay_amount / pay_period
    yearly_pay = 365 * daily_pay
    # get age
    age = df_cus.age.mean()
    # get average balance
    balance = df_cus.balance.mean()
    # get gender
    gender = df_cus["gender"].mode()[0]
    # store all the payment related info in the dataframe
    df_cus_info.loc[index, ["customer_id", "annual salary", "age", "avg_balance", "gender"]] = [id, yearly_pay, age, balance, gender]

In [None]:
# iterate through each customer
for index, id in enumerate(customer_list):
    # extract all info of this customer
    df_cus = df[df.customer_id == id]
    # assume mode of transaction merchant state is the state of this customer                     
    state = df_cus["merchant_state"].mode()[0]
    # calculate average transaction amount of this customer
    avg_transaction_amount = df_cus["amount"].mean()
    # calculate the number of transaction during a certain time of period
    transaction_number = df_cus["transaction_id"].count() 
    # calculate the max transaction amount during a certain time of period 
    max_transaction_amount = df_cus["amount"].max() 
    # put all calculted results above in the data frame
    df_cus_info.loc[index, ["state", "avg_transaction_amount", "transaction_number", "max_transaction_amount"]] = [state, avg_transaction_amount, transaction_number, max_transaction_amount]

In [None]:
# transform the data type
df_cus_info["annual salary"] = df_cus_info["annual salary"].astype(float)
df_cus_info["age"] = df_cus_info["age"].astype(float)
df_cus_info["avg_transaction_amount"] = df_cus_info["avg_transaction_amount"].astype(float)
df_cus_info["transaction_number"] = df_cus_info["transaction_number"].astype(float)
df_cus_info["max_transaction_amount"] = df_cus_info["max_transaction_amount"].astype(float)
df_cus_info["avg_balance"] = df_cus_info["avg_balance"].astype(float)
df_cus_info.dtypes

According the correlation matrix below, we can see that there is a strong correlation between annual salary and max transaction amount, as well as average transaction amount. It makes sense since people with high income tend to have high transaction amount.

In [None]:
# calculate correlation matrix
corrMatrix = df_cus_info.loc[:, ["annual salary", "age", "avg_balance", "avg_transaction_amount", "transaction_number", "max_transaction_amount"]].astype('float64').corr(method='pearson', min_periods=1)
corrMatrix
sn.heatmap(corrMatrix, annot=True)
plt.show()

## Decision Tree

In this section we build a decision tree model to fit the data, with estimated annual salary as the label (we mark annual salary that higher than 60 K as high 1, while lower than 60 K as low 0). It can observed that avg_transaction_amount and transaction_number are the most effective feature.

In [None]:
X = df_cus_info.drop(labels=["customer_id", "annual salary"], axis=1)
X_OHE = pd.get_dummies(X, columns=["state", "gender"])
Y = df_cus_info["annual salary"].apply(lambda x: 1 if x >  60000 else 0)
clf = tree.DecisionTreeClassifier(max_depth=2)
clf = clf.fit(X_OHE, Y)
tree.plot_tree(clf) 

print()

## Linear Regression

In this section we build the linear regression model of the annual salary.

In [None]:
# df_cus_info
X_rgr = df_cus_info[["age", "avg_transaction_amount", "transaction_number", "max_transaction_amount", "avg_balance"]]
Y_rgr = df_cus_info["annual salary"]

In [None]:
# build linear regression model
rgr = linear_model.LinearRegression()
# fit model
rgr.fit(X_rgr, Y_rgr)
# coefficient of determination R^2
print(rgr.score(X_rgr, Y_rgr))

From the regression coefficient plot we can see that age is negatively correlated with annual salary while transaction amount and transaction number are negatively correlated with annual salary, which is consistent with the outcomes of correlation matrix and their impact is relatively high. Furthermore, average balance has little impact on the final prediction result in our regression model. 

In [None]:
# plot regression coefficients
names = ["age", "avg_transaction_amount", "transaction_number", "max_transaction_amount", "avg_balance"]
values = rgr.coef_
plt.figure(figsize=(16,8))
plt.bar(names, values)

Consider the effect of regularization might further improve our  coefficient of determination, we try elastic net regression, which includes both L1 and L2 norm. However, it turns out the final results are almost the same even if we try adjusting the ratio between L1 and L2 norms.

In [None]:
# build linear regression model
rgr_elastic = linear_model.ElasticNet(random_state=0, l1_ratio=0.5)
# fit model
rgr_elastic.fit(X_rgr, Y_rgr)
# coefficient of determination R^2
print(rgr_elastic.score(X_rgr, Y_rgr))

In [None]:
# plot regression coefficients
names = ["age", "avg_transaction_amount", "transaction_number", "max_transaction_amount", "avg_balance"]
values = rgr_elastic.coef_
plt.figure(figsize=(16,8))
plt.bar(names, values)

# Visualization

This section includes some visiualizations of EDA. 

## Histogram of Transaction Quantity 

In [None]:
# plot
plt.figure(figsize=(8, 5))
plt.hist(df_cus_info.transaction_number)
plt.xlabel('Transaction Quantity')
plt.ylabel("Frequency")
plt.title('Histogram of Transaction Quantity')
plt.show()

## Histogram of Transaction Amount

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(df_cus_info.avg_transaction_amount)
plt.xlabel('Transaction Amount')
plt.ylabel("Frequency")
plt.title('Histogram of Transaction Amount')
plt.show()

## Histogram of Annual Salary

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(df_cus_info["annual salary"])
plt.xlabel('Annual Salary')
plt.ylabel("Frequency")
plt.title('Histogram of Annual Salary')
plt.show()