# Project: ANZ synthesised transaction Dataset
## by Imene KOLLI

<a id='intro'></a>
## Introduction

> This task is based on a synthesised transaction dataset containing 3 months’ worth of transactions for 100 hypothetical customers. It contains purchases, recurring transactions, and salary transactions.
The dataset is designed to simulate realistic transaction behaviours that are observed in ANZ’s real transaction data, so many of the insights you can gather from the tasks below will be genuine.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('../input/anz-synthesised-transaction-dataset/anz.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

### 1. Data cleaning

In [None]:
# checking the unique values in currency
df.currency.value_counts()

currency column should be dropped.

In [None]:
df.drop(['currency'], axis = 1 , inplace = True)

In [None]:
# let's check if we have NaNs in our dataset
df.isnull().sum()

In [None]:
# because there are so many NaNs in these two columns, I am going to drop them
df.drop(['bpay_biller_code', 'merchant_code'], axis = 1 , inplace = True)

In [None]:
# let's count the dupliates in the dataset
df.duplicated().sum()

In [None]:
# it looks like the NaNs in the dataset are all on commun rows.
df[df.card_present_flag.isnull()][['merchant_id', 'merchant_state', 'merchant_suburb', 'merchant_long_lat']].isnull().sum()

In [None]:
# these are the unique values in the status column.
df['status'].value_counts()

In [None]:
# all the transactions happened in Australia.
df['country'].value_counts()

In [None]:
# then I am dropping the country column 
df.drop(['country'], axis = 1 , inplace = True)

In [None]:
# the unique movement types 
df.movement.value_counts()

In [None]:
# I am going to change the dtype of date column and extraction column to datetime.
df['date'] = pd.to_datetime(df['date'])
df['extraction'] = pd.to_datetime(df['extraction'])

In [None]:
# adding three columns: year, month, and day of the transaction to make the segmentation by date easy.
df['day']= df['date'].dt.day_name()
df['month'] = df['date'].dt.month_name()
df['year'] = df['date'].dt.year

In [None]:
# now let's check the unique values of our new columns
df.year.value_counts()

In [None]:
# because all the transactions are from 2018, I will drop the year columns
df.drop(['year'], axis = 1 , inplace = True)

In [None]:
# it looks like transactions happened only in August, September, and October
df.month.value_counts()

In [None]:
# unique values of txn_description (mode of transaction)
df.txn_description.value_counts()

In [None]:
# finally let's check the distribution of the dataset by gender
df.gender.value_counts()

In [None]:
# this is a quick summary of our categorical variables showing the number of unique values in each
# column with the most frequent value in it and its frequency.
df.describe(exclude = [np.number], datetime_is_numeric=True).loc[['unique','top' ,'freq']]

- transaction_id has 12043 unique values. it won't be significant in my analysis so I will drop it.
- there is 100 customers in the dataset. However, there are only 80 unique first names. I decided to drop the first name columns because the customer_id is more accurate.
- because each customer has a unique account. I will only keep the customer_id column and drop account column.
- for the location based analysis, I will rely merchat_state. I will drop long_lat, merchant_suburb, and merchant_long_lat.
- there are 5725 merchants. the highest number of transactions by merchant is 14 only, meaning the merchant_id won't significantly have an impact on the amount of the transaction. I will drop for now.

In [None]:
df.drop(['transaction_id', 'account', 'first_name', 'long_lat','merchant_id','merchant_suburb', 'merchant_long_lat'], axis = 1 , inplace = True)

In [None]:
# my final dataset

df.head()

## Univariate analysis

First, a univariate analysis for my categorical variables.

In [None]:
# there are more authorized transactions than posted

sns.countplot(x = 'status', data = df, palette = 'magma')
plt.title('count of transactions by transaction status');

In [None]:
# POS and SALES-POS are the most used modes of transaction

sns.countplot(y = 'txn_description', data = df, palette = 'magma')
plt.title('count of transactions by mode of transaction')
plt.ylabel('mode of transaction');

In [None]:
# there are more males than females

sns.countplot(x = 'gender', data = df, palette = 'magma')
plt.title('count of transactions by gender');

In [None]:
# NSW and VIC are the states with the most transactions

sns.countplot(x = 'merchant_state', data = df, palette = 'magma')
plt.title('count of transactions by merchants state');

In [None]:
# there are much more debit transactions than credit ones.

sns.countplot(x = 'movement', data = df, palette = 'magma')
plt.title('count of transactions by movement type');

In [None]:
plt.figure(figsize=[15, 5])

# the left plot
plt.subplot(1,2,1)
sns.countplot(y = 'day', data = df, palette = 'magma')
plt.title('count of transactions by day of week')

# the right plot
plt.subplot(1,2,2)
sns.countplot(x = 'month', data = df, palette = 'magma')
plt.title('count of transactions by month');

- Wednesday and Friday are the days with the most number of transactions.
- Monday scored the lowest number of transactions
- the difference in the number of transactions between the months is not significants

Next, an analysis for the numeric variables.

In [None]:
# these are histograms of my numeric variables.
df.hist(figsize=(10,8));

In [None]:
df.describe(datetime_is_numeric=True , percentiles = [0.75, 0.90, 0.95, 0.99])

- There seem to be some outliers in amount, age, and balance columns.

## Transaction amount analysis

In [None]:
plt.figure(figsize=[15,4])
binsize = 10
bins = np.arange(-100, df['amount'].max()+binsize, binsize)
plt.hist(data = df, x = 'amount', bins = bins)
plt.xlabel('transaction amount');

let's get rid of the outliers by limiting the x axis

In [None]:
plt.figure(figsize=[8, 5])

bin_edges = np.arange(-10, 350+1,1)
plt.hist(data = df, x = 'amount', bins = bin_edges)
plt.xlim(-10, 350)
plt.xlabel('transaction amount');

the transaction amount looks normally distributed but long-tailed due to outliers. But with the use of axis limits, we are able to see the normal distribution clearly.

### Amount by gender

In [None]:
sns.catplot(x="gender", y="amount", kind="bar", data=df.query('amount <= 350') , palette = 'magma')
plt.title('average transaction amount per gender');

this is a barplot showing the average amount of transactions for each gender. with a confidence interval displayed as error bars.

In [None]:
# scatterplot
sns.catplot(x="gender", y = 'amount', data=df.query('amount <= 350'), palette = 'magma')
plt.title('transaction amount per gender');

this is a categorical scatterplot showing the transaction amount for each gender.

- the highest amounts are made by men
- the amount of transactions made by men are higher on average.

### Amount by month

In [None]:
sns.catplot(x="month", y="amount", kind="bar", data=df.query('amount <= 350') , palette = 'magma')
plt.title('average transaction amount per month')
sns.catplot(x="month", y="amount", kind="point", data=df.query('amount <= 350'));

October has higher transaction amounts than Auguest and September.

In [None]:
sns.catplot(x="month", y = 'amount', data=df.query('amount <= 350'), palette = 'magma')
plt.title('transaction amount per gender');

outliers in the transaction amount happen in all three months. 

In [None]:
sns.catplot(x="month", y="amount", hue="gender",kind="point", data=df.query('amount <= 350'), palette = 'magma')
plt.title('average transaction amount by month per gender');

- males have higher transaction amounts.
- October is the month with the highest transaction amounts for both genders.

### Amount by day

In [None]:
sns.catplot(y="day", x="amount", kind="bar", data=df.query('amount <= 350'), palette = 'magma')
plt.title('average transaction amount by week of the day')
sns.catplot(y="day", x="amount", kind="point", data=df.query('amount <= 350'));

- the amount of transactions is the lowest on weekend Saturday.
- the transactions with the highest amounts happen mostly on Monday.

In [None]:
sns.catplot(y="day", x="amount", hue="gender",kind="point", data=df.query('amount <= 350'), palette = 'magma')
plt.title('average transaction amount by weekday per gender');

only on Sunday do women have almost the same amount of transaction as men.

In [None]:
sns.catplot(y="day", x="amount",hue='month', kind="bar", data=df.query('amount <= 350'), palette = 'magma')
plt.title('transaction amount by week of the day per month');

### Amount by location

In [None]:
# min and max amount of transactions in each state
df.groupby('merchant_state')['amount'].agg(['min' , 'max'])

In [None]:
sns.catplot(x="merchant_state", y = 'amount', data=df.query('amount <= 350'), palette = 'magma')
plt.title('transaction amount per state');

In [None]:
sns.catplot(x="merchant_state", y="amount", kind="bar", data=df.query('amount <= 350') , palette = 'magma')
plt.title('average transaction amount per state')
sns.catplot(x="merchant_state", y="amount", kind="point", data=df.query('amount <= 350'));

In [None]:
sns.catplot(y="merchant_state", x="amount", hue="gender",kind="point", data=df.query('amount <= 350'), palette = 'magma')
plt.title('average transaction amount by state per gender');

women's transaction amounts are higher than men in the state of TAS and ACT

### Amount by  Age

In [None]:
sns.distplot(df['age']);

In [None]:
sns.relplot(x='age' , y='amount', kind="line" , data=df.query('amount <= 350'));

the highest amounts of transactions are made by customers of age between 40 and 45.

### Amount by transaction type

In [None]:
sns.catplot(x="movement", y="amount", kind="bar", data=df , palette = 'magma')
plt.title('average transaction amount per movement')
sns.catplot(x="movement", y="amount", kind="point", data=df, join=False);

In [None]:
sns.catplot(x="status", y="amount", kind="bar", data=df.query('amount <= 350') , palette = 'magma')
plt.title('average transaction amount per status of transaction')
sns.catplot(x="status", y="amount", kind="point", data=df.query('amount <= 350'), join=False);

- small amount transactions are debit and authorised transactions.
- high amount transactions are credit and posted transactions.

In [None]:
sns.catplot(x="movement", y="amount", hue="gender",kind="point", data=df, palette = 'magma', join=False)
plt.title('average transaction amount by movement per gender');

In [None]:
sns.catplot(y="txn_description", x="amount", kind="bar", data=df.query('amount <= 350') , palette = 'magma')
plt.title('average transaction amount per mode of transaction')
sns.catplot(y="txn_description", x="amount", kind="point", data=df.query('amount <= 350'), join=False);

- Phone bank transactions are the ones with the highest amount in transactions.

### Amount and Customers

In [None]:
# customers of the 10 highest transaction amounts
df1 = df.loc[df.amount.sort_values(ascending =False)[:10].index]
sns.catplot(y="customer_id", x="amount", kind="bar", data=df1 , palette = 'magma');

In [None]:
# cutomers with the highest number of transactions
df2 = df.loc[df['customer_id'].isin(df.customer_id.value_counts()[:10].index)]
sns.countplot(y = 'customer_id', data=df2 , palette = 'magma');

In [None]:
# total transactions amounts for the customers with highest number of transactions
df2.groupby('customer_id')['amount'].sum().plot.bar(color = 'pink', alpha = 0.8);

In [None]:
# most spending customers
df.groupby('customer_id')['amount'].sum().sort_values(ascending=False)[:10].plot.bar(color = 'pink', alpha = 0.8);

In [None]:
highest_transactions_customers = df1.customer_id.value_counts().index.tolist()

In [None]:
most_dealing_customers = df2.customer_id.value_counts().index.tolist()

In [None]:
most_spending_customers = df.groupby('customer_id')['amount'].sum().sort_values(ascending=False)[:10].index.tolist()

In [None]:
# no customer belongs to top 10 : most spending customers, highest transations made, and highest number of transactions.
set(highest_transactions_customers).intersection(most_dealing_customers, most_spending_customers)

In [None]:
# this customer did the 4th highest transaction amount and ranked 9th in the customers who made the most transactions.
set(highest_transactions_customers).intersection(most_dealing_customers)

In [None]:
# this customer did the highest transaction amount and is the thirst most spending customer
set(highest_transactions_customers).intersection(most_spending_customers)

In [None]:
# this customer did more transactions than all others, and is the 9th most spending customer.
set(most_dealing_customers).intersection(most_spending_customers)

### Amount and Balance

In [None]:
sns.heatmap(df[['amount', 'balance']].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

There is no significant relation between the balance and transaction amount.

In [None]:
df.to_csv('ANZ-data.csv', index = False)

# IMENE KOLLI

### LinkedIn : https://www.linkedin.com/in/imenekolli/

### Github : https://github.com/imene-swaan