# FINANCIAL ADVISOR PROJECT 

<br>Questions:
1. [What would be the suggested steps to make this Supervised Learning](#1)

1. [What are some of the challenges you oversee from the dataset shared?](#2)

1. [Do you consider a need to apply any preprocessing on the training dataset? If so, why?](#3)

1. [How would you execute data from any one of the indicators (highlighted in yellow on slide 4) for 1 of the profiles mentioned and across one of the 4 time frames referred on that same slide to analyze user consumption trends](#4)

1. [If your were the client "Bank" how would you envision a dashboard where all this data is collected, which 3 features from the users' spending behavior do you think they'd be interested in taking a look at and why?](#5)

<a id="1"></a> <br>
# 1 - What would be the suggested steps to make this Supervised Learning Model? 
<font color='green'>
    
1. Firstly, data cleaning should be done. (The steps in question 3 are done at this stage) 
1. Train and test data is created according to the size of the data we have (we assume that customers are tagged).  
1. We start using the logistic regression algorithm, which is the most basic algorithm.We are building our logistic regression model with the data we have
1. We find the error of our test data with the model we created 
1. Then we need to repeat this process with all the algorithms used in classification problems (Random Forest, Decision Trees, Naive Bayes, SVM, Neural Network) and we have to find a test error in each
1. Then, in all the algorithms we tried, we select the algorithm that we get the lowest error (I think that tree-based algorithms will be more suitable for our problem)
1. We subject the algorithm we have chosen to tuning. With the cross-validation method, we are trying to find the best hyperparameter we will get the lowest error
1. After finding the parameters that give the lowest error, we build our model with those parameters
1. We can now use this algorithm with future data

<a id="2"></a> <br>
# 2 - What are some of the challenges you oversee from the dataset shared?
<span style='color:green'> 
I had difficulties to group the customers called Group 3 because it was not in the data the credit card statement payment day.

<a id="3"></a> <br>
# 3 - Do you consider a need to apply any preprocessing on the training dataset? If so, why?

<font color='green'>

* Converting the date variable to DateTime variable so that we can operate using the date
    
* There are values that cannot be used in the Amount variable, such as ‘-‘, we need to fill or destroy them. 
    
* The use of "," and "." In the Amount variable is wrong, we need to edit them.
    
* We need to convert the Amount variable to float
    
* We need to convert the Gender variable to the dummy variable 
    
* [OPTIONAL] If we want to control the expenditures made on special days, we need to add the days like black friday, national holidays, religious holidays, special days to the data set (for example, creating a variable called black_friday and making the values ​​of black friday in the dataset as 1 other days 0)
    
* [OPTIONAL] dollar rates can be added to the dataset (spending may decrease on days when the dollar increases)
    
* [OPTIONAL] inflation data can be added to the dataset (we can see how the increase in inflation is reflected in expenditures)
    
* [OPTIONAL] A variable called vip_customer can be created. In this variable, customers with high credit card limit (for example 40000) and high spending can be regulated to be 1 other and 0. This can make our model work better.

In [None]:
# import the libraries we will use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# read data
df = pd.read_csv('../input/brazilian-real-bank-dataset/MiBolsillo.csv',encoding = 'unicode_escape',sep=';')

In [None]:
# we take a look at the data
df.head(5)

In [None]:
# translate variable names into English

df.columns = ['id','branch_number','city','state','age','gender','total_credit_card_limit','current_available_limit' ,'date','amount','category_expense','purchase_city','purchase_country']

In [None]:
df.date = pd.to_datetime(df.date,dayfirst=True)
df.head()

In [None]:
# I create a dataframe as customers with old and new customer names

customers = pd.DataFrame(df.id.unique())
customers.columns = ['old_customer_name']
new_list = list(range(1, len(customers)+1))
customers['new_customer_name'] = new_list
customers

In [None]:
# I am changing id to understand customers more easily

df["id"].replace(customers['old_customer_name'].values,customers['new_customer_name'].values, inplace=True)

In [None]:
# we have information about our dataset
df.info()

In [None]:
# we need to float the type of amount variable
for i in range(0,len(df.amount)):
    df.amount[i] = df.amount[i].replace('.','').replace(',','.')

#I export values containing '-' in the amount variable as df_amount_nan
df_amount_nan = df[df.amount == ' -   ']

# There are '-' values in the amount column. I am deleting theese rows because we can't convert it to the float type with these values
df = df[df.amount != ' -   ']

# Now the amount variable is ready to convert to float type.
df.amount = df.amount.astype(float)

In [None]:
# I am converting the date variable to date_time format
df.date = pd.to_datetime(df.date,dayfirst=True)

In [None]:
# We convert the gender variable to dummy_variable
dms = pd.get_dummies(df['gender'])
df = pd.concat([df,dms],axis=1)
df.drop(['gender', 'M'], axis=1,inplace=True)
df.rename(columns={'F': 'Female'}, inplace=True)
df

In [None]:
# I want to look at the total expense in each category

categ = df.category_expense.value_counts().sort_values(ascending=False)


plt.figure(figsize=(15,10))
sns.barplot(x=categ.index,y=categ.values)
plt.xlabel('Cataegories')
plt.ylabel('Count')
plt.title("Cataegories Count")
plt.xticks(rotation= 45);

In [None]:

# I look at the number of transactions each client has with a credit card

freq = df.groupby('id')[['amount']].count().sort_values('amount',ascending=False)
freq.rename(columns={'amount': 'Frequency'}, inplace=True)
freq

In [None]:
# I look at the total spending of each customer
total = df.groupby('id')[['amount']].sum().sort_values('amount',ascending=False)
total.rename(columns={'amount': 'Total_spending'}, inplace=True)
total

In [None]:
# I want to look at the total spending of each customers

plt.figure(figsize=(15,10))
sns.barplot(x=total.index,y=total['Total_spending'])
plt.xlabel('Customers')
plt.ylabel('Total Spending')
plt.title("Total Spending vs Customers")
plt.xticks(rotation= 45);

<span style='color:blue'> 
As it is seen, the 19th customer is the person who spends the most among the customers.

In [None]:
# I add the special days(Carnival,Good Friday,Christmas,Corpus Christi,New Year's Day,Black Friday,Halloween) of Brazil to the dataset

df['carnival'] = '0'
df["carnival"].replace(df[(df.date == '2020-02-24') | (df.date == '2020-02-25') | (df.date == '2019-03-04') | (df.date == '2019-03-05')]['carnival'],'1', inplace=True)

df['good_friday'] = '0'
df["good_friday"].replace(df[(df.date == '2019-04-19') | (df.date == '2020-04-10')]['good_friday'],'1', inplace=True)


df['christmas'] = '0'
df["christmas"].replace(df[(df.date == '2019-12-25') | (df.date == '2020-12-25')]['christmas'],'1', inplace=True)


df['corpus_christi'] = '0' 
df["corpus_christi"].replace(df[(df.date == '2019-06-20') | (df.date == '2020-06-11')]['corpus_christi'],'1', inplace=True)


df['new_year'] = '0'
df["new_year"].replace(df[(df.date == '2019-01-01') | (df.date == '2020-01-01')]['new_year'],'1', inplace=True)


df['black_friday'] = '0'
df["black_friday"].replace(df[(df.date == '2019-11-29') | (df.date == '2020-11-27')]['black_friday'],'1', inplace=True)


df['halloween'] = '0'
df["halloween"].replace(df[(df.date == '2019-10-31') | (df.date == '2020-10-31')]['halloween'],'1', inplace=True)



In [None]:
# I want to see the total transaction on special days as a barplot

special_days = ['carnival','good_friday','christmas','corpus_christi','new_year','black_friday','halloween']
counts = []

for i in special_days:
    counts.append(df[i].value_counts()[1])
special_days_dict = dict( zip( special_days, counts))


plt.figure(figsize=(15,10))
sns.barplot(x=list(special_days_dict.keys()),y=list(special_days_dict.values()))
plt.xlabel('Special Days')
plt.ylabel('Total Transaction')
plt.title('Total Transaction of Special Days')
plt.xticks(rotation= 45);

<font color='blue'>
This chart shows us some things, but it is not right to compare the total transactions made on special days with this chart only. For example, we can look at the total number of transactions 1 week before the special days (not for black friday).

<a id="4"></a> <br>
# 4 - How would you execute data from any one of the indicators (highlighted in yellow on slide 4) for 1 of the profiles mentioned and across one of the 4 time frames referred on that same slide to analyze user consumption trends 

In [None]:
# I create data sets covid and pre covid

# pre covid
pre_covid = df[(df.date > '2020-01-01') & (df.date < '2020-03-18')]

#covid
covid = df[(df.date >= '2020-03-18')]

## For Group 1

### During Covid

In [None]:
covid.head()

In [None]:
# Frequency of use

covid_freq = covid.groupby('id')[['age']].count().sort_values('age',ascending=False)
covid_freq.columns = ['frequency']
print('average number of transactions frequency: ',covid_freq.frequency.mean())
print('max number of transactions frequency: ',covid_freq.frequency.max())
print('min number of transactions frequency: ',covid_freq.frequency.min())

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x=covid_freq.index,y=covid_freq['frequency'])
plt.xlabel('Customers')
plt.ylabel('Frequency of use')
plt.title("The frequency of transactions made by customers during the covid")
plt.xticks(rotation= 45);

In [None]:
# The frequency of using credit card for each customer during covid
covid_freq

In [None]:
covid['category_expense'].value_counts()

In [None]:
# Since there is no specific rule, I create the essential and non-essential list myself.

essential_list = ['FARMACIAS','VAREJO','HOSP E CLINICA','SUPERMERCADOS','POSTO DE GAS','TRANS FINANC']

non_essential_list = ['SERVI\x82O','M.O.T.O.','ARTIGOS ELETRO','LOJA DE DEPART','VESTUARIO','SEM RAMO','MAT CONSTRUCAO','RESTAURANTE','CIA AEREAS','MOVEIS E DECOR','JOALHERIA','AGENCIA DE TUR','HOTEIS','AUTO PE AS','INEXISTENTE','']

In [None]:
# The transaction amount from the essential list during the COVID

covid[covid.category_expense.isin(essential_list)]['category_expense'].value_counts()

In [None]:
# The transaction amount from the non-essential list during the COVID

covid[covid.category_expense.isin(non_essential_list)]['category_expense'].value_counts()

% of essential vs % non-essential expenses. and in Brazilian R$


In [None]:
# % of essential
print('Total spending during corid: ',covid['amount'].sum(),'Brazillian R')
print('Total spending in essential category during covid',covid[covid.category_expense.isin(essential_list)]['amount'].sum(),'Brazillian R')
print('Essential : %',covid[covid.category_expense.isin(essential_list)]['amount'].sum() * 100 / covid['amount'].sum())
essential_covid = covid[covid.category_expense.isin(essential_list)]['amount'].sum() * 100 / covid['amount'].sum()

In [None]:
# % of non essential
print('Total spending during corid: ',covid['amount'].sum(),'Brazillian R')
print('Total spending in non essential category during covid',covid[covid.category_expense.isin(non_essential_list)]['amount'].sum(),'Brazillian R')
print('Non - Essential : %',covid[covid.category_expense.isin(non_essential_list)]['amount'].sum() * 100 / covid['amount'].sum())
non_essential_covid = covid[covid.category_expense.isin(non_essential_list)]['amount'].sum() * 100 / covid['amount'].sum()

In [None]:
# Top 3 essential expenses

In [None]:
covid[covid.category_expense.isin(essential_list)]['category_expense'].value_counts()[:3]

In [None]:
# Top 3 non - essential expenses

In [None]:
covid[covid.category_expense.isin(non_essential_list)]['category_expense'].value_counts()[:3]

In [None]:
# The lowest paid spending and category of each customers during covid

for i in range(1,30):
    print(covid[covid['id'] == i].sort_values('amount')[:3][['category_expense','amount']].set_index(covid[covid['id'] == i]['id'][:3]))

<font color='blue'>

According to these expenditures, even the lowest expenditures of customers 4,7,10,13,17 and 18. appear high.

In [None]:
# The lowest paid spending and category of each customers during covid

for i in range(1,30):
    print(covid[covid['id'] == i].sort_values('amount',ascending=False)[:3][['category_expense','amount']].set_index(covid[covid['id'] == i]['id'][:3]))

<font color='blue'>
I think we can transfer at least 13,20,26 numbered customers to group 1 :)

### Pre Covid

In [None]:
pre_covid.head()

In [None]:
# Frequency of use

pre_covid_freq = pre_covid.groupby('id')[['age']].count().sort_values('age',ascending=False)
pre_covid_freq.columns = ['frequency']
print('average number of transactions frequency: ',pre_covid_freq.frequency.mean())
print('max number of transactions frequency: ',pre_covid_freq.frequency.max())
print('min number of transactions frequency: ',pre_covid_freq.frequency.min())

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x=pre_covid_freq.index,y=pre_covid_freq['frequency'])
plt.xlabel('Customers')
plt.ylabel('Frequency of use')
plt.title("The frequency of transactions made by customers before the covid")
plt.xticks(rotation= 45);

In [None]:
# The frequency of using credit card for each customer before covid
pre_covid_freq

In [None]:
# Since there is no specific rule, I create the essential and non-essential list myself.

essential_list = ['FARMACIAS','VAREJO','HOSP E CLINICA','SUPERMERCADOS','POSTO DE GAS','TRANS FINANC']

non_essential_list = ['SERVI\x82O','M.O.T.O.','ARTIGOS ELETRO','LOJA DE DEPART','VESTUARIO','SEM RAMO','MAT CONSTRUCAO','RESTAURANTE','CIA AEREAS','MOVEIS E DECOR','JOALHERIA','AGENCIA DE TUR','HOTEIS','AUTO PE AS','INEXISTENTE','']

In [None]:
# The transaction amount from the essential list before the COVID

pre_covid[pre_covid.category_expense.isin(essential_list)]['category_expense'].value_counts()

In [None]:
# The transaction amount from the non-essential list before the COVID 

pre_covid[pre_covid.category_expense.isin(non_essential_list)]['category_expense'].value_counts()

% of essential vs % non-essential expenses. and in Brazilian R$


In [None]:
# % of essential
print('Total spending before covid: ',pre_covid['amount'].sum(),'Brazillian R')
print('Total spending in essential category before covid',pre_covid[pre_covid.category_expense.isin(essential_list)]['amount'].sum(),'Brazillian R')
print('Essential : %',pre_covid[pre_covid.category_expense.isin(essential_list)]['amount'].sum() * 100 / pre_covid['amount'].sum())
essential_pre_covid = pre_covid[pre_covid.category_expense.isin(essential_list)]['amount'].sum() * 100 / pre_covid['amount'].sum()

In [None]:
# % of non essential
print('Total spending before corid: ',pre_covid['amount'].sum(),'Brazillian R')
print('Total spending in non essential category before covid',pre_covid[pre_covid.category_expense.isin(non_essential_list)]['amount'].sum(),'Brazillian R')
print('Non - Essential : %',pre_covid[pre_covid.category_expense.isin(non_essential_list)]['amount'].sum() * 100 / pre_covid['amount'].sum())
non_essential_pre_covid = pre_covid[pre_covid.category_expense.isin(non_essential_list)]['amount'].sum() * 100 / pre_covid['amount'].sum()

In [None]:
# Top 3 essential expenses

In [None]:
pre_covid[pre_covid.category_expense.isin(essential_list)]['category_expense'].value_counts()[:3]

In [None]:
# Top 3 non - essential expenses

In [None]:
pre_covid[pre_covid.category_expense.isin(non_essential_list)]['category_expense'].value_counts()[:3]

In [None]:
# The lowest paid spending and category of each customers before covid

for i in range(1,30):
    print(pre_covid[pre_covid['id'] == i].sort_values('amount')[:3][['category_expense','amount']].set_index(pre_covid[pre_covid['id'] == i]['id'][:3]))

<font color='blue'>

According to these expenses, the values of the 10th customer are high

In [None]:
# The lowest paid spending and category of each customers before covid

for i in range(1,30):
    print(pre_covid[pre_covid['id'] == i].sort_values('amount',ascending=False)[:3][['category_expense','amount']].set_index(pre_covid[pre_covid['id'] == i]['id'][:3]))

In [None]:
freq = pd.concat([covid_freq.sort_index(),pre_covid_freq.sort_index()],axis=1)
freq.columns = ['covid_freq','pre_covid_freq']

In [None]:
# The frequency of using credit card for each customer before covid vs during covid


fig, ax = plt.subplots(2,2,sharey=True)


ax[0,0].plot(pre_covid_freq.sort_index().index,pre_covid_freq.sort_index().values,color='g',marker='o')
ax[0,0].set_title('Pre Covid')


ax[0,1].plot(covid_freq.sort_index().index,covid_freq.sort_index().values,color='b',marker='o')
ax[0,1].set_title('Covid')

plt.show();

In [None]:
# The frequency of using credit card for each customer before covid vs during covid

import plotly.graph_objs as go
#import chart_studio.plotly as py


fig = go.Figure()
fig.add_trace(go.Box(y=freq.covid_freq, name='The frequency of using credit card for each customer during covid',
                marker_color = 'indianred'))
fig.add_trace(go.Box(y=freq.pre_covid_freq, name = 'The frequency of using credit card for each customer before covid',
                marker_color = 'lightseagreen'))

fig.show()

In [None]:
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go


# Creating trace1
trace1 = go.Scatter(
                    x = freq.index,
                    y = freq.covid_freq,
                    mode = "lines",
                    name = "Covid Freq",
                    marker = dict(color = 'rgba(16, 112, 2, 0.8)'),
                    text= freq.covid_freq)
# Creating trace2
trace2 = go.Scatter(
                    x = freq.index,
                    y = freq.pre_covid_freq,
                    mode = "lines+markers",
                    name = "Pre Covidreq",
                    marker = dict(color = 'rgba(80, 26, 80, 0.8)'),
                    text= freq.pre_covid_freq)
data = [trace1, trace2]
layout = dict(title = 'The frequency of using credit card for each customer before covid vs during covid',
              xaxis= dict(title= 'Customers',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
iplot(fig)

<font color='green'>


<a id="5"></a> <br>
# 5 - If your were the client "Bank" how would you envision a dashboard where all this data is collected, which 3 features from the users' spending behavior do you think they'd be interested in taking a look at and why?
<font color='green'>

I would take the variables 'total_credit_card_limit', 'current_available_limit' and 'age'. When we look at the correlation matrix, the relationship between them is the most in these 3 variables.


In [None]:
df.corr()

In [None]:
f,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, linewidths=0.1,linecolor="red", fmt= '.2f',ax=ax);

# Thanks 

