# 银行电话营销数据分析
目前，银行会为其客户提供各种各样的理财产品，银行会通过电话营销的方式将产品推荐为客户。为了提供营销的成功率，银行会借助数据分析来分析哪些客户更可能会购买他们的产品。这是葡萄牙银行机构电话营销活动的记录。机构通常与同一个客户会进多次电话沟通，客户明确购买或不购买该产品的情况会被记录到这个数据集中，希望通过对这个数据集的分析，发现客户是否会购买新产品的规律。原数据集有大约4万5千条记录，在这里使用随机抽样的10%的小数据集来进行分析，该数据集中有4521条记录。

首先，引入需要的类库，并读入数据文件。在CSV文件中，数据的分隔符分号，因此读入文件时，需要指定数据的分隔符。

In [1]:
import pandas as pd
import numpy as np

# 读入数据
df = pd.read_csv('data/银行电话营销/bank.csv', delimiter=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


接下来，检查一下数据中是否有缺失值。

In [2]:
df.isnull().all()

age          False
job          False
marital      False
education    False
default      False
balance      False
housing      False
loan         False
contact      False
day          False
month        False
duration     False
campaign     False
pdays        False
previous     False
poutcome     False
y            False
dtype: bool

可以看到，数据被整理的非常完善，没有缺失值的存在。数据中y表示客户是否购买了银行电话营销的产品，首先根据y将数据分成两组，对其他的数据项进行对比，来看一下哪些项目会对客户购买有促进作用，哪些项目起反作用。数据集中的大部分数据是分类数据，因此使用饼图来看一下各个分类项对购买的分布情况，通过对比发现哪些用户更容易产生购买行为，或者拒绝购买。对于年龄，年均存款余额等数值型字段，进行转换形成分类字段。如年龄可以分为18以下的少年，18-25的青年，26-50的壮年，以及51岁以上的分组。年均存款以1000， 2000， 5000位分界形成分类数据。

In [3]:
def age_level(age):
    try:
        if int(age) < 18:
            return 'Level 1'
        elif int(age) <= 25:
            return 'Level 2'
        elif int(age) <= 50:
            return 'Level 3'
        else:
            return 'Level 4' 
    except:
        return age
df['age'] = df['age'].apply(age_level)

def balance_level(balance):
    try:
        if int(balance) < 1000:
            return 'Level 1'
        elif int(balance) < 2000:
            return 'Level 2'
        elif int(balance) < 5000:
            return 'Level 3'
        else:
            return 'Level 4'
    except:
        return balance
df['balance'] = df['balance'].apply(balance_level)

buy_df = df[df['y'] == 'yes']

对购买产品的顾客的各个纬度进行比较

In [4]:
show_data = []

# 年龄分组数据构成
data = buy_df.groupby('age').size()
value = {}
value['Title'] = 'Age'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [5]:
# Job分组数据构成
data = buy_df.groupby('job').size()
value = {}
value['Title'] = 'Job'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [6]:
# marital分组数据构成
data = buy_df.groupby('marital').size()
value = {}
value['Title'] = 'marital'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [7]:
# education分组数据构成
data = buy_df.groupby('education').size()
value = {}
value['Title'] = 'education'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [8]:
# balance分组数据构成
data = buy_df.groupby('balance').size()
value = {}
value['Title'] = 'balance'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [9]:
# housing分组数据构成
data = buy_df.groupby('housing').size()
value = {}
value['Title'] = 'housing'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [10]:
# loan分组数据构成
data = buy_df.groupby('loan').size()
value = {}
value['Title'] = 'loan'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [11]:
from pyecharts import Pie, Style
pie = Pie('购买产品的比较', "分类比较", title_pos='center')
style = Style()
pie_style = style.add(label_pos="center", is_label_show=False, label_text_color=None, is_legend_show=False)
i = 1
# print(len(show_data))
for value in show_data:
    j = 0
    k = 0
    if (i < 5):
        j = 0
        k = i
    else:
        j = 1
        k = i - 4
    pie.add(value['Title'], value['Attributes'], value['Values'], center=[k * 20, 30 + j * 40], 
            radius=[8, 34], **pie_style)
    i += 1
pie

通过上述的饼图可以看到，在购买了产品的顾客中，没有贷款，年龄都在26-50之间，没有购买房产，受教育程度高等项目比例较高。

相应的我们也看一下，没有购买产品的顾客是如何分布的呢？

In [12]:
non_buy_df = df[df['y'] == 'no']

In [13]:
show_data = []

# 年龄分组数据构成
data = non_buy_df.groupby('age').size()
value = {}
value['Title'] = 'Age'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [14]:
# Job分组数据构成
data = non_buy_df.groupby('job').size()
value = {}
value['Title'] = 'Job'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [15]:
# education分组数据构成
data = non_buy_df.groupby('education').size()
value = {}
value['Title'] = 'education'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [16]:
# marital分组数据构成
data = non_buy_df.groupby('marital').size()
value = {}
value['Title'] = 'marital'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [17]:
# balance分组数据构成
data = non_buy_df.groupby('balance').size()
value = {}
value['Title'] = 'balance'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [18]:
# housing分组数据构成
data = non_buy_df.groupby('housing').size()
value = {}
value['Title'] = 'housing'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [19]:
# loan分组数据构成
data = non_buy_df.groupby('loan').size()
value = {}
value['Title'] = 'loan'
value['Attributes'] = data.index.tolist() 
value['Values'] = [data[i] for i in data.index.tolist() ]
show_data.append(value)

In [21]:
pie = Pie('未购买产品的比较', "分类比较", title_pos='center')
style = Style()
pie_style = style.add(label_pos="center", is_label_show=False, label_text_color=None, is_legend_show=False)
i = 1
# print(len(show_data))
for value in show_data:
    j = 0
    k = 0
    if (i < 5):
        j = 0
        k = i
    else:
        j = 1
        k = i - 4
    pie.add(value['Title'], value['Attributes'], value['Values'], center=[k * 20, 30 + j * 40], 
            radius=[0, 34], **pie_style)
    i += 1
pie

通过对比看一看到，购买了房产的人，购买没有购买产品的人占比较多；其他的项目单纯从占比多少来看，与购买了产品的人员相当，但是比例上有轻微的变化。这有可能是统计的数据分布不平均造成的。因此，在这里进行数据相关性计算，更进一步确定哪些项目对客户最终决定购买产品起关键性的作用，然后再进行深入分析。因为做相关性计算时，需要所有的项目为数值型，因此，要把所有的分类项目转换成数值。

In [22]:
new_df = df
new_df['y'] = new_df['y'].apply(lambda x : 1 if x == 'yes' else 0)
new_df_dummies = pd.get_dummies(new_df)

In [23]:
new_df_y = new_df_dummies[new_df_dummies['y'] == 1]
corr = new_df_dummies.corr()
corr

Unnamed: 0,day,duration,campaign,pdays,previous,y,age_Level 2,age_Level 3,age_Level 4,job_admin.,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
day,1.0,-0.024629,0.160706,-0.094352,-0.059114,-0.011244,0.013415,-0.005098,0.000168,0.017052,...,-0.217517,-0.02457,-0.028992,0.095832,0.040235,-0.043666,-0.064235,-0.021062,-0.02772,0.0751
duration,-0.024629,1.0,-0.068382,0.01038,0.01808,0.401118,-0.021683,0.021162,-0.013733,-0.038763,...,-0.016196,-0.026212,0.008639,0.009572,0.004566,-0.020023,-0.012852,0.008109,0.049255,-0.015239
campaign,0.160706,-0.068382,1.0,-0.093137,-0.067833,-0.061147,-0.001422,0.011465,-0.011399,-0.017895,...,0.044317,-0.004045,-0.076263,-0.083385,-0.058536,-0.040207,-0.094021,-0.030435,-0.058268,0.117375
pdays,-0.094352,0.01038,-0.093137,1.0,0.577562,0.104087,-0.012108,0.025142,-0.021549,0.035127,...,-0.110324,0.008673,0.090216,0.012549,0.059521,0.04789,0.70838,0.38297,0.212188,-0.867713
previous,-0.059114,0.01808,-0.067833,0.577562,1.0,0.116714,-0.01032,0.017143,-0.013902,0.020665,...,-0.084432,0.019445,0.027549,0.0554,0.088764,0.059763,0.475289,0.358382,0.250277,-0.682746
y,-0.011244,0.401118,-0.061147,0.104087,0.116714,1.0,0.045694,-0.05498,0.039759,0.006568,...,-0.013323,0.102716,-0.102077,-0.014397,0.145964,0.07151,0.014556,0.051908,0.283481,-0.162038
age_Level 2,0.013415,-0.021683,-0.001422,-0.012108,-0.01032,0.045694,1.0,-0.290616,-0.080574,0.001228,...,0.004274,-0.002803,0.026831,-0.033388,0.000388,0.063313,-0.013934,0.008144,-0.018605,0.014994
age_Level 3,-0.005098,0.021162,0.011465,0.025142,0.017143,-0.05498,-0.290616,1.0,-0.930313,0.038905,...,-0.027909,-0.044444,0.085316,0.030593,-0.05837,-0.029895,0.004232,0.023779,-0.029637,-0.003213
age_Level 4,0.000168,-0.013733,-0.011399,-0.021549,-0.013902,0.039759,-0.080574,-0.930313,1.0,-0.040999,...,0.027434,0.047372,-0.099159,-0.019071,0.060656,0.006873,0.000932,-0.027893,0.038005,-0.0024
job_admin.,0.017052,-0.038763,-0.017895,0.035127,0.020665,0.006568,0.001228,0.038905,-0.040999,1.0,...,-0.02266,0.019587,0.012749,0.004801,0.030236,0.003387,0.016644,0.018222,0.040445,-0.040635


可以看到，duration，previous等对结果的影响非常大的，在这里通过散点图来展示一下这两个项目对结果的影响。

In [24]:
data = df[df['y'] == 1]
v_y = [i for i in data['y']]
v_duration = [i for i in data['duration']]
v_previous = [i for i in data['previous']]

In [27]:
from pyecharts import Scatter, Grid
scatter_1 = Scatter(width=1200)
scatter_1.add('Duration', v_y, v_duration)

scatter_2 = Scatter(width=1200)
scatter_2.add('Previous', v_y, v_previous)


grid = Grid()
grid.add(scatter_1, grid_left="60%")
grid.add(scatter_2, grid_right="60%")
grid

可以看到，通话次数在10次之内的成功率较高，与上次通话间隔在1800以内的情况下成功率高。

接下来可以看一下营销不成功的案例的分布情况，在这里留给读者自行验证。