### Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. 

There are four datasets: 
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). 
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). 

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).


Attribute Information:

Input variables:
* Bank client data:
* 1 - age (numeric)
* 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
* 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* 5 - default: has credit in default? (categorical: 'no','yes','unknown')
* 6 - housing: has housing loan? (categorical: 'no','yes','unknown')
* 7 - loan: has personal loan? (categorical: 'no','yes','unknown')
* Related with the last contact of the current campaign:
* 8 - contact: contact communication type (categorical: 'cellular','telephone') 
* 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
* 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
* 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
* Other attributes:
v12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
* 14 - previous: number of contacts performed before this campaign and for this client (numeric)
* 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
* Social and economic context attributes
* 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
* 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
* 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
* 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
* 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
* 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import brewer2mpl
from matplotlib import rcParams
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 10})
%matplotlib inline
np.random.seed(0)
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [7]:
#colorbrewer2 Dark2 qualitative color table
dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)
dark2_colors = dark2_cmap.mpl_colors

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
# rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

In [93]:
# Convert the column to numbers
def pd_column_to_number(df,col_name):
    """
    Convert number in strings to number

    Args:
        df(dataframe): a pandas dataframe to perform the conversion on
        col_name (list): a list of column headers
    Returns:
        df: dataframe with numbers
    """
    
    for c in col_name:
        df[c] = [string_to_number(x) for x in df[c]]
    return df

In [94]:
# Convert a number in accounting format from string to float
def string_to_number(s):
    """
    Convert number in accounting format from string to float.

    Args:
        s: number as string in accounting format
    Returns:
        float number
    """

    if type(s).__name__=="str":
        s = s.strip()
        if s =="-":
            s = 0
        else:
            s = s.replace(",","").replace("$","")
            if s.find("(")>=0 and s.find(")")>=0:
                s = s.replace("(","-").replace(")","")
    return float(s)

In [121]:
df = pd.read_csv('/Users/stevalang/Galvanize/0002_capstones/capstone1/data/bank/bank-full.csv', delimiter=';')

In [122]:
df.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


In [117]:
plotPerColumnDistribution(df, 10, 5)

NameError: name 'plotPerColumnDistribution' is not defined

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [106]:
df.dropna(inplace=True)

In [107]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,no,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,no,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,no,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,no,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,no,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,yes,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,yes,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,yes,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,yes,17,nov,508,4,-1,0,unknown,no


In [97]:
df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [57]:
# look at the first five rows of the nfl_data file. 
# I can see a handful of unknown data already!
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [100]:
list_item = []
for col in df.columns:
    list_item.append([col, df[col].dtype, df[col].isna().sum(), round((df[col].isna().sum()/len(df[col]))*100,2),
                      df[col].nunique(), list(df[col].sample(5).drop_duplicates().values)])

dfDesc = pd.DataFrame(columns=['feature', 'data_type', 'null', 'nulPct', 'unique', 'uniqueSample'],data=list_item)

In [101]:
dfDesc

Unnamed: 0,feature,data_type,null,nulPct,unique,uniqueSample
0,age,int64,0,0.0,77,"[51, 48, 29, 59, 47]"
1,job,object,0,0.0,12,"[unemployed, unknown, management, retired, blu..."
2,marital,object,0,0.0,3,"[single, married]"
3,education,object,0,0.0,4,"[primary, tertiary, secondary]"
4,default,object,0,0.0,2,"[yes, no]"
5,balance,int64,0,0.0,7168,"[-392, -236, 3217, 249, 507]"
6,housing,object,0,0.0,2,"[no, yes]"
7,loan,object,0,0.0,2,"[no, yes]"
8,contact,object,0,0.0,2,"[yes, no]"
9,day,int64,0,0.0,31,"[12, 28, 23, 30, 9]"


In [110]:
for x in df[['age', 'job','marital','education','default','balance','housing','loan','contact','day',
             'month','duration','campaign', 'campaign','pdays', 'previous','poutcome', 'y']].columns:
    print(f'{x}: \n{df[x].unique()}\n')

age: 
[58 44 33 47 35 28 42 43 41 29 53 57 51 45 60 56 32 25 40 39 52 46 36 49
 59 37 50 54 55 48 24 38 31 30 27 34 23 26 61 22 21 20 66 62 83 75 67 70
 65 68 64 69 72 71 19 76 85 63 90 82 73 74 78 80 94 79 77 86 95 81 18 89
 84 87 92 93 88]

job: 
['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student']

marital: 
['married' 'single' 'divorced']

education: 
['tertiary' 'secondary' 'unknown' 'primary']

default: 
['no' 'yes']

balance: 
[ 2143    29     2 ...  8205 14204 16353]

housing: 
['yes' 'no']

loan: 
['no' 'yes']

contact: 
['no' 'yes']

day: 
[ 5  6  7  8  9 12 13 14 15 16 19 20 21 23 26 27 28 29 30  2  3  4 11 17
 18 24 25  1 10 22 31]

month: 
['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']

duration: 
[ 261  151   76 ... 1298 1246 1556]

campaign: 
[ 1  2  3  5  4  6  7  8  9 10 11 12 13 19 14 24 16 32 18 22 15 17 25 21
 43 51 63 41 26 28 55 50 38 23 20 2

## 1. Feature Selection

### Target Feature (Y)
The feature that will be labeled as the target is the y feature. Where in this feature explains which client has subscribed a term deposit? (binary: 'yes','no')


### Train Feature (X)
#### 1. Personal & Institusion Information Feature
From SBA Data Frame we can make a conclusion that there are seven column that give personal and instituion information
* LoanNr_ChkDgt as borrower Identifeir
* Name as Borrower Name
* City as Borrower City
* State as Borrower State
* Zip as Borrower Zip Code
* Bank as Bank Name
* BankState as Bank State
* FranchiseCode as FranchiseCode
* UrbanRural as information about business
*This information is used only to identify the borrower and the Bank. With that reason we not use this feature for future modeling.

#### 2. NAICS (North American industry classification system code)
NAICS is a classification system of several types of industries registered in America. The first two digits of NAICS explained the type of business industries.
NAICS itself has the potential for decision making. The type of industry will affect the company's performance in business. So that this feature will be used later in the model.

#### 5. Jobs Columns
This feature explains how many employees are in the related business and how many jobs were created and existed before.
In this dataset there are three features: NoEmp, CreateJob dan RetainedJob

In [104]:
def outliers(DataFrame,Series):
    iqr = Series.quantile(.75) - Series.quantile(.25)
    lower_bound = Series.quantile(.25) - (1.5*iqr)
    upper_bound = Series.quantile(.75) + (1.5*iqr)
    return DataFrame[(Series >= upper_bound) | (Series <= lower_bound)]

In [135]:
# get the number of missing data points per column

missing_values_count = df.isnull().sum()
missing_values_count = df.isna().sum()
missing_values_count

age              0
job            288
marital          0
education     1857
default          0
balance          0
housing          0
loan             0
contact      13020
day              0
month            0
duration         0
campaign         0
pdays            0
previous         0
poutcome     36959
y                0
dtype: int64

In [136]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

In [137]:
total_missing


52124

In [142]:
df.shape

(45211, 17)

In [138]:
df.isnull().values.any()


True

In [139]:
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
percent_missing

6.781795684808617

In [140]:
df.dropna()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
24060,33,admin.,married,tertiary,no,882,no,no,telephone,21,oct,39,1,151,3,failure,no
24062,42,admin.,single,secondary,no,-247,yes,yes,telephone,21,oct,519,1,166,1,other,yes
24064,33,services,married,secondary,no,3444,yes,no,telephone,21,oct,144,1,91,4,failure,yes
24072,36,management,married,tertiary,no,2415,yes,no,telephone,22,oct,73,1,86,4,other,no
24077,36,management,married,tertiary,no,0,yes,no,telephone,23,oct,140,1,143,3,failure,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45199,34,blue-collar,single,secondary,no,1475,yes,no,cellular,16,nov,1166,3,530,12,other,no
45201,53,management,married,tertiary,no,583,no,no,cellular,17,nov,226,1,184,4,success,yes
45204,73,retired,married,secondary,no,2850,no,no,cellular,17,nov,300,1,40,8,failure,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes


In [143]:
df.shape

(45211, 17)

In [30]:
# just how much data did we lose?
# print("Columns in original dataset: %d \n" % nfl_data.shape[1])
# print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

In [38]:
result = df.contact == 'cellular'

In [41]:
result.mean()

0.647740594103205

In [42]:
result = df.contact == 'telephone'
result.mean()

0.06427639291322908

In [45]:
result = df.contact == 'unknown'
result.mean()

0.28798301298356593

In [46]:
df[result]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45061,30,self-employed,single,secondary,no,1031,no,no,unknown,20,oct,7,1,-1,0,unknown,no
45062,58,retired,married,primary,no,742,no,no,unknown,20,oct,5,1,-1,0,unknown,no
45122,40,entrepreneur,single,tertiary,no,262,yes,yes,unknown,26,oct,17,1,-1,0,unknown,no
45135,53,blue-collar,married,primary,no,1294,no,no,unknown,28,oct,71,1,-1,0,unknown,no


In [44]:
df.contact.unique()

array(['unknown', 'cellular', 'telephone'], dtype=object)

In [49]:
df[result]= 'no'

In [50]:
df[result]

0        no
1        no
2        no
3        no
4        no
         ..
45206    no
45207    no
45208    no
45209    no
45210    no
Name: yes, Length: 45211, dtype: object

In [51]:
df[ df.contact == 'telephone' ] = 'yes'

In [52]:
df[ df.contact == 'cellular' ] = 'yes'

In [58]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [70]:
contact_col = df['contact'].copy()

In [71]:
contact_col[df.contact == 'unknown'] = 'no'

In [72]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [73]:
contact_col[ df.contact == 'telephone' ] = 'yes'
contact_col[ df.contact == 'cellular' ] = 'yes'

In [74]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [75]:
contact_col

0         no
1         no
2         no
3         no
4         no
        ... 
45206    yes
45207    yes
45208    yes
45209    yes
45210    yes
Name: contact, Length: 45211, dtype: object

In [76]:
df.contact = contact_col

In [77]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,no,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,no,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,no,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,no,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,no,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,yes,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,yes,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,yes,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,yes,17,nov,508,4,-1,0,unknown,no


In [79]:
df.job.unique()

array(['management', 'technician', 'entrepreneur', 'blue-collar',
       'unknown', 'retired', 'admin.', 'services', 'self-employed',
       'unemployed', 'housemaid', 'student'], dtype=object)

In [81]:
(df.job == 'management').mean()

0.20919687686624935

In [87]:
managment = (df.job == 'management').mean()
technician = (df.job == 'technician').mean()
entrepreneur = (df.job == 'entrepreneur').mean()
blue_collar = (df.job == 'blue-collar').mean()
retired = (df.job == 'retired').mean()
admin = (df.job == 'admin').mean()
services = (df.job == 'services').mean()
unemployed = (df.job == 'unemployed').mean()
self_employed = (df.job == 'self-employed').mean()
housemaid = (df.job == 'housemaid').mean()
student = (df.job == 'student').mean()
unknown = (df.job == 'unknown').mean()


In [111]:
round(managment,2)

0.21

In [89]:
technician

0.16803432792904383

In [90]:
housemaid

0.027426953617482472

In [91]:
student

0.02074716330096658

In [92]:
unknown

0.006370131162770122

In [123]:
df = pd.read_csv('/Users/stevalang/Galvanize/0002_capstones/capstone1/data/bank/bank-full.csv', delimiter=';',
                na_values = 'unknown')

In [134]:
df.head(20)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
1,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,,no
3,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,,no
4,33,,single,,no,1,no,no,,5,may,198,1,-1,0,,no
5,35,management,married,tertiary,no,231,yes,no,,5,may,139,1,-1,0,,no
6,28,management,single,tertiary,no,447,yes,yes,,5,may,217,1,-1,0,,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,,5,may,380,1,-1,0,,no
8,58,retired,married,primary,no,121,yes,no,,5,may,50,1,-1,0,,no
9,43,technician,single,secondary,no,593,yes,no,,5,may,55,1,-1,0,,no


In [130]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()
missing_values_count

age              0
job            288
marital          0
education     1857
default          0
balance          0
housing          0
loan             0
contact      13020
day              0
month            0
duration         0
campaign         0
pdays            0
previous         0
poutcome     36959
y                0
dtype: int64

In [133]:
df.shape

(45211, 17)

In [132]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
round(percent_missing, 2)

6.78

* Is this value missing because it wasn't recorded or because it doesn't exist?