### Introduction

In this blog post we will be introducing the data set and some tools that are commonly used such as the pandas library.<br/>


 To begin lets examine the tools we will be using:<br/>
 -jupyter notebook - That is the framework we use to run python code<br/>
 -Pandas -Useful for reading in and manipulation of data<br/>
 -Matplotlib -Useful for constructing plots<br/>
 -Seaborn -Another library that produces plots<br/>
 -sklearn -A very useful library for machine learning<br/>
 
 If you are interested in using these tools a convenient way is to install the anaconda package.
 [Anacondas](https://www.anaconda.com/download/#macos)


In [1]:
#This loads the relevant libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

Now that we have loaded the libraries we need to load our data. </br>
The data is from the UCI Machine Learning Repository. This is a great place to locate fairly clean data to test out software pipelines. The particular data set we will be using is the Bank Marketing Data Set. [Bank Marketing](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) 

We will use the pandas library to read in and examine the data

In [2]:
#This is where we import the data using pandas
bank_df = pd.read_csv("bank-additional-full.csv",delimiter=";")

Lets see how many rows and columns we will be working with

In [5]:
bank_df.shape

(41188, 21)

We will have 21 inital columns or features and 41188 records

In [6]:
#Here is where we explore the data in the data set
bank_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


Because many ML techniques require only numeric columns we want to see which columns are categorical

In [8]:
#Now we select only the numeric columns.
bank_num = bank_df.select_dtypes(include=['int','float'])
bank_bool = bank_df.select_dtypes(include=['bool'])
bank_other = bank_df.select_dtypes(exclude=['bool','int','float'])

In [12]:
#Lets loook at the numeric columns
print (bank_num.shape)
x = bank_num.columns
print (x)
bank_num.head()

(41188, 10)
Index(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'],
      dtype='object')


Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0


In [13]:
#Lets loook at the boolean columns
print (bank_bool.shape)
x = bank_bool.columns
print (x)
bank_bool.head()

(41188, 0)
Index([], dtype='object')


0
1
2
3
4


Apparently this data set does not contain any boolean features

In [14]:
print (bank_other.shape)
x = bank_other.columns
print (x)
bank_other.head()

(41188, 11)
Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'poutcome', 'y'],
      dtype='object')


Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,y
0,housemaid,married,basic.4y,no,no,no,telephone,may,mon,nonexistent,no
1,services,married,high.school,unknown,no,no,telephone,may,mon,nonexistent,no
2,services,married,high.school,no,yes,no,telephone,may,mon,nonexistent,no
3,admin.,married,basic.6y,no,no,no,telephone,may,mon,nonexistent,no
4,services,married,high.school,no,no,yes,telephone,may,mon,nonexistent,no


In our next post we will see how to transform categorical features to numeric feature