In [1]:
import pandas as pd

### Reading Data and Understanding Your Dataset

Before you start working on the model training, it is important to understand the following things about your dataset:
- What is the unit of observation? AKA What is each row? (Is each row a user? Is each row a state? Is each row a type of loan? )
- What is the dimension of the dataset (how many rows and columns are there)?
- What is my response variable? If there is a response variable (AKA I should consider supervised learning models), is the response variable categorical or numeric? If there does not exist a response variable, I should think about unsupervised models.
    - Hint: If the model outputs a decimal number (eg. 2.5) does that make sense as a value for the response variable? If a decimal value makes sense (eg. $2.50, 2.5 feet, 2.5 hours, etc), then the response variable is likely a numeric value. If a decimal value does not make sense (eg. half way between red and blue), then the response is categorical.
- How does the independent variables/features/covariates (all different names meaning the same thing) correlate with the response variable?

In [6]:
# Depending on the format of your file, you may need to specify the sep (seperator) type
# 
train_df = pd.read_csv("train.csv", sep = ";")

print (f"""The shape of the dataset is {train_df.shape} with 
       {train_df.shape[0]} rows and {train_df.shape[1]} columns.""")
# This is called formatted string literals. It is commonly used to format print
# outputs.

# Another way to code this is
# print("The shape of the dataset is", train_df.shape, "with", train_df.shape[0], "rows and", train_df.shape[1], "columns.")

# df.head(n) grabs the first n rows
# df.tail(n) grabs the last n rows

# I usually start with this head function to get a better idea of the columns
# in my dataframe. Understand the datatype (integer, double, boolean, etc.) for
# the columns that I have.
train_df.head(10)

The shape of the dataset is (45211, 17) with 
       45211 rows and 17 columns.


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


Looking at the dataset, it seems like I am dealing with mostly categorical variables (eg. variables that have clear categories as to what they values can be) (example of categorical variables: job, marital, y).

There is also a clear y variable (also called target variable, response variable, etc.). The y variable is binary (a special type of categorical variable with just 2 possible outcomes). It is either: 
- yes (the customer was converted and the campaign was successful in bringing in the customer)
- no (the customer was not converted and the campaign was not successful).

Next, I may want to understand how many nulls there are and see what I can do to reduce the number of nulls

In [None]:
# The function isna returns a boolean Series for each column of your dataset 
# and the sum function adds up all the rows such that the condition is true 
# (there is a null)
train_df.isna().sum()

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64