# Starting with Jupyter notebooks

There are two kinds of cells: markdown and 'code' cells. There are two kinds of cells: markdown and 'code' cells. This one is marked as markdown. The next one is a `markdown` cell, followed by a 'code' cell.

# The Business context of the data

Our consultancy firm is contracted by a credit card client. The client have offered us a data set with data from account holders. Each row corresponds to one customer's account data. The data contains monthly records, over a period of 6 months. Data from 30,000 acount holders are provided. The data are classified according to whether an account owner has defaulted, after a six month period. In practice this means that the account holder did not make the minimum required payment. 

# The goal for data analytics

The client wants us to develop a model to predic whether in the month after the six-month period of the historical data, e account holder will default or not. 


# The data set

The data set is a modified version of the following: 
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. 

This is taken from the UCI Machine Learning Repository, a public repository of bencmarking datasets. .
Have a look at that web page to find out more abot the data characteristics. Can you summarise the key characteristics?

Next, let's turn our attention to the code needed to load the data. 

In [1]:
import pandas as pd

With the first command we imported the pandas data management library. 

Next we will use it to load our first data set. 

This is taken from the UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


Data loading is done in the next code cell. 

In [3]:
df = pd.read_excel('../data/default_of_credit_card_clients.xls')

This presupposes that we have a directory Data and that the above excel file is placed there. The file is available on Nestor.

We start our explorator analysis by inspecting the size of the data file

In [4]:
df.shape

(30000, 25)

So this file has 30000 records and 25 fields (columns)

We can also execute direct value assignment commands and view their outcomes, as below

In [5]:
a = 5

In [6]:
a

5

# Step 3: Check Data Integrity

The data set contains monthly credit card account data, for a period of six months. Let's perform a quality check to ensure we have the data for the accounts as expected. The account ID distinguishes one account from the other. We can check unique IDs with Pandas with the function `.nunique()`. But first let's check our data structure. To do so we first build an Index of he different columns in the table. We will use the .columns method of the pandas DataFrame to see the column names. 

In [7]:
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

We then check the column headings

The first column is the account ID. The other columns contain the dataset 'fetures': 

LIMIT_BAL: Amount of credit provided (in New Taiwanese (NT) dollar) including individual consumer credit and the family (supplementary) credit.

SEX: Gender (1 = male; 2 = female).

Note

We will not be using gender to determine credit-worthiness (ethical use of data). 

EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).

AGE: Age (year).

PAY_1–PAY_6: A record of past payments. Past monthly payments, recorded from April to September, are stored in these columns.

PAY_1 represents the repayment status in September; PAY_2 is the repayment status in August; and so on up to PAY_6, which represents the repayment status in April.

The measurement scale for the repayment status is as follows: -1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months; and so on up to 8 = payment delay for 8 months; 9 = payment delay for 9 months and above.

BILL_AMT1–BILL_AMT6: Bill statement amount (in NT dollar).

BILL_AMT1 represents the bill statement amount in September; BILL_AMT2 represents the bill statement amount in August; and so on up to BILL_AMT6, which represents the bill statement amount in April.

PAY_AMT1–PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 represents the amount paid in September; PAY_AMT2 represents the amount paid in August; and so on up to PAY_AMT6, which represents the amount paid in April.
    
We can take a look at the first 5 data rows by invoking the 'head()' method of the data frame. 

In [8]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,798fc410-45c1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,8a8c8f3b-8eb4,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,85698822-43f5,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,0737c11b-be42,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,3b7f77cc-dbc0,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [9]:
df.shape

(30000, 25)

Earlier, we saw the type of data that the ID column contains. It appears to be a unique identifier. But is it? Let's now check the unique IDs

In [10]:
df['ID'].nunique()

29687

That's interesting! We have fewer unique IDs than rows, so we clearly have duplicates! But we don't really know how many IDs are repeated how many times. We can check the number of occurences of each ID by counting them, as follows: 

In [11]:
id_counts = df['ID'].value_counts()
id_counts.head()

ad23fe5c-7b09    2
1fb3e3e6-a68d    2
89f8f447-fca8    2
7c9b7473-cc2f    2
90330d02-82d9    2
Name: ID, dtype: int64

This wasn't too helpful. It only returned the occurences of the first five rows. 

In [12]:
id_counts.value_counts()

1    29374
2      313
Name: ID, dtype: int64

So now we now that we have 29374 IDs which appear only once but we also have 313 which appear twice!