# Exercise 5: Exploring and Cleaning the Data

We will start with the first stage of data cleaning output. This was saved as a CSV file, so we will read it as a CSV using the relevant pandas method. 

In [2]:
import pandas as pd

df_clean_1 = pd.read_csv('../data/df_clean_1.csv')

Let us examine the characteristics of the data, using the info method of the data frame. 

In [3]:
df_clean_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29685 entries, 0 to 29684
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   ID                          29685 non-null  object
 1   LIMIT_BAL                   29685 non-null  int64 
 2   SEX                         29685 non-null  int64 
 3   EDUCATION                   29685 non-null  int64 
 4   MARRIAGE                    29685 non-null  int64 
 5   AGE                         29685 non-null  int64 
 6   PAY_1                       29685 non-null  object
 7   PAY_2                       29685 non-null  int64 
 8   PAY_3                       29685 non-null  int64 
 9   PAY_4                       29685 non-null  int64 
 10  PAY_5                       29685 non-null  int64 
 11  PAY_6                       29685 non-null  int64 
 12  BILL_AMT1                   29685 non-null  int64 
 13  BILL_AMT2                   29685 non-null  in

So our data file has 29,685 non-null record values, in 25 columns. So there are no empty (missing) values in the strict sense, but we don't yet know if all such values are meaningful as well. The data types are of int64 type, with the exceptions being ID and Pay_1. While we are aware of the ID field, let us take a look at the Pay_1 now. 

According to: 
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
we are only expecting numbers, so what Pay_1 appears as object? Let's have a look at the Dataframe header line using the 'head'method: 

In [4]:
df_clean_1.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,798fc410-45c1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,8a8c8f3b-8eb4,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,85698822-43f5,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,0737c11b-be42,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,3b7f77cc-dbc0,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


Let us focus on the 6th column, which is the PAY_1. 

In [5]:
df_clean_1['PAY_1'].head(5)

0     2
1    -1
2     0
3     0
4    -1
Name: PAY_1, dtype: object

The leftside column is the index of the Dataframe and is an ordered sequence of integers, as expected. But the column on the right side, which is the payment status, 

The integers on the left of the output are the DataFrame index, which is simply consecutive integers starting with 0.According to the data description this should be taking the following values:  
" -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above."

However, we notice that several records have '0' values, which is not according to expectations. This indicates that there is something not quite right with the data in this column. Let's find out more. We will employ the value_counts method to count the occurences of each unique value in this column. 


In [6]:
df_clean_1['PAY_1'].value_counts()

0                13087
-1                5047
1                 3261
Not available     3021
-2                2476
2                 2378
3                  292
4                   63
5                   23
8                   17
6                   11
7                    9
Name: PAY_1, dtype: int64

So we already see the occurence of two values which were not expected, namely 0 and –2. Furthermore, pandas imported this as "object data" and not integer. Notice the "not available" value presence on several records. So it appears to be a case of missing values in multiple records. There are many different ways of dealing with missing data. We will not use any of the more sophisticated ones now but we will just seek to discard the data with the missing values. We will do so again with filtering. We will retain all records that do not have a missing value in the PAY_1 column. Here is how: 

In [7]:
valid_pay_1_mask = df_clean_1['PAY_1'] != 'Not available'

In [None]:
Just for a quick first check, let us inspect the first 5 records: 

In [8]:
valid_pay_1_mask[0:5]

0    True
1    True
2    True
3    True
4    True
Name: PAY_1, dtype: bool

That's good - none contains any missing value. Since the valid_pay_mask is a Boolean attribute we can sum the "ones" (it is equivalent to summing up integer ones), as below: 

In [9]:
sum(valid_pay_1_mask)

26664

So now we have 26664 out of 29685 records without missing values in Pay_1. Let's copy that to a "second stage" data cleaning output dataframe, locating the indices of the records of the first stage cleaning which do not correspond to missing values, creating a copy of them as df_clean_2. 

In [10]:
df_clean_2 = df_clean_1.loc[valid_pay_1_mask,:].copy()

In [11]:
df_clean_2.shape

(26664, 25)

Let's now inspect the value counts for each different value that the Pay_1 attribute takes. 

In [12]:
df_clean_2['PAY_1'].value_counts()

0     13087
-1     5047
1      3261
-2     2476
2      2378
3       292
4        63
5        23
8        17
6        11
7         9
Name: PAY_1, dtype: int64

So we have now taken out the records containing a not "available" value. But lets us inspect the data types of the remaining attributes. We use the method .astype to do that for the Pay_1 column. But we check also the Pay_2 and compare them. 

In [13]:
df_clean_2['PAY_1'] = df_clean_2['PAY_1'].astype('int64')

In [14]:
df_clean_2[['PAY_1', 'PAY_2']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26664 entries, 0 to 29684
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   PAY_1   26664 non-null  int64
 1   PAY_2   26664 non-null  int64
dtypes: int64(2)
memory usage: 624.9 KB


We see that both attributes have the same type of attributes and these are non-null. But we still cannot explain the occurence of the value -2 in the Pay_1 attribute. The response we got when asked the provider of the data was: 
-2 indicated the account started on that month with 0 credit and balance. 
-1 indicates the palance was paid in full
0 indicates that at least the minimum payment was made, however there is still outstanding balance.  
So there is more scope for further work with the data, but in the meantime, let's export the outcome of the 2nd stage data clening. 

In [15]:
df_clean_2.to_csv('../Data/df_clean_2.csv', index=False)