In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import os
import random

First let's look at the columns_description, seems interesting...

In [None]:
# desc_df = pd.read_csv('../input/credit-card/columns_description.csv')

BUT! If we run the commented out line it gives us an error. Turns out columns_description.csv is actually a weirdly formatted .xls file. You can download it from the main dataset page: https://www.kaggle.com/mishra5001/credit-card?select=columns_description.csv, and analyse it, but otherwise let's just leave it be.

On to the next file

In [None]:
df_app = pd.read_csv('../input/credit-card/application_data.csv')

Great, it's actually a CSV file. And now we've used `pandas` to import it as a `DataFrame` instance (hence the "df" in the name). How big is it I wonder?

In [None]:
print(df_app.shape)

Ok, about 300,000 rows, 122 columns. That's a lot of data. What does it look like?

In [None]:
df_app.head()

The most important column here is `TARGET`. The description for this variable is:

> Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)

Basically, this is the variable that the dataset is centered around. If this were a modeling competition, it would be the one we had to try to predict. Looking at the description it seems ot boil down to "did the perosn who made this application later miss a bunch of payments (a.k.a. credit card fraud, which also includes accidental fraud)?"

Let's see how many people committed fraud.

In [None]:
print("number of people who committed fraud:", df_app["TARGET"].sum())
print("proportion of people who committed fraud:", df_app["TARGET"].sum() / len(df_app))

Now let's look at the other columns. Are there any columns that look like they might be predictive of fraud?

In [None]:
list(df_app.columns)

`FLAG_OWNS_CAR` seems a likely candidate. I bet peole who own cars are less likely to miss payments. Let's see if this bears out in the data. First we'll create a new dataframe containing only those rows belonging to applicants who own cars.

In [None]:
car_owners = df_app[df_app["FLAG_OWN_CAR"] == "Y"]
print(car_owners.shape)

Looks like there are about 100,000 car owners in the data, around 1/3rd of the dataset. Now let's see if fraud is more or less common among car owners.

In [None]:
def show_fraud_prop(df):
    print("number of rows in dataframe:", len(df))
    print("number of positive targets:", df["TARGET"].sum())
    print("proportion of positive targets:", df["TARGET"].sum() / len(df))
    
show_fraud_prop(car_owners)

Hmm.. apparently car owners are slightly less likely to commit fraud, but only slightly. How about men vs women? First lets see what values are in this column:

In [None]:
df_app["CODE_GENDER"].value_counts()

Apparently there are basically only two genders in this dataset, now which one commits more fraud? Probably the men right?

In [None]:
print("male fraud")
show_fraud_prop(df_app[df_app["CODE_GENDER"] == "M"])
print("female fraud")
show_fraud_prop(df_app[df_app["CODE_GENDER"] == "F"])

Knew it lol. Ok, how about income, let's use `pandas`'s `describe()` function to get a better idea about the distribution of the `AMT_INCOME_TOTAL` column:

In [None]:
df_app["AMT_INCOME_TOTAL"].describe()

Nice not having to calculate all those stats manually. Ok now let's compare fraud among high earners to fraud among low earners.

In [None]:
print("high earner fraud")
show_fraud_prop(df_app[df_app["AMT_INCOME_TOTAL"] > 2.025000e+05])

print("low earner fraud")
show_fraud_prop(df_app[df_app["AMT_INCOME_TOTAL"] < 1.125000e+05])

There isn't too much difference here... Ok, enough messing around, let's start plotting! We'll start simple. Let's create a bar chart over the gender data we were looking at earlier, using the python plotting library `seaborn` (imported as `sns`)

In [None]:
sns.countplot(x="CODE_GENDER", data=df_app)

That was easy eh? As we can see, it's plotted the number of rows in the dataset with male vs female applicants. Let's try another categorical column:

In [None]:
sns.countplot(x="NAME_CONTRACT_TYPE", data=df_app)

Apparently cash loans are much more popular. In general, looking a graphs is much easier on the eyes than squinting at printout, but the magic of plotting really gets started when you compare different variables on the same plot. 

Below is one of my favorite hand-spun custom plotting functions. I'm going to use it to plot the average fraud rate accross gender.

In [None]:
def mean_count_plot(df, col, target, rc={'figure.figsize':(15,10)}):
    sns.set(rc=rc)
    ax = sns.countplot(x=col, data=df)
    ax2 = ax.twinx()
    ax.set_xticklabels(ax.get_xticklabels(),rotation=80)    
    ax2 = sns.pointplot(x=col, y=target, data=df, color='black', legend=False, errwidth=0.5)
    ax.grid(False)

In [None]:
mean_count_plot(df_app, "CODE_GENDER", "TARGET")

Not bad eh? Much easier to parse than the `prints` we were doing earlier.

But we're just getting started baby. Now let's try some scatter plots over some of the numerical columns using `matplotlib` imported as `plt`. 

In [None]:
plt.scatter(df_app["AMT_CREDIT"], df_app["AMT_INCOME_TOTAL"])

Hmm... that wasn't super informative because the graph had to include that one person with a crazy high income. Let's remove that outlier and plot again. 

To help us do this we're going to use the `matplotlib` `Axis` class. Basically an `Axis` is a single graph. Usually I use the `Axis` object rather than `plt` when I need to make a more complicated graph.

In [None]:
_, ax = plt.subplots()
ax.set_ylim((0, 2e7))
ax.scatter(df_app["AMT_CREDIT"], df_app["AMT_INCOME_TOTAL"])

Hmm... still not super informative because all the blue dots are overlapping. To get a bit more insight we'll lower the ceiling again and make the dots a little transparent.

In [None]:
_, ax = plt.subplots()
ax.set_ylim((0, 1e6))
ax.scatter(df_app["AMT_CREDIT"], df_app["AMT_INCOME_TOTAL"],  alpha=0.01)

There we go. By being a bit more careful with our plotting we've revealed that there's a very strong correlation between the amount of credit applied for and the total income of the applicant, which wasn't at all obvious beforehand. 

To end I'd like to show you a custom scatter plotting function and use it to compare 

In [None]:
def split_plot(data, x, y, compcol, aalpha=0.1, balpha=0.1, xlim=None, ylim=None):
    alphamap = {False: aalpha, True: balpha}
    colormap = {False: "tab:blue", True: "red"}
    
    for val in [False, True]:          
        plt.scatter(x, y, data=data[data[compcol] == val], alpha=alphamap[val], s=20, c=colormap[val])
    
    plt.xlabel(x)
    plt.ylabel(y)
    
    if xlim is not None:
        plt.xlim(xlim)
        
    if ylim is not None:
        plt.ylim(ylim)

In [None]:
split_plot(df_app,  "AMT_CREDIT", "LIVINGAPARTMENTS_MODE", "TARGET", aalpha=0.01, balpha=0.07, ylim=(0, 0.4))

Here we plot the amount of credit applied for against the mode of the number of living apartments in the building where the applicant lives (see the data explanation (https://www.kaggle.com/mishra5001/credit-card?select=columns_description.csv). Blue dots represent non-fraudlent instances, and red dots represent fraud. 

This allows us to see that fraud seems to occur slightly more often when the number of living apartments is a little higher, though in general fraud seems to be evenly distributed through the data along these axes. 

That's all, now get plotting!

In [None]:
df_prev = pd.read_csv("../input/credit-card/previous_application.csv")

In [None]:
list(df_prev.columns)

In [None]:
df_prev.shape

In [None]:
df = df_app.join(df_prev, on="SK_ID_CURR", rsuffix="_PREV")

In [None]:
df.shape

In [None]:
df

In [None]:
split_plot(df.sample(100000),  "AMT_CREDIT", "AMT_CREDIT_PREV", "TARGET", aalpha=0.01, balpha=0.07, ylim=(0, 4e5), xlim=(0, 4e6))