# Credit Card Transactions Data


<span style='color:darkblue;font-size:20px;font-family:sans-serif'> The credit card market is fairly large. There were about 45 billion U.S. general-purpose credit card transactions in 2019 which accounted for a volume of $4 trillion.([ref](https://www.creditcards.com/credit-card-news/market-share-statistics/)) <br/> With such a large presence naturally people can be targets of credit card fraud and identity theft. Credit Card Fraud occurs when someone uses your credit card to execute an unauthorized charge. The card may be morphed, stolen, credentials leaked and another n number of possibilities.</span>


<p align='left'>    
<!-- [CreditCard](https://media.giphy.com/media/d3mmdNnW5hkoUxTG/giphy.gif) -->
<!--  <img src=https://media.giphy.com/media/d3mmdNnW5hkoUxTG/giphy.gif align= 'left' width=240px /> -->
 <img src=https://media.giphy.com/media/d3mmdNnW5hkoUxTG/giphy.gif align= 'left' width=240px />
 <img src=https://media.giphy.com/media/hgjNPEmAmpCMM/giphy.gif align= 'left' width=240px />
 
<br/>
<br/>
<br/>
</p>  
<br/>
<br/>    



### Summary

<p>
<span align='left'style='color:darkblue;font-size:17px;font-family:sans-serif'> This is a starter notebook which shows basic Exploratory Data Analysis on this Credit Card Fraud Transactions Dataset. You will see some basic statistics, followed by some visualizations. 
</span>
</p>


### Loading the required libraries

In [None]:
import numpy as np 
import pandas as pd 
import os
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import requests

### Reading the datafile

In [None]:
df = pd.read_csv("../input/credit-card-transactions/credit_card_transactions-ibm_v2.csv")

### Sneak Peak of the data

In [None]:
df.head().style.background_gradient(cmap='Blues')

#### Dtypes of the columns

In [None]:
df.dtypes.to_frame(name='Type').T.style.set_properties(**{'background-color': 'deepskyblue'})

<span style='color:darkblue;font-size:17px'>Based on the default type assigning by pandas:</span>


We can see that the column Amount and Time columns are object types but we know that Amount should be actually processed like a numeric column and Time can be decomposed into Hour and Minute.

Similarly, User and Card columns should also be converted to object type as unique User values represent different users, and card values correspond to an index of a card a particular user uses. The card values do not specify a unique card number.

Additionally Merchant Name, Zip and MCC are treated as numeric but should be processed as categorical.

However, since this dataset is large, to work with the kaggle kernel for visualization purposes we will not convert to object/category type. Also, we will downcast all numeric datatypes to reduce memory consumption.

So let's convert these columns to the desired datatypes

In [None]:
df['Amount'] = df['Amount'].apply(lambda value: value.split("$")[1])
df['Hour'] = df['Time'].apply(lambda value: value.split(":")[0])
df['Minutes'] = df['Time'].apply(lambda value: value.split(":")[1])
df.drop(['Time'],axis=1,inplace=True)
convert_dict = {'Amount': np.float32,
                'Minutes': np.uint8,
                'Hour': np.uint8,
                'Year': np.int16,
                'Month': np.uint8,
                'Day': np.uint8,
                'Zip': np.float16,
                
#                 'MCC':'category',
#                 'Zip':'category',
#                 'Merchant Name':'category',
                'User':np.uint16,
                'Card':np.uint8
               }

df = df.astype(convert_dict)

#### The data now looks like this:  

We see that the extra decimals in ZipCode no longer exist as the type is converted to object,
we have two more columns: Hour and Minute whereas Time doesn't exist

In [None]:
df.head()

### Basic Statistics

In [None]:
df.describe(include='all').fillna("").T.style

<span style='color:darkblue;font-size:17px'> Statistics Analysis </span>

From the basic statistics we can see that the dataset consists of **24386900 transactions, with 2000 unique users and a user owns at most 9 cards.**

The most common type of transactions are swipe transactions and the majority of transactions in this dataset are in the zipcode **98516(Postal code in Thurston County, Washington)**.

The median amount is 30$ which is almost equal to one grocery shopping trip for a single person which coincides with the most common merchant category code: 5411 (Grocery and Supermarkets).

#### Missing Values and their proportion

In [None]:
missing_count = df.isna().sum()
missing_df = (pd.concat([missing_count.rename('Missing count'),
                     missing_count.div(len(df))
                          .rename('Missing ratio')],axis = 1)
             .loc[missing_count.ne(0)])
missing_df.style.background_gradient(cmap="Reds")

<span style='color:darkblue;font-size:17px'> Missing Value Analysis </span>

Based on how the data was generated, Merchant State and Zip are not present when a transaction is processed online. Additionally for tranactions which are not US based, Zipcode is missing.

For successful transactions, errors are absent and a mjaority of transactions in this dataset are processed without errors which explains the high missing ratio for the errors column

### Visualizations

#### Distribution of transactions over the Months

In [None]:
fig = px.histogram(df, x="Month")
fig.update_layout(bargap=0.2, title="Transactions Distribution over the months")
fig.show()

#### Distribution of transactions over the Hours in a Day

In [None]:
fig_hour = px.histogram(df, x="Hour")
fig.update_layout(bargap=0.09, title="Transactions Distribution over Hour")
fig_hour.show()

#### Fraudulent Transactions over the Years

In [None]:
df_year_fraud = df.loc[:,['Year','Is Fraud?']]

df_year_fraud = df_year_fraud.groupby(['Year'])['Is Fraud?'].value_counts().to_frame(name="Count")

unique_year_vals = df.Year.unique()

to_plot_df = pd.DataFrame(columns=['Year', 'No','Yes'])

for year in unique_year_vals:
    try:
        no = df_year_fraud.loc[(year,'No')]['Count']
    except:
        no = 0
    try:
        yes = df_year_fraud.loc[(year,'Yes')]['Count']
    except:
        yes = 0
    to_plot_df = to_plot_df.append(pd.DataFrame([[year,no,yes]],columns=["Year","No","Yes"]))
    
to_plot_df['No'] = to_plot_df['No'].replace(0, np.nan)
to_plot_df['Yes'] = to_plot_df['Yes'].replace(0, np.nan)
to_plot_df['No'] = to_plot_df['No'].apply(lambda x: np.log10(x))
to_plot_df['Yes'] = to_plot_df['Yes'].apply(lambda x: np.log10(x))


fig = go.Figure(data=[
    go.Bar(name='Non-Fraud', x=to_plot_df.Year, y=to_plot_df.No),
    go.Bar(name='Fraud', x=to_plot_df.Year, y=to_plot_df.Yes)
])
fig.update_layout(barmode='group',title="Logartihmic Count of Fraud and Non-Fraud Transactions over the Years")
fig.show()

#### Fraudulent and Non fraudulent transactions based on type of card use

In [None]:
plot = sns.catplot("Is Fraud?", col="Use Chip",data=df,kind="count", height=6,aspect=.7);
plot.fig.suptitle("Card Type and Fraud", size = 20, y=1.05);

#### Transactions per state in the US

In [None]:
us_transactions = df[~df['Merchant State'].isna()]

transactions_per_state = us_transactions.groupby(['Merchant State'],as_index=False).count()

fraud_count_per_state = us_transactions.groupby(['Merchant State', 'Is Fraud?']).size()

merchant_state_fraud_dict = fraud_count_per_state.to_dict()

merchant_plot_df = pd.DataFrame(us_transactions['Merchant State'].value_counts().reset_index())

merchant_plot_df.rename({'index':"State", 'Merchant State':"Total_Transactions"},axis=1,inplace=True )

merchant_plot_df['FraudPercent'] = merchant_plot_df['State'].apply(lambda x: merchant_state_fraud_dict.get((x,"Yes"),0))

merchant_plot_df['NonFraudPercent'] = merchant_plot_df['State'].apply(lambda x: merchant_state_fraud_dict.get((x,"No"),0))

merchant_plot_df['FraudPercent'] = round(100 * (merchant_plot_df['FraudPercent'] / merchant_plot_df['Total_Transactions']),2)

merchant_plot_df['NonFraudPercent'] = round(100 * (merchant_plot_df['NonFraudPercent'] / merchant_plot_df['Total_Transactions']),2)

for col in merchant_plot_df.columns:
    merchant_plot_df[col] = merchant_plot_df[col].astype(str)
    

merchant_plot_df['text'] = "Fraudulent: " + merchant_plot_df['FraudPercent'] +  "% " +'<br>' +              "Non-Fraudulent: " + merchant_plot_df['NonFraudPercent'] + "% "+ '<br>'


fig = go.Figure(data=go.Choropleth(
    locations=merchant_plot_df['State'],
    z=merchant_plot_df['Total_Transactions'].astype(float),
    locationmode='USA-states',
    colorscale='deep',
    autocolorscale=False,
    text=merchant_plot_df['text'], # hover text
    marker_line_color='white', # line markers between states
    colorbar_title="Credit Card Transactions"
))

fig.update_layout(
    title_text='Credit Card Transactions per US state',
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),
)

fig.show()

#### Transactions Outside the US

In [None]:
df_nonusa = df[(df.Zip.isnull()) & (df['Merchant City'] != 'ONLINE')]
print(f"Transactions not in the United States: {len(df_nonusa)}")

In [None]:
df_usa = df[(~df.Zip.isnull()) & (df['Merchant City'] != 'ONLINE')]
print(f"Transactions in the United States: {len(df_usa)}")

In [None]:
transaction_count = df_nonusa['Merchant State'].value_counts().to_frame().reset_index()

In [None]:
transaction_count = transaction_count.rename(columns={"index":"Country","Merchant State":"Count"})

In [None]:
code_list = []

for country in transaction_count.Country.values:
    small_country = country.lower()
    try:
        response = requests.get("https://restcountries.eu/rest/v2/name/" + small_country + "?fullText=true")
        code_list.append(response.json()[0]['alpha3Code'])
    except:
        code_list.append("")

In [None]:
transaction_count["ISO_Code"] = code_list

##### Countries whose 3 digit ISO codes are not found

In [None]:
transaction_count[transaction_count["ISO_Code"] == ""]

Manually assigning 3 digit ISO codes to the countries the API could not find

In [None]:
transaction_count.at[3, "ISO_Code"] = "GBR"
transaction_count.at[13, "ISO_Code"] = "KOR"
transaction_count.at[17, "ISO_Code"] = "BHS"
transaction_count.at[47, "ISO_Code"] = "RUS"
transaction_count.at[54, "ISO_Code"] = "VAT"
transaction_count.at[55, "ISO_Code"] = "MKD"
transaction_count.at[57, "ISO_Code"] = "VNM"
transaction_count.at[91, "ISO_Code"] = "VEN"
transaction_count.at[97, "ISO_Code"] = "MDA"
transaction_count.at[99, "ISO_Code"] = "SYR"
transaction_count.at[107, "ISO_Code"] = "BUR"
transaction_count.at[110, "ISO_Code"] = "IRN"
transaction_count.at[116, "ISO_Code"] = "FSM"
transaction_count.at[123, "ISO_Code"] = "KOS"
transaction_count.at[124, "ISO_Code"] = "TLS"
transaction_count.at[136, "ISO_Code"] = "TZA"
transaction_count.at[140, "ISO_Code"] = "BRN"
transaction_count.at[148, "ISO_Code"] = "COG"
transaction_count.at[166, "ISO_Code"] = "COD"

#### Converting to Logarithmic Scale

In [None]:
transaction_count['Count'] = np.log10(transaction_count['Count'].replace(0, np.nan))

In [None]:
fig = px.choropleth(transaction_count, locations="ISO_Code",
                    color="Count",
                    hover_name="Country",
                    color_continuous_scale=px.colors.sequential.Blues,
                    title = "Transaction Count Across the World")
fig.show()