In [1]:
import pandas as pd # used as operations for manipulating numerical tables and time series.
identity = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv')   #reading identity data
transaction = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv')  #reading the transaction data
import matplotlib.pyplot as plt #creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
import seaborn as sns #helps you explore and understand your data.
import numpy as np

In [2]:
#function to reduce the memory of dataset
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

# ***Working with treachery diagnosys of all types of money cards***

**INPUT DATA:**

The data is broken in two data points’ **identity** and **transaction** which are joined by *TransactionID*.

1.**Identity Table**: (Information about cards and client details)

> The Identity Table has 144233 rows and 41 columns.

> Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc.) and digital signature (UA/browser/so/version, etc.) associated with transactions.

> Features:

*         DeviceType 
*         DeviceInfo 
*         Id12-id38 
*         TransactionID etc.

2.**Transaction Table:** (Information about the transaction of the clients)

> “It contains money transfer and also other gifting goods and services, like you booked a ticket for others, etc.”

> Transaction table has 590540 rows and 394 columns.

> Features:

* ProductCD 
* isFraud 
* Card1 – card6 
* TransactionID 
* addr1,addr2 
* P_emaildomain 
* R_emaildomain

> We have to merge the identity and transaction data points for our training

**Output :**
In the dataset the target variable is a binary attribute ‘isFraud’ ‘0’ or ‘1’ (“Yes” or “NO”). We have to diagnose the probability that the online transaction is treacherous.


> 
**Let's first check identity table**

In [3]:
identity.head()

In [4]:
identity.info()

In [5]:
identity.shape

It is clear from the above analysis that identity table have 144233 rows and 41 columns 

**Let's check transactions table**

In [6]:
transaction.head()

In [7]:
transaction.shape

It is clear from the above analysis that transaction table have 590540 rows and 394 columns 

In [8]:
#checking for the number of unique values in each column of identity table
identity.apply(lambda x: x.nunique()) 

**Merging transaction and identity data into training data**

In [9]:
#merging transaction and identity data by using only keys from left dataframe
#in which primry key acts as TransactionId which is common in both tables, similar to a SQL left outer join.
training = transaction.merge(identity, how = 'left')  

In [10]:
training.head()

In [11]:
#plotting bargraph for checking the count of transactions that are fraudulent and non-fraudulent
training.groupby('isFraud').count()['TransactionID'].plot(kind='bar',
          title='Distribution of Target in Train',color=('pink','purple'),                            
          figsize=(8, 5))
plt.show()

### In above graph we can see that the fraud cases are very less as compared to non-fraud cases.

In [12]:
# loc locates all data by column or conditional statement
non_legit = training.loc[training['isFraud'] == 1] # find all rows that are fraudulent
print('Non legit are', len(non_legit), ' transactions or ', round(training['isFraud'].value_counts()[1]/len(training)*100, 2), '% of the dataset')

legit = training.loc[training['isFraud'] == 0] # final all rows that aren't fraudulent
print('Legit are ', len(legit), ' transactions or ', round(training['isFraud'].value_counts()[0]/len(training)*100, 2), '% of the dataset')

We see that about 96.5% of transactions are Non-Fraud and only 3.5% are fraud cases. There is high class imbalance in our dataset that is going to effect the validation and evaluation strategy that we are going to choose.

As the class is highly imbalanced we will be using roc_auc_score(AUC measures how well a model is able to distinguish between classes) as our evaluation metric.

# **Data Cleaning**

Those features which had missing values more than 90% were removed from the dataset because they did not provide any information to our model and they are just garbage which will affect our model’s run time.

We filled the missing values with mode values for categorical features and with median for continuous features.

In [13]:
training.isnull().sum()

id_36             449555
id_37             449555
id_38             449555
DeviceType        449730
DeviceInfo        471874
we can see here that there are very high null values so these columns are of no use so we'll drop them 

In [14]:
null_columns = [col for col in training.columns if training[col].isnull().sum() / training.shape[0] > 0.9]
training.drop(null_columns,axis=1,inplace=True)
# col for col iterates over the list training.columns with the variable col and adds it to the resulting list if col is null
#analyze and drop Rows/Columns with Null values
#inplace: It is a boolean which makes the changes in data frame itself if True.
#axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String.

filling the missing values with mode values for categorical features and with median for continuous features.
The columns with object dtype are the possible categorical features in your dataset.

In [15]:
#filling null alues with mean for continuous variables
for i in training.columns:
    if training[i].dtypes=='int64' or training[i].dtypes=='float64':   
        training[i].fillna(training[i].mean(),inplace=True)

In [16]:
#filling null alues with mode for categorical variables
for i in training.columns:
    if training[i].dtypes=='object':     
        training[i].fillna(training[i].mode()[0],inplace=True)

In [17]:
#The columns with object dtype are the possible categorical features in your dataset.
catagorical_cols = ['id_12','id_15', 'id_16', 'id_23', 
            'id_27', 'id_28', 'id_29','id_30', 'id_31', 'id_33', 'id_34', 'id_35', 
            'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo', 'ProductCD', 'card4', 'card6', 'M4','P_emaildomain',
            'R_emaildomain', 'addr1', 'addr2', 'M1', 'M2', 'M3', 'M5', 'M6', 'M7', 'M8', 'M9']

# **Data Transformation**

We are applying label encoding on all the categorical features. For example, ‘Credit_card’ can be assigned as 0, ‘Debit_Card’ can be assigned as 1 and ‘Others’ can be assigned as 2.

The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance.

In [18]:
#Label encoder can be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
# Basically label encoding is used to convert non-numerical labels such as device type , defice info to numerical labels as 0 , 1 for e.g
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in catagorical_cols:
  if i in training.columns:
    training[i] = le.fit_transform(training[i].astype(str).values) ##Fit label encoder of target values and return encoded labels.

In [19]:
y= training['isFraud']
print(y.shape)


In [20]:
x = training.drop(['isFraud','TransactionID','TransactionDT'],axis=1)
print(x.shape)

> The dataset has a highly imbalanced class as we have seen above. There are only 3.5 % fraud cases in our dataset and 96.5% non-fraud cases. We have a ratio of 96.5: 3.5 in our original dataset, we have to divide the dataset for train and test in such a way that both the classes are present in the same proportion in both train and test set. For maintaining the same proportion in the train and test set we have used stratified sampling. We are using 70% of the dataset for training and 30% for testing.

> The features are the descriptive attributes, and the label is what you're attempting to predict or forecast. Here the feature is x and label is y

> here we have to analyze our data whether it is fraud or not whch is available in isFraud so isFraud column will become label of the prediction and rest other columns will be features on which we will be analyzing our model except for transction id and transaction date which is of no relevance here so we will remove both in x.

The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

In [21]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,stratify = y,test_size = 0.3, random_state=1)

train_test_split Split arrays or matrices into random train and test subsets. (*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None) -> list[Any | ndarray | list]


In [22]:
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

# **Logistic Regression**

High ROC simply means the probability of a randomly chosen positive example is indeed positive. High ROC also means your algorithm does a good job at ranking the test data, with most negative cases at one end of a scale and positive cases at the other.

In [23]:
#fitting the logistic regression on training set
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=1)                        
model.fit(x_train,y_train)#Fit the model according to the given training data.

In [24]:
predict = model.predict(x_test)  #predicting the class of test labels i.e x

In [25]:
model.score(x_train,y_train)      #checking the accuracy score for training by Returning the mean accuracy on the given train data and labels.

In [26]:
model.score(x_test,y_test)        #checking the accuracy score for testing by Returning the mean accuracy on the given test data and labels.

After applying our lgistic regression model we saw that the model was giving accuracy of 96.4 % both for training and testing. But this was a problem here as discussed above we cannot go with accuracy score as we can get a high accuracy score even if our model predicts all the fraud cases as non-fraud. Instead of checking the accuracy score we will see roc_auc_score here.

The roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number. Roc & Auc curves are performance measurements for binary classification problems at various thresholds settings. It tells us how much the model is capable of distinguishing between the classes. Higher the Auc, the better the model at predicting 0s as 0 and 1s as 1.

In [27]:
# predict probabilities
lr_probs = model.predict_proba(x_test)
print(lr_probs[16])
print(lr_probs[15])
print(lr_probs[14])

In [28]:
# keep probabilities for the positive outcome only
outcome = lr_probs[:,1] 

In [29]:
for j in np.array([0.14,0.15,0.16]):
  predict = np.zeros(len(outcome))
  for i in range(len(outcome)):
    if outcome[i] > j:
      predict[i] = 1
    else:
      predict[i] = 0

In [30]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,predict)      #checking roc_auc_score

After applying logistic regression we only got 0.51 auc_score which is very less. Our main motive was to predict most of the fraud cases as fraud.