Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In this project, you are a Data Scientist working for a credit card company. You have access to a dataset (based on a synthetic financial dataset), that represents a typical set of credit card transactions. `transactions.csv` is the original dataset containing 200k transactions. For starters, we're going to be working with a small portion of this dataset, `transactions_modified.csv`, which contains one thousand transactions. Your task is to use Logistic Regression and create a predictive model to determine if a transaction is fraudulent or not.

Note that a `solution.py` file is loaded for you in the workspace, which contains solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or want to check your answers when you're done!



## Tasks

1. The file `transactions_modified.csv` contains data on 1000 simulated credit card transactions. Let's begin by loading the data into a pandas DataFrame named transactions. Take a peek at the dataset using `.head()` and you can use `.info()` to examine how many rows are there and what datatypes the are. How many transactions are fraudulent? Print your answer.

    The isFraud column gives information on fraud versus not with 1 representing a fraudulent transaction and 0 representing non-fraudulent transaction. You can use the `.sum()` method to add the rows and the number you get will be the number of fraudulent transactions!

In [8]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [9]:
# Load the data into a DataFrame
transactions = pd.read_csv('D:/Repositories/data/transactions.csv')

# View the first few rows of the dataset
print(transactions.head())

# View the summary information of the dataset
print(transactions.info())

fraudulent_transactions = transactions['isFraud'].sum()
print("Number of fraudulent transactions:", fraudulent_transactions)

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 co

In [10]:
# How many fraudulent transactions?



### Clean the Data

2. Looking at the dataset, combined with our knowledge of credit card transactions in general, we can see that there are a few interesting columns to look at. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column. What does the distribution look like?

    Use the `.describe()` on a column like this:
    
    ```python
    pd['column'].describe()
    ```

In [11]:
# Summary statistics on amount column
amount_stats = transactions['amount'].describe()
print(amount_stats)

count    6.362620e+06
mean     1.798619e+05
std      6.038582e+05
min      0.000000e+00
25%      1.338957e+04
50%      7.487194e+04
75%      2.087215e+05
max      9.244552e+07
Name: amount, dtype: float64


3. We have a lot of information about the type of transaction we are looking at. Let's create a new column called `isPayment` that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

    You can create a new column for a pandas DataFrame like this:

    ```python
    df['new_column'] = value
    ```

    You can filter a DataFrame for specific values like this:

    ```python
    df[df['filter_column'] == value]
    ```



In [12]:
# Create isPayment field
transactions['isPayment'] = (transactions['type'] == 'PAYMENT') | (transactions['type'] == 'DEBIT')
print(transactions.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  \
0  M1979787155             0.0             0.0        0               0   
1  M2044282225             0.0             0.0        0               0   
2   C553264065             0.0             0.0        1               0   
3    C38997010         21182.0             0.0        1               0   
4  M1230701703             0.0             0.0        0               0   

   isPayment  
0       True  
1       True  
2      False  
3      False  
4       True  


4. Similarly, create a column called `isMovement`, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either `CASH_OUT` or `TRANSFER`, and a 0 otherwise.

    You can create a new column for a pandas DataFrame like this:

    ```python
    df['new_column'] = value
    ```

    You can filter a DataFrame for specific values like this:

    ```python
    df[df['filter_column'] == value]
    ```

In [13]:
# Create isMovement field
transactions['isMovement'] = (transactions['type'] == 'CASH_OUT') | (transactions['type'] == 'TRANSFER')
print(transactions.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  \
0  M1979787155             0.0             0.0        0               0   
1  M2044282225             0.0             0.0        0               0   
2   C553264065             0.0             0.0        1               0   
3    C38997010         21182.0             0.0        1               0   
4  M1230701703             0.0             0.0        0               0   

   isPayment  isMovement  
0       True       False  
1       True       False  
2      False     

5. With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud.

   Let's create a column called `accountDiff` with the absolute difference of the `oldbalanceOrg` and `oldbalanceDest` columns.

   You can perform standard mathematical functions like `+`, `-`, `*`, and `/` with entire columns.

In [14]:
# Create accountDiff field
transactions['accountDiff'] = abs(transactions['oldbalanceOrg'] - transactions['oldbalanceDest'])
print(transactions.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  \
0  M1979787155             0.0             0.0        0               0   
1  M2044282225             0.0             0.0        0               0   
2   C553264065             0.0             0.0        1               0   
3    C38997010         21182.0             0.0        1               0   
4  M1230701703             0.0             0.0        0               0   

   isPayment  isMovement  accountDiff  
0       True       False     170136.0  
1       True      

### Select and Split the Data

6. Before we can start training our model, we need to define our features and label columns. Our label column in this dataset is the isFraud field. Create a variable called features which will be an array consisting of the following fields:

   - `amount`
   - `isPayment`
   - `isMovement`
   - `accountDiff`

    Also create a variable called `label` with the column `isFraud`.

    You can assign an entire DataFrame or a pandas Series (one column) to a variable.

In [15]:
# Create features and label variables
features = transactions[['amount', 'isPayment', 'isMovement', 'accountDiff']]
label = transactions['isFraud']

7. Split the data into training and test sets using sklearn's `train_test_split()` method. We'll use the training set to train the model and the test set to evaluate the model. Use a `test_size` value of 0.3.

    To capture all of the output arrays, use the method like this:

    `X_train, X_test, y_train, y_test = train_test_split(...)`

In [16]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.3, random_state=42)


### Normalize the Data

8. Since sklearn‘s Logistic Regression implementation uses `Regularization`, we need to scale our feature data. Create a `StandardScaler` object, `.fit_transform()` it on the training features, and `.transform()` the test features.

    Pass the entire feature variables as the argument for your `StandardScaler` object.



In [17]:
# Normalize the features variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Create and Evaluate the Model

9. Create a LogisticRegression model with sklearn and `.fit()` it on the training data.

    Fitting the model find the best coefficients for our selected features so it can more accurately predict our label. We will start with the default threshold of 0.5.

    Pass the newly normalized training features as the argument to your `.fit()` method.

In [18]:
# Fit the model to the training data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

10. Run the model's `.score()` method on the training data and print the training score.

    Scoring the model on the training data will process the training data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy.

    Pass both the training features and label variables to the `.score()` method.

In [19]:
# Score the model on the training data
training_score = model.score(X_train_scaled, y_train)
print("Training score:", training_score)

Training score: 0.9986829325026483


11. Run the model's `.score()` method on the test data and print the test score.

    Scoring the model on the test data will process the test data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy, and will be an indicator for the sucess of your model.

    How did your model perform?

    Pass both the test features and label variables to the `.score()` method.



In [20]:
# Score the model on the test data
test_score = model.score(X_test_scaled, y_test)
print("Test score:", test_score)

Test score: 0.9987017926577416


12. Print the coefficients for our model to see how important each feature column was for prediction. Which feature was most important? Least important?

    To print the model coefficients, use the `.coef_ method` on your model.

In [21]:
# Print the model coefficients
coefficients = model.coef_
feature_importance = pd.Series(coefficients[0], index=features.columns)
print("Feature importance:\n", feature_importance)

Feature importance:
 amount         0.221755
isPayment     -1.171221
isMovement     3.588887
accountDiff   -0.661338
dtype: float64


## Predict With the Model

13. Let's use our model to process more transactions that have gone through our systems. There are three numpy arrays pre-loaded in the workspace with information on new sample transactions under "New transaction data"

    ```python
    # New transaction data
    transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
    transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
    transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
    ```
    Create a fourth array, `your_transaction`, and add any transaction information you'd like. Make sure to enter all values as floats with a `.!`

    Make sure all of your array values are floats (have a decimal point)

In [22]:
# New transaction data
tr1 = np.array([123456.78, 0.0, 1.0, 54670.1])
tr2 = np.array([98765.43, 1.0, 0.0, 8524.75])
tr3 = np.array([543678.31, 1.0, 0.0, 510025.5])

14. Combine the new transactions and `your_transaction` into a single numpy array called `sample_transactions`.

    You can combine numpy arrays using the `.stack()` method.

In [23]:
# Create a new transaction
your_tr = np.array([354748.31, 1.0, 0.0, 30725.5])

# Combine new transactions into a single array
sample_transactions = np.stack([tr1, tr2, tr3, your_tr])

15. Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the `StandardScaler` object created earlier, apply its `.transform()` method to `sample_transactions` and save the result to `sample_transactions`.

    Use the `.transform()` method with the same model as before.

In [24]:
# Normalize the new transactions
# Scale the sample transactions
sample_transactions_scaled = scaler.transform(sample_transactions)



16. Which transactions are fraudulent? Use your model's `.predict()` method on `sample_transactions` and print the result to find out.

    Want to see the probabilities that led to these predictions? Call your model's `.predict_proba()` method on `sample_transactions` and print the result. The 1st column is the probability of a transaction not being fraudulent, and the 2nd column is the probability of a transaction being fraudulent (which was calculated by our model to make the final classification decision).

    You can pass `sample_transactions` to both the `.predict()` and `.predict_proba()` methods.

In [25]:
# Predict fraud on the new transactions
# Make predictions
predictions = model.predict(sample_transactions_scaled)
probabilities = model.predict_proba(sample_transactions_scaled)

print("Predictions:", predictions)
print("Probabilities:\n", probabilities)

Predictions: [0 0 0 0]
Probabilities:
 [[9.96831107e-01 3.16889315e-03]
 [9.99999806e-01 1.94016916e-07]
 [9.99999790e-01 2.10487274e-07]
 [9.99999788e-01 2.12414213e-07]]


17. Congratulations on completing the project!

    Note that we'd used a modified version of the dataset. You can now try to re-run the project using the original dataset, `transactions.csv`. Examine how the results change. If you notice something weird, you're totally on to something! That "something" is what is known as an imbalanced class classification problem.

    We will cover this very relevant topic (among many other things) in the Logistic Regression II module!

    Check how many fraudulent transactions are there in the complete dataset. What percentage of the total number of transactions is this? What was the this percentage in the modified dataset?


In [26]:
# Show probabilities on the new transactions
