# Capstone Project: Predicting Fraudulent Transactions Using Machine Learning - Preprocessing

## Author: Tolulope Oludemi

## Date: August 8, 2022

---

# Table of Contents

---

[I. Data Preprocessing](#Data-Preprocessing)<br>

[II. Import Libraries and Cleaned Dataset](#Import-Libraries-and-Cleaned-Dataset)<br>

[III. Converting to Numerical Variables](#Converting-to-Numerical-Variables)<br>
- [`trans_date_trans_time` Column](#trans_date_trans_time-Column)<br>
    - [Extracting from DateTime Column](#Extracting-Year,-Month,-Day,-and-Hour)<br>
- [`dob` Column](#dob-Column)<br>
    - [Extracting from DateTime Column](#Extracting-the-Year-from-the-dob-Column)<br>

[IV. Columns to Transform to Numeric Variables](#Columns-to-Transform-to-Numeric-Variables)<br>

[V. One Hot Encoding the Categorical Columns](#One-Hot-Encoding-the-Categorical-Columns)<br>

[VI. Summary of Notebook](#Summary)<br>

# Data Preprocessing


In this notebook, I will be loading the cleaned up and feature engineered dataset (cleaned_cc_trans_df.csv) and processing it to be model ready. This includes transforming categorical columns to numeric columns.

---

# Import Libraries and Cleaned Dataset

In [None]:
# import necessary libraries

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Using GoogleColab so I need to set a path for where the data is located

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Read in the cleaned dataset

cc_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/data/cleaned_cc_trans_df.csv", index_col=0)


# Inspect the data
cc_df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,zip,lat,long,city_pop,job,dob,unix_time,merch_lat,merch_long,is_fraud
5,2019-01-01 00:04:08,4767265376804500,"Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,...,18917,40.375,-75.2045,2158,Transport planner,1961-06-19 00:00:00,1325376248,40.653382,-76.152667,0
7,2019-01-01 00:05:08,6011360759745864,Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,...,22824,38.8432,-78.6003,6018,Designer,1947-08-21 00:00:00,1325376308,38.948089,-78.540296,0
12,2019-01-01 00:06:56,180042946491150,Lockman Ltd,grocery_pos,71.22,Charles,Robles,M,3337 Lisa Divide,Saint Petersburg,...,33710,27.7898,-82.7243,341043,Engineer,1989-02-28 00:00:00,1325376416,27.630593,-82.308891,0
14,2019-01-01 00:09:03,3514865930894695,Beier-Hyatt,shopping_pos,7.77,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,...,88325,32.9396,-105.8189,899,Naval architect,1967-08-30 00:00:00,1325376543,32.863258,-106.520205,0
15,2019-01-01 00:09:20,6011999606625827,Schmidt and Sons,shopping_net,3.26,Ronald,Carson,M,870 Rocha Drive,Harrington Park,...,7640,40.9918,-73.98,4664,Radiographer,1965-06-30 00:00:00,1325376560,41.831174,-74.335559,0


In [None]:
# Confirm the shape of the dataset

cc_df.shape

(128802, 21)

In [None]:
# Check for NaNs

cc_df.isna().sum().sum()

0

In [None]:
# Check for duplicates

cc_df.duplicated().sum()

0

In [None]:
# Check datatypes for each column

cc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128802 entries, 5 to 1296642
Data columns (total 21 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   trans_date_trans_time  128802 non-null  object 
 1   cc_num                 128802 non-null  int64  
 2   merchant               128802 non-null  object 
 3   category               128802 non-null  object 
 4   amt                    128802 non-null  float64
 5   first                  128802 non-null  object 
 6   last                   128802 non-null  object 
 7   gender                 128802 non-null  object 
 8   street                 128802 non-null  object 
 9   city                   128802 non-null  object 
 10  state                  128802 non-null  object 
 11  zip                    128802 non-null  int64  
 12  lat                    128802 non-null  float64
 13  long                   128802 non-null  float64
 14  city_pop               128802 non-n

There are 11 columns that are categorical columns that need to be converted to numeric columns. These are:

1. `trans_date_trans_time`
2. `merchant`
3. `category`
4. `first`
5. `last`
6. `gender`
7. `street`
8. `city`
9. `state`
10. `job`
11. `dob`

# Converting to Numerical Variables

## `trans_date_trans_time` Column


The first column that will be transformed is the transaction date and time column. It will need to be in a `datetime64` data type format. I will also be extracting the year, month, weekday, and hour from the column.

In [None]:
# Change to datetime format

cc_df['trans_date_trans_time'] = cc_df["trans_date_trans_time"].astype('datetime64')

In [None]:
# Make sure it worked

cc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128802 entries, 5 to 1296642
Data columns (total 21 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   trans_date_trans_time  128802 non-null  datetime64[ns]
 1   cc_num                 128802 non-null  int64         
 2   merchant               128802 non-null  object        
 3   category               128802 non-null  object        
 4   amt                    128802 non-null  float64       
 5   first                  128802 non-null  object        
 6   last                   128802 non-null  object        
 7   gender                 128802 non-null  object        
 8   street                 128802 non-null  object        
 9   city                   128802 non-null  object        
 10  state                  128802 non-null  object        
 11  zip                    128802 non-null  int64         
 12  lat                    128802 non-null  flo

In [None]:
# Checking to make sure nothing unusual happened in the dataset

cc_df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,zip,lat,long,city_pop,job,dob,unix_time,merch_lat,merch_long,is_fraud
5,2019-01-01 00:04:08,4767265376804500,"Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,...,18917,40.375,-75.2045,2158,Transport planner,1961-06-19 00:00:00,1325376248,40.653382,-76.152667,0
7,2019-01-01 00:05:08,6011360759745864,Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,...,22824,38.8432,-78.6003,6018,Designer,1947-08-21 00:00:00,1325376308,38.948089,-78.540296,0
12,2019-01-01 00:06:56,180042946491150,Lockman Ltd,grocery_pos,71.22,Charles,Robles,M,3337 Lisa Divide,Saint Petersburg,...,33710,27.7898,-82.7243,341043,Engineer,1989-02-28 00:00:00,1325376416,27.630593,-82.308891,0
14,2019-01-01 00:09:03,3514865930894695,Beier-Hyatt,shopping_pos,7.77,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,...,88325,32.9396,-105.8189,899,Naval architect,1967-08-30 00:00:00,1325376543,32.863258,-106.520205,0
15,2019-01-01 00:09:20,6011999606625827,Schmidt and Sons,shopping_net,3.26,Ronald,Carson,M,870 Rocha Drive,Harrington Park,...,7640,40.9918,-73.98,4664,Radiographer,1965-06-30 00:00:00,1325376560,41.831174,-74.335559,0


### Extracting Year, Month, Day, and Hour

In [None]:
# Adding new columns for the transaction year, month, day and hour

cc_df['trans_year'] = cc_df['trans_date_trans_time'].dt.year
cc_df['trans_month'] = cc_df['trans_date_trans_time'].dt.month
cc_df['trans_day'] = cc_df['trans_date_trans_time'].dt.day
cc_df['trans_hour'] = cc_df['trans_date_trans_time'].dt.hour


# Check to make sure it worked
cc_df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,job,dob,unix_time,merch_lat,merch_long,is_fraud,trans_year,trans_month,trans_day,trans_hour
5,2019-01-01 00:04:08,4767265376804500,"Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,...,Transport planner,1961-06-19 00:00:00,1325376248,40.653382,-76.152667,0,2019,1,1,0
7,2019-01-01 00:05:08,6011360759745864,Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,...,Designer,1947-08-21 00:00:00,1325376308,38.948089,-78.540296,0,2019,1,1,0
12,2019-01-01 00:06:56,180042946491150,Lockman Ltd,grocery_pos,71.22,Charles,Robles,M,3337 Lisa Divide,Saint Petersburg,...,Engineer,1989-02-28 00:00:00,1325376416,27.630593,-82.308891,0,2019,1,1,0
14,2019-01-01 00:09:03,3514865930894695,Beier-Hyatt,shopping_pos,7.77,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,...,Naval architect,1967-08-30 00:00:00,1325376543,32.863258,-106.520205,0,2019,1,1,0
15,2019-01-01 00:09:20,6011999606625827,Schmidt and Sons,shopping_net,3.26,Ronald,Carson,M,870 Rocha Drive,Harrington Park,...,Radiographer,1965-06-30 00:00:00,1325376560,41.831174,-74.335559,0,2019,1,1,0


In [None]:
# After making sure it worked, I will now drop the original column

cc_df.drop('trans_date_trans_time', axis=1, inplace=True)
cc_df.head()

Unnamed: 0,cc_num,merchant,category,amt,first,last,gender,street,city,state,...,job,dob,unix_time,merch_lat,merch_long,is_fraud,trans_year,trans_month,trans_day,trans_hour
5,4767265376804500,"Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,PA,...,Transport planner,1961-06-19 00:00:00,1325376248,40.653382,-76.152667,0,2019,1,1,0
7,6011360759745864,Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,VA,...,Designer,1947-08-21 00:00:00,1325376308,38.948089,-78.540296,0,2019,1,1,0
12,180042946491150,Lockman Ltd,grocery_pos,71.22,Charles,Robles,M,3337 Lisa Divide,Saint Petersburg,FL,...,Engineer,1989-02-28 00:00:00,1325376416,27.630593,-82.308891,0,2019,1,1,0
14,3514865930894695,Beier-Hyatt,shopping_pos,7.77,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,NM,...,Naval architect,1967-08-30 00:00:00,1325376543,32.863258,-106.520205,0,2019,1,1,0
15,6011999606625827,Schmidt and Sons,shopping_net,3.26,Ronald,Carson,M,870 Rocha Drive,Harrington Park,NJ,...,Radiographer,1965-06-30 00:00:00,1325376560,41.831174,-74.335559,0,2019,1,1,0


In [None]:
# Checking the shape of the new dataset

cc_df.shape

(128802, 24)

## `dob` Column

For the date of birth column, I will be changing it to the appropriate data type, "datetime64". Afterwards, I will be extracting the year of birth and dropping the month and day of birth, because the year of birth is more significant than the month and day of birth.

In [None]:
# Changing to the datetime format

cc_df['dob'] = cc_df['dob'].astype('datetime64')

cc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128802 entries, 5 to 1296642
Data columns (total 24 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   cc_num       128802 non-null  int64         
 1   merchant     128802 non-null  object        
 2   category     128802 non-null  object        
 3   amt          128802 non-null  float64       
 4   first        128802 non-null  object        
 5   last         128802 non-null  object        
 6   gender       128802 non-null  object        
 7   street       128802 non-null  object        
 8   city         128802 non-null  object        
 9   state        128802 non-null  object        
 10  zip          128802 non-null  int64         
 11  lat          128802 non-null  float64       
 12  long         128802 non-null  float64       
 13  city_pop     128802 non-null  int64         
 14  job          128802 non-null  object        
 15  dob          128802 non-null  dat

### Extracting the Year from the `dob` Column

In [None]:
# Adding a new column for the year of birth

cc_df['dob_year'] = cc_df['dob'].dt.year


# Checking to make sure it worked
cc_df.head()

Unnamed: 0,cc_num,merchant,category,amt,first,last,gender,street,city,state,...,dob,unix_time,merch_lat,merch_long,is_fraud,trans_year,trans_month,trans_day,trans_hour,dob_year
5,4767265376804500,"Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,PA,...,1961-06-19,1325376248,40.653382,-76.152667,0,2019,1,1,0,1961
7,6011360759745864,Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,VA,...,1947-08-21,1325376308,38.948089,-78.540296,0,2019,1,1,0,1947
12,180042946491150,Lockman Ltd,grocery_pos,71.22,Charles,Robles,M,3337 Lisa Divide,Saint Petersburg,FL,...,1989-02-28,1325376416,27.630593,-82.308891,0,2019,1,1,0,1989
14,3514865930894695,Beier-Hyatt,shopping_pos,7.77,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,NM,...,1967-08-30,1325376543,32.863258,-106.520205,0,2019,1,1,0,1967
15,6011999606625827,Schmidt and Sons,shopping_net,3.26,Ronald,Carson,M,870 Rocha Drive,Harrington Park,NJ,...,1965-06-30,1325376560,41.831174,-74.335559,0,2019,1,1,0,1965


In [None]:
# Checking the new shape of the dataset

cc_df.drop('dob', axis=1, inplace=True)
cc_df.shape

(128802, 24)

In [None]:
# See the updated datatypes for the columns 

cc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128802 entries, 5 to 1296642
Data columns (total 24 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   cc_num       128802 non-null  int64  
 1   merchant     128802 non-null  object 
 2   category     128802 non-null  object 
 3   amt          128802 non-null  float64
 4   first        128802 non-null  object 
 5   last         128802 non-null  object 
 6   gender       128802 non-null  object 
 7   street       128802 non-null  object 
 8   city         128802 non-null  object 
 9   state        128802 non-null  object 
 10  zip          128802 non-null  int64  
 11  lat          128802 non-null  float64
 12  long         128802 non-null  float64
 13  city_pop     128802 non-null  int64  
 14  job          128802 non-null  object 
 15  unix_time    128802 non-null  int64  
 16  merch_lat    128802 non-null  float64
 17  merch_long   128802 non-null  float64
 18  is_fraud     128802 non

# Columns to Transform to Numeric Variables

1. merchant
2. category
3. first
4. last
5. gender
6. state
7. job

**Columns to Drop from Categorical Columns**
1. street (unique to first and last names, and represented by lat and long)
2. city (represented by state in a way - might not be useful to the model)

Now that I know the columns that will be converted to numeric, I will use One Hot Encoding and Column Transformer to convert the categorical columns to numeric.

# One Hot Encoding the Categorical Columns

In [None]:
# Import necessary methods

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


# Putting the columns to be transformed in a list with the method to be used
column_transform_list = [
                         ('merchant_transform', OneHotEncoder(), ['merchant']),
                         ('category_transform', OneHotEncoder(), ['category']),
                         ('first_tranform', OneHotEncoder(), ['first']),
                         ('last_transform', OneHotEncoder(), ['last']),
                         ('gender_transform', OneHotEncoder(), ['gender']),
                         ('state_transform', OneHotEncoder(), ['state']),
                         ('job_transform', OneHotEncoder(), ['job'])
]

# Creating the column transformer
column_transformer = ColumnTransformer(column_transform_list)

# Fit it on the original dataset
column_transformer.fit(cc_df)

ColumnTransformer(transformers=[('merchant_transform', OneHotEncoder(),
                                 ['merchant']),
                                ('category_transform', OneHotEncoder(),
                                 ['category']),
                                ('first_tranform', OneHotEncoder(), ['first']),
                                ('last_transform', OneHotEncoder(), ['last']),
                                ('gender_transform', OneHotEncoder(),
                                 ['gender']),
                                ('state_transform', OneHotEncoder(), ['state']),
                                ('job_transform', OneHotEncoder(), ['job'])])

In [None]:
# Get column names
column_transformer.get_feature_names()

['merchant_transform__x0_Altenwerth, Cartwright and Koss',
 'merchant_transform__x0_Ankunding LLC',
 'merchant_transform__x0_Auer-Mosciski',
 'merchant_transform__x0_Auer-West',
 'merchant_transform__x0_Bahringer, Schoen and Corkery',
 'merchant_transform__x0_Bailey-Morar',
 'merchant_transform__x0_Barrows PLC',
 'merchant_transform__x0_Bartoletti-Wunsch',
 'merchant_transform__x0_Barton Inc',
 'merchant_transform__x0_Bashirian Group',
 'merchant_transform__x0_Bauch-Raynor',
 'merchant_transform__x0_Baumbach, Feeney and Morar',
 'merchant_transform__x0_Baumbach, Hodkiewicz and Walsh',
 'merchant_transform__x0_Baumbach, Strosin and Nicolas',
 'merchant_transform__x0_Bednar Group',
 'merchant_transform__x0_Beier-Hyatt',
 'merchant_transform__x0_Bernhard Inc',
 'merchant_transform__x0_Bernhard, Grant and Langworth',
 'merchant_transform__x0_Bernier, Volkman and Hoeger',
 'merchant_transform__x0_Bins-Rice',
 'merchant_transform__x0_Block-Parisian',
 'merchant_transform__x0_Bogisich Inc',
 

In [None]:
# Transform the original dataset

transformed_col = column_transformer.transform(cc_df)

# Check that it worked
transformed_col

<128802x768 sparse matrix of type '<class 'numpy.float64'>'
	with 901614 stored elements in Compressed Sparse Row format>

In [None]:
# Putting the info from transformed_col to a dataframe

transformed_df = pd.DataFrame(transformed_col.toarray(),
                              columns=column_transformer.get_feature_names(),
                              index=cc_df.index)

# Inspect that it worked
transformed_df.head()

Unnamed: 0,"merchant_transform__x0_Altenwerth, Cartwright and Koss",merchant_transform__x0_Ankunding LLC,merchant_transform__x0_Auer-Mosciski,merchant_transform__x0_Auer-West,"merchant_transform__x0_Bahringer, Schoen and Corkery",merchant_transform__x0_Bailey-Morar,merchant_transform__x0_Barrows PLC,merchant_transform__x0_Bartoletti-Wunsch,merchant_transform__x0_Barton Inc,merchant_transform__x0_Bashirian Group,...,job_transform__x0_Surveyor,job_transform__x0_Teacher,job_transform__x0_Television production assistant,job_transform__x0_Television/film/video producer,job_transform__x0_Therapist,job_transform__x0_Trading standards officer,job_transform__x0_Transport planner,job_transform__x0_Tree surgeon,job_transform__x0_Veterinary surgeon,job_transform__x0_Warehouse manager
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Concat the transformed_df and original dataframe together

cc_df = pd.concat([cc_df, transformed_df], axis=1)

Now that's been done, the original columns that were transformed will be dropped from the dataframe.

1. merchant
2. category
3. first
4. last
5. gender
6. state
7. job

In [None]:
# Drop the columns
cc_df.drop(['merchant', 'category', 'first', 'last', 'gender', 'state', 'job'], axis=1, inplace=True)

# Check that it worked
cc_df.head()

Unnamed: 0,cc_num,amt,street,city,zip,lat,long,city_pop,unix_time,merch_lat,...,job_transform__x0_Surveyor,job_transform__x0_Teacher,job_transform__x0_Television production assistant,job_transform__x0_Television/film/video producer,job_transform__x0_Therapist,job_transform__x0_Trading standards officer,job_transform__x0_Transport planner,job_transform__x0_Tree surgeon,job_transform__x0_Veterinary surgeon,job_transform__x0_Warehouse manager
5,4767265376804500,94.63,4655 David Island,Dublin,18917,40.375,-75.2045,2158,1325376248,40.653382,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,6011360759745864,71.65,231 Flores Pass Suite 720,Edinburg,22824,38.8432,-78.6003,6018,1325376308,38.948089,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,180042946491150,71.22,3337 Lisa Divide,Saint Petersburg,33710,27.7898,-82.7243,341043,1325376416,27.630593,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,3514865930894695,7.77,1632 Cohen Drive Suite 639,High Rolls Mountain Park,88325,32.9396,-105.8189,899,1325376543,32.863258,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,6011999606625827,3.26,870 Rocha Drive,Harrington Park,7640,40.9918,-73.98,4664,1325376560,41.831174,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Also drop the street and city since they are represented by other numeric columns

cc_df.drop(["street", "city"], axis=1, inplace=True)

# Check the shape of the new dataset
cc_df.shape

(128802, 783)

In [None]:
# See the final dataset

cc_df.head()

Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud,...,job_transform__x0_Surveyor,job_transform__x0_Teacher,job_transform__x0_Television production assistant,job_transform__x0_Television/film/video producer,job_transform__x0_Therapist,job_transform__x0_Trading standards officer,job_transform__x0_Transport planner,job_transform__x0_Tree surgeon,job_transform__x0_Veterinary surgeon,job_transform__x0_Warehouse manager
5,4767265376804500,94.63,18917,40.375,-75.2045,2158,1325376248,40.653382,-76.152667,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,6011360759745864,71.65,22824,38.8432,-78.6003,6018,1325376308,38.948089,-78.540296,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,180042946491150,71.22,33710,27.7898,-82.7243,341043,1325376416,27.630593,-82.308891,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,3514865930894695,7.77,88325,32.9396,-105.8189,899,1325376543,32.863258,-106.520205,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,6011999606625827,3.26,7640,40.9918,-73.98,4664,1325376560,41.831174,-74.335559,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Check to make sure there are no more categorical columns

cc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128802 entries, 5 to 1296642
Columns: 783 entries, cc_num to job_transform__x0_Warehouse manager
dtypes: float64(773), int64(10)
memory usage: 770.4 MB


In [None]:
# Save the new dataset to be used for modeling

cc_df.to_csv('model_ready_dataset.csv')
!cp model_ready_dataset.csv "/content/drive/MyDrive/Colab Notebooks/data"

# Summary

This notebook focused on transforming the categorical columns in the clean dataset to numeric columns for the purpose of modeling.

In the next notebook: **'Tolulope_Oludemi - Notebook #3 - Modeling'**, models will be conducted on the dataset, and the best model will be optimized to get better performance. The next notebook will also discuss some insights and summary of findings with practical applications and next steps.