# 1. Goal

E-commerce websites often transact huge amounts of money. And whenever a huge amount of money is moved, there is a high risk of users performing fraudulent activities, e.g. using stolen credit cards, doing money laundry, etc.
The goal here is to build a machine learning model that predicts the probability that the first transaction of a new user is fraudulent.

In [1]:
import pandas as pd

In [2]:
df_ip = pd.read_csv('IpAddress_to_Country.csv')

In [3]:
df = pd.read_csv('Fraud_Data.csv')

# 2. Data cleaning

In this step, we want to have an overview of the input - the column types, distribution, length, etc.

In [4]:
df.head()

Unnamed: 0,"""user_id""",signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0


In [15]:
df = df.rename(columns = {' "user_id"':'user_id'})


In [20]:
df['ip_address']

0         7.327584e+08
1         3.503114e+08
2         2.621474e+09
3         3.840542e+09
4         4.155831e+08
5         2.809315e+09
6         3.987484e+09
7         1.692459e+09
8         3.719094e+09
9         3.416747e+08
10        1.819009e+09
11        4.038285e+09
12        4.161541e+09
13        3.178510e+09
14        4.203488e+09
15        9.957328e+08
16        3.503883e+09
17        1.314238e+05
18        3.037372e+09
19        1.044590e+09
20        3.847612e+09
21        3.836794e+09
22        1.008391e+09
23        3.442658e+09
24        1.120619e+09
25        1.752167e+09
26        7.458239e+08
27        1.799141e+09
28        6.987000e+08
29        2.836025e+09
              ...     
151082    3.067794e+09
151083    3.345437e+09
151084    3.995197e+09
151085    2.345685e+09
151086    9.549052e+08
151087    2.483980e+09
151088    1.448751e+09
151089    1.966801e+09
151090    2.086092e+09
151091    2.882102e+09
151092    1.353476e+09
151093    3.097327e+09
151094    1

In [26]:
country = []
def match_country(row):
    ip = row['ip_address']
    country.append(df_ip[(df_ip['lower_bound_ip_address']<ip)&(df_ip['upper_bound_ip_address']>ip)]['country'])

In [27]:
df.apply(lambda row:match_country(row),axis = 1)

0         None
1         None
2         None
3         None
4         None
5         None
6         None
7         None
8         None
9         None
10        None
11        None
12        None
13        None
14        None
15        None
16        None
17        None
18        None
19        None
20        None
21        None
22        None
23        None
24        None
25        None
26        None
27        None
28        None
29        None
          ... 
151082    None
151083    None
151084    None
151085    None
151086    None
151087    None
151088    None
151089    None
151090    None
151091    None
151092    None
151093    None
151094    None
151095    None
151096    None
151097    None
151098    None
151099    None
151100    None
151101    None
151102    None
151103    None
151104    None
151105    None
151106    None
151107    None
151108    None
151109    None
151110    None
151111    None
Length: 151112, dtype: object

In [30]:
df['country'] = country

In [48]:
country[0].values[0]

'Japan'

In [22]:
df_ip.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


In [23]:
df_ip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
lower_bound_ip_address    138846 non-null float64
upper_bound_ip_address    138846 non-null int64
country                   138846 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 3.2+ MB


Let's match ip address to country name and add that as a feature to our main dataframe.

In [None]:
df['country'] = '-'
for i in range(len(df)):
    ip_address = df.loc[i]['ip_address']
    match = df_ip[(df_ip['lower_bound_ip_address']<ip_address)&(df_ip['upper_bound_ip_address']>ip_address)]
    if len(match) == 1:
        df.loc[i]['country'] = match.iloc[0]['country']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [11]:
df['ip_address'] = df['ip_address'].astype('int')
df_ip['lower_bound_ip_address'] = df_ip['lower_bound_ip_address'].astype('int')

In [12]:
df_ip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
lower_bound_ip_address    138846 non-null int64
upper_bound_ip_address    138846 non-null int64
country                   138846 non-null object
dtypes: int64(2), object(1)
memory usage: 3.2+ MB
