## Fraud detection data challenge. There are 4 tasks:
#### 1) For each user, determine her country based on the numeric IP address.
#### 2) Build a model to predict whether an activity is fraudulent or not. Explain how different assumptions about the cost of false positives vs false negatives would impact the model.
#### 3) Your boss is a bit worried about using a model she doesn't understand for something as important as fraud detection. How would you explain her how the model is making the predictions? Not from a mathematical perspective (she couldn't care less about that), but from a user perspective. What kinds of users are more likely to be classified as at risk? What are their characteristics?
#### 4) Let's say you now have this model which can be used live to predict in real time if an activity is fraudulent or not. From a product perspective, how would you use it? That is, what kind of different user experiences would you build based on the model output?

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [54]:
fraud = pd.read_csv('Fraud_Data.csv')
ip = pd.read_csv('IpAddress_to_Country.csv')
#loading data

In [4]:
fraud.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0


In [25]:
fraud.info() ##large dataset, no missing values apparent so far

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
user_id           151112 non-null int64
signup_time       151112 non-null object
purchase_time     151112 non-null object
purchase_value    151112 non-null int64
device_id         151112 non-null object
source            151112 non-null object
browser           151112 non-null object
sex               151112 non-null object
age               151112 non-null int64
ip_address        151112 non-null float64
class             151112 non-null int64
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB


In [26]:
fraud.isnull().sum() #no missing values

user_id           0
signup_time       0
purchase_time     0
purchase_value    0
device_id         0
source            0
browser           0
sex               0
age               0
ip_address        0
class             0
dtype: int64

In [9]:
ip.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


In [27]:
ip.isnull().sum() #no missing values

lower_bound_ip_address    0
upper_bound_ip_address    0
country                   0
dtype: int64

In [13]:
ip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
lower_bound_ip_address    138846 non-null float64
upper_bound_ip_address    138846 non-null int64
country                   138846 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 3.2+ MB


#### 1) For each user, determine her country based on the numeric IP address.

In [28]:
# Just testing how to grab the upper and lower bound ip info for one country before I apply this principle to the
# whole large dataset
ip[(ip['lower_bound_ip_address'] <= 16777216) & (ip['upper_bound_ip_address'] >= 16777471)]

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia


#### Creating a dictionary of upper-lower bound ip addresses and countries so that I can map this to the larger dataset - since the same country has multiple ip bands I have to use the country name as the value instead of the key since a dictionary cannot have multiple identical keys

In [23]:
ip_country_dict = {}
for i in range(len(ip)):
    value = ip['country'][i]
    key = (ip['lower_bound_ip_address'][i], ip['upper_bound_ip_address'][i])
    ip_country_dict[key] = value

In [30]:
ip_country_dict #just checking the dictionary

{(3639320576.0, 3639324671): 'United States',
 (3334992384.0, 3334992895): 'United States',
 (3560595456.0, 3560603647): 'Georgia',
 (3562176512.0, 3562184703): 'France',
 (3423431680.0, 3423432703): 'United States',
 (3448020992.0, 3448023039): 'United States',
 (3450296832.0, 3450297343): 'United States',
 (3664199680.0, 3664216063): 'Hong Kong',
 (3477143552.0, 3477209087): 'United States',
 (783065088.0, 783073279): 'Russian Federation',
 (1542367232.0, 1542368255): 'Russian Federation',
 (3663990272.0, 3663990527): 'Viet Nam',
 (625999872.0, 627048447): 'Germany',
 (3326017280.0, 3326017535): 'United States',
 (2707292160.0, 2707357695): 'Japan',
 (3332874240.0, 3332874495): 'Canada',
 (3234347264.0, 3234347519): 'United States',
 (3355262464.0, 3355262719): 'Canada',
 (3333681920.0, 3333682175): 'United States',
 (3389413376.0, 3389413887): 'China',
 (621993984.0, 621998079): 'Slovenia',
 (3407489024.0, 3407489279): 'Australia',
 (2680487936.0, 2680553471): 'United Kingdom',
 (33

In [60]:
#now I need to map the dictionary back to the original dataset
def map_country(row, ip_country_dict):
    ip_country = row['ip_address']
    for key in ip_country_dict:
        if ip_country >= key[0] and ip_country <= key[1]:
            return ip_country_dict[key]

In [61]:
fraud['country'] = fraud.apply(lambda row : map_country(row, ip_country_dict), axis = 1)

In [62]:
fraud.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1,United States
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States


In [32]:
fraud.ip_address[1]

350311387.86590803

#### 2) Build a model to predict whether an activity is fraudulent or not. Explain how different assumptions about the cost of false positives vs false negatives would impact the model.