# Task 4.10 - Part 1

##### This script contains the following points:

#### Import libraries

#### 1. Import data

#### 2. Consider andy security implications that might exist for this new data. Address any PII data in the data.

#### 3a. Create 'regions' column out of 'states' column

#### 3b. Create crosstab to determine possible difference in spending habits between U.S. regions

#### 4. Create exclusion flag for low-activity customers. Exclude this group from the data. Export sample. 

##### ...continued in next notebook

# ________________________________________________________

### Import libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

### 1. Import data

In [2]:
# Set path
path = r'/Users/mainframe/Documents/Instacart Basket Analysis'

In [3]:
df = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'IC_data.pkl'))

### 2. Consider any security implications that might exist for this new data. Address any PII data in the data.

##### I dropped the first and last names of the customers in the last task, mostly in an attempt to reduce the size of my data frame. But had I not dropped the names previously for said reason, I would do that now for the reason of keeping PII secure. The user_id is attached to the rest of the data, still making it possible to analyze effectively.

### 3a. Create 'regions' column out of 'states' column

In [4]:
# Check for uniformity in names of states 
df['state'].value_counts(dropna = False)

state
Pennsylvania            667082
California              659783
Rhode Island            656913
Georgia                 656389
New Mexico              654494
Arizona                 653964
North Carolina          651900
Oklahoma                651739
Alaska                  648495
Minnesota               647825
Massachusetts           646358
Wyoming                 644255
Virginia                641421
Missouri                640732
Texas                   640394
Colorado                639280
Maine                   638583
North Dakota            638491
Alabama                 638003
Kansas                  637538
Louisiana               637482
Delaware                637024
South Carolina          636754
Oregon                  636425
Arkansas                636144
Nevada                  636139
New York                635983
Montana                 635265
South Dakota            633772
Illinois                633024
Hawaii                  632901
Washington              632852
Mi

In [5]:
# Define regions
Northeast = [
    'Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'New Jersey', 'New York', 'Pennsylvania', 'Rhode Island', 'Vermont'
]
Midwest = [
    'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Michigan', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'Ohio', 'South Dakota', 'Wisconsin'
]
South = [
    'Alabama', 'Arkansas', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Kentucky', 'Louisiana', 'Maryland', 'Mississippi', 'North Carolina', 'Oklahoma', 'South Carolina', 'Tennessee', 'Texas', 'Virginia', 'West Virginia'
]
West = [
    'Alaska', 'Arizona', 'California', 'Colorado', 'Hawaii', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Oregon', 'Utah', 'Washington', 'Wyoming'
]

In [6]:
# Function to determine region
def get_region(state):
    if state in Northeast:
        return 'Northeast'
    elif state in Midwest:
        return 'Midwest'
    elif state in South:
        return 'South'
    elif state in West:
        return 'West'
    else:
        return 'Unknown'

In [7]:
# Apply user-defined column 'region'
df['region'] = df['state'].apply(get_region)

In [8]:
# Perform value counts of new labels for regions
df['region'].value_counts(dropna = False)

region
South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: count, dtype: int64

### 3b. Create crosstab to determine possible difference in spending habits between U.S. regions

In [9]:
# Create crosstab
crosstab = pd.crosstab(df['region'], df['spending_flag'], dropna = False)

In [10]:
print(crosstab)

spending_flag  High spender  Low spender
region                                  
Midwest              155975      7441350
Northeast            108225      5614511
South                209691     10582194
West                 160354      8132559


### 4. Create exclusion flag for low-activity customers. Exclude this group from the data. Export sample. 

In [11]:
# Create exclusion flag
df.loc[df['max_order'] < 5, 'exclusion_flag'] = 'Low-activity'
df.loc[df['max_order'] >= 5, 'exclusion_flag'] = 'Acceptable'

In [12]:
# Check utility of new column
df['exclusion_flag'].value_counts(dropna = False)

exclusion_flag
Acceptable      30964564
Low-activity     1440295
Name: count, dtype: int64

In [13]:
# Exclude 'Low-activity' customers from the data
df_acc = df[df['exclusion_flag'] == 'Acceptable']

In [14]:
df_acc.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,hour_ordered,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,gender,state,age,date_joined,dependents,marital,income,_merge,region,exclusion_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Acceptable
1,2539329,1,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Acceptable
2,2539329,1,1,2,8,,12427,3,0,Original Beef Jerky,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Acceptable
3,2539329,1,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Acceptable
4,2539329,1,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Acceptable


In [15]:
# Check for excluded rows
df_acc['exclusion_flag'].value_counts(dropna = False)

exclusion_flag
Acceptable    30964564
Name: count, dtype: int64

In [16]:
# Drop unnecessary columns
df_ac = df_acc.drop(columns = ['_merge', 'exclusion_flag'])

In [17]:
# Export filtered data frame
df_ac.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'acceptable_customers.pkl'))

#### ...continued in next notebook