# Walmart Trip Type Classification - Data Wrangling
### Capstone Project - 1
### By: * Rajesh Dharmarajan *
******************
_This is my first capstone project for Springboard Career Track. This is a classification problem with Data from Walmart to predict the Trip Type. This is orginally from a Kaggle competition_

** Overview **


Walmart uses both art and science to continually make progress on their core mission of better understanding and serving their customers. One way Walmart is able to improve customers' shopping experiences is by segmenting their store visits into different trip types. 

Whether they're on a last minute run for new puppy supplies or leisurely making their way through a weekly grocery list, classifying trip types enables Walmart to create the best shopping experience for every customer.

Currently, Walmart's trip types are created from a combination of existing customer insights ("art") and purchase history data ("science"). 

The challenge here is to classify customer trips using only a transactional dataset of the items they've purchased.

https://www.kaggle.com/c/walmart-recruiting-trip-type-classification


In [1]:
import pandas as pd
import tkinter
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
%matplotlib tk

** Read the file **

In [2]:
walmart_raw = pd.read_csv('wm_train.csv')

**Data fields**

* TripType - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category.
* VisitNumber - an id corresponding to a single trip by a single customer
* Weekday - the weekday of the trip
* Upc - the UPC number of the product purchased
* ScanCount - the number of the given item that was purchased. A negative value indicates a product return.
* DepartmentDescription - a high-level description of the item's department
* FinelineNumber - a more refined category for each of the products, created by Walmart

In [3]:
walmart_raw.head()

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
0,999,5,Friday,68113150000.0,-1,FINANCIAL SERVICES,1000.0
1,30,7,Friday,60538820000.0,1,SHOES,8931.0
2,30,7,Friday,7410811000.0,1,PERSONAL CARE,4504.0
3,26,8,Friday,2238404000.0,2,PAINT AND ACCESSORIES,3565.0
4,26,8,Friday,2006614000.0,2,PAINT AND ACCESSORIES,1017.0


In [4]:
walmart_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 647054 entries, 0 to 647053
Data columns (total 7 columns):
TripType                 647054 non-null int64
VisitNumber              647054 non-null int64
Weekday                  647054 non-null object
Upc                      642925 non-null float64
ScanCount                647054 non-null int64
DepartmentDescription    645693 non-null object
FinelineNumber           642925 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 34.6+ MB


There are rows where both the UPC and Depeartment Description does not contain any value. Remove these rows from the 
data since, without knowning what the customer bought or returned, the trip cannot be classified

In [42]:
discard_data = walmart_raw[(walmart_raw.Upc.isnull())&(walmart_raw.DepartmentDescription.isnull())]

In [30]:
len(discard_data)

1361

In [31]:
discard_data[discard_data.Upc.isnull()]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
25,26,8,Friday,,1,,
548,27,259,Friday,,3,,
549,27,259,Friday,,1,,
959,999,409,Friday,,-1,,
1116,39,479,Friday,,1,,
1134,999,484,Friday,,-2,,
1135,999,484,Friday,,-2,,
1926,32,845,Friday,,1,,
1927,32,845,Friday,,1,,
2294,40,1004,Friday,,1,,


In [32]:
walmart_cleanse1 = walmart_raw[(walmart_raw.Upc.notnull())|(walmart_raw.DepartmentDescription.notnull())]

In [33]:
len(walmart_cleanse1)

645693

In [36]:
walmart_cleanse1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 645693 entries, 0 to 647053
Data columns (total 7 columns):
TripType                 645693 non-null int64
VisitNumber              645693 non-null int64
Weekday                  645693 non-null object
Upc                      642925 non-null float64
ScanCount                645693 non-null int64
DepartmentDescription    645693 non-null object
FinelineNumber           642925 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 39.4+ MB


In [50]:
walmart_cleanse1[walmart_cleanse1.VisitNumber == 496]
walmart_cleanse1[walmart_cleanse1.VisitNumber == 521]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
1216,5,521,Friday,,1,PHARMACY RX,


In [80]:
walmart_cleanse1[(walmart_cleanse1.TripType==5) & (walmart_cleanse1.DepartmentDescription.str.contains('/[^PHAR]',regex=True))]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
5067,5,1903,Friday,1.980004e+09,1,HOUSEHOLD CHEMICALS/SUPP,1555.0
5069,5,1903,Friday,2.340000e+09,1,HOUSEHOLD CHEMICALS/SUPP,1555.0
15375,5,4998,Friday,1.920000e+09,2,HOUSEHOLD CHEMICALS/SUPP,3515.0
23794,5,7484,Saturday,4.650072e+09,1,HOUSEHOLD CHEMICALS/SUPP,8920.0
29649,5,9144,Saturday,3.700091e+09,1,HOUSEHOLD CHEMICALS/SUPP,8945.0
30546,5,9346,Saturday,2.340001e+09,2,HOUSEHOLD CHEMICALS/SUPP,1555.0
39314,5,11676,Saturday,1.980004e+09,1,HOUSEHOLD CHEMICALS/SUPP,1555.0
39318,5,11676,Saturday,1.980004e+09,1,HOUSEHOLD CHEMICALS/SUPP,1555.0
40549,5,12028,Saturday,7.533883e+09,1,SLEEPWEAR/FOUNDATIONS,1260.0
45965,5,13588,Sunday,3.500053e+09,1,HOUSEHOLD CHEMICALS/SUPP,55.0


In [81]:
count_series = walmart_cleanse1.Upc.value_counts()

In [113]:
count_series

[Float64Index([88786030657.0,  1305139482.0,  4269428770.0,  7644086478.0,
                2471931796.0, 88614490328.0,  2856272104.0, 88589800008.0,
               88711764844.0,  2856277258.0,
               ...
                8511845417.0, 68210220252.0,  2898194492.0,  1675132038.0,
               88882319989.0, 84374709043.0,  7644011360.0,  1119230643.0,
               76705282388.0,  8669413409.0],
              dtype='float64', length=37080)]

In [97]:
count_series = (count_series[count_series.values < 2])

In [123]:
walmart_cleanse1.Upc[0:5]

0    6.811315e+10
1    6.053882e+10
2    7.410811e+09
3    2.238404e+09
4    2.006614e+09
Name: Upc, dtype: float64