<a href="https://colab.research.google.com/github/swaroop-raj/machine-learning/blob/main/machine-learning/notebooks/Walmart_Recruiting_Trip_Type_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Description
Walmart uses both art and science to continually make progress on their core mission of better understanding and serving their customers. One way Walmart is able to improve customers' shopping experiences is by segmenting their store visits into different trip types. 



Whether they're on a last minute run for new puppy supplies or leisurely making their way through a weekly grocery list, classifying trip types enables Walmart to create the best shopping experience for every customer.

Currently, Walmart's trip types are created from a combination of existing customer insights ("art") and purchase history data ("science"). In their third recruiting competition, Walmart is challenging Kagglers to focus on the (data) science and classify customer trips using only a transactional dataset of the items they've purchased. Improving the science behind trip type classification will help Walmart refine their segmentation process.

Walmart is hosting this competition to connect with data scientists who break the mold.

## Evaluation 

Submissions are evaluated using the multi-class logarithmic loss. For each visit, you must submit a set of predicted probabilities oneforeveryTripType. The formula is:

−1N∑i=1N∑j=1Myijlog(pij),

where N is the number of visits in the test set, M is the number of trip types, \\(log\\) is the natural logarithm, \\(y_{ij}\\) is 1 if observation \\(i\\) is of class \\(j\\) and 0 otherwise, and \\(p_{ij}\\) is the predicted probability that observation \\(i\\) belongs to class \\(j\\).

The submitted probabilities for a given visit are not required to sum to one because they are rescaled prior to being scored eachrowisdividedbytherowsum. In order to avoid the extremes of the log function, predicted probabilities are replaced with \\(max(min(p,1-10^{-15}),10^{-15})\\).

## Data Description

For this competition, you are tasked with categorizing shopping trip types based on the items that customers purchased. To give a few hypothetical examples of trip types: a customer may make a small daily dinner trip, a weekly large grocery trip, a trip to buy gifts for an upcoming holiday, or a seasonal trip to buy clothes.

Walmart has categorized the trips contained in this data into 38 distinct types using a proprietary method applied to an extended set of data. You are challenged to recreate this categorization/clustering with a more limited set of features. This could provide new and more robust ways to categorize trips.

The training set (train.csv) contains a large number of customer visits with the TripType included. You must predict the TripType for each customer visit in the test set (test.csv). Each visit may only have one TripType. You will not be provided with more information than what is given in the data (e.g. what the TripTypes represent or more product information).

### Data fields

TripType - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category.

VisitNumber - an id corresponding to a single trip by a single customer

Weekday - the weekday of the trip

Upc - the UPC number of the product purchased

ScanCount - the number of the given item that was 
purchased. A negative value indicates a product return.

DepartmentDescription - a high-level description of the item's department

FinelineNumber - a more refined category for each of the products, created by Walmart


**Importing the packages** 




In [8]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score, precision_score , f1_score , roc_auc_score, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

from yellowbrick.classifier import ClassificationReport, ROCAUC

plt.style.use('ggplot')
pd.options.display.float_format = '{:,.2f}'.format
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))




**Reading the train and test dataset** 

In [10]:
cdata = pd.read_csv('/content/wallmart-train.csv')
x_test = pd.read_csv('/content/wallmart-test.csv')

In [11]:
cdata.head()

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
0,999,5,Friday,68113152929.0,-1,FINANCIAL SERVICES,1000.0
1,30,7,Friday,60538815980.0,1,SHOES,8931.0
2,30,7,Friday,7410811099.0,1,PERSONAL CARE,4504.0
3,26,8,Friday,2238403510.0,2,PAINT AND ACCESSORIES,3565.0
4,26,8,Friday,2006613744.0,2,PAINT AND ACCESSORIES,1017.0


In [12]:
x_test.head()

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
0,1,Friday,72503389714.0,1,SHOES,3002.0
1,1,Friday,1707710732.0,1,DAIRY,1526.0
2,1,Friday,89470001026.0,1,DAIRY,1431.0
3,1,Friday,88491211470.0,1,GROCERY DRY GOODS,3555.0
4,2,Friday,2840015224.0,1,DSD GROCERY,4408.0


In [14]:
cdata.shape

(647054, 7)

In [16]:
cdata.nunique()

TripType                    38
VisitNumber              95674
Weekday                      7
Upc                      97714
ScanCount                   39
DepartmentDescription       68
FinelineNumber            5195
dtype: int64

In [19]:
cdata.isnull().sum()

TripType                    0
VisitNumber                 0
Weekday                     0
Upc                      4129
ScanCount                   0
DepartmentDescription    1361
FinelineNumber           4129
dtype: int64

In [28]:
print(cdata['DepartmentDescription'].value_counts(normalize=True))
plt.show()

GROCERY DRY GOODS        0.11
DSD GROCERY              0.11
PRODUCE                  0.08
DAIRY                    0.07
PERSONAL CARE            0.06
                         ... 
LARGE HOUSEHOLD GOODS    0.00
CONCEPT STORES           0.00
SEASONAL                 0.00
OTHER DEPARTMENTS        0.00
HEALTH AND BEAUTY AIDS   0.00
Name: DepartmentDescription, Length: 68, dtype: float64


In [39]:
cdata[cdata['DepartmentDescription'].notnull() & cdata['FinelineNumber'].isnull()]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
1155,44,496,Friday,,1,PHARMACY RX,
1216,5,521,Friday,,1,PHARMACY RX,
1373,5,585,Friday,,1,PHARMACY RX,
1455,5,619,Friday,,1,PHARMACY RX,
1456,5,619,Friday,,1,PHARMACY RX,
...,...,...,...,...,...,...,...
636715,5,188839,Sunday,,1,PHARMACY RX,
636716,5,188839,Sunday,,1,PHARMACY RX,
636717,5,188839,Sunday,,1,PHARMACY RX,
636847,5,188896,Sunday,,1,PHARMACY RX,


In [40]:
cdata.describe()

Unnamed: 0,TripType,VisitNumber,Upc,ScanCount,FinelineNumber
count,647054.0,647054.0,642925.0,647054.0,642925.0
mean,58.58,96167.64,30606982273.49,1.11,3726.88
std,157.64,55545.49,91201337280.41,0.7,2780.97
min,3.0,5.0,834.0,-12.0,0.0
25%,27.0,49268.0,3400000995.0,1.0,1404.0
50%,39.0,97074.0,7050102580.0,1.0,3352.0
75%,40.0,144316.0,30065314449.0,1.0,5501.0
max,999.0,191347.0,978970666419.0,71.0,9998.0


In [41]:
cdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 647054 entries, 0 to 647053
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   TripType               647054 non-null  int64  
 1   VisitNumber            647054 non-null  int64  
 2   Weekday                647054 non-null  object 
 3   Upc                    642925 non-null  float64
 4   ScanCount              647054 non-null  int64  
 5   DepartmentDescription  645693 non-null  object 
 6   FinelineNumber         642925 non-null  float64
dtypes: float64(2), int64(3), object(2)
memory usage: 34.6+ MB
