# Template Explanation
I use this notebook multiple times across the life of project, both as a starting overview and a point of reference between various cleaning and transformation steps.

It uses a personalized version of the [cookiecutter data science](https://drivendata.github.io/cookiecutter-data-science/) file structure to simplify file selection.

In addition to basic built-in `pandas` overview tools, like `.info()` and `.describe()`, I added my own functions for common tasks including:
*  Creating a singular table of the same data from `.info()` and `.describe()`
    * Writing the table to `.xlsx` with frozen column headers and empty columns for note-taking.
        * `Notes` - General info that I need to remember, including questions
        * `Drop?` - Mark if the column can be dropped
        * `Change Dtype?` - Add what the column data type should be changed to
*  Displaying an overview of unique values by column

# Notebook Conclusions
Update this section with any findings and notes.

# Set Up

In [124]:
## Authorize Google Drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Library imports

In [125]:
#general analysis
import pandas as pd
import pprint as pp
import re
import numpy as np
import seaborn as sns

#file management
from pathlib import Path
from datetime import datetime

#stop words counter
#from collections import Counter

In [126]:
#unique marker for new files
today = datetime.now().strftime("%Y-%m-%d")

## Display Preferences

In [127]:

#current preferences
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_colwidth', None) #change column display width
#pd.set_option('display.precision', 2) #displays 2 decimal places on all numbers
pd.set_option('display.float_format',  '{:.2f}'.format)
pd.set_option('display.memory_usage', 'deep')

# File Handling

In [128]:
# @title File Selection
# @markdown 1. Navigate to original file location in sidebar
# @markdown 2. right-click 'Copy path'
# @markdown 3. Paste into 'file_path' field

file_path = '/content/drive/MyDrive/data_analysis/dunnhumby/data/raw/transactions_200607.csv'  # @param {type: "string"}

# @markdown ---

In [129]:
#set up file_path
file_path = Path(file_path)

In [130]:
file_path

PosixPath('/content/drive/MyDrive/data_analysis/dunnhumby/data/raw/transactions_200607.csv')

In [150]:
#select main drive path
main_drive = file_path.parent.parent.parent
print(main_drive)

#create output filename string using today for unique id
output_filename = f'{file_path.stem}_{today}'
print(output_filename)

/content/drive/MyDrive/data_analysis/dunnhumby
transactions_200607_2024-03-15


### Read into pandas dataframe

In [132]:
keep_cols = []

df = pd.read_csv(file_path,
#                     usecols= keep_cols,
#                      sep='\t',
#                        nrows=100,
#                       engine='python',
#                     encoding='ISO-8859-1'
                        )

# DataFrame Overview

## Row and Column Count

In [133]:
df.shape

(250546, 22)

## `.info()`

In [134]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250546 entries, 0 to 250545
Data columns (total 22 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SHOP_WEEK                 250546 non-null  int64  
 1   SHOP_DATE                 250546 non-null  int64  
 2   SHOP_WEEKDAY              250546 non-null  int64  
 3   SHOP_HOUR                 250546 non-null  int64  
 4   QUANTITY                  250546 non-null  int64  
 5   SPEND                     250546 non-null  float64
 6   PROD_CODE                 250546 non-null  object 
 7   PROD_CODE_10              250546 non-null  object 
 8   PROD_CODE_20              250546 non-null  object 
 9   PROD_CODE_30              250546 non-null  object 
 10  PROD_CODE_40              250546 non-null  object 
 11  CUST_CODE                 202211 non-null  object 
 12  CUST_PRICE_SENSITIVITY    202211 non-null  object 
 13  CUST_LIFESTAGE            175047 non-null  o

## Head and Tail Rows
First and last 5 rows

In [135]:
df.head(5)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
0,200607,20060411,3,16,1,1.05,PRD0900011,CL00033,DEP00008,G00004,D00002,,,,994100100100943,L,UM,Top Up,Fresh,STORE00001,LS,E02
1,200607,20060411,3,19,3,1.65,PRD0900035,CL00113,DEP00040,G00011,D00003,CUST0000173993,UM,OA,994100100257041,L,MM,Full Shop,Mixed,STORE00001,LS,E02
2,200607,20060414,6,21,1,0.71,PRD0900043,CL00148,DEP00052,G00015,D00003,,,,994100100052871,L,LA,Top Up,Mixed,STORE00001,LS,E02
3,200607,20060414,6,14,1,0.46,PRD0900057,CL00107,DEP00037,G00010,D00003,CUST0000644893,LA,PE,994100100539059,L,LA,Full Shop,Fresh,STORE00001,LS,E02
4,200607,20060411,3,16,1,1.87,PRD0900058,CL00020,DEP00005,G00003,D00001,,,,994100100100943,L,UM,Top Up,Fresh,STORE00001,LS,E02


In [136]:
df.tail(5)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
250541,200607,20060412,4,21,1,1.54,PRD0904358,CL00063,DEP00019,G00007,D00002,,,,994100100007204,M,LA,Top Up,Mixed,STORE02908,SS,N02
250542,200607,20060411,3,21,1,1.54,PRD0904358,CL00063,DEP00019,G00007,D00002,,,,994100100037346,L,UM,Top Up,Fresh,STORE02908,SS,N02
250543,200607,20060413,5,8,1,0.31,PRD0904471,CL00218,DEP00073,G00023,D00005,CUST0000325603,MM,YA,994100100347826,S,MM,Small Shop,Mixed,STORE02908,SS,N02
250544,200607,20060416,1,10,7,5.95,PRD0904660,CL00161,DEP00054,G00016,D00003,,,,994100100107776,M,MM,Small Shop,Grocery,STORE02908,SS,N02
250545,200607,20060411,3,17,3,9.81,PRD0904933,CL00150,DEP00052,G00015,D00003,,,,994100100038770,L,UM,Top Up,Grocery,STORE02908,SS,N02


## Descriptive Stats

In [137]:
df.describe(include='all')

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
count,250546.0,250546.0,250546.0,250546.0,250546.0,250546.0,250546,250546,250546,250546,250546,202211,202211,175047,250546.0,250546,250546,250546,250546,250546,250546,250546
unique,,,,,,,3809,242,89,31,9,17243,4,6,,3,4,4,5,748,4,12
top,,,,,,,PRD0903052,CL00063,DEP00019,G00007,D00002,CUST0000701659,MM,OT,,L,MM,Top Up,Fresh,STORE01423,LS,S02
freq,,,,,,,5968,11041,21616,56825,123764,101,91176,51914,,178405,119486,113151,129265,1224,160648,26370
mean,200607.0,20060413.02,4.0,14.99,1.47,1.93,,,,,,,,,994100100379997.6,,,,,,,
std,0.0,2.01,2.01,3.72,1.41,4.0,,,,,,,,,216601.67,,,,,,,
min,200607.0,20060410.0,1.0,8.0,1.0,0.01,,,,,,,,,994100100000007.0,,,,,,,
25%,200607.0,20060411.0,2.0,12.0,1.0,0.76,,,,,,,,,994100100193201.0,,,,,,,
50%,200607.0,20060413.0,4.0,15.0,1.0,1.23,,,,,,,,,994100100379527.0,,,,,,,
75%,200607.0,20060415.0,6.0,18.0,1.0,2.11,,,,,,,,,994100100568819.5,,,,,,,


## Testing

## Data Types, Memory Usage, Nulls, Value Counts

In [138]:
#pandas documention re memory usage: base-2 representation; i.e. 1KB = 1024 bytes

In [139]:
def get_dataframe_info(df):
    """
Recreates column-wise info from df.info() as a DataFrame to allow for easier viewing from CSV
    """
    df_dtypes = pd.DataFrame(df.dtypes, columns=['Data Types'])

    df_memory_usage = df.memory_usage(index=False, deep=True).to_frame(name='Memory Use')
    df_memory_usage['Memory Use'] = df_memory_usage['Memory Use']/1024/1024

    df_percent_null = ((1 - df.count() / len(df)) * 100).to_frame(name='Percent Null')

    df_described = df.describe(include='all').T


    df_info = pd.concat([df_dtypes, df_memory_usage, df_percent_null, df_described], axis=1)

    # Reassign column names
    df_info = df_info.rename_axis('Column Names').round(2)

#    df_info = df_info.set_index('column_names')

    return df_info

In [140]:
df_info = get_dataframe_info(df)

In [141]:
df_info.style \
    .format(precision=2) \
    .set_sticky(axis='index') \
    .bar(color='#89d8e0')

Unnamed: 0_level_0,Data Types,Memory Use,Percent Null,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Column Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SHOP_WEEK,int64,1.91,0.0,250546.0,,,,200607.0,0.0,200607.0,200607.0,200607.0,200607.0,200607.0
SHOP_DATE,int64,1.91,0.0,250546.0,,,,20060413.02,2.01,20060410.0,20060411.0,20060413.0,20060415.0,20060416.0
SHOP_WEEKDAY,int64,1.91,0.0,250546.0,,,,4.0,2.01,1.0,2.0,4.0,6.0,7.0
SHOP_HOUR,int64,1.91,0.0,250546.0,,,,14.99,3.72,8.0,12.0,15.0,18.0,21.0
QUANTITY,int64,1.91,0.0,250546.0,,,,1.47,1.41,1.0,1.0,1.0,1.0,275.0
SPEND,float64,1.91,0.0,250546.0,,,,1.93,4.0,0.01,0.76,1.23,2.11,1397.64
PROD_CODE,object,16.01,0.0,250546.0,3809.0,PRD0903052,5968.0,,,,,,,
PROD_CODE_10,object,15.29,0.0,250546.0,242.0,CL00063,11041.0,,,,,,,
PROD_CODE_20,object,15.53,0.0,250546.0,89.0,DEP00019,21616.0,,,,,,,
PROD_CODE_30,object,15.05,0.0,250546.0,31.0,G00007,56825.0,,,,,,,


### Save overview output in `.csv` and `.xlsx` format

In [142]:
#add a empty columns for note-taking in a spreadsheet
df_info.insert(loc=0,
               column='Notes',
               value = '')

df_info.insert(loc=1,
               column='Drop?',
               value = '')

df_info.insert(loc=2,
               column='Change Dtype?',
               value = '')

In [152]:
#write to excel file
df_info.to_excel(main_drive / 'notebooks' / 'eda' / f"{output_filename}_ov.xlsx",
                 sheet_name=f'Overview {file_path.stem}',
                 freeze_panes=(1,4)
                 )

## Correlation Table
For numeric columns only

In [153]:
df.corr(numeric_only=True)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,BASKET_ID
SHOP_WEEK,,,,,,,
SHOP_DATE,,1.0,0.24,0.0,-0.0,-0.0,0.01
SHOP_WEEKDAY,,0.24,1.0,-0.0,-0.0,0.0,-0.01
SHOP_HOUR,,0.0,-0.0,1.0,-0.02,0.0,-0.06
QUANTITY,,-0.0,-0.0,-0.02,1.0,0.35,-0.03
SPEND,,-0.0,0.0,0.0,0.35,1.0,0.0
BASKET_ID,,0.01,-0.01,-0.06,-0.03,0.0,1.0


## Duplicated Rows

In [154]:
df[df.duplicated(keep=False)]

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION


## Null Columns
List of columns with all NaN values

In [155]:
list(df.columns[df.isnull().all(axis=0)])

[]

# Columns

## List of Column Names

In [156]:
list(df.columns)

['SHOP_WEEK',
 'SHOP_DATE',
 'SHOP_WEEKDAY',
 'SHOP_HOUR',
 'QUANTITY',
 'SPEND',
 'PROD_CODE',
 'PROD_CODE_10',
 'PROD_CODE_20',
 'PROD_CODE_30',
 'PROD_CODE_40',
 'CUST_CODE',
 'CUST_PRICE_SENSITIVITY',
 'CUST_LIFESTAGE',
 'BASKET_ID',
 'BASKET_SIZE',
 'BASKET_PRICE_SENSITIVITY',
 'BASKET_TYPE',
 'BASKET_DOMINANT_MISSION',
 'STORE_CODE',
 'STORE_FORMAT',
 'STORE_REGION']

## Column Overview
This function loops over each column to produce the following info:
* name
* count of unique values
* datatype
* string to copy into a new cell, for displaying all value counts for a column

For value counts under the cutoff:
* a transposed display frame of all unique values

For value counts over the cutoff:
* transposed display frames of the top and bottom 10 values and counts


In [157]:
def column_overview(df, columns, cutoff=100):
    """
    Display column name, count of unique values, and an easy to read dataframe of individual unique values and their counts

    Parameters
    ----------
    df: dataframe
    columns: list
    cutoff: int, maximum unique value count to display
    Recommend string or object columns with unclear or low expected unique values.

    Returns
    -------
    Prints out one result for each column in the provided list.
    """
    for col in columns:
        print("\n")
        print("Column name: " + col)
        num_unique = str(df[col].nunique())
        print(f"Number of Unique Values: {num_unique}")
        col_datatype = str(df[col].dtype)
        print(f"Column Datatype: {col_datatype}")
        print("\n")

        if int(num_unique) < cutoff:
          print("use line below for vertical results")
          print(f"pd.DataFrame(df['{col}'].value_counts(dropna=False))")
          display(pd.DataFrame(df[col].value_counts(dropna=False)).T)
          print(u'\u2500' * 80)
        else:
          print(f"more than {cutoff} results, showing Top 10 and Bottom 10")
          print("use line below for complete results")
          print(f"pd.DataFrame(df['{col}'].value_counts(dropna=False))")
          print("\n")
          print(f"Top 10 Unique Values of {col}")
          display(pd.DataFrame(df[col].value_counts(dropna=False).head(10)).T)
          print(f"Bottom 10 Unique Values of {col}")
          display(pd.DataFrame(df[col].value_counts(dropna=False).tail(10)).T)
          print(u'\u2500' * 80)


In [158]:
column_overview(df, df.columns, 200)



Column name: SHOP_WEEK
Number of Unique Values: 1
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_WEEK'].value_counts(dropna=False))


Unnamed: 0,200607
SHOP_WEEK,250546


────────────────────────────────────────────────────────────────────────────────


Column name: SHOP_DATE
Number of Unique Values: 7
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_DATE'].value_counts(dropna=False))


Unnamed: 0,20060416,20060415,20060410,20060414,20060413,20060412,20060411
SHOP_DATE,36656,36183,35989,35922,35567,35183,35046


────────────────────────────────────────────────────────────────────────────────


Column name: SHOP_WEEKDAY
Number of Unique Values: 7
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_WEEKDAY'].value_counts(dropna=False))


Unnamed: 0,1,7,2,6,5,4,3
SHOP_WEEKDAY,36656,36183,35989,35922,35567,35183,35046


────────────────────────────────────────────────────────────────────────────────


Column name: SHOP_HOUR
Number of Unique Values: 14
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_HOUR'].value_counts(dropna=False))


Unnamed: 0,21,13,15,14,16,12,17,18,11,19,20,10,8,9
SHOP_HOUR,25092,24337,22892,22646,20916,20478,19713,17440,16900,15228,12779,12574,11179,8372


────────────────────────────────────────────────────────────────────────────────


Column name: QUANTITY
Number of Unique Values: 46
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['QUANTITY'].value_counts(dropna=False))


Unnamed: 0,1,3,4,6,5,7,8,9,11,10,12,14,13,16,17,15,18,19,21,22,27,20,23,26,24,40,34,32,35,28,31,30,25,29,60,33,45,53,36,37,59,68,81,275,76,73
QUANTITY,205634,35112,5080,1693,1018,607,455,324,112,89,84,64,42,40,39,36,24,11,9,8,8,6,6,6,5,3,3,3,3,3,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: SPEND
Number of Unique Values: 1797
Column Datatype: float64


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['SPEND'].value_counts(dropna=False))


Top 10 Unique Values of SPEND


Unnamed: 0,1.54,0.97,0.98,1.01,0.86,0.40,0.70,1.02,0.84,0.53
SPEND,4341,3269,3269,2551,2328,2236,1983,1937,1935,1912


Bottom 10 Unique Values of SPEND


Unnamed: 0,17.92,26.50,29.72,28.31,26.59,8.89,25.95,33.69,19.86,11.43
SPEND,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE
Number of Unique Values: 3809
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['PROD_CODE'].value_counts(dropna=False))


Top 10 Unique Values of PROD_CODE


Unnamed: 0,PRD0903052,PRD0904358,PRD0900121,PRD0901265,PRD0900830,PRD0903074,PRD0904887,PRD0900302,PRD0904976,PRD0901887
PROD_CODE,5968,3896,3453,2388,1667,1389,1346,1300,1266,1252


Bottom 10 Unique Values of PROD_CODE


Unnamed: 0,PRD0903752,PRD0903690,PRD0901462,PRD0904483,PRD0900847,PRD0903606,PRD0901793,PRD0903572,PRD0902003,PRD0902485
PROD_CODE,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_10
Number of Unique Values: 242
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['PROD_CODE_10'].value_counts(dropna=False))


Top 10 Unique Values of PROD_CODE_10


Unnamed: 0,CL00063,CL00031,CL00070,CL00045,CL00067,CL00073,CL00079,CL00043,CL00140,CL00201
PROD_CODE_10,11041,7789,5748,5522,5464,5139,4984,4548,4210,4185


Bottom 10 Unique Values of PROD_CODE_10


Unnamed: 0,CL00194,CL00242,CL00184,CL00019,CL00183,CL00109,CL00168,CL00192,CL00191,CL00223
PROD_CODE_10,11,10,8,8,8,7,6,6,3,3


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_20
Number of Unique Values: 89
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['PROD_CODE_20'].value_counts(dropna=False))


Unnamed: 0,DEP00019,DEP00008,DEP00011,DEP00022,DEP00052,DEP00067,DEP00055,DEP00021,DEP00054,DEP00020,DEP00053,DEP00001,DEP00010,DEP00046,DEP00002,DEP00012,DEP00049,DEP00048,DEP00069,DEP00003,DEP00004,DEP00005,DEP00025,DEP00013,DEP00035,DEP00047,DEP00051,DEP00037,DEP00073,DEP00036,DEP00039,DEP00024,DEP00023,DEP00016,DEP00030,DEP00034,DEP00009,DEP00033,DEP00050,DEP00044,DEP00042,DEP00070,DEP00081,DEP00071,DEP00084,DEP00040,DEP00028,DEP00041,DEP00018,DEP00068,DEP00027,DEP00043,DEP00083,DEP00089,DEP00062,DEP00026,DEP00088,DEP00031,DEP00029,DEP00056,DEP00077,DEP00061,DEP00090,DEP00085,DEP00076,DEP00086,DEP00007,DEP00017,DEP00079,DEP00063,DEP00082,DEP00057,DEP00006,DEP00014,DEP00078,DEP00032,DEP00045,DEP00058,DEP00060,DEP00080,DEP00015,DEP00074,DEP00059,DEP00075,DEP00066,DEP00087,DEP00038,DEP00065,DEP00064
PROD_CODE_20,21616,19600,18712,14744,12043,11699,9229,8962,7923,7268,7019,6512,5838,5348,4950,4825,4635,4151,4036,3966,3831,3744,3729,3483,3479,3465,2691,2485,2470,2307,2292,2167,2068,1901,1832,1770,1592,1461,1351,1336,1327,1243,1217,1193,1137,916,840,833,796,725,717,652,641,633,625,577,567,506,293,275,266,244,202,194,192,180,157,144,142,122,52,50,49,47,40,29,28,27,22,21,17,15,14,12,11,10,7,6,3


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_30
Number of Unique Values: 31
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['PROD_CODE_30'].value_counts(dropna=False))


Unnamed: 0,G00007,G00004,G00016,G00015,G00013,G00010,G00021,G00001,G00008,G00003,G00022,G00014,G00002,G00005,G00011,G00006,G00023,G00028,G00012,G00009,G00027,G00018,G00030,G00029,G00017,G00024,G00031,G00025,G00026,G00020,G00019
PROD_CODE_30,56825,50567,24171,14734,12964,12340,11699,11462,8494,7781,7197,5986,3966,3547,3076,2841,2689,2162,2016,1490,1269,991,633,567,388,306,202,142,21,11,9


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_40
Number of Unique Values: 9
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['PROD_CODE_40'].value_counts(dropna=False))


Unnamed: 0,D00002,D00003,D00001,D00005,D00008,D00004,D00009,D00006,D00007
PROD_CODE_40,123764,75287,23209,21585,3998,1399,835,448,21


────────────────────────────────────────────────────────────────────────────────


Column name: CUST_CODE
Number of Unique Values: 17243
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['CUST_CODE'].value_counts(dropna=False))


Top 10 Unique Values of CUST_CODE


Unnamed: 0,NaN,CUST0000701659,CUST0000640576,CUST0000756781,CUST0000804366,CUST0000566247,CUST0000113121,CUST0000965332,CUST0000663738,CUST0000678780
CUST_CODE,48335,101,71,71,70,70,70,68,64,63


Bottom 10 Unique Values of CUST_CODE


Unnamed: 0,CUST0000873840,CUST0000239976,CUST0000989742,CUST0000937245,CUST0000642010,CUST0000838133,CUST0000406552,CUST0000576814,CUST0000923291,CUST0000583271
CUST_CODE,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: CUST_PRICE_SENSITIVITY
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['CUST_PRICE_SENSITIVITY'].value_counts(dropna=False))


Unnamed: 0,MM,UM,LA,NaN,XX
CUST_PRICE_SENSITIVITY,91176,55706,53060,48335,2269


────────────────────────────────────────────────────────────────────────────────


Column name: CUST_LIFESTAGE
Number of Unique Values: 6
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['CUST_LIFESTAGE'].value_counts(dropna=False))


Unnamed: 0,NaN,OT,YF,OA,YA,PE,OF
CUST_LIFESTAGE,75499,51914,40553,27572,26875,15418,12715


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_ID
Number of Unique Values: 38007
Column Datatype: int64


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['BASKET_ID'].value_counts(dropna=False))


Top 10 Unique Values of BASKET_ID


Unnamed: 0,994100100550262,994100100542690,994100100567321,994100100395008,994100100067226,994100100425072,994100100259078,994100100573449,994100100199547,994100100692873
BASKET_ID,64,60,55,54,53,53,51,51,51,49


Bottom 10 Unique Values of BASKET_ID


Unnamed: 0,994100100125327,994100100091014,994100100703247,994100100749925,994100100282735,994100100307332,994100100271967,994100100625924,994100100681541,994100100549369
BASKET_ID,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_SIZE
Number of Unique Values: 3
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_SIZE'].value_counts(dropna=False))


Unnamed: 0,L,M,S
BASKET_SIZE,178405,58388,13753


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_PRICE_SENSITIVITY
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_PRICE_SENSITIVITY'].value_counts(dropna=False))


Unnamed: 0,MM,UM,LA,XX
BASKET_PRICE_SENSITIVITY,119486,66350,64291,419


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_TYPE
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_TYPE'].value_counts(dropna=False))


Unnamed: 0,Top Up,Full Shop,Small Shop,XX
BASKET_TYPE,113151,90407,46736,252


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_DOMINANT_MISSION
Number of Unique Values: 5
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_DOMINANT_MISSION'].value_counts(dropna=False))


Unnamed: 0,Fresh,Mixed,Grocery,Nonfood,XX
BASKET_DOMINANT_MISSION,129265,92796,24900,3333,252


────────────────────────────────────────────────────────────────────────────────


Column name: STORE_CODE
Number of Unique Values: 748
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['STORE_CODE'].value_counts(dropna=False))


Top 10 Unique Values of STORE_CODE


Unnamed: 0,STORE01423,STORE00696,STORE00729,STORE01604,STORE00278,STORE02797,STORE02577,STORE01707,STORE01007,STORE00030
STORE_CODE,1224,1109,1088,1038,1006,948,918,911,911,902


Bottom 10 Unique Values of STORE_CODE


Unnamed: 0,STORE02457,STORE00843,STORE01804,STORE01769,STORE00183,STORE01535,STORE00962,STORE01856,STORE00691,STORE00779
STORE_CODE,15,14,14,14,14,9,7,6,5,3


────────────────────────────────────────────────────────────────────────────────


Column name: STORE_FORMAT
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['STORE_FORMAT'].value_counts(dropna=False))


Unnamed: 0,LS,MS,XLS,SS
STORE_FORMAT,160648,51994,21781,16123


────────────────────────────────────────────────────────────────────────────────


Column name: STORE_REGION
Number of Unique Values: 12
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['STORE_REGION'].value_counts(dropna=False))


Unnamed: 0,S02,N01,N03,W02,N02,W01,S01,S03,E03,E02,E01,W03
STORE_REGION,26370,24947,23024,22922,20669,20552,20278,20242,18591,18079,17876,16996


────────────────────────────────────────────────────────────────────────────────
