# Introduction
I use this notebook multiple times across the life of project, both as a starting overview and a point of reference between various cleaning and transformation steps.

It uses Google Colab form entries and a personalized version of the [cookiecutter data science](https://drivendata.github.io/cookiecutter-data-science/) file structure to simplify file selection.

In addition to basic built-in `pandas` overview tools, like `.info()` and `.describe()`, I added my own functions for common tasks including:
*  Creating a singular table of the same data from `.info()` and `.describe()`
    * Writing the table to `.csv`
    * Writing the table to `.xlsx` with frozen column headers and a `Notes` column, which I use to take notes about processing steps
*  Displaying an overview of unique values by column

# Set Up

## Authorize Google Drive
Follow pop up prompts to authorize Drive access. May not work with non-Chrome browsers depending on ad block and privacy settings.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Library imports

In [None]:
#general analysis
import pandas as pd
import pprint as ppr
import re
import numpy as np

#file management
from pathlib import Path
from datetime import datetime

#stop words counter
#from collections import Counter

## Display Preferences

In [None]:
#current preferences
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_colwidth', None) #change column display width
#pd.set_option('display.precision', 2) #displays 2 decimal places on all numbers
pd.set_option('display.float_format',  '{:.2f}'.format)
pd.set_option('display.memory_usage', 'deep')

# File Handling
This section uses parametized forms in Google Colab to simplify file selection.

It may require running the same cells multiple times depending on how much information is needed to select the intended file/directory.

In [None]:
file_path = ''
project = ''

In [None]:
#main project path
project_dir = Path.cwd().joinpath("drive", "MyDrive", project)
project_dir

PosixPath('/content/drive/MyDrive/data_analysis/dunnhumby')

In [None]:
#input files path
input_dir = project_dir.joinpath(folders)
input_dir

PosixPath('/content/drive/MyDrive/data_analysis/dunnhumby/data/raw')

In [None]:
#output files path
output_dir = project_dir.joinpath("notebooks", "eda")
output_dir

PosixPath('/content/drive/MyDrive/data_analysis/dunnhumby/notebooks/eda')

In [None]:
#unique marker for new files
today = datetime.now().strftime("%m-%d-%Y_%H%M%S")

### Read into pandas dataframe

In [None]:
df = pd.read_csv(file_path,
#                        usecols= cols,
#                      sep='\t',
#                        nrows=100,
#                       engine='python',
#                     encoding='ISO-8859-1'
                        )

# DataFrame Overview

## Row and Column Count

In [None]:
df.shape

(277100, 22)

## `.info()`

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277100 entries, 0 to 277099
Data columns (total 22 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SHOP_WEEK                 277100 non-null  int64  
 1   SHOP_DATE                 277100 non-null  int64  
 2   SHOP_WEEKDAY              277100 non-null  int64  
 3   SHOP_HOUR                 277100 non-null  int64  
 4   QUANTITY                  277100 non-null  int64  
 5   SPEND                     277100 non-null  float64
 6   PROD_CODE                 277100 non-null  object 
 7   PROD_CODE_10              277100 non-null  object 
 8   PROD_CODE_20              277100 non-null  object 
 9   PROD_CODE_30              277100 non-null  object 
 10  PROD_CODE_40              277100 non-null  object 
 11  CUST_CODE                 226383 non-null  object 
 12  CUST_PRICE_SENSITIVITY    226383 non-null  object 
 13  CUST_LIFESTAGE            198732 non-null  o

## Descriptive Stats

In [None]:
df.describe(include='all')

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
count,277100.0,277100.0,277100.0,277100.0,277100.0,277100.0,277100,277100,277100,277100,277100,226383,226383,198732,277100.0,277100,277100,277100,277100,277100,277100,277100
unique,,,,,,,4368,242,87,30,9,18301,4,6,,3,4,4,5,759,4,12
top,,,,,,,PRD0903052,CL00063,DEP00008,G00007,D00002,CUST0000413198,MM,OT,,L,MM,Top Up,Fresh,STORE00696,LS,S02
freq,,,,,,,5992,12215,23384,62306,136426,81,103544,60657,,198321,144730,121307,142006,1306,173282,27863
mean,200701.0,20070270.69,4.02,14.97,1.51,1.84,,,,,,,,,994104700412106.2,,,,,,,
std,0.0,37.28,2.0,3.65,2.06,2.43,,,,,,,,,233667.92,,,,,,,
min,200701.0,20070226.0,1.0,8.0,1.0,0.0,,,,,,,,,994104700000003.0,,,,,,,
25%,200701.0,20070227.0,2.0,12.0,1.0,0.76,,,,,,,,,994104700209805.0,,,,,,,
50%,200701.0,20070301.0,4.0,15.0,1.0,1.22,,,,,,,,,994104700411380.0,,,,,,,
75%,200701.0,20070303.0,6.0,18.0,1.0,2.08,,,,,,,,,994104700615334.1,,,,,,,


## Testing

## Data Types, Memory Usage, Nulls, Value Counts

In [None]:
#pandas documention re memory usage: base-2 representation; i.e. 1KB = 1024 bytes

In [None]:
def get_dataframe_info(df):
    """
Recreates column-wise info from df.info() as a DataFrame to allow for easier viewing from CSV
    """
    df_dtypes = pd.DataFrame(df.dtypes, columns=['Data Types'])

    df_memory_usage = df.memory_usage(index=False, deep=True).to_frame(name='Memory (Bytes)')
    df_memory_usage['Memory (MB)'] = df_memory_usage['Memory (Bytes)']/1024/1024

    df_percent_null = ((1 - df.count() / len(df)) * 100).to_frame(name='Percent Null')

    df_described = df.describe(include='all').T


    df_info = pd.concat([df_dtypes, df_memory_usage, df_percent_null, df_described], axis=1)

    # Reassign column names
    new_column_names =  {'count': 'Count Null',
                        'unique': 'Unique Counts',
                        'top': 'Top Value',
                        'freq': 'Frequency',
                        'mean': 'Mean',
                        'std': 'Standard Deviation',
                        'min': 'Minimum',
                        '25%': '25%',
                        '50%': '50%',
                        '75%': '75%',
                        'max': 'Maximum'}
    df_info = df_info.rename(columns=new_column_names).rename_axis('Column Names')

#    df_info = df_info.set_index('column_names')

    return df_info

In [None]:
df_info = get_dataframe_info(df)

In [None]:
df_info.style.set_sticky(axis='index')

Unnamed: 0_level_0,Data Types,Memory (Bytes),Memory (MB),Percent Null,Count Null,Unique Counts,Top Value,Frequency,Mean,Standard Deviation,Minimum,25%,50%,75%,Maximum
Column Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
SHOP_WEEK,int64,2216800,2.114105,0.0,277100.0,,,,200701.0,0.0,200701.0,200701.0,200701.0,200701.0,200701.0
SHOP_DATE,int64,2216800,2.114105,0.0,277100.0,,,,20070270.693446,37.279772,20070226.0,20070227.0,20070301.0,20070303.0,20070304.0
SHOP_WEEKDAY,int64,2216800,2.114105,0.0,277100.0,,,,4.02432,2.000169,1.0,2.0,4.0,6.0,7.0
SHOP_HOUR,int64,2216800,2.114105,0.0,277100.0,,,,14.967286,3.652395,8.0,12.0,15.0,18.0,21.0
QUANTITY,int64,2216800,2.114105,0.0,277100.0,,,,1.506348,2.062051,1.0,1.0,1.0,1.0,692.0
SPEND,float64,2216800,2.114105,0.0,277100.0,,,,1.839542,2.429634,0.0,0.76,1.22,2.08,240.24
PROD_CODE,object,18565700,17.705631,0.0,277100.0,4368.0,PRD0903052,5992.0,,,,,,,
PROD_CODE_10,object,17734400,16.912842,0.0,277100.0,242.0,CL00063,12215.0,,,,,,,
PROD_CODE_20,object,18011500,17.177105,0.0,277100.0,87.0,DEP00008,23384.0,,,,,,,
PROD_CODE_30,object,17457300,16.648579,0.0,277100.0,30.0,G00007,62306.0,,,,,,,


### Save overview output in `.csv` and `.xlsx` format

In [None]:
#write to csv file
df_info.to_csv(output_dir / f"{file_name}_overview_{today:%b-%d-%Y}.csv", index=False)

In [None]:
#add a `notes` column for use with in a spreadsheet
df_info.insert(loc=0,
               column='Notes',
               value = '')

In [None]:
#write to excel file
df_info.to_excel(output_dir / f"{file_name}_overview_{today:%b-%d-%Y}.xlsx",
                 sheet_name=f'Overview {file_name}',
                 freeze_panes=(1,2)
                 )

## Correlation Table
For numeric columns only

In [None]:
df.corr(numeric_only=True)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,BASKET_ID
SHOP_WEEK,,,,,,,
SHOP_DATE,,1.0,0.42,-0.0,-0.0,0.0,-0.01
SHOP_WEEKDAY,,0.42,1.0,0.01,-0.0,-0.0,0.0
SHOP_HOUR,,-0.0,0.01,1.0,-0.01,0.01,-0.07
QUANTITY,,-0.0,-0.0,-0.01,1.0,0.21,-0.0
SPEND,,0.0,-0.0,0.01,0.21,1.0,-0.0
BASKET_ID,,-0.01,0.0,-0.07,-0.0,-0.0,1.0


## Head and Tail Rows
First and last 10 rows

In [None]:
df.head(10)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
0,200701,20070304,1,17,1,1.13,PRD0900013,CL00015,DEP00004,G00003,D00001,CUST0000361701,MM,YA,994104700394636,L,MM,Full Shop,Mixed,STORE00001,LS,E02
1,200701,20070301,5,19,1,1.1,PRD0900015,CL00015,DEP00004,G00003,D00001,CUST0000871730,UM,,994104700728646,S,MM,Small Shop,Fresh,STORE00001,LS,E02
2,200701,20070303,7,12,1,1.0,PRD0900015,CL00015,DEP00004,G00003,D00001,CUST0000949903,MM,PE,994104700780122,M,UM,Top Up,Fresh,STORE00001,LS,E02
3,200701,20070303,7,15,3,4.68,PRD0900049,CL00160,DEP00054,G00016,D00003,CUST0000644893,LA,PE,994104700579780,L,LA,Top Up,Mixed,STORE00001,LS,E02
4,200701,20070302,6,14,1,1.04,PRD0900055,CL00230,DEP00081,G00027,D00008,CUST0000926111,UM,OT,994104700764453,L,UM,Top Up,Fresh,STORE00001,LS,E02
5,200701,20070304,1,14,1,1.6,PRD0900062,CL00175,DEP00059,G00017,D00004,CUST0000605487,LA,YF,994104700553719,M,MM,Top Up,Mixed,STORE00001,LS,E02
6,200701,20070301,5,21,1,2.36,PRD0900071,CL00086,DEP00024,G00007,D00002,CUST0000666576,MM,YA,994104700593739,L,MM,Full Shop,Mixed,STORE00001,LS,E02
7,200701,20070304,1,12,1,1.05,PRD0900077,CL00150,DEP00052,G00015,D00003,CUST0000710863,LA,YF,994104700622917,L,MM,Full Shop,Mixed,STORE00001,LS,E02
8,200701,20070304,1,12,1,1.05,PRD0900077,CL00150,DEP00052,G00015,D00003,CUST0000795333,MM,,994104700678351,L,MM,Full Shop,Mixed,STORE00001,LS,E02
9,200701,20070304,1,12,3,3.72,PRD0900086,CL00067,DEP00019,G00007,D00002,CUST0000710863,LA,YF,994104700622917,L,MM,Full Shop,Mixed,STORE00001,LS,E02


In [None]:
df.tail(10)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
277090,200701,20070302,6,18,3,0.45,PRD0903473,CL00014,DEP00004,G00003,D00001,CUST0000948225,MM,,994104700779080,M,MM,Small Shop,Fresh,STORE00558,SS,N03
277091,200701,20070301,5,20,3,4.02,PRD0904593,CL00010,DEP00003,G00002,D00001,,,,994104700052086,L,MM,Top Up,Fresh,STORE00558,SS,N03
277092,200701,20070303,7,15,1,9.24,PRD0904735,CL00235,DEP00083,G00028,D00008,CUST0000026510,MM,OT,994104700175791,L,MM,Top Up,Fresh,STORE00558,SS,N03
277093,200701,20070301,5,14,1,0.42,PRD0900436,CL00148,DEP00052,G00015,D00003,CUST0000927651,MM,OF,994104700765515,L,MM,Full Shop,Mixed,STORE00558,SS,N03
277094,200701,20070227,3,21,1,0.93,PRD0900580,CL00047,DEP00012,G00004,D00002,CUST0000872634,UM,YA,994104700729203,M,UM,Small Shop,Fresh,STORE00558,SS,N03
277095,200701,20070227,3,18,1,1.91,PRD0901045,CL00197,DEP00067,G00021,D00005,CUST0000880716,UM,OT,994104700734658,S,UM,Small Shop,Mixed,STORE00558,SS,N03
277096,200701,20070302,6,12,1,1.22,PRD0901711,CL00163,DEP00055,G00016,D00003,CUST0000980541,MM,OT,994104700800088,M,LA,Top Up,Fresh,STORE00558,SS,N03
277097,200701,20070301,5,13,1,1.08,PRD0902098,CL00002,DEP00001,G00001,D00001,CUST0000024585,MM,YF,994104700174614,S,MM,Small Shop,Fresh,STORE00558,SS,N03
277098,200701,20070226,2,14,1,1.08,PRD0902098,CL00002,DEP00001,G00001,D00001,CUST0000067777,MM,OT,994104700202723,M,MM,Top Up,Mixed,STORE00558,SS,N03
277099,200701,20070227,3,11,1,2.51,PRD0903360,CL00100,DEP00033,G00009,D00002,CUST0000024585,MM,YF,994104700174615,L,MM,Top Up,Mixed,STORE00558,SS,N03


## Duplicated Rows

In [None]:
df[df.duplicated(keep=False)]

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION


## Null Columns
List of columns with all NaN values

In [None]:
list(df.columns[df.isnull().all(axis=0)])

[]

# Columns

## List of Column Names

In [None]:
list(df.columns)

['SHOP_WEEK',
 'SHOP_DATE',
 'SHOP_WEEKDAY',
 'SHOP_HOUR',
 'QUANTITY',
 'SPEND',
 'PROD_CODE',
 'PROD_CODE_10',
 'PROD_CODE_20',
 'PROD_CODE_30',
 'PROD_CODE_40',
 'CUST_CODE',
 'CUST_PRICE_SENSITIVITY',
 'CUST_LIFESTAGE',
 'BASKET_ID',
 'BASKET_SIZE',
 'BASKET_PRICE_SENSITIVITY',
 'BASKET_TYPE',
 'BASKET_DOMINANT_MISSION',
 'STORE_CODE',
 'STORE_FORMAT',
 'STORE_REGION']

## Column Overview
This function loops over each column to produce the following info:
* name
* count of unique values
* datatype
* string to copy into a new cell, for displaying all value counts for a column

For value counts under the cutoff:
* a transposed display frame of all unique values

For value counts over the cutoff:
* transposed display frames of the top and bottom 10 values and counts


In [None]:
def column_overview(df, columns, cutoff=100):
    """
    Display column name, count of unique values, and an easy to read dataframe of individual unique values and their counts

    Parameters
    ----------
    df: dataframe
    columns: list
    cutoff: int, maximum unique value count to display
    Recommend string or object columns with unclear or low expected unique values.

    Returns
    -------
    Prints out one result for each column in the provided list.
    """
    for col in columns:
        print("\n")
        print("Column name: " + col)
        num_unique = str(df[col].nunique())
        print(f"Number of Unique Values: {num_unique}")
        col_datatype = str(df[col].dtype)
        print(f"Column Datatype: {col_datatype}")
        print("\n")

        if int(num_unique) < cutoff:
          print("use line below for vertical results")
          print(f"pd.DataFrame(df['{col}'].value_counts(dropna=False))")
          display(pd.DataFrame(df[col].value_counts(dropna=False)).T)
          print(u'\u2500' * 80)
        else:
          print(f"more than {cutoff} results, showing Top 10 and Bottom 10")
          print("use line below for complete results")
          print(f"pd.DataFrame(df['{col}'].value_counts(dropna=False))")
          print("\n")
          print(f"Top 10 Unique Values of {col}")
          display(pd.DataFrame(df[col].value_counts(dropna=False).head(10)).T)
          print(f"Bottom 10 Unique Values of {col}")
          display(pd.DataFrame(df[col].value_counts(dropna=False).tail(10)).T)
          print(u'\u2500' * 80)


In [None]:
column_overview(df, df.columns, 200)



Column name: SHOP_WEEK
Number of Unique Values: 1
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_WEEK'].value_counts(dropna=False))


Unnamed: 0,200701
SHOP_WEEK,277100


────────────────────────────────────────────────────────────────────────────────


Column name: SHOP_DATE
Number of Unique Values: 7
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_DATE'].value_counts(dropna=False))


Unnamed: 0,20070302,20070228,20070301,20070304,20070303,20070226,20070227
SHOP_DATE,40639,40625,40216,39825,39665,38145,37985


────────────────────────────────────────────────────────────────────────────────


Column name: SHOP_WEEKDAY
Number of Unique Values: 7
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_WEEKDAY'].value_counts(dropna=False))


Unnamed: 0,6,4,5,1,7,2,3
SHOP_WEEKDAY,40639,40625,40216,39825,39665,38145,37985


────────────────────────────────────────────────────────────────────────────────


Column name: SHOP_HOUR
Number of Unique Values: 14
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['SHOP_HOUR'].value_counts(dropna=False))


Unnamed: 0,14,13,21,15,16,12,17,18,11,19,20,10,8,9
SHOP_HOUR,28217,26574,25762,25530,23342,23069,21020,19826,18822,16477,14498,13565,10822,9576


────────────────────────────────────────────────────────────────────────────────


Column name: QUANTITY
Number of Unique Values: 55
Column Datatype: int64


use line below for vertical results
pd.DataFrame(df['QUANTITY'].value_counts(dropna=False))


Unnamed: 0,1,3,4,6,5,7,8,9,11,12,10,14,13,16,15,17,18,20,19,21,22,26,23,27,29,25,24,28,32,35,44,33,30,58,36,54,34,45,31,40,75,90,39,154,692,38,81,62,43,52,59,50,65,47,41
QUANTITY,228292,37036,5027,2033,1181,834,692,524,278,178,153,149,115,109,100,90,69,31,28,23,22,18,15,12,11,8,8,7,6,6,5,5,4,3,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: SPEND
Number of Unique Values: 1668
Column Datatype: float64


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['SPEND'].value_counts(dropna=False))


Top 10 Unique Values of SPEND


Unnamed: 0,1.54,0.98,0.97,1.02,1.01,0.86,1.49,1.00,0.53,0.70
SPEND,4083,3911,3828,2553,2482,2448,2338,2234,2143,2002


Bottom 10 Unique Values of SPEND


Unnamed: 0,13.84,8.83,20.32,26.39,15.55,35.49,20.57,0.00,27.25,28.00
SPEND,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE
Number of Unique Values: 4368
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['PROD_CODE'].value_counts(dropna=False))


Top 10 Unique Values of PROD_CODE


Unnamed: 0,PRD0903052,PRD0903678,PRD0904358,PRD0900121,PRD0901265,PRD0900830,PRD0904976,PRD0900173,PRD0904887,PRD0901887
PROD_CODE,5992,5201,4254,3682,3359,1791,1448,1428,1380,1300


Bottom 10 Unique Values of PROD_CODE


Unnamed: 0,PRD0902670,PRD0901302,PRD0902687,PRD0901160,PRD0904800,PRD0902827,PRD0900057,PRD0903032,PRD0902095,PRD0901735
PROD_CODE,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_10
Number of Unique Values: 242
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['PROD_CODE_10'].value_counts(dropna=False))


Top 10 Unique Values of PROD_CODE_10


Unnamed: 0,CL00063,CL00031,CL00070,CL00045,CL00067,CL00079,CL00073,CL00222,CL00140,CL00201
PROD_CODE_10,12215,7554,7017,6382,6036,5781,5670,5323,4863,4645


Bottom 10 Unique Values of PROD_CODE_10


Unnamed: 0,CL00220,CL00242,CL00109,CL00174,CL00175,CL00241,CL00181,CL00223,CL00210,CL00166
PROD_CODE_10,15,15,15,13,13,12,11,6,6,5


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_20
Number of Unique Values: 87
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['PROD_CODE_20'].value_counts(dropna=False))


Unnamed: 0,DEP00008,DEP00019,DEP00011,DEP00022,DEP00067,DEP00052,DEP00021,DEP00020,DEP00053,DEP00055,DEP00054,DEP00001,DEP00046,DEP00012,DEP00076,DEP00049,DEP00002,DEP00010,DEP00069,DEP00003,DEP00048,DEP00025,DEP00004,DEP00035,DEP00073,DEP00005,DEP00013,DEP00047,DEP00039,DEP00051,DEP00036,DEP00009,DEP00037,DEP00016,DEP00034,DEP00023,DEP00024,DEP00027,DEP00042,DEP00050,DEP00033,DEP00030,DEP00071,DEP00044,DEP00083,DEP00070,DEP00026,DEP00081,DEP00031,DEP00018,DEP00041,DEP00062,DEP00040,DEP00089,DEP00028,DEP00068,DEP00043,DEP00084,DEP00056,DEP00088,DEP00061,DEP00029,DEP00077,DEP00090,DEP00079,DEP00085,DEP00017,DEP00007,DEP00086,DEP00063,DEP00057,DEP00059,DEP00006,DEP00066,DEP00014,DEP00082,DEP00058,DEP00078,DEP00045,DEP00032,DEP00080,DEP00060,DEP00075,DEP00015,DEP00074,DEP00038,DEP00087
PROD_CODE_20,23384,22894,19234,15603,13610,11915,10437,9443,9002,8525,6970,6670,6215,5608,5460,5427,5009,4760,4746,4148,4038,3987,3953,3740,3715,3605,3408,3304,3135,2939,2803,2772,2358,2323,2154,2062,1867,1565,1453,1427,1369,1329,1301,1236,1193,1157,1151,1069,989,966,912,847,845,771,750,711,695,617,524,469,263,259,259,206,190,175,163,154,127,97,71,69,63,62,56,50,40,38,34,28,25,19,19,19,15,15,15


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_30
Number of Unique Values: 30
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['PROD_CODE_30'].value_counts(dropna=False))


Unnamed: 0,G00007,G00004,G00016,G00015,G00010,G00021,G00013,G00001,G00008,G00023,G00022,G00003,G00014,G00002,G00005,G00006,G00011,G00028,G00012,G00009,G00018,G00027,G00030,G00017,G00029,G00024,G00031,G00025,G00020,G00026
PROD_CODE_30,62306,55758,24497,14854,14205,13610,13557,11679,10030,9209,7915,7775,6854,4148,3483,3452,3210,2127,1965,1397,1207,1119,771,723,469,297,206,190,62,25


────────────────────────────────────────────────────────────────────────────────


Column name: PROD_CODE_40
Number of Unique Values: 9
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['PROD_CODE_40'].value_counts(dropna=False))


Unnamed: 0,D00002,D00003,D00005,D00001,D00008,D00004,D00009,D00006,D00007
PROD_CODE_40,136426,79142,30734,23602,3715,1992,977,487,25


────────────────────────────────────────────────────────────────────────────────


Column name: CUST_CODE
Number of Unique Values: 18301
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['CUST_CODE'].value_counts(dropna=False))


Top 10 Unique Values of CUST_CODE


Unnamed: 0,NaN,CUST0000413198,CUST0000222338,CUST0000468457,CUST0000357420,CUST0000105647,CUST0000372778,CUST0000275243,CUST0000284634,CUST0000351908
CUST_CODE,50717,81,79,77,73,73,70,70,69,69


Bottom 10 Unique Values of CUST_CODE


Unnamed: 0,CUST0000495839,CUST0000364432,CUST0000334217,CUST0000650687,CUST0000778755,CUST0000112392,CUST0000065275,CUST0000544090,CUST0000918559,CUST0000805364
CUST_CODE,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: CUST_PRICE_SENSITIVITY
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['CUST_PRICE_SENSITIVITY'].value_counts(dropna=False))


Unnamed: 0,MM,LA,UM,NaN,XX
CUST_PRICE_SENSITIVITY,103544,62720,59896,50717,223


────────────────────────────────────────────────────────────────────────────────


Column name: CUST_LIFESTAGE
Number of Unique Values: 6
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['CUST_LIFESTAGE'].value_counts(dropna=False))


Unnamed: 0,NaN,OT,YF,YA,OA,PE,OF
CUST_LIFESTAGE,78368,60657,45754,33559,29199,15618,13945


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_ID
Number of Unique Values: 41273
Column Datatype: int64


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['BASKET_ID'].value_counts(dropna=False))


Top 10 Unique Values of BASKET_ID


Unnamed: 0,994104700357918,994104700583306,994104700020560,994104700241263,994104700720925,994104700484776,994104700199213,994104700138299,994104700291735,994104700687116
BASKET_ID,57,55,54,54,52,52,52,51,51,51


Bottom 10 Unique Values of BASKET_ID


Unnamed: 0,994104700395742,994104700107965,994104700338686,994104700140844,994104700267632,994104700336080,994104700409887,994104700325613,994104700658152,994104700174614
BASKET_ID,1,1,1,1,1,1,1,1,1,1


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_SIZE
Number of Unique Values: 3
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_SIZE'].value_counts(dropna=False))


Unnamed: 0,L,M,S
BASKET_SIZE,198321,63383,15396


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_PRICE_SENSITIVITY
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_PRICE_SENSITIVITY'].value_counts(dropna=False))


Unnamed: 0,MM,UM,LA,XX
BASKET_PRICE_SENSITIVITY,144730,65958,65434,978


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_TYPE
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_TYPE'].value_counts(dropna=False))


Unnamed: 0,Top Up,Full Shop,Small Shop,XX
BASKET_TYPE,121307,101136,53975,682


────────────────────────────────────────────────────────────────────────────────


Column name: BASKET_DOMINANT_MISSION
Number of Unique Values: 5
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['BASKET_DOMINANT_MISSION'].value_counts(dropna=False))


Unnamed: 0,Fresh,Mixed,Grocery,Nonfood,XX
BASKET_DOMINANT_MISSION,142006,105295,25050,4067,682


────────────────────────────────────────────────────────────────────────────────


Column name: STORE_CODE
Number of Unique Values: 759
Column Datatype: object


more than 200 results, showing Top 10 and Bottom 10
use line below for complete results
pd.DataFrame(df['STORE_CODE'].value_counts(dropna=False))


Top 10 Unique Values of STORE_CODE


Unnamed: 0,STORE00696,STORE01423,STORE02504,STORE02206,STORE01637,STORE01604,STORE02797,STORE01441,STORE00729,STORE01007
STORE_CODE,1306,1213,1170,1031,1025,1021,978,968,967,938


Bottom 10 Unique Values of STORE_CODE


Unnamed: 0,STORE00785,STORE01172,STORE02194,STORE01556,STORE00883,STORE00779,STORE00895,STORE02457,STORE02573,STORE02575
STORE_CODE,33,32,32,31,25,23,19,16,14,11


────────────────────────────────────────────────────────────────────────────────


Column name: STORE_FORMAT
Number of Unique Values: 4
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['STORE_FORMAT'].value_counts(dropna=False))


Unnamed: 0,LS,MS,XLS,SS
STORE_FORMAT,173282,59923,23635,20260


────────────────────────────────────────────────────────────────────────────────


Column name: STORE_REGION
Number of Unique Values: 12
Column Datatype: object


use line below for vertical results
pd.DataFrame(df['STORE_REGION'].value_counts(dropna=False))


Unnamed: 0,S02,N01,W02,N03,S03,W01,N02,S01,E03,E01,W03,E02
STORE_REGION,27863,27356,26160,25992,22970,22827,22793,22358,21076,19709,19016,18980


────────────────────────────────────────────────────────────────────────────────
