diff --git a/examples/segmentation/segments.ipynb b/examples/segmentation/segments.ipynb new file mode 100644 index 0000000000..eabf0b679c --- /dev/null +++ b/examples/segmentation/segments.ipynb @@ -0,0 +1,1873 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Intro to Segmentation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes, certain subgroups of data can behave very differently from the overall dataset. When monitoring the health of a dataset, it’s often helpful to have visibility at the sub-group level to better understand how these subgroups are contributing to trends in the overall dataset. whylogs supports data segmentation for this purpose.\n", + "\n", + "Data segmentation is done at the point of profiling a dataset.\n", + "\n", + "Segmentation can be done by a single feature or by multiple features simultaneously. For example, you could have different profiles according to the gender of your dataset (\"M\" or \"F\"), and also for different combinations of, let's say, Gender and City Code.\n", + "\n", + "The specification of segments can be done in two different ways:\n", + "- At the Feature level (i.e., a column name - \"Gender\" or \"Product Category\")\n", + "- At the Feature-value level (i.e. value for a given column - \"Product Category\":\"Books\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Table of Contents" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Intro to Segmentation\n", + "- Segmentation on Different features (Feature level)\n", + "- Segmentation on key-values (Feature-value level)\n", + "- Auto Segmentation\n", + "- Merging back the segmented profiles" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Segmentation on different features" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's use a sample data for the following steps of this notebook.\n", + "We'll be using data from the [Retail Case Study Data](https://www.kaggle.com/darpan25bajaj/retail-case-study-data). The present data was modified to contain features for a specific task: predict whether the transaction is a purchase cancelation or not. For this example, we'll look only at the input features and target label, and not on the prediction output itself.\n", + "\n", + "In the dataset, we have info on the transaction itself, like total amount and item price, as well as info on the Product (category and subcategory) and the customer (Age,Gender, City Code).\n", + "\n", + "We'll use data for one given day, logging a single batch of data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install whylogs\n", + "!pip install pybars3" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "daily_df = pd.read_csv(\"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/retail-daily-features.csv\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Transaction IDCustomer IDProduct Subcategory CodeProduct Category CodeItem PriceTotal TaxTotal AmountStore TypeProduct CategoryProduct SubcategoryDate of BirthGenderCity CodeAge at Transaction DatePurchase CanceledTransaction Day of WeekTransaction WeekTransaction Batch
0T25601292314C268458126114.924.1290253.9290TeleShopHome and kitchenTools1976-10-08M1.036.00.0000
1T1465175267C27134435107.722.6170238.0170e-ShopBooksComics1970-01-29F5.043.00.0000
2T4968790114C2723054314.67.665080.6650e-ShopElectronicsMobiles1975-08-25F10.037.00.0000
3T50504166310C2750574415.74.945552.0455MBRBagsWomen1980-09-17M7.032.00.0000
4T10877729712C270074105144.145.3915477.6915e-ShopBooksNon-Fiction1983-02-20M10.030.00.0000
\n", + "
" + ], + "text/plain": [ + " Transaction ID Customer ID Product Subcategory Code Product Category Code \\\n", + "0 T25601292314 C268458 12 6 \n", + "1 T1465175267 C271344 3 5 \n", + "2 T4968790114 C272305 4 3 \n", + "3 T50504166310 C275057 4 4 \n", + "4 T10877729712 C270074 10 5 \n", + "\n", + " Item Price Total Tax Total Amount Store Type Product Category \\\n", + "0 114.9 24.1290 253.9290 TeleShop Home and kitchen \n", + "1 107.7 22.6170 238.0170 e-Shop Books \n", + "2 14.6 7.6650 80.6650 e-Shop Electronics \n", + "3 15.7 4.9455 52.0455 MBR Bags \n", + "4 144.1 45.3915 477.6915 e-Shop Books \n", + "\n", + " Product Subcategory Date of Birth Gender City Code \\\n", + "0 Tools 1976-10-08 M 1.0 \n", + "1 Comics 1970-01-29 F 5.0 \n", + "2 Mobiles 1975-08-25 F 10.0 \n", + "3 Women 1980-09-17 M 7.0 \n", + "4 Non-Fiction 1983-02-20 M 10.0 \n", + "\n", + " Age at Transaction Date Purchase Canceled Transaction Day of Week \\\n", + "0 36.0 0.0 0 \n", + "1 43.0 0.0 0 \n", + "2 37.0 0.0 0 \n", + "3 32.0 0.0 0 \n", + "4 30.0 0.0 0 \n", + "\n", + " Transaction Week Transaction Batch \n", + "0 0 0 \n", + "1 0 0 \n", + "2 0 0 \n", + "3 0 0 \n", + "4 0 0 " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "daily_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's first segment our profiles according to the Customer's `Gender` and `Product Category`.\n", + "Let's take a look of the possible categories of each feature:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['M', 'F']" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "daily_df['Gender'].unique().tolist()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Home and kitchen', 'Books', 'Electronics', 'Bags', 'Footwear', 'Clothing']" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "daily_df['Product Category'].unique().tolist()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can obtain the profile's segments upon logging the dataframe by specifying the column names we want to segment on:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARN: Missing config\n" + ] + } + ], + "source": [ + "\n", + "import numpy as np\n", + "from whylogs import get_or_create_session\n", + "from datetime import datetime\n", + "session = get_or_create_session()\n", + "\n", + "features_to_segment = ['Gender','Product Category']\n", + "now = datetime.today()\n", + "\n", + "with session.logger(\"segment-test\", dataset_timestamp=now) as logger:\n", + " logger.log_dataframe(daily_df,segments=features_to_segment)\n", + " profile_segments = logger.segmented_profiles\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the `with` statement is closed, the profiles are written to disk, but let's also store them in-memory as `profile_segments` to make this easier.\n", + "\n", + "The `profile_segments` takes form as a dict with different keys and profiles, according to the segment's combination. We can see the different profiles by inspecting the `tag` of each one, and seeing to which `Gender` and `Product Category` the profile's segmented on." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Bags', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Books', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Clothing', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Electronics', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Footwear', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Home and kitchen', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Bags', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Books', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Clothing', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Electronics', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Footwear', 'name': 'segment-test'}\n", + "{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Home and kitchen', 'name': 'segment-test'}\n" + ] + } + ], + "source": [ + "for k, prof in profile_segments.items():\n", + " print(prof.tags)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each profile is a statistical fingerpring of the data, segmented on the related columns, let's take a look at a simple summary that can be extracted from the profile for, let's say, Male Customers who bought (or cancelled) Clothing products:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column count null_count bool_count numeric_count \\\n", + "0 City Code 66.0 0.0 0.0 66.0 \n", + "1 Store Type 66.0 0.0 0.0 0.0 \n", + "2 Age at Transaction Date 66.0 0.0 0.0 66.0 \n", + "3 Date of Birth 66.0 0.0 0.0 0.0 \n", + "4 Product Subcategory Code 66.0 0.0 0.0 66.0 \n", + "\n", + " max mean min stddev nunique_numbers ... stddev_token_length \\\n", + "0 10.0 5.469697 1.0 2.735632 10.0 ... 0.000000 \n", + "1 0.0 0.000000 0.0 0.000000 0.0 ... 0.361298 \n", + "2 41.0 29.803030 20.0 6.043992 22.0 ... 0.000000 \n", + "3 0.0 0.000000 0.0 0.000000 0.0 ... 0.000000 \n", + "4 4.0 2.530303 1.0 1.338423 3.0 ... 0.000000 \n", + "\n", + " quantile_0.0000 quantile_0.0100 quantile_0.0500 quantile_0.2500 \\\n", + "0 1.0 1.0 2.0 3.0 \n", + "1 NaN NaN NaN NaN \n", + "2 20.0 20.0 21.0 24.0 \n", + "3 NaN NaN NaN NaN \n", + "4 1.0 1.0 1.0 1.0 \n", + "\n", + " quantile_0.5000 quantile_0.7500 quantile_0.9500 quantile_0.9900 \\\n", + "0 5.0 8.0 10.0 10.0 \n", + "1 NaN NaN NaN NaN \n", + "2 29.0 35.0 39.0 41.0 \n", + "3 NaN NaN NaN NaN \n", + "4 3.0 4.0 4.0 4.0 \n", + "\n", + " quantile_1.0000 \n", + "0 10.0 \n", + "1 NaN \n", + "2 41.0 \n", + "3 NaN \n", + "4 4.0 \n", + "\n", + "[5 rows x 40 columns]\n" + ] + } + ], + "source": [ + "product_label = 'Clothing'\n", + "gender_label = 'M'\n", + "for k, prof in profile_segments.items():\n", + " if product_label==prof.tags['whylogs.tag.Product Category'] and gender_label==prof.tags['whylogs.tag.Gender']:\n", + " profile_summary = prof.flat_summary()['summary']\n", + " target_profile = prof\n", + " print(profile_summary.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just to make it more readable, let's use one of the features of the `NotebookProfileViewer` to display simple statistics for a given feature.\n", + "We might be interested in looking at the `Purchase Canceled` feature:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Feature Statistics for:\n", + " Feature:Purchase Canceled\n", + "For profile segment of:\n", + " Gender: M\n", + " Product Category: Clothing\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from whylogs.viz import NotebookProfileViewer\n", + "feature_name = \"Purchase Canceled\"\n", + "\n", + "print(\"Feature Statistics for:\\n Feature:{}\\nFor profile segment of:\\n Gender: {}\\n Product Category: {}\".format(feature_name,gender_label,product_label))\n", + "visualization = NotebookProfileViewer()\n", + "visualization.set_profiles(target_profile=target_profile)\n", + "visualization.feature_statistics(feature_name=feature_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Around 6% of transactions were cancellations for male customers and clothing products. This is actually a lot less than the mean for other segments, and overall. You can see that this is the case by inspecting other segments! " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Segmentation on Key-values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The second method of defining segment is by specifying the specific values of given columns you want to segment on. We will do this by specifying key-value pairs.\n", + "\n", + "Suppose we are only interested for transactions in a given type of store - `e-Shops`." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "e-Shop 375\n", + "TeleShop 184\n", + "MBR 176\n", + "Flagship store 174\n", + "Name: Store Type, dtype: int64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "daily_df['Store Type'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As before, we specify the `features_to_segment`. But this time, passing a list of lists. Each list will have one or more dicts with `key` and `value` field, like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from whylogs import get_or_create_session\n", + "session = get_or_create_session()\n", + "\n", + "features_to_segment = [[{\"key\": \"Store Type\", \"value\": \"e-Shop\"}]]\n", + "\n", + "now = datetime.today()\n", + "\n", + "with session.logger(\"segment-test\", dataset_timestamp=now) as logger:\n", + " logger.log_dataframe(daily_df,segments=features_to_segment)\n", + " profile_segments = logger.segmented_profiles\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As before, we can take a look at the segments. In this case, we only have the segment related to transactions that took place in the e-Shop:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'whylogs.tag.Store Type': 'e-Shop', 'name': 'segment-test'}\n" + ] + } + ], + "source": [ + "for k, prof in profile_segments.items():\n", + " print(prof.tags)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For segmenting at the feature + value level with python, each nested list defines one or more conditions defining each segment. In the example above, we get only the segment for which `Store Type` has values equal to `e-Shop`. We could define additional segments, such as:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "features_to_segment = [[{\"key\": \"Store Type\", \"value\": \"e-Shop\"}],[{\"key\": \"Store Type\", \"value\": \"TeleShop\"},{\"key\": \"Product Subcategory\", \"value\": \"Comics\"}]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Auto segmentation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition to manual segmentation, we can also automatically estimate the most important features and values on which to segment. This is done in the whylogs library using entropy-based methods. The intuition is that the columns that have the most entropy according to one target feature will probably be an interesting one to segment on.\n", + "\n", + "To obtain the columns that has the most entropy according to our feature of interest (`Purchase Canceled`), we can pass the dataframe to the `estimate_segments` methods:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['City Code']" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from whylogs import get_or_create_session\n", + "sess = get_or_create_session()\n", + "\n", + "auto_segments = sess.estimate_segments(daily_df, max_segments=20,target_field=\"Purchase Canceled\",name=\"demo1\")\n", + "\n", + "auto_segments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that we can also specify the maximum number of segments. In this case, we specify a maximum of 20 segments.\n", + "\n", + "If no target field is specified, the method will find a suitable field based on the maximum entropy column.\n", + "\n", + "`estimate_segments` returns a list of column names, which in turn can be used as argument to `log_dataframe`:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from whylogs import get_or_create_session\n", + "session = get_or_create_session()\n", + "\n", + "now = datetime.today()\n", + "\n", + "with session.logger(\"segment-test\", dataset_timestamp=now) as logger:\n", + " logger.log_dataframe(daily_df,segments=auto_segments)\n", + " profile_segments = logger.segmented_profiles\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once again, let's check the segmented profiles. In this case, we have 10 different categories, related to different cities:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'whylogs.tag.City Code': '1.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '2.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '3.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '4.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '5.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '6.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '7.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '8.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '9.0', 'name': 'segment-test'}\n", + "{'whylogs.tag.City Code': '10.0', 'name': 'segment-test'}\n" + ] + } + ], + "source": [ + "for k, prof in profile_segments.items():\n", + " print(prof.tags)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Merging the Profiles" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In case you want the complete profile from the segmented ones, you can make use of the fact that DatasetProfiles are mergeable.\n", + "\n", + "Let's take the last example. If you want to get the complete profile back from the profiles for each `City Code`, you can just use the `.merge` method of each Profile object:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "from functools import reduce\n", + "profiles = [prof for _,prof in profile_segments.items()]\n", + "merged = reduce(lambda x, y: x.merge(y), profiles)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`merged` is now the profile for the complete original DataFrame. Let's take a look at the merged profile's summary:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
columncountnull_countbool_countnumeric_countmaxmeanminstddevnunique_numbers...stddev_token_lengthquantile_0.0000quantile_0.0100quantile_0.0500quantile_0.2500quantile_0.5000quantile_0.7500quantile_0.9500quantile_0.9900quantile_1.0000
0Date of Birth908.00.00.00.00.00000.0000000.0000.0000000.0...0.000000NaNNaNNaNNaNNaNNaNNaNNaNNaN
1City Code908.00.00.0908.010.00005.3909691.0002.88749110.0...0.0000001.0000001.0000001.0000003.0000005.0000008.00000010.00000010.00000010.000000
2Product Category908.00.00.00.00.00000.0000000.0000.0000000.0...0.767990NaNNaNNaNNaNNaNNaNNaNNaNNaN
3Transaction Week908.00.00.0908.00.00000.0000000.0000.0000001.0...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
4Purchase Canceled908.072.00.0836.01.00000.0956940.0000.2943472.0...0.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000001.0000001.000000
5Transaction Batch908.00.00.0908.00.00000.0000000.0000.0000001.0...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
6Product Subcategory Code908.00.00.0908.012.00006.0088111.0003.75664512.0...0.0000001.0000001.0000001.0000003.0000005.00000010.00000012.00000012.00000012.000000
7Transaction ID908.00.00.00.00.00000.0000000.0000.0000000.0...0.000000NaNNaNNaNNaNNaNNaNNaNNaNNaN
8Total Tax908.00.00.0908.078.277524.8087250.86118.527800809.0...0.0000000.8610001.7010003.4545009.92250020.64300036.85500061.69800275.80999878.277496
9Product Subcategory908.00.00.00.00.00000.0000000.0000.0000000.0...0.425731NaNNaNNaNNaNNaNNaNNaNNaNNaN
10Customer ID908.00.00.00.00.00000.0000000.0000.0000000.0...0.000000NaNNaNNaNNaNNaNNaNNaNNaNNaN
11Age at Transaction Date908.00.00.0908.043.000030.57929519.0006.69770125.0...0.00000019.00000020.00000020.00000025.00000031.00000036.00000041.00000042.00000043.000000
12Product Category Code908.00.00.0908.06.00003.7191631.0001.7052316.0...0.0000001.0000001.0000001.0000002.0000004.0000005.0000006.0000006.0000006.000000
13Gender908.00.00.00.00.00000.0000000.0000.0000000.0...0.000000NaNNaNNaNNaNNaNNaNNaNNaNNaN
14Store Type908.00.00.00.00.00000.0000000.0000.0000000.0...0.392934NaNNaNNaNNaNNaNNaNNaNNaNNaN
15Item Price908.00.00.0908.0149.800079.0969167.10042.087292664.0...0.0000007.1000009.00000013.20000043.40000280.099998116.199997141.500000148.300003149.800003
16Total Amount908.00.00.0908.0823.7775219.909604-767.975240.507694830.0...0.000000-767.974976-495.924011-107.62699982.432999185.639999360.009003641.341980797.809998823.777527
17Transaction Day of Week908.00.00.0908.00.00000.0000000.0000.0000001.0...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
\n", + "

18 rows × 40 columns

\n", + "
" + ], + "text/plain": [ + " column count null_count bool_count numeric_count \\\n", + "0 Date of Birth 908.0 0.0 0.0 0.0 \n", + "1 City Code 908.0 0.0 0.0 908.0 \n", + "2 Product Category 908.0 0.0 0.0 0.0 \n", + "3 Transaction Week 908.0 0.0 0.0 908.0 \n", + "4 Purchase Canceled 908.0 72.0 0.0 836.0 \n", + "5 Transaction Batch 908.0 0.0 0.0 908.0 \n", + "6 Product Subcategory Code 908.0 0.0 0.0 908.0 \n", + "7 Transaction ID 908.0 0.0 0.0 0.0 \n", + "8 Total Tax 908.0 0.0 0.0 908.0 \n", + "9 Product Subcategory 908.0 0.0 0.0 0.0 \n", + "10 Customer ID 908.0 0.0 0.0 0.0 \n", + "11 Age at Transaction Date 908.0 0.0 0.0 908.0 \n", + "12 Product Category Code 908.0 0.0 0.0 908.0 \n", + "13 Gender 908.0 0.0 0.0 0.0 \n", + "14 Store Type 908.0 0.0 0.0 0.0 \n", + "15 Item Price 908.0 0.0 0.0 908.0 \n", + "16 Total Amount 908.0 0.0 0.0 908.0 \n", + "17 Transaction Day of Week 908.0 0.0 0.0 908.0 \n", + "\n", + " max mean min stddev nunique_numbers ... \\\n", + "0 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "1 10.0000 5.390969 1.000 2.887491 10.0 ... \n", + "2 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "3 0.0000 0.000000 0.000 0.000000 1.0 ... \n", + "4 1.0000 0.095694 0.000 0.294347 2.0 ... \n", + "5 0.0000 0.000000 0.000 0.000000 1.0 ... \n", + "6 12.0000 6.008811 1.000 3.756645 12.0 ... \n", + "7 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "8 78.2775 24.808725 0.861 18.527800 809.0 ... \n", + "9 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "10 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "11 43.0000 30.579295 19.000 6.697701 25.0 ... \n", + "12 6.0000 3.719163 1.000 1.705231 6.0 ... \n", + "13 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "14 0.0000 0.000000 0.000 0.000000 0.0 ... \n", + "15 149.8000 79.096916 7.100 42.087292 664.0 ... \n", + "16 823.7775 219.909604 -767.975 240.507694 830.0 ... \n", + "17 0.0000 0.000000 0.000 0.000000 1.0 ... \n", + "\n", + " stddev_token_length quantile_0.0000 quantile_0.0100 quantile_0.0500 \\\n", + "0 0.000000 NaN NaN NaN \n", + "1 0.000000 1.000000 1.000000 1.000000 \n", + "2 0.767990 NaN NaN NaN \n", + "3 0.000000 0.000000 0.000000 0.000000 \n", + "4 0.000000 0.000000 0.000000 0.000000 \n", + "5 0.000000 0.000000 0.000000 0.000000 \n", + "6 0.000000 1.000000 1.000000 1.000000 \n", + "7 0.000000 NaN NaN NaN \n", + "8 0.000000 0.861000 1.701000 3.454500 \n", + "9 0.425731 NaN NaN NaN \n", + "10 0.000000 NaN NaN NaN \n", + "11 0.000000 19.000000 20.000000 20.000000 \n", + "12 0.000000 1.000000 1.000000 1.000000 \n", + "13 0.000000 NaN NaN NaN \n", + "14 0.392934 NaN NaN NaN \n", + "15 0.000000 7.100000 9.000000 13.200000 \n", + "16 0.000000 -767.974976 -495.924011 -107.626999 \n", + "17 0.000000 0.000000 0.000000 0.000000 \n", + "\n", + " quantile_0.2500 quantile_0.5000 quantile_0.7500 quantile_0.9500 \\\n", + "0 NaN NaN NaN NaN \n", + "1 3.000000 5.000000 8.000000 10.000000 \n", + "2 NaN NaN NaN NaN \n", + "3 0.000000 0.000000 0.000000 0.000000 \n", + "4 0.000000 0.000000 0.000000 1.000000 \n", + "5 0.000000 0.000000 0.000000 0.000000 \n", + "6 3.000000 5.000000 10.000000 12.000000 \n", + "7 NaN NaN NaN NaN \n", + "8 9.922500 20.643000 36.855000 61.698002 \n", + "9 NaN NaN NaN NaN \n", + "10 NaN NaN NaN NaN \n", + "11 25.000000 31.000000 36.000000 41.000000 \n", + "12 2.000000 4.000000 5.000000 6.000000 \n", + "13 NaN NaN NaN NaN \n", + "14 NaN NaN NaN NaN \n", + "15 43.400002 80.099998 116.199997 141.500000 \n", + "16 82.432999 185.639999 360.009003 641.341980 \n", + "17 0.000000 0.000000 0.000000 0.000000 \n", + "\n", + " quantile_0.9900 quantile_1.0000 \n", + "0 NaN NaN \n", + "1 10.000000 10.000000 \n", + "2 NaN NaN \n", + "3 0.000000 0.000000 \n", + "4 1.000000 1.000000 \n", + "5 0.000000 0.000000 \n", + "6 12.000000 12.000000 \n", + "7 NaN NaN \n", + "8 75.809998 78.277496 \n", + "9 NaN NaN \n", + "10 NaN NaN \n", + "11 42.000000 43.000000 \n", + "12 6.000000 6.000000 \n", + "13 NaN NaN \n", + "14 NaN NaN \n", + "15 148.300003 149.800003 \n", + "16 797.809998 823.777527 \n", + "17 0.000000 0.000000 \n", + "\n", + "[18 rows x 40 columns]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "merged.flat_summary()['summary']" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "323493c40bedb65fef2eec2a6e595ce0cca722dcb720da40c0d127c8422c938f" + }, + "kernelspec": { + "display_name": "Python 3.8.10 ('base')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +}