# LUFlow Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

Data source: [https://www.kaggle.com/datasets/mryanm/luflow-network-intrusion-detection-data-set](https://www.kaggle.com/datasets/mryanm/luflow-network-intrusion-detection-data-set)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

pd.set_option('display.max_columns', None)

In [2]:
DATA_PATH_BASE = '/data/disk2/luflow'

In [3]:
# List the paths to data and the number of files
all_paths = []

for path, subdirs, files in os.walk(DATA_PATH_BASE):
    for name in files:
        if name.endswith('.csv'):
            all_paths.append(os.path.join(path, name))

display(all_paths)

display(len(all_paths))

['/data/disk2/luflow/2021/02/2021.02.09/2021.02.09.csv',
 '/data/disk2/luflow/2021/02/2021.02.04/2021.02.04.csv',
 '/data/disk2/luflow/2021/02/2021.02.03/2021.02.03.csv',
 '/data/disk2/luflow/2021/02/2021.02.14/2021.02.14.csv',
 '/data/disk2/luflow/2021/02/2021.02.08/2021.02.08.csv',
 '/data/disk2/luflow/2021/02/2021.02.15/2021.02.15.csv',
 '/data/disk2/luflow/2021/02/2021.02.05/2021.02.05.csv',
 '/data/disk2/luflow/2021/02/2021.02.01/2021.02.01.csv',
 '/data/disk2/luflow/2021/02/2021.02.07/2021.02.07.csv',
 '/data/disk2/luflow/2021/02/2021.02.13/2021.02.13.csv',
 '/data/disk2/luflow/2021/02/2021.02.17/2021.02.17.csv',
 '/data/disk2/luflow/2021/02/2021.02.10/2021.02.10.csv',
 '/data/disk2/luflow/2021/02/2021.02.06/2021.02.06.csv',
 '/data/disk2/luflow/2021/02/2021.02.12/2021.02.12.csv',
 '/data/disk2/luflow/2021/02/2021.02.11/2021.02.11.csv',
 '/data/disk2/luflow/2021/02/2021.02.16/2021.02.16.csv',
 '/data/disk2/luflow/2021/02/2021.02.02/2021.02.02.csv',
 '/data/disk2/luflow/2021/01/20

241

## Sample Analysis

We choose a few indices in the list as samples to analyze together. Due to the sheer size of the data, we will stick to analyzing these three days instead of the whole capture.

In [4]:
sample_paths = [all_paths[0], all_paths[50], all_paths[75], all_paths[100], all_paths[125], all_paths[150], all_paths[175], all_paths[200]]
sample_paths

['/data/disk2/luflow/2021/02/2021.02.09/2021.02.09.csv',
 '/data/disk2/luflow/2020/08/2020.08.13/2020.08.13.csv',
 '/data/disk2/luflow/2020/08/2020.08.16/2020.08.16.csv',
 '/data/disk2/luflow/2020/10/2020.10.30/2020.10.30.csv',
 '/data/disk2/luflow/2020/11/2020.11.04/2020.11.04.csv',
 '/data/disk2/luflow/2020/09/2020.09.23/2020.09.23.csv',
 '/data/disk2/luflow/2020/06/2020.06.19/2020.06.19.csv',
 '/data/disk2/luflow/2020/12/2020.12.05/2020.12.05.csv']

In [5]:
data_sample = pd.DataFrame()

for filepath in sample_paths:
    data_sample = pd.concat([data_sample, pd.read_csv(filepath)])

In [6]:
data_sample.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 7053658 entries, 0 to 861472
Data columns (total 16 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   avg_ipt        7053658 non-null  float64
 1   bytes_in       7053658 non-null  int64  
 2   bytes_out      7053658 non-null  int64  
 3   dest_ip        7053658 non-null  int64  
 4   dest_port      6929209 non-null  float64
 5   entropy        7053658 non-null  float64
 6   num_pkts_out   7053658 non-null  int64  
 7   num_pkts_in    7053658 non-null  int64  
 8   proto          7053658 non-null  int64  
 9   src_ip         7053658 non-null  int64  
 10  src_port       6929209 non-null  float64
 11  time_end       7053658 non-null  int64  
 12  time_start     7053658 non-null  int64  
 13  total_entropy  7053658 non-null  float64
 14  label          7053658 non-null  object 
 15  duration       7053658 non-null  float64
dtypes: float64(6), int64(9), object(1)
memory usage: 914.9+ MB


In [7]:
data_sample.describe()

Unnamed: 0,avg_ipt,bytes_in,bytes_out,dest_ip,dest_port,entropy,num_pkts_out,num_pkts_in,proto,src_ip,src_port,time_end,time_start,total_entropy,duration
count,7053658.0,7053658.0,7053658.0,7053658.0,6929209.0,7053658.0,7053658.0,7053658.0,7053658.0,7053658.0,6929209.0,7053658.0,7053658.0,7053658.0,7053658.0
mean,2465881.0,575.9836,2670.398,3441.887,13275.55,3.232973,5.884168,3.596375,6.34941,21856.47,40579.75,1455979000000000.0,1455487000000000.0,11320.08,1.0286
std,55941210.0,3037.122,6568.602,16494.73,18011.7,2.389113,19.30359,13.47172,2.273362,48330.59,19993.21,436417500000000.0,437066400000000.0,49390.51,7.000111
min,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,1.0,3.0,1.0,15925360000.0,15925370000.0,0.0,0.0
25%,0.0,0.0,0.0,786.0,445.0,0.0,1.0,0.0,6.0,786.0,29644.0,1597293000000000.0,1597292000000000.0,0.0,0.0
50%,0.0,0.0,191.0,786.0,9200.0,3.832851,2.0,1.0,6.0,786.0,50032.0,1600829000000000.0,1600829000000000.0,2109.405,0.000167
75%,13.0,270.0,2021.0,786.0,9200.0,4.952367,6.0,3.0,6.0,19182.0,54438.0,1604463000000000.0,1604463000000000.0,14782.96,0.127098
max,4294967000.0,65512.0,65535.0,398722.0,65535.0,93.9307,255.0,255.0,132.0,398722.0,65535.0,1612912000000000.0,1612912000000000.0,5413978.0,10070.32


In [8]:
data_sample.head()

Unnamed: 0,avg_ipt,bytes_in,bytes_out,dest_ip,dest_port,entropy,num_pkts_out,num_pkts_in,proto,src_ip,src_port,time_end,time_start,total_entropy,label,duration
0,329.366667,19920,43375,786,9200.0,14.01139,224,134,6,786,42850.0,1612879933536483,1612879903869536,886850.94,benign,29.666946
1,354.571429,120,81,786,23.0,4.920487,20,27,6,786,6453.0,1612879927566457,1612879918705133,989.01794,outlier,8.861324
2,0.0,0,0,786,80.0,0.0,1,5,6,786,34024.0,1612879912432857,1612879912432857,0.0,malicious,0.0
3,0.0,56,56,786,,1.020244,1,1,1,786,,1612879927204083,1612879927203971,114.267365,malicious,0.000112
4,301.24,257,81,786,23.0,5.232285,22,28,6,786,6441.0,161287992767784,1612879918491693,1768.5123,outlier,9.186147


In [9]:
# Labels distribution
data_sample['label'].value_counts()

label
benign       3535169
malicious    2754751
outlier       763738
Name: count, dtype: int64

In [10]:
# IP addresses are anonymized via ordinal numbers
data_sample['src_ip'].value_counts()

src_ip
786       4061188
45899      181531
49877      131096
8048       112803
7552       111349
           ...   
32934           1
41682           1
49544           1
265862          1
34038           1
Name: count, Length: 2101, dtype: int64

In [11]:
data_sample['dest_ip'].value_counts()

dest_ip
786       6592136
18403       54180
49877       32097
45090       27489
15169       26692
           ...   
40065           1
265887          1
15879           1
28591           1
34038           1
Name: count, Length: 1022, dtype: int64

In [12]:
# Port distribution
data_sample['dest_port'].value_counts()

dest_port
9200.0     2561746
445.0      1918403
22.0        310432
5900.0      167620
5060.0      133589
            ...   
37240.0          1
33729.0          1
33542.0          1
31234.0          1
20373.0          1
Name: count, Length: 59313, dtype: int64

Ports does not look anonymized, but most common - like 443 would be expected to be at the top ranks.
What are these ports:

| Port | Service |
-------| --------| 
| 9200 | elasticsearch |
| 445  | microsoft-ds |
| 22   | ssh |
| 5900 | vnc |
| 5060 | sip |

These ports look realistic, although their distribution is rather specific to the capture network and might not represent other network types, where these services would not be such predominant.

In [13]:
# How much does HTTP(S) cover - Only 0.003% ...
len(data_sample[data_sample['dest_port'] == 443]) / len(data_sample)

0.003663914524917426

### Prepare Paths for Every File Based on Their Month

In [14]:
# Split them by months
paths_bymonth = {}

for path in sorted(all_paths):
    year_month = os.path.split(path)[1].rsplit('.', maxsplit=2)[0]

    if year_month not in paths_bymonth:
        paths_bymonth[year_month] = []

    paths_bymonth[year_month].append(path)

paths_bymonth

{'2020.06': ['/data/disk2/luflow/2020/06/2020.06.19/2020.06.19.csv',
  '/data/disk2/luflow/2020/06/2020.06.20/2020.06.20.csv',
  '/data/disk2/luflow/2020/06/2020.06.21/2020.06.21.csv',
  '/data/disk2/luflow/2020/06/2020.06.22/2020.06.22.csv',
  '/data/disk2/luflow/2020/06/2020.06.23/2020.06.23.csv',
  '/data/disk2/luflow/2020/06/2020.06.24/2020.06.24.csv',
  '/data/disk2/luflow/2020/06/2020.06.25/2020.06.25.csv',
  '/data/disk2/luflow/2020/06/2020.06.26/2020.06.26.csv',
  '/data/disk2/luflow/2020/06/2020.06.27/2020.06.27.csv',
  '/data/disk2/luflow/2020/06/2020.06.28/2020.06.28.csv',
  '/data/disk2/luflow/2020/06/2020.06.29/2020.06.29.csv',
  '/data/disk2/luflow/2020/06/2020.06.30/2020.06.30.csv'],
 '2020.07': ['/data/disk2/luflow/2020/07/2020.07.01/2020.07.01.csv',
  '/data/disk2/luflow/2020/07/2020.07.02/2020.07.02.csv',
  '/data/disk2/luflow/2020/07/2020.07.03/2020.07.03.csv',
  '/data/disk2/luflow/2020/07/2020.07.04/2020.07.04.csv',
  '/data/disk2/luflow/2020/07/2020.07.05/2020.07.

Based on this direct analysis, we see that the capture was ongoing from 2020-06-19 to 2021-02-17 continually and then 3 more days from 06-2022 are included. We will focus on analyzing 2020 and 2021 data for their continuity.

### 2022 Captures Brief Analysis

Analyze capture files from 2022 separately.

In [15]:
# Try to open 3 at once
display(paths_bymonth['2022.06'])

data2022 = pd.DataFrame()

for flowfile in paths_bymonth['2022.06']:
    data2022 = pd.concat([data2022, pd.read_csv(flowfile)])

['/data/disk2/luflow/2022/06/2022.06.12/2022.06.12.csv',
 '/data/disk2/luflow/2022/06/2022.06.13/2022.06.13.csv',
 '/data/disk2/luflow/2022/06/2022.06.14/2022.06.14.csv']

In [16]:
data2022.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 1068376 entries, 0 to 590085
Data columns (total 16 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   avg_ipt        1068376 non-null  float64
 1   bytes_in       1068376 non-null  int64  
 2   bytes_out      1068376 non-null  int64  
 3   dest_ip        1068376 non-null  int64  
 4   dest_port      964168 non-null   float64
 5   entropy        1068376 non-null  float64
 6   num_pkts_out   1068376 non-null  int64  
 7   num_pkts_in    1068376 non-null  int64  
 8   proto          1068376 non-null  int64  
 9   src_ip         1068376 non-null  int64  
 10  src_port       964168 non-null   float64
 11  time_end       1068376 non-null  int64  
 12  time_start     1068376 non-null  int64  
 13  total_entropy  1068376 non-null  float64
 14  label          1068376 non-null  object 
 15  duration       1068376 non-null  float64
dtypes: float64(6), int64(9), object(1)
memory usage: 138.6+ MB


In [17]:
data2022['src_ip'].value_counts()

src_ip
786    1068376
Name: count, dtype: int64

In [18]:
data2022['dest_ip'].value_counts()

dest_ip
786    1068376
Name: count, dtype: int64

In [19]:
data2022['label'].value_counts()

label
benign       517210
outlier      440178
malicious    110988
Name: count, dtype: int64

## Timestamp and Other Analyses

In [20]:
# Timestamp format
data_sample['time_start'].head()

0    1612879903869536
1    1612879918705133
2    1612879912432857
3    1612879927203971
4    1612879918491693
Name: time_start, dtype: int64

In [21]:
# When we sort the timestamps, we observe that some are in microsecond and some in second precision...
data_sample['time_start'].sort_values()

510977         15925366400
464890         15925373126
296203         15925397203
332537         15925398899
155338         15925403809
                ...       
274391    1612912374009122
275284    1612912374009685
275285    1612912374009685
275679    1612912374009685
275283    1612912374009685
Name: time_start, Length: 7053658, dtype: int64

We need to reformat these timestamps before any meaningful analysis can be made.

In [22]:
# Based on manual analysis, we need to fix the timestamps as follows:
# We want the timestamp to be in a millisecond precision, needing to add 00 or remove 000 from existing tstamps
def unify_timestamp_length(timestamp):
  """Unifies the length of a timestamp to 13 decimal digits.

  Args:
      timestamp: An integer representing the timestamp.

  Returns:
      A string representation of the timestamp with 13 decimal digits,
      right-padded with zeros if necessary. Truncates if longer than 13 digits.
  """
  timestamp_str = str(timestamp)

  if len(timestamp_str) > 13:
    return timestamp_str[:13]  # Truncate if longer than 13 digits
  else:
    return timestamp_str.ljust(13, '0')  # Right-pad with zeros if necessary

In [23]:
# Verify the function works
data_sample['time_start'].apply(unify_timestamp_length).sort_values()

37435     1592524759061
37153     1592524764730
37413     1592524766901
37167     1592524766951
37412     1592524767904
              ...      
274388    1612912374009
274395    1612912374009
274392    1612912374009
274390    1612912374009
275288    1612912374009
Name: time_start, Length: 7053658, dtype: object

In [24]:
# Function for measuring capture duration
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = pd.Timedelta(seconds=0)

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    # Log the final continuous block and sort them via their length
    cont_durations.append(current_dur)
    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations

Now, Go day-by-day and search for capture gaps.

In [25]:
# Iterate every other entry and determine continuity
for month, paths in paths_bymonth.items():
    for datapath in paths:
        file = os.path.splitext(os.path.basename('/data/disk2/luflow/2020/07/2020.07.23/2020.07.23.csv'))[0]
        # Load the file
        data = pd.read_csv(datapath)

        # Normalize timestamps, sort in a descending manner and covert to relevent structure
        timestamps = data['time_start'].apply(unify_timestamp_length).astype(np.int64)
        timestamps = timestamps.sort_values(ascending=False)
        timestamps = pd.to_datetime(timestamps, origin='unix', unit='ms')

        # Print the result of capture duration/continuity analysis, 1% of 24h seconds = 864s (14.4m)
        print(f"{file} : {measure_real_capture_dur(timestamps)}")


2020.07.23 : (Timedelta('1 days 00:00:24.009000'), True, [Timedelta('1 days 00:00:24.009000')])
2020.07.23 : (Timedelta('1 days 00:00:23.967000'), True, [Timedelta('1 days 00:00:23.967000')])
2020.07.23 : (Timedelta('1 days 00:00:23.387000'), True, [Timedelta('1 days 00:00:23.387000')])
2020.07.23 : (Timedelta('1 days 00:00:31.926000'), True, [Timedelta('1 days 00:00:31.926000')])
2020.07.23 : (Timedelta('1 days 00:00:43.610000'), False, [Timedelta('1 days 00:00:25.975000'), Timedelta('0 days 00:00:17.635000')])
2020.07.23 : (Timedelta('1 days 00:00:15.838000'), True, [Timedelta('1 days 00:00:15.838000')])
2020.07.23 : (Timedelta('1 days 00:00:37.176000'), True, [Timedelta('1 days 00:00:37.176000')])
2020.07.23 : (Timedelta('1 days 00:00:16.213000'), True, [Timedelta('1 days 00:00:16.213000')])
2020.07.23 : (Timedelta('1 days 00:00:11.298000'), True, [Timedelta('1 days 00:00:11.298000')])
2020.07.23 : (Timedelta('1 days 00:00:19.160000'), True, [Timedelta('1 days 00:00:19.160000')])
20

Looks like these daily captures are more-less continuous.

### General Information About Each Month

In [26]:
UNIQ_IPS_THRESHOLD = 25


# Based on manual analysis, we need to fix the timestamps as follows:
# We want the timestamp to be in a millisecond precision, needing to add 00 or remove 000 from existing tstamps
def unify_timestamp_length(timestamp):
  """Unifies the length of a timestamp to 13 decimal digits.

  Args:
      timestamp: An integer representing the timestamp.

  Returns:
      A string representation of the timestamp with 13 decimal digits,
      right-padded with zeros if necessary. Truncates if longer than 13 digits.
  """
  timestamp_str = str(timestamp)

  if len(timestamp_str) > 13:
    return timestamp_str[:13]  # Truncate if longer than 13 digits
  else:
    return timestamp_str.ljust(13, '0')  # Right-pad with zeros if necessary


def provide_capture_info(paths: list) -> None:
    uniq_ips    = set()
    files_cnt    = 0
    flows_total = 0
    dur_total   = 0

    for flows_path in paths:
        flows_df = pd.read_csv(flows_path)

        # Update unique IP set
        ips_activity = flows_df['src_ip'].value_counts()
        uniq_ips = uniq_ips.union(set(ips_activity[ips_activity > 100].keys()))

        ips_activity = flows_df['dest_ip'].value_counts()
        uniq_ips = uniq_ips.union(set(ips_activity[ips_activity > 100].keys()))

        # Determine duration
        start_stamp = flows_df['time_start'].apply(unify_timestamp_length).sort_values().iloc[0]
        end_stamp   = flows_df['time_end'].apply(unify_timestamp_length).sort_values().iloc[-1]

        dur_total += (int(end_stamp) - int(start_stamp)) // 1000

        # Update counters
        files_cnt   += 1
        flows_total += len(flows_df)

    return {'uniq_ips': uniq_ips, 'files_cnt': files_cnt, 'flows_total': flows_total, 'dur_total': dur_total}


In [27]:
# Provide capture info for all months and summary as well
total_stats = {
    'uniq_ips':  set(),
    'files_cnt': 0,
    'flows_total': 0,
    'dur_total': 0
}

for month, month_data in paths_bymonth.items():
    month_stats = provide_capture_info(month_data)
    total_stats['uniq_ips'] = total_stats['uniq_ips'].union(month_stats['uniq_ips'])
    total_stats['files_cnt'] += month_stats['files_cnt']
    total_stats['flows_total'] += month_stats['flows_total']
    total_stats['dur_total'] += month_stats['dur_total']

    month_stats['uniq_ips'] = len(month_stats['uniq_ips'])

    print(month)
    print(f'{month_stats}\n')

2020.06
{'uniq_ips': 870, 'files_cnt': 12, 'flows_total': 12594630, 'dur_total': 1045956}

2020.07
{'uniq_ips': 1403, 'files_cnt': 31, 'flows_total': 25056405, 'dur_total': 2682483}

2020.08
{'uniq_ips': 1534, 'files_cnt': 31, 'flows_total': 24652019, 'dur_total': 2867225}

2020.09
{'uniq_ips': 1668, 'files_cnt': 30, 'flows_total': 26206532, 'dur_total': 5912330}

2020.10
{'uniq_ips': 1760, 'files_cnt': 30, 'flows_total': 26948169, 'dur_total': 2609290}

2020.11
{'uniq_ips': 1676, 'files_cnt': 30, 'flows_total': 27679773, 'dur_total': 2566000}

2020.12
{'uniq_ips': 1122, 'files_cnt': 28, 'flows_total': 20689199, 'dur_total': 2419855}

2021.01
{'uniq_ips': 1083, 'files_cnt': 29, 'flows_total': 26986466, 'dur_total': 18422712}

2021.02
{'uniq_ips': 1, 'files_cnt': 17, 'flows_total': 14674978, 'dur_total': 1304073}

2022.06
{'uniq_ips': 1, 'files_cnt': 3, 'flows_total': 1068376, 'dur_total': 183398}



In [28]:
# Total number of unique IPs above 25 flows per day"
print(len(total_stats['uniq_ips']))
print(total_stats['files_cnt'])
print(total_stats['flows_total'])
print(total_stats['dur_total'])

4489
241
206556547
40013322


### Investigation of Broken Files With Timestamps

In [29]:
# Find out which file in the set is broken
def find_faulty_captures(month_files_list: list):
    for file_path in month_files_list:
        file_pd = pd.read_csv(file_path)

        start = int(file_pd['time_start'].apply(unify_timestamp_length).sort_values().iloc[0])
        end   = int(file_pd['time_end'].apply(unify_timestamp_length).sort_values().iloc[-1])

        diff = (end - start) / 1000 / 3600

        if diff > 24.25:
            print(f'{file_path} : {diff}')

In [30]:
# Files from 2021.01 and 2020.09 look sketchy
find_faulty_captures(paths_bymonth['2021.01'])

/data/disk2/luflow/2021/01/2021.01.07/2021.01.07.csv : 2095.9662933333334
/data/disk2/luflow/2021/01/2021.01.20/2021.01.20.csv : 2407.9639880555555


In [31]:
find_faulty_captures(paths_bymonth['2020.09'])

/data/disk2/luflow/2020/09/2020.09.02/2020.09.02.csv : 193.5781397222222
/data/disk2/luflow/2020/09/2020.09.17/2020.09.17.csv : 823.4335019444444
/data/disk2/luflow/2020/09/2020.09.26/2020.09.26.csv : 25.534390277777774


In [34]:
# Let us analyze some of these faulty captures
faulty_sample = pd.read_csv(os.path.join(DATA_PATH_BASE, '2021/01/2021.01.07/2021.01.07.csv'))
faulty_sample_sort_start = faulty_sample['time_start'].apply(unify_timestamp_length).sort_values()
faulty_sample_sort_end   = faulty_sample['time_end'].apply(unify_timestamp_length).sort_values()

In [35]:
faulty_sample_sort_start

2186590    1602518516255
2186593    1602518517878
2186591    1602518518261
2186592    1602518518811
2186594    1602518519048
               ...      
2730253    1610063987267
2730255    1610063987527
2730256    1610063987799
2730257    1610063988008
2730258    1610063988066
Name: time_start, Length: 2730274, dtype: object

Apparently, there are some old feature vectors - see it before unification.

In [36]:
# Before unification
faulty_sample['time_start'].sort_values().head()

893572     16099906359
902157     16099917048
1155085    16099923453
985410     16099923639
1185359    16099924370
Name: time_start, dtype: int64

In [37]:
# After unification - are the timestamps 1602* really there or just unification fks them up somehow?
faulty_sample['time_start'].apply(unify_timestamp_length).sort_values().head()

2186590    1602518516255
2186593    1602518517878
2186591    1602518518261
2186592    1602518518811
2186594    1602518519048
Name: time_start, dtype: object

In [38]:
# Select indices with '11' legth and discard them
length_11 = faulty_sample[faulty_sample['time_start'] > 200000000000]
length_11['time_start'].sort_values()

2184362       1602520007326
2184364       1602520007328
2656086       1602524705624
2183128       1602525677093
2180790       1602529483742
                 ...       
2730245    1610063986922806
2730253    1610063987267019
2730255    1610063987527807
2730256    1610063987799245
2730258    1610063988066807
Name: time_start, Length: 2729977, dtype: int64

As apparent, several flows contain timestamps from October 2020, therefore significantly breaking the duration measurement.

In [39]:
# Leave only relevant timestamped flows in from January 2021 to see whether other captures are fine
after_jan = faulty_sample[faulty_sample['time_start'] > 1609459200000]

In [40]:
after_jan['time_start'].sort_values()

887528        1609979427388
887634        1609979609239
887914        1609980734992
887993        1609981297904
889075        1609981459439
                 ...       
2730245    1610063986922806
2730253    1610063987267019
2730255    1610063987527807
2730256    1610063987799245
2730258    1610063988066807
Name: time_start, Length: 2729945, dtype: int64

In this case, it is rather correct. Consider full day captures for the purpose of duration reporting, although these timestamps are relatively buggy after all.