# Introduction to firewall Log File


---

The firewall log file contains detailed records of network activity, capturing various fields that describe the nature of each event. Below is an explanation of the key fields present in the log file:

- **Time**: The timestamp when the event was logged.
- **Log comp**: The component of the firewall that generated the log.
- **Log subtype**: The subtype of the log event, indicating the action taken (e.g., Allowed, Blocked).
- **Username**: The username associated with the event, if applicable.
- **Firewall rule**: The ID of the firewall rule that processed the event.
- **Firewall rule name**: The name of the firewall rule that processed the event.
- **NAT rule**: The ID of the NAT rule applied to the event.
- **NAT rule name**: The name of the NAT rule applied to the event.
- **In interface**: The network interface where the traffic originated.
- **Out interface**: The network interface where the traffic was sent.
- **Src IP**: The source IP address of the traffic.
- **Dst IP**: The destination IP address of the traffic.
- **Src port**: The source port number of the traffic.
- **Dst port**: The destination port number of the traffic.
- **Protocol**: The protocol used by the traffic (e.g., TCP, UDP).
- **Rule type**: The type of rule (e.g., 1 for firewall rule).
- **Live PCAP**: Indicator for live packet capture availability.
- **Message**: Additional message or information related to the event.
- **Log occurrence**: Number of times this log entry has been logged.



## Objective:


---
Students should concentrate on exploratory and explanatory data analysis using univariate exploration of data and understand what constitutes good vs. bad data visualization. Here are some questions to guide their analysis:

**Basic Statistics:**

---



* What is the total number of log entries in the dataset?
* How many unique firewall rules are present in the dataset?
* How many unique NAT rules are present in the dataset?

**Time-Based Analysis**

---



* What is the distribution of log entries over time? (Create a histogram of log entries by time)
* Identify peak times of network activity based on the number of log entries.

**Source and Destination IP Analysis**


---


* List the top 10 most frequent source IP addresses.
* List the top 10 most frequent destination IP addresses.
* Visualize the distribution of source and destination IP addresses.

**Port Analysis**


---



* Identify the most common source and destination ports.
* Visualize the distribution of source and destination ports using bar charts.

**Protocol Analysis**

---



* What is the distribution of protocols used in the log entries?
* Visualize the protocol distribution with a pie chart or bar chart.

**Firewall and NAT Rule Analysis**

---

* Which firewall rule has the most log entries?
* Which NAT rule is applied most frequently?
* Visualize the distribution of firewall and NAT rule usage.


In [1]:
!pip install skimpy

Collecting skimpy
  Downloading skimpy-0.0.15-py3-none-any.whl.metadata (28 kB)
Collecting click<9.0.0,>=8.1.6 (from skimpy)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting polars<0.21,>=0.19 (from skimpy)
  Downloading polars-0.20.31-cp38-abi3-win_amd64.whl.metadata (14 kB)
Collecting pyarrow<17,>=13 (from skimpy)
  Downloading pyarrow-16.1.0-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Collecting rich<14.0,>=10.9 (from skimpy)
  Using cached rich-13.7.1-py3-none-any.whl.metadata (18 kB)
Collecting typeguard==4.2.1 (from skimpy)
  Downloading typeguard-4.2.1-py3-none-any.whl.metadata (3.7 kB)
Collecting markdown-it-py>=2.2.0 (from rich<14.0,>=10.9->skimpy)
  Using cached markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich<14.0,>=10.9->skimpy)
  Using cached mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Downloading skimpy-0.0.15-py3-none-any.whl (16 kB)
Downloading typeguard-4.2.1-py3-none-any.whl (34 k

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from skimpy import skim

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df=pd.read_csv('./new_logs.csv')

In [None]:
skim(df)

  return n/db/n.sum(), bin_edges
