Step 1️: Create a sample log file

In [9]:
sample_logs = '''127.0.0.1 - - [12/Oct/2024:06:25:24 +0000] "GET /index.html HTTP/1.1" 200 1043
192.168.1.5 - - [12/Oct/2024:06:25:50 +0000] "POST /login HTTP/1.1" 302 512
10.0.0.10 - - [12/Oct/2024:06:26:12 +0000] "GET /dashboard HTTP/1.1" 200 2048
192.168.1.5 - - [12/Oct/2024:06:27:05 +0000] "GET /logout HTTP/1.1" 200 123'''

with open("access.log", "w") as f:
    f.write(sample_logs)

print(" Sample access.log file created!")

 Sample access.log file created!


Step 2️: Import Required Libraries

In [10]:
import re
import pandas as pd

Step 3️: Define the Regular Expression Pattern

In [11]:
log_pattern = re.compile(r'(\S+) - - \[(.*?)\] "(\S+) (\S+) \S+" (\d+) (\d+)')
print(" Regex pattern compiled successfully!")

 Regex pattern compiled successfully!


Step 4️: Read and Parse Log File

In [12]:
logs = []

with open("access.log", "r") as f:
    for line in f:
        match = log_pattern.match(line)
        if match:
            ip, timestamp, method, url, status, size = match.groups()
            logs.append((ip, timestamp, method, url, int(status), int(size)))

df = pd.DataFrame(logs, columns=['IP', 'Timestamp', 'Method', 'URL', 'Status', 'Size'])
print(" Raw log data loaded successfully!")
df.head()

 Raw log data loaded successfully!


Unnamed: 0,IP,Timestamp,Method,URL,Status,Size
0,127.0.0.1,12/Oct/2024:06:25:24 +0000,GET,/index.html,200,1043
1,192.168.1.5,12/Oct/2024:06:25:50 +0000,POST,/login,302,512
2,10.0.0.10,12/Oct/2024:06:26:12 +0000,GET,/dashboard,200,2048
3,192.168.1.5,12/Oct/2024:06:27:05 +0000,GET,/logout,200,123


Step 5️: Clean and Preprocess the Data

In [13]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%d/%b/%Y:%H:%M:%S %z', errors='coerce')
df = df.drop_duplicates().dropna(subset=['Timestamp'])

df.to_csv("cleaned_logs.csv", index=False)
print(" Cleaned log data saved as 'cleaned_logs.csv'")
df.head()

 Cleaned log data saved as 'cleaned_logs.csv'


Unnamed: 0,IP,Timestamp,Method,URL,Status,Size
0,127.0.0.1,2024-10-12 06:25:24+00:00,GET,/index.html,200,1043
1,192.168.1.5,2024-10-12 06:25:50+00:00,POST,/login,302,512
2,10.0.0.10,2024-10-12 06:26:12+00:00,GET,/dashboard,200,2048
3,192.168.1.5,2024-10-12 06:27:05+00:00,GET,/logout,200,123


Step 6️: Perform Simple Analysis (Optional)

In [14]:
print("Requests per IP:")
print(df['IP'].value_counts())

print("\nMost Accessed Pages:")
print(df['URL'].value_counts())

print("\nHTTP Status Code Distribution:")
print(df['Status'].value_counts())

Requests per IP:
IP
192.168.1.5    2
127.0.0.1      1
10.0.0.10      1
Name: count, dtype: int64

Most Accessed Pages:
URL
/index.html    1
/login         1
/dashboard     1
/logout        1
Name: count, dtype: int64

HTTP Status Code Distribution:
Status
200    3
302    1
Name: count, dtype: int64
