# ***Problem Statement:***

# **Assignment: Log Intelligence System**

Level 1: Foundational Log Processing

Level 2: Metadata Enrichment & Optimization

# **Background**

Your company operates several web-based services that generate large volumes of server logs. These logs are critical for understanding system performance, user behavior, and detecting operational issues. Additionally, business and product teams rely on enriched metrics for decision-making.


# **Level 1 Tasks: Log Intelligence Pipeline**
1. Load & Explore

  Ingest the log data and perform data profiling.

  I have provided a sample, generate Log Dataset of at least 100,000 rows

  Connect to AWS S3 for data retrieval or storage.


2. Clean & Prepare

  Clean and standardize the dataset.

  Handle incorrect, missing values

  Prepare the dataset for further processing.

3. Feature Engineering

  Derive new informative fields to enhance the dataset.
Hint -  hour_of_day, day_of_week, is_error (status >= 400) etc

4. Analyze Usage Patterns and Insights

  Analyze traffic patterns, usage behavior, and performance trends.
Hint -  Slowest APIs, most hit , suspicious user detected

5. Store & Summarize

  Store cleaned and processed data.

  Create summary outputs for further reporting


# **Level 2 Tasks: Metadata Enrichment & Optimization**
6. Enrichment

  Integrate the user_metadata dataset with the log dataset using user_id.
Generate data of size 5k

7. Aggregated Analysis
  Perform grouped analysis using enriched fields.

  Analyze patterns across account types, regions, and activity status.

  Ex . Activity trends of Free vs Premium users

8. Optimization & Output Strategy

Apply appropriate optimization techniques to improve performance.

Organize output for efficient querying and reuse.

Structure your saved data so that subsets can be efficiently accessed without scanning the entire dataset

In [1]:
from google.colab import files
uploaded = files.upload()

Saving server_log_dataset.csv to server_log_dataset.csv


In [2]:
%%writefile config.py
CONFIG = {
    "input_path": "server_log_dataset.csv",
    "output_path": "output/",
    "file_format": "csv"
}


Writing config.py


In [3]:
%%writefile ingestion.py
from config import CONFIG

def ingest_data(spark):
    return spark.read.option("header", True).csv(CONFIG["input_path"])

Writing ingestion.py


In [4]:
!pip install pyspark



In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Log Intelligence Pipeline").getOrCreate()

In [6]:
from ingestion import ingest_data

df_serverlog = ingest_data(spark)
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|               NULL|     INFO| user_2|  172.254.4.144|  /api/delete|      109.0|        1.927|4b15d36c-3b74-499...|
|2023-06-09T22:51:32|    ERROR| user_3|  192.15.237.65|  /api/logout|       NULL|        2.431|                NULL|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/up

In [7]:
%%writefile cleaning.py
from config import CONFIG

def clean_data(spark):
    return spark.read.option("header", True).csv(CONFIG["input_path"])

Writing cleaning.py


**Dealing with the null values. Assuming that I cannot replace columns like timestamp, ip_address, request_id, user_id with any values from the respected columns. So dropping that rows.**

In [8]:
from cleaning import clean_data

df_serverlog = clean_data(spark)
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|               NULL|     INFO| user_2|  172.254.4.144|  /api/delete|      109.0|        1.927|4b15d36c-3b74-499...|
|2023-06-09T22:51:32|    ERROR| user_3|  192.15.237.65|  /api/logout|       NULL|        2.431|                NULL|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/up

In [9]:
from pyspark.sql.functions import col

In [10]:
df_serverlog = df_serverlog.where(col("timestamp") != "NULL")

In [11]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2023-06-09T22:51:32|    ERROR| user_3|  192.15.237.65|  /api/logout|       NULL|        2.431|                NULL|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/lo

In [12]:
df_serverlog = df_serverlog.where(col("user_id") != "NULL")

In [13]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2023-06-09T22:51:32|    ERROR| user_3|  192.15.237.65|  /api/logout|       NULL|        2.431|                NULL|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/lo

In [14]:
df_serverlog = df_serverlog.where(col("ip_address") != "NULL")

In [15]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2023-06-09T22:51:32|    ERROR| user_3|  192.15.237.65|  /api/logout|       NULL|        2.431|                NULL|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/lo

In [16]:
df_serverlog = df_serverlog.where(col("request_id") != "NULL")

In [17]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/logout|      495.0|        1.314|1683c6a0-cacb-456...|
|2024-02-11T05:59:14|     WARN|user_10|   10.81.117.60|  /api/lo

**After dropping rows having nulls in columns timestamp, user_id, ip_address, request_id, Replacing nulls of other columns with most used, least used, average, median depending on column's requirement.**

In [18]:
from pyspark.sql.functions import avg,round,count,mean,median,mode,sum,when,min,max

In [19]:
from pyspark.sql.functions import col, count

most_used_log_level = df_serverlog.groupBy("log_level").agg(
    count("*").alias("count")
).orderBy(col("count").desc()).first()[0]

In [20]:
print(most_used_log_level)

ERROR


In [21]:
df_serverlog = df_serverlog.fillna({'log_level' : most_used_log_level})

In [22]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/logout|      495.0|        1.314|1683c6a0-cacb-456...|
|2024-02-11T05:59:14|     WARN|user_10|   10.81.117.60|  /api/lo

In [23]:
most_used_endpoint = df_serverlog.groupBy("endpoint").agg(
    count("*").alias("count")
).orderBy(col("count").desc()).first()[0]

In [24]:
print(most_used_endpoint)

/api/delete


In [25]:
df_serverlog = df_serverlog.fillna({'endpoint' : most_used_endpoint})

In [26]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|       NULL|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|       NULL|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/logout|      495.0|        1.314|1683c6a0-cacb-456...|
|2024-02-11T05:59:14|     WARN|user_10|   10.81.117.60|  /api/lo

In [27]:
avg_status_code = df_serverlog.select(round(avg("status_code"))).collect()[0][0]

In [28]:
df_serverlog = df_serverlog.fillna({'status_code' : avg_status_code})

In [29]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+
|2023-02-22T19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|
|2024-11-26T03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|      390.0|         0.94|cd76fb1a-b0c0-4c2...|
|2023-06-18T23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|      390.0|        0.225|058ee99a-6a35-4b2...|
|2024-09-18T17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|         NULL|b427fa91-5bf8-4fa...|
|2023-12-08T11:24:56|    ERROR| user_9|   10.93.214.59|  /api/logout|      495.0|        1.314|1683c6a0-cacb-456...|
|2024-02-11T05:59:14|     WARN|user_10|   10.81.117.60|  /api/lo

In [30]:
mean_response_time = df_serverlog.select(round(mean("response_time"),3)).collect()[0][0]

In [31]:
df_serverlog = df_serverlog.fillna({'response_time' : mean_response_time})

In [32]:
df_serverlog.orderBy("status_code").show()

+-------------------+---------+----------+--------------+-------------+-----------+-------------+--------------------+
|          timestamp|log_level|   user_id|    ip_address|     endpoint|status_code|response_time|          request_id|
+-------------------+---------+----------+--------------+-------------+-----------+-------------+--------------------+
|2023-05-20T18:25:17|     INFO|user_70610| 10.217.42.230|   /api/login|      100.0|        1.931|69cc858a-2680-457...|
|2024-11-09T23:44:13|     WARN|user_11129|172.107.57.202|/api/register|      100.0|        3.479|3e8166cd-094f-42f...|
|2023-02-13T05:55:11|    ERROR|user_71952|172.88.177.186|   /api/login|      100.0|        2.523|7b52648c-e485-4dc...|
|2023-07-19T22:06:49|     WARN|user_18551| 10.128.67.191|  /api/update|      100.0|        3.479|1818cbe4-3a78-465...|
|2025-01-02T02:33:56|     WARN|user_72339| 172.229.23.61|/api/register|      100.0|        2.523|8b5531a4-aa1e-4a1...|
|2023-02-17T19:41:28|     INFO|user_11317|192.21

In [33]:
df_serverlog.count()

72864

**Cleaning the dataset. Assuming that if the status code is less than 15% should be dropped. Status code has integer less than 1000.**

In [34]:
df_serverlog = df_serverlog.where(col("status_code")/1000 > 0.15)

In [35]:
df_serverlog.count()

62430

**Adding columns like hour_of_the_day, day_of_the_week, is_error to check the error log_level. Assuming good status code be greater than 400, and good response time be less than 1 sec.**

In [36]:
from pyspark.sql.functions import hour

df_serverlog = df_serverlog.withColumn("timestamp", df_serverlog["timestamp"].cast("timestamp"))
df_serverlog = df_serverlog.withColumn("hour_of_the_day", hour(df_serverlog["timestamp"]))

In [37]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+
|2023-02-22 19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|
|2024-11-26 03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|      390.0|         0.94|cd76fb1a-b0c0-4c2...|              3|
|2023-06-18 23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|      390.0|        0.225|058ee99a-6a35-4b2...|             23|
|2024-09-18 17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|        2.523|b427fa91-5bf8-4fa...|             17|
|2023-12-08 11:24:56|    ERROR| user_9|   10.93.214.59|  /api/logout|

In [38]:
from pyspark.sql.functions import date_format

df_serverlog = df_serverlog.withColumn("day_of_the_week", date_format(df_serverlog["timestamp"], "EEEE"))

In [39]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|day_of_the_week|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+
|2023-02-22 19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|      Wednesday|
|2024-11-26 03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|      390.0|         0.94|cd76fb1a-b0c0-4c2...|              3|        Tuesday|
|2023-06-18 23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|      390.0|        0.225|058ee99a-6a35-4b2...|             23|         Sunday|
|2024-09-18 17:54:00|     INFO| user_6|172.126.189.228|  /api/update|      309.0|        2.523|b427fa91-5b

In [40]:
from pyspark.sql.functions import when

df_serverlog = df_serverlog.withColumn("is_error",when(df_serverlog['log_level'] == "ERROR", True).otherwise(False))

In [41]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|day_of_the_week|is_error|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+
|2023-02-22 19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|      Wednesday|    true|
|2024-11-26 03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|      390.0|         0.94|cd76fb1a-b0c0-4c2...|              3|        Tuesday|   false|
|2023-06-18 23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|      390.0|        0.225|058ee99a-6a35-4b2...|             23|         Sunday|   false|
|2024-09-18 17:54:00|     INFO| user_6|172.126.189.2

In [42]:
df_serverlog = df_serverlog.withColumn("good_status",when(df_serverlog['status_code'] >=400, True).otherwise(False))

In [43]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|day_of_the_week|is_error|good_status|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+
|2023-02-22 19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|      Wednesday|    true|       true|
|2024-11-26 03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|      390.0|         0.94|cd76fb1a-b0c0-4c2...|              3|        Tuesday|   false|      false|
|2023-06-18 23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|      390.0|        0.225|058ee99a-6a35-4b2...|             23|         Sunday|  

In [44]:
df_serverlog = df_serverlog.withColumn("good_response",when(df_serverlog['response_time'] <= 2.000, True).otherwise(False))

In [45]:
df_serverlog.show()

+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+-------------+
|          timestamp|log_level|user_id|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|day_of_the_week|is_error|good_status|good_response|
+-------------------+---------+-------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+-------------+
|2023-02-22 19:28:34|    ERROR| user_1|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|      Wednesday|    true|       true|         true|
|2024-11-26 03:43:14|     WARN| user_4|  172.52.177.15|   /api/login|      390.0|         0.94|cd76fb1a-b0c0-4c2...|              3|        Tuesday|   false|      false|         true|
|2023-06-18 23:44:57|     INFO| user_5|   192.25.49.86|  /api/logout|      390.0

In [46]:
#Considering Slowest API be the one with hightest avg response_time#

mean_endpoint_response_time = print(df_serverlog.groupBy("endpoint").agg(
    avg("response_time").alias("mean_response_time")
).orderBy(col("endpoint").desc()).first()[0])

/api/update


In [47]:
#Consider suspicious user as the one who has response time less than 1, status code greater than 500 and endpoint is not null#

suspicious_user_id_df = df_serverlog.filter(col("status_code") > 500)

In [48]:
suspicious_user_id_df = suspicious_user_id_df.filter(col("response_time") < 1)

In [49]:
suspicious_user_id_df.select("user_id").show()

+--------+
| user_id|
+--------+
| user_33|
| user_54|
| user_64|
|user_113|
|user_139|
|user_158|
|user_183|
|user_194|
|user_221|
|user_257|
|user_269|
|user_296|
|user_305|
|user_324|
|user_343|
|user_365|
|user_377|
|user_409|
|user_503|
|user_573|
+--------+
only showing top 20 rows



In [50]:
df_serverlog.write.option(
    "Header",True).mode(
    "overwrite").csv(
    "output/cleaned_server_log_data"
    )

In [51]:
suspicious_user_id_df.write.option(
    "Header",True).mode(
    "overwrite").csv(
    "output/cleaned_suspicious_user_data"
    )

In [52]:
server_log_summary_df = df_serverlog.summary()

In [53]:
server_log_summary_df.show()

+-------+---------+----------+-------------+-----------+------------------+-----------------+--------------------+-----------------+---------------+
|summary|log_level|   user_id|   ip_address|   endpoint|       status_code|    response_time|          request_id|  hour_of_the_day|day_of_the_week|
+-------+---------+----------+-------------+-----------+------------------+-----------------+--------------------+-----------------+---------------+
|  count|    62430|     62430|        62430|      62430|             62430|            62430|               62430|            62430|          62430|
|   mean|     NULL|      NULL|         NULL|       NULL| 439.4622777510812|2.524618789043812|                NULL| 11.4853275668749|           NULL|
| stddev|     NULL|      NULL|         NULL|       NULL|161.56784812341596|1.355815827281236|                NULL|6.912774143850606|           NULL|
|    min|    DEBUG|    user_1|    10.0.1.14|/api/delete|             151.0|             0.05|00008032-a958

In [54]:
server_log_summary_df.write.option(
    "Header",True).mode(
    "Overwrite").csv(
    "output/summary_server_log_data"
    )

In [55]:
from google.colab import files
upload = files.upload()

Saving synthetic_user_metadata.csv to synthetic_user_metadata.csv


In [56]:
%%writefile config_user.py
CONFIG = {
    "input_path": "synthetic_user_metadata.csv",
    "output_path": "output/",
    "file_format": "csv"
}

Writing config_user.py


In [57]:
%%writefile ingestion_user.py
from config_user import CONFIG

def ingest_data(spark):
    return spark.read.option("header", True).csv(CONFIG["input_path"])

Writing ingestion_user.py


In [58]:
from ingestion_user import ingest_data

df_usermetadata = ingest_data(spark)
df_usermetadata.show()

+-------+------------+-------------+-----------+---------+
|user_id|account_type|       region|signup_date|is_active|
+-------+------------+-------------+-----------+---------+
| user_1|        Free|North America| 2016-09-10|     True|
| user_2|     Premium|      Oceania| 2018-05-03|    False|
| user_3|     Premium|       Europe| 2024-06-30|    False|
| user_4|  Enterprise|      Oceania| 2018-12-08|     True|
| user_5|     Premium|North America| 2024-10-26|     True|
| user_6|     Premium|         NULL| 2024-11-29|    False|
| user_7|     Premium|South America| 2015-03-17|    False|
| user_8|        Free|South America| 2016-04-14|     True|
| user_9|  Enterprise|South America| 2019-05-04|     True|
|user_10|  Enterprise|         Asia| 2019-05-06|     True|
|user_11|        NULL|North America| 2021-03-29|    False|
|user_12|  Enterprise|         NULL| 2024-01-28|     True|
|user_13|  Enterprise|         NULL| 2024-07-21|     True|
|user_14|        NULL|South America| 2024-01-24|    Fals

**Inner joining the 2 dataset and doing task based on problem statement**

In [59]:
df_joined = df_serverlog.join(df_usermetadata, on='user_id', how='inner')

In [60]:
df_joined.count()

3148

In [61]:
df_joined.show()

+-------+-------------------+---------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+-------------+------------+-------------+-----------+---------+
|user_id|          timestamp|log_level|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|day_of_the_week|is_error|good_status|good_response|account_type|       region|signup_date|is_active|
+-------+-------------------+---------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+-------------+------------+-------------+-----------+---------+
| user_1|2023-02-22 19:28:34|    ERROR|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|      Wednesday|    true|       true|         true|        Free|North America| 2016-09-10|     True|
| user_4|2024-11-26 03:43:14|     WARN|  172.52.177.15|   /api/login

**Getting total number of active users based on account type.
Not done of inactive or status unknown users and it just needs one change in entire code.**

In [62]:
account_type_active_user = df_joined.groupBy("account_type").agg(
    count(when(col("is_active") == "True", True)).alias("active_users")
)

In [63]:
account_type_active_user.show()

+------------+------------+
|account_type|active_users|
+------------+------------+
|     Premium|         435|
|        NULL|         147|
|  Enterprise|         421|
|        Free|         432|
+------------+------------+



**Getting total number of active users based on region.
Not done of inactive or status unknown users and it just needs one change in entire code.**

In [64]:
region_active_user = df_joined.groupBy("region").agg(
    count(when(col("is_active") == "True", True)).alias("active_users")
)

In [65]:
region_active_user.show()

+-------------+------------+
|       region|active_users|
+-------------+------------+
|       Europe|         221|
|       Africa|         207|
|         NULL|         136|
|North America|         209|
|South America|         212|
|      Oceania|         232|
|         Asia|         218|
+-------------+------------+



**Getting total number of active users based on year of signup.
Not done of inactive or status unknown users and it just needs one change in entire code.**

In [66]:
from pyspark.sql.functions import year

df_joined = df_joined.withColumn("signup_year",year(df_joined["signup_date"]))

In [67]:
df_joined.show()

+-------+-------------------+---------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+-------------+------------+-------------+-----------+---------+-----------+
|user_id|          timestamp|log_level|     ip_address|     endpoint|status_code|response_time|          request_id|hour_of_the_day|day_of_the_week|is_error|good_status|good_response|account_type|       region|signup_date|is_active|signup_year|
+-------+-------------------+---------+---------------+-------------+-----------+-------------+--------------------+---------------+---------------+--------+-----------+-------------+------------+-------------+-----------+---------+-----------+
| user_1|2023-02-22 19:28:34|    ERROR|   192.113.2.76|/api/register|      463.0|        1.411|093c96b7-7dae-48e...|             19|      Wednesday|    true|       true|         true|        Free|North America| 2016-09-10|     True|       2016|
| user_4|2024-11-26 

In [68]:
signup_year_active_users = df_joined.groupBy("signup_year").agg(
    count(when(col("is_active") == "True" , True)).alias("signup_year_active_user")
).orderBy(col("signup_year").asc())

In [69]:
signup_year_active_users.show()

+-----------+-----------------------+
|signup_year|signup_year_active_user|
+-----------+-----------------------+
|       NULL|                    156|
|       2015|                    131|
|       2016|                    118|
|       2017|                    122|
|       2018|                    134|
|       2019|                    122|
|       2020|                    133|
|       2021|                    133|
|       2022|                    129|
|       2023|                    129|
|       2024|                    128|
+-----------+-----------------------+



In [70]:
account_type_active_user.write.option(
    "Header",True).mode(
    "Overwrite").csv(
    "output/account_type_active_user"
    )

In [71]:
region_active_user.write.option(
    "Header",True).mode(
    "Overwrite").csv(
    "output/region_active_user"
    )

In [72]:
signup_year_active_users.write.option(
    "Header",True).mode(
    "Overwrite").csv(
    "output/signup_year_active_user"
    )

In [74]:
!zip -r /content/output.zip /content/output/

  adding: content/output/ (stored 0%)
  adding: content/output/summary_server_log_data/ (stored 0%)
  adding: content/output/summary_server_log_data/.part-00000-29145985-b928-4d92-9faf-360b464b6255-c000.csv.crc (stored 0%)
  adding: content/output/summary_server_log_data/_SUCCESS (stored 0%)
  adding: content/output/summary_server_log_data/._SUCCESS.crc (stored 0%)
  adding: content/output/summary_server_log_data/part-00000-29145985-b928-4d92-9faf-360b464b6255-c000.csv (deflated 37%)
  adding: content/output/cleaned_suspicious_user_data/ (stored 0%)
  adding: content/output/cleaned_suspicious_user_data/.part-00000-39b76bdc-ac6d-454a-86fe-41e0c192d732-c000.csv.crc (stored 0%)
  adding: content/output/cleaned_suspicious_user_data/.part-00001-39b76bdc-ac6d-454a-86fe-41e0c192d732-c000.csv.crc (stored 0%)
  adding: content/output/cleaned_suspicious_user_data/_SUCCESS (stored 0%)
  adding: content/output/cleaned_suspicious_user_data/part-00000-39b76bdc-ac6d-454a-86fe-41e0c192d732-c000.csv (d