# **To-Do Tasks**

- [ ] Create data frames for both datasets.

- [ ] Clean the data.

- [ ] Find the heaviest response sent to the client.

- [ ] Find the number of requests sent from various host machines.

- [ ] Find the host that sent the minimum and maximum number of requests.

- [ ] Find the rush hour per day.

- [ ] Find the user-requested KBDOC with its description.

- [ ] Find the most popular web page.

- [ ] Find the different types of web pages requested by users.

- [ ] Store the last output in Hive.



In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[4]").enableHiveSupport().getOrCreate()

25/06/23 04:46:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
25/06/23 04:46:12 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
25/06/23 04:46:12 WARN Utils: Serv

In [3]:
spark

In [4]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [5]:
sch = StructType([\
                 StructField("Document_id",StringType(),True),\
                 StructField("Description",StringType(),True)])

In [6]:
kblists = spark.read.csv("kblist.txt",sep=":",schema=sch, header=False)

In [7]:
kblists.describe().show()

                                                                                

+-------+-----------+--------------------+
|summary|Document_id|         Description|
+-------+-----------+--------------------+
|  count|        300|                 300|
|   mean|       null|                null|
| stddev|       null|                null|
|    min|KBDOC-00001|MeeToo 1.0 - Back...|
|    max|KBDOC-00300|  iFruit 5A - reboot|
+-------+-----------+--------------------+



In [8]:
kblists.show(5)

+-----------+--------------------+
|Document_id|         Description|
+-----------+--------------------+
|KBDOC-00087|Ronin Novelty Not...|
|KBDOC-00293|Ronin S2 - Batter...|
|KBDOC-00199|Titanic 2000 - Ch...|
|KBDOC-00211|MeeToo 5.1 - Chan...|
|KBDOC-00037|iFruit 2 - Change...|
+-----------+--------------------+
only showing top 5 rows



In [9]:
kblists = kblists.withColumn("mobiles", split(kblists["Description"], "-")[0]).withColumn("Issues", split(kblists["Description"], "-")[1])

In [10]:
kblists = kblists.drop("Description")

In [11]:
kblists.show()

+-----------+--------------------+--------------------+
|Document_id|             mobiles|              Issues|
+-----------+--------------------+--------------------+
|KBDOC-00087|Ronin Novelty Not...|       Back up files|
|KBDOC-00293|           Ronin S2 |        Battery Life|
|KBDOC-00199|       Titanic 2000 | Change the phone...|
|KBDOC-00211|         MeeToo 5.1 | Change the phone...|
|KBDOC-00037|           iFruit 2 | Change the phone...|
|KBDOC-00245|      Sorrento F31L |        Battery Life|
|KBDOC-00058|         MeeToo 1.0 |              reboot|
|KBDOC-00067|           iFruit 4 | Change the phone...|
|KBDOC-00116|          iFruit 3A |   Transfer Contacts|
|KBDOC-00164|       Titanic 4000 |   Transfer Contacts|
|KBDOC-00039|           iFruit 2 |       Back up files|
|KBDOC-00109|Ronin Novelty Not...| Change the phone...|
|KBDOC-00273|           Ronin S4 |       Back up files|
|KBDOC-00051|       Titanic 1000 |       Back up files|
|KBDOC-00156|      Sorrento F30L |            ov

In [12]:
sch_logs = StructType([\
                     StructField("IP_Address", StringType(), True),
                     StructField("placeholder", StringType(), True),
                     StructField("Timestamps", TimestampType(), True),
                     StructField("Request_method", StringType(), True),
                     StructField("URL", StringType(), True),
                     StructField("Method_type", StringType(), True),
                     StructField("Status_code", IntegerType(), True),
                     StructField("Response_size", IntegerType(), True),
                     StructField("refferrer", StringType(), True),
                     StructField("user_agent", StringType(), True)])

In [13]:
logs_str = spark.read.text("weblogs/2013-09-15.log")

In [14]:
import re

In [15]:
log_rdd = logs_str.rdd

In [57]:
logs_df = spark.createDataFrame(log_rdd,["value"])

                                                                                

In [62]:
df_with_columns = logs_df \
    .withColumn("ip_addr", regexp_extract(col("value"), r'\d+\.\d+\.\d+\.\d+', 0).cast(StringType())) \
    .withColumn("placeholder", regexp_extract(col("value")
    .withColumn("timestamp", regexp_extract(col("value"), r"\[(\w+\/\w+\/\w+\:\d+\:\d+\:\d+\s\+\d+)]", 1)) \
    .withColumn("request_method", regexp_extract(col("value"), r'\"([A-Z]+)\s\/([^\.]+.\w+)\s(\w+\/\w+\.\w+)', 1)) \
    .withColumn("url", regexp_extract(col("value"), r'\"([A-Z]+)\s\/([^\.]+.\w+)\s(\w+\/\w+\.\w+)', 2)) \
    .withColumn("method_type", regexp_extract(col("value"), r'\"([A-Z]+)\s\/([^\.]+.\w+)\s(\w+\/\w+\.\w+)', 3)) \
    .withColumn("refferer", regexp_extract(col("value"), r'\"(htt\w+\:\/\/w+\.\w+\.\w+)\"', 1)) \
    .withColumn("user_agent", regexp_extract(col("value"), r'\"(L[^\"]+)', 1))