In [1]:
import sys
import os
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_PYTHON_DRIVER"] = sys.executable

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .appName("Article")\
        .getOrCreate()

### Question 1 : Write a pyspark code to generate the below output for the given input dataset(Asked in service based companies)

In [3]:
data = [(1, "Gaurav", "Pune, Bangalore, Hyderabad"),
       (2, "Rishabh", "Mumbai, Bangalore, Pune")]
cols = ["EmpId", "Name", "Locations"]
df = spark.createDataFrame(data, cols)

In [4]:
from pyspark.sql.functions import col, split, explode
df.select(df.EmpId, df.Name, explode(split(df.Locations,",")).alias("Locations")).show()

+-----+-------+----------+
|EmpId|   Name| Locations|
+-----+-------+----------+
|    1| Gaurav|      Pune|
|    1| Gaurav| Bangalore|
|    1| Gaurav| Hyderabad|
|    2|Rishabh|    Mumbai|
|    2|Rishabh| Bangalore|
|    2|Rishabh|      Pune|
+-----+-------+----------+



### Question 2.Spark on Windows — What exactly is winutils and why do we need it? Also How ir relates the PySpark.

- Hadoop requires native libraries on Windows to work properly -that includes accessing the file:// filesystem, where Hadoop uses some Windows APIs to implement posix-like file access permissions.
- This is implemented in HADOOP.DLL and WINUTILS.EXE. *
In particular, %HADOOP_HOME%\BIN\WINUTILS.EXE must be locatab.
- I know of at least one usage, it is for running shell commands on Windows OS. You can find it in org.apache.hadoop.util.Shell, other modules depends on this class and uses it's methods.
- PySpark is an interface for Apache Spark in Python.
- It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.
- PySpark supports most of Spark's features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.le

### Question 3.What's the Difference Between Hadoop and Spark?

- Apache Hadoop and Apache Spark are two open-source frameworks you can use to manage and process large volumes of data for analytics.
- Organizations must process data at scale and speed to gain real-time insights for business intelligence.
- Apache Hadoop allows you to cluster multiple computers to analyze massive datasets in parallel more quickly.
- Apache Spark uses in-memory caching and optimized query execution for fast analytic queries against data of any size.
- Spark is a more advanced technology than Hadoop, as Spark uses artificial intelligence and machine learning (AI/ML) in data processing.
- However, many companies use Spark and Hadoop together to meet their data analytics goals.lity.

Distributed big data processing :
- Big data is collected frequently, continuously, and at scale in various formats
- To store, manage, and process big data, Apache Hadoop separates datasets into smaller subsets or partitions.
-  It then stores the partitions over a distributed network of servers.
-  Likewise, Apache Spark processes and analyzes big data over distributed nodes to provide business insights.
- Depending on the use cases, you might need to integrate both Hadoop and Spark with different software for optimum functionality

Fault tolerance:
- Apache Hadoop continues to run even if one or several data processing nodes fail
- It makes multiple copies of the same data block and stores them across several nodes
- When a node fails, Hadoop retrieves the information from another node and prepares it for data processing.
- Meanwhile, Apache Spark relies on a special data processing technology called Resilient Distributed Dataset (RDD).
- With RDD, Apache Spark remembers how it retrieves specific information from storage and can reconstruct the data if the underlying storage.

Key Components :

a. Hadoop Components :
    Apache Hadoop has four main components:
- Hadoop Distributed File System (HDFS) is a special file system that stores large datasets across multiple computers.
- These computers are called Hadoop clusters.
- Yet Another Resource Negotiator (YARN) schedules tasks and allocates resources to applications running on Hadoop.
- Hadoop MapReduce allows programs to break large data processing tasks into smaller ones and runs them in parallel on multiple servers.
- Hadoop Common, or Hadoop Core, provides the necessary software libraries for other Hadoop components.raphs.

b. Spark Components :
    Apache Spark runs with the following components:
- Spark Core coordinates the basic functions of Apache Spark.
- These functions include memory management, data storage, task scheduling, and data processing.
- Spark SQL allows you to process data in Spark's distributed storage.
- Spark Streaming and Structured Streaming allow Spark to stream data efficiently in real time by separating data into tiny continuous blocks.
- Machine Learning Library (MLlib) provides several machine learning algorithms that you can apply to big data.
- GraphX allows you to visualize and analyze data with graphs.

### Question 4. Can you write a query to find the employee count under each manager?

In [5]:
data = [('4529', 'Nancy', 'Young', '4125'),
('4238','John', 'Simon', '4329'),
('4329', 'Martina', 'Candreva', '4125'),
('4009', 'Klaus', 'Koch', '4329'),
('4125', 'Mafalda', 'Ranieri', 'NULL'),
('4500', 'Jakub', 'Hrabal', '4529'),
('4118', 'Moira', 'Areas', '4952'),
('4012', 'Jon', 'Nilssen', '4952'),
('4952', 'Sandra', 'Rajkovic', '4529'),
('4444', 'Seamus', 'Quinn', '4329')]
schema = ['employee_id' ,'first_name', 'last_name', 'manager_id']

df = spark.createDataFrame(data=data, schema=schema)

In [7]:
df.createOrReplaceTempView("EMP")

In [11]:
spark.sql("""
            select e.manager_id as manager_id, 
            count(e.employee_id) as no_of_emp, m.first_name as manager_name from EMP e
            inner join EMP m        
            on m.employee_id = e.manager_id
            group by e.manager_id, m.first_name
            """).show()

+----------+---------+------------+
|manager_id|no_of_emp|manager_name|
+----------+---------+------------+
|      4125|        2|     Mafalda|
|      4329|        3|     Martina|
|      4529|        2|       Nancy|
|      4952|        2|      Sandra|
+----------+---------+------------+



In [13]:
result_df = df.alias("e").join(df.alias("m"), col("e.manager_id") == col("m.employee_id"), "inner") \
            .select(col("e.employee_id"), col("e.first_name"), col("e.last_name"), col("e.manager_id"), col("m.first_name").alias("manager_name"))
result_df.show()

+-----------+----------+---------+----------+------------+
|employee_id|first_name|last_name|manager_id|manager_name|
+-----------+----------+---------+----------+------------+
|       4529|     Nancy|    Young|      4125|     Mafalda|
|       4329|   Martina| Candreva|      4125|     Mafalda|
|       4238|      John|    Simon|      4329|     Martina|
|       4009|     Klaus|     Koch|      4329|     Martina|
|       4444|    Seamus|    Quinn|      4329|     Martina|
|       4500|     Jakub|   Hrabal|      4529|       Nancy|
|       4952|    Sandra| Rajkovic|      4529|       Nancy|
|       4118|     Moira|    Areas|      4952|      Sandra|
|       4012|       Jon|  Nilssen|      4952|      Sandra|
+-----------+----------+---------+----------+------------+



### Question 5. Write a Pyspark code to find the output table as given below- employeeid, default_number, total_entry, total_login, total_logout, latest_login, latest_logout.

In [21]:
'''
- The first step is to create two DataFrames called checkin_df and detail_df. 
- The checkin_df DataFrame contains the following columns: 
    a. employeeid: The employee ID 
    b. entry_details: The type of entry (login or logout) 
    c. timestamp_details: The timestamp of the entry
- T
The detail_df DataFrame contains the following columns
    a. : id: The employee I
    b. D phone_number: The employee's phone numbe
    c. r isdefault: A flag indicating whether the employee is a default user- 

The next step is to join the two DataFrames on the employeeid colu
- mn. This will create a new DataFrame called joined_df that contains all of the data from both DataFrame- s.

The next step is to filter the joined_df DataFrame to only include rows where the isdefault column is eq al to 
- true. This will ensure that we only consider default users in our analy- sis.

The next step is to create three separate DataF    a. rames:

total_entry_df: This df will contain the total number of entries for each 
    b. employee. total_login_df: This df will contain the total number of logins for each 
    c. employee. latest_login_df: This df will contain the latest login timestamp for each 
    
- employee. To create these DataFrames, we use the groupBy() and agg() f
- unctions. The groupBy() function groups the DataFrame by the employeeid column, and the agg() function calculates the total number of entries, total number of logins, and latest login timestamp for eac- h group.

The final step is to join the three DF's together to create the final
-  DataFrame. We use the join() function with the on and how arguments to join the DF's on the employe
'''the results.

SyntaxError: invalid syntax (1987548079.py, line 30)

In [22]:
checkin_df = spark.createDataFrame([(1000, 'login', '2023-06-16 01:00:15.34'),
                                    (1000, 'login', '2023-06-16 02:00:15.34'),
                                    (1000, 'login', '2023-06-16 03:00:15.34'),
                                    (1000, 'logout', '2023-06-16 12:00:15.34'),
                                    (1001, 'login', '2023-06-16 01:00:15.34'),
                                    (1001, 'login', '2023-06-16 02:00:15.34'),
                                    (1001, 'login', '2023-06-16 03:00:15.34'),
                                    (1001, 'logout', '2023-06-16 12:00:15.34')],
                                   ["employeeid", "entry_details", "timestamp_details"])

detail_df = spark.createDataFrame([(1001, 9999, 'false'),
                                   (1001, 1111, 'false'),
                                   (1001, 2222, 'true'),
                                   (1003, 3333, 'false')],
                                  ["id", "phone_number", "isdefault"])

In [23]:
joined_df = checkin_df.join(detail_df, checkin_df.employeeid == detail_df.id)
joined_df.show()

+----------+-------------+--------------------+----+------------+---------+
|employeeid|entry_details|   timestamp_details|  id|phone_number|isdefault|
+----------+-------------+--------------------+----+------------+---------+
|      1001|        login|2023-06-16 01:00:...|1001|        9999|    false|
|      1001|        login|2023-06-16 01:00:...|1001|        1111|    false|
|      1001|        login|2023-06-16 01:00:...|1001|        2222|     true|
|      1001|        login|2023-06-16 02:00:...|1001|        9999|    false|
|      1001|        login|2023-06-16 02:00:...|1001|        1111|    false|
|      1001|        login|2023-06-16 02:00:...|1001|        2222|     true|
|      1001|        login|2023-06-16 03:00:...|1001|        9999|    false|
|      1001|        login|2023-06-16 03:00:...|1001|        1111|    false|
|      1001|        login|2023-06-16 03:00:...|1001|        2222|     true|
|      1001|       logout|2023-06-16 12:00:...|1001|        9999|    false|
|      1001|

In [24]:
joined_df = joined_df.where(joined_df["isdefault"] == 'true')
joined_df.show()

+----------+-------------+--------------------+----+------------+---------+
|employeeid|entry_details|   timestamp_details|  id|phone_number|isdefault|
+----------+-------------+--------------------+----+------------+---------+
|      1001|        login|2023-06-16 01:00:...|1001|        2222|     true|
|      1001|        login|2023-06-16 02:00:...|1001|        2222|     true|
|      1001|        login|2023-06-16 03:00:...|1001|        2222|     true|
|      1001|       logout|2023-06-16 12:00:...|1001|        2222|     true|
+----------+-------------+--------------------+----+------------+---------+



In [26]:
from pyspark.sql.functions import count
total_entry_df = joined_df.groupBy('employeeid').agg(count('*')
                                                     .alias('Total_entry'))
total_entry_df.show()

+----------+-----------+
|employeeid|Total_entry|
+----------+-----------+
|      1001|          4|
+----------+-----------+



In [27]:
total_login_df = joined_df.filter(joined_df['entry_details'] == 'login').groupBy('employeeid').agg(count('*').alias('total_login'))
total_login_df.show()

+----------+-----------+
|employeeid|total_login|
+----------+-----------+
|      1001|          3|
+----------+-----------+



In [30]:
from pyspark.sql.functions import first
latest_login_df = joined_df.filter(joined_df['entry_details'] == 'login').orderBy(joined_df['timestamp_details'].desc()).groupBy('employeeid').agg(first('timestamp_details').alias('latest_login'))
latest_login_df.show(truncate = False)

+----------+----------------------+
|employeeid|latest_login          |
+----------+----------------------+
|1001      |2023-06-16 03:00:15.34|
+----------+----------------------+



In [31]:
latest_logout_df = joined_df.filter(joined_df['entry_details'] == 'logout').\
            orderBy(joined_df['timestamp_details'].desc()).groupBy('employeeid').\
            agg(first('timestamp_details').alias('latest_logout'))
latest_logout_df.show()

+----------+--------------------+
|employeeid|       latest_logout|
+----------+--------------------+
|      1001|2023-06-16 12:00:...|
+----------+--------------------+



In [None]:
final_df = total_entry_df.join(total_login_df, on='employeeid', how='inner')
final_df = final_df.join(latest_login_df, on='employeeid', how='inner')
final_df = final_df.join(latest_logout_df, on='employeeid', how='inner')

final_df.show()