# Projekt Apache Spark - brudnopis

# Wprowadzenie

Wykorzystując ten notatnik jako szablon zrealizuj projekt Apache Spark zgodnie z przydzielonym zestawem. 

Kilka uwag:

* Nie modyfikuj ani nie usuwaj paragrafów *markdown* w tym notatniku, chyba że wynika to jednoznacznie z instrukcji. 
* Istniejące paragrafy zawierające *kod* uzupełnij w razie potrzeby zgodnie z instrukcjami
    - nie usuwaj ich
    - nie usuwaj zawartych w nich instrukcji oraz kodu
    - nie modyfikuj ich, jeśli instrukcje jawnie tego nie nakazują
* Możesz dodawać nowe paragrafy zarówno zawierające kod jak i komentarze dotyczące tego kodu (markdown)

# Zestaw 4 – imdb-persons

## Dwa zbiory danych 

### `datasource1` – informacje na temat najważniejszych osób zaangażowanych w poszczególne filmy (1)

Dane mają format `TSV`, pliki nie posiadają nagłówka.

Pola w pliku:

0. `tconst` – identyfikator filmu
1. `ordering` - numer kolejny osoby w filmie
2. `nconst` - identyfikator osoby
3. `role` - rola osoby w filmie
4. `job` - nazwa zawodu (jeśli dotyczy, w przeciwnym wypadku `\N`)
5. `characters` - nazwa postaci jaką grała osoba (jeśli dotyczy, w przeciwnym wypadku `\N`)

### `datasource4` – informacje na temat osób zaangażowanych w filmach (4)

Dane mają format `TSV`, każdy z plików posiada nagłówek.

Pola w pliku:

0. `nconst` – identyfikator osoby
1. `primaryName` – nazwa (imię i nazwisko) osoby
2. `birthYear` – rok urodzenia
3. `deathYear` – rok śmierci (`\N`, jeśli nie dotyczy)
4. `primaryProfession` – główne profesje osoby
5. `knownForTitles` – identyfikatory filmów, z których ta osoba jest znana



## Misja główna

### Cel przetwarzania 

Dla czterech najbardziej popularnych profesji należy wyznaczyć trzy osoby, które były zaangażowane w największą liczbę filmów zgodnie z tą profesją. 
W obliczeniach nie uwzględniamy filmów, dla których nie ma zdefiniowanej "pełnej obsady". 

Film z pełną obsadą to taki, który posiada: 
- co najmniej jednego aktora (`role in (actor, actress, self)`), 
- co najmniej jednego reżysera (`role = director`) oraz 
- osoby pełniące co najmniej dwie inne dowolne role. 

Ostateczny wynik powinien zawierać następujące atrybuty: 
- `profession` – profesja
- `primaryName` – nazwa osoby
- `movies` – liczba filmów 

Sugerowany schemat wyniku 
```
root
 |-- profession: string (nullable = false)
 |-- primaryName: string (nullable = true)
 |-- movies: long (nullable = false)
```

Uwagi
- Przez profesje rozumiemy wartości występujące jako rozdzielane przecinkami składowe w `primaryProfession` wyłączając z nich wartość `"miscellaneous"`
- Poziom popularności profesji wyznaczany jest wyznaczany jest na podstawie tego ile osób posiada daną profesję na swojej liście profesji w `primaryProfession`
 


## Misje poboczne 

### Misja 1
Przeanalizuj dane dotyczące zmarłych osób wyznacz ile osób żyło określoną liczbę lat. Podaj ilu z nich było aktorami, a ilu reżyserami. 

Wynik ma zawierać następujące kolumny:
- `age` – liczba przeżytych lat
- `persons` – liczba osób, które przeżyły określoną liczbę lat
- `actors` – liczba aktorów (`primaryProfession` zawiera jedną z wartości: `actors`, `actress`). 
- `directors` - liczba reżyserów (`primaryProfession` zawiera wartość `director`).

### Misja 2
Wśród osób, które urodziły się w ubiegłym wieku i które przeżyły ponad 70 lat, wyznacz te trzy, które za brały udział w największej liczbie filmów. Określ w ilu filmach byli oni aktorami oraz ile z tych filmów zostało przez nich wyreżyserowane. 

Wynik ma zawierać następujące kolumny:
- `primaryName` – nazwa (imię i nazwisko) osoby
- `birthYear` – data urodzenia 
- `age` – wiek 
- `filmCount` – liczba filmów w których osoba brała udział
- `filmCountAsActor` – liczba filmów, w których osoba była aktorem/aktorką (role in (`actors`, `actress`, `self`)). 
- `FilmCountAsDirector` - liczba filmów, w których osoba była reżyserem/reżyserką (`role` = `director`).




In [None]:
Oczekiwany wynik dla misji głównej:
+----------+------------------+------+
|profession|       primaryName|movies|
+----------+------------------+------+
|     actor|Luis Eduardo Motoa|  3559|
|     actor|         Ronit Roy|  2602|
|     actor|       Dilip Joshi|  2385|
|   actress|Luz Stella Luengas|  3636|
|   actress| Rohini Hattangadi|  3240|
|   actress|        Kavita Lad|  3204|
|  producer|     Shobha Kapoor| 11833|
|  producer|       Ekta Kapoor|  8826|
|  producer| Valentin Pimstein|  6081|
|    writer|       Tony Warren|  6153|
|    writer|      Delia Fiallo|  6132|
|    writer|     Sampurn Anand|  5205|
+----------+------------------+------+

Miejsca docelowe dla wyników misji głównych:
* Spark Core - RDD: katalog HDFS /tmp/output1, pliki w formacie SequenceFile serializowane przez Pickle (saveAsPickleFile).
* Spark Core - DataFrame: tabela Delta Lake output2.
* Spark SQL - Pandas API on Spark: plik /tmp/output3.json w lokalnym systemie plików w formacie json (json lines).

# Zestaw 0 – wzorzec

**Uwaga**

- W ramach wzorca nie są spełnione żadne reguły projektu. 
- Brak konsekwencji w wykorzystaniu właściwego API w ramach poszczególnych części
- Zadanie *misji głównej* polega na zliczeniu słówek.  

# Działania wstępne 

Uruchom poniższy paragraf, aby utworzyć obiekty kontekstu Sparka. Jeśli jest taka potrzeba dostosuj te polecenia. Pamiętaj o potrzebnych bibliotekach.

In [2]:
%conda install delta-spark==3.0.0

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.9.0
  latest version: 24.11.0

Please update conda by running

    $ conda update -n base -c conda-forge conda

Or to minimize the number of packages updated during conda update use

     conda install conda=24.11.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - delta-spark==3.0.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2024.8.30  |       hbcca054_0         155 KB  conda-forge
    certifi-2024.8.30          |     pyhd8ed1ab_0         160 KB  conda-forge
    delta-spark-3.0.0          |     pyhd8ed1ab_0          23 KB  conda-forge
    py4j-0.10.9.7              |     pyhd8ed1ab_0         182 KB  conda-forge
    pyspark-3.5.3              |     pyhd8ed1ab_0       296.7 MB  conda-forge
    ---------------------------

In [1]:
from pyspark.sql import SparkSession
# from delta import *

# Spark session & context
spark = SparkSession.builder \
    .appName("zestaw4") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.databricks.delta.schema.autoMerge.enabled", "true") \
    .getOrCreate()
    
# spark = configure_spark_with_delta_pip(builder).getOrCreate()

sc = spark.sparkContext

W poniższym paragrafie uzupełnij polecenia definiujące poszczególne zmienne. 

Pamiętaj abyś:

* w późniejszym kodzie, dla wszystkich cześci projektu, korzystał z tych zdefiniowanych zmiennych. Wykorzystuj je analogicznie jak parametry
* przed ostateczną rejestracją projektu usunął ich wartości, tak aby nie pozostawiać w notatniku niczego co mogłoby identyfikować Ciebie jako jego autora

In [2]:
# pełna ścieżka do katalogu w zasobniku zawierającego podkatalogi `datasource1` i `datasource4` 
# z danymi źródłowymi
input_dir = "/home/jovyan/data" # BRUDNOPIS

In [2]:
!ls /home/jovyan/data

datasource1  datasource4


In [6]:
%conda list

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
alembic                   1.12.0             pyhd8ed1ab_0    conda-forge
altair                    5.1.2              pyhd8ed1ab_0    conda-forge
anyio                     4.0.0              pyhd8ed1ab_0    conda-forge
aom                       3.6.1                h59595ed_0    conda-forge
argon2-cffi               23.1.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0          py311h459d7ec_4    conda-forge
arrow                     1.3.0              pyhd8ed1ab_0    conda-forge
asttokens                 2.4.0              pyhd8ed1ab_0    conda-forge
async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
async_generator           1.10                       py_0    conda-forge
attrs         

Nie modyfikuj poniższych paragrafów. Wykonaj je i używaj zdefniowanych poniżej zmiennych jak parametrów Twojego programu.

In [3]:
# NIE ZMIENIAĆ
# ścieżki dla danych źródłowych 
datasource1_dir = input_dir + "/datasource1"
datasource4_dir = input_dir + "/datasource4"

# nazwy i ścieżki dla wyników dla misji głównej 
# część 1 (Spark Core - RDD) 
rdd_result_dir = "/tmp/output1"

# część 2 (Spark SQL - DataFrame)
df_result_table = "output2"

# część 3 (Pandas API on Spark)
ps_result_file = "/tmp/output3.json"

In [4]:
# NIE ZMIENIAĆ
import os
def remove_file(file):
    if os.path.exists(file):
        os.remove(file)

remove_file("metric_functions.py")
remove_file("tools_functions.py")

In [5]:
# NIE ZMIENIAĆ
import requests
r = requests.get("https://jankiewicz.pl/bigdata/metric_functions.py", allow_redirects=True)
open('metric_functions.py', 'wb').write(r.content)
r = requests.get("https://jankiewicz.pl/bigdata/tools_functions.py", allow_redirects=True)
open('tools_functions.py', 'wb').write(r.content)

3322

In [None]:
# BRUDNOPIS
from metric_functions import *
from tools_functions import *

In [15]:
# NIE ZMIENIAĆ
%run metric_functions.py
%run tools_functions.py

Poniższe paragrafy mają na celu usunąć ewentualne pozostałości poprzednich uruchomień tego lub innych notatników

In [9]:
# NIE ZMIENIAĆ
# usunięcie miejsca docelowego dla część 1 (Spark Core - RDD) 
delete_dir(spark, rdd_result_dir)

Successfully deleted directory: /tmp/output1


In [10]:
# NIE ZMIENIAĆ
# usunięcie miejsca docelowego dla część 2 (Spark SQL - DataFrame) 
drop_table(spark, df_result_table)

The table output2 does not exist.
Path file:/home/jovyan/spark-warehouse/output2 does not exist.


In [16]:
# NIE ZMIENIAĆ
# usunięcie miejsca docelowego dla część 3 (Pandas API on Spark) 
remove_file(ps_result_file)

NameError: name 'remove_file' is not defined

In [4]:
# NIE ZMIENIAĆ
spark

***Uwaga!***

Uruchom poniższy paragraf i sprawdź czy adres, pod którym dostępny *Apache Spark Application UI* jest poprawny wywołując następny testowy paragraf. 

W razie potrzeby określ samodzielnie poprawny adres, pod którym dostępny *Apache Spark Application UI*

In [18]:
# adres URL, pod którym dostępny Apache Spark Application UI (REST API)
# 
spark_ui_address = extract_host_and_port(spark, "http://localhost:4040")
spark_ui_address

'http://localhost:4040'

In [6]:
# testowy paragraf
test_metrics = get_current_metrics(spark_ui_address)
test_metrics

{'numTasks': 0,
 'numActiveTasks': 0,
 'numCompleteTasks': 0,
 'numFailedTasks': 0,
 'numKilledTasks': 0,
 'numCompletedIndices': 0,
 'executorDeserializeTime': 0,
 'executorDeserializeCpuTime': 0,
 'executorRunTime': 0,
 'executorCpuTime': 0,
 'resultSize': 0,
 'jvmGcTime': 0,
 'resultSerializationTime': 0,
 'memoryBytesSpilled': 0,
 'diskBytesSpilled': 0,
 'peakExecutionMemory': 0,
 'inputBytes': 0,
 'inputRecords': 0,
 'outputBytes': 0,
 'outputRecords': 0,
 'shuffleRemoteBlocksFetched': 0,
 'shuffleLocalBlocksFetched': 0,
 'shuffleFetchWaitTime': 0,
 'shuffleRemoteBytesRead': 0,
 'shuffleRemoteBytesReadToDisk': 0,
 'shuffleLocalBytesRead': 0,
 'shuffleReadBytes': 0,
 'shuffleReadRecords': 0,
 'shuffleWriteBytes': 0,
 'shuffleWriteTime': 0,
 'shuffleWriteRecords': 0}

# Część 1 - Spark Core (RDD)

## Misje poboczne

W ponizszych paragrafach wprowadź swoje rozwiązania *misji pobocznych*, o ile **nie** chcesz, aby oceniana była *misja główna*. W przeciwnym przypadku **KONIECZNIE** pozostaw je **puste**.  

## Misja główna 

Poniższy paragraf zapisuje metryki przed uruchomieniem Twojego rozwiązania *misji głównej*. 

Nie musisz go uruchamiać podczas implementacji rozwiązania.

In [7]:
# NIE ZMIENIAĆ
before_rdd_metrics = get_current_metrics(spark_ui_address)

W poniższych paragrafach wprowadź **rozwiązanie** *misji głównej* oparte na *RDD API*. 

Pamiętaj o wydajności Twojego przetwarzania, *RDD API* tego wymaga. 

Nie wprowadzaj w poniższych paragrafach żadnego kodu, w przypadku wykorzystania *misji pobocznych*.

In [4]:
# Wczytanie plików tekstowych
datasource1 = sc.textFile(datasource1_dir).map(lambda x: x.split("\t"))
datasource4 = sc.textFile(datasource4_dir).map(lambda x: x.split("\t"))


In [5]:
ds1_rdd = datasource1.map(
    lambda x: (
        x[0],
        x[2],
        x[3],
        "performer" if x[3] in {"actor", "actress", "self"} else x[3],
    )
)  # (tconst, nconst, role, role_normalized)

ds4_rdd = datasource4.map(
    lambda x: (x[0], x[1], x[4])
)  # (nconst, primaryName, primaryProfession)

In [8]:
ds1_rdd.collect()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 53 in stage 1.0 failed 1 times, most recent failure: Lost task 53.0 in stage 1.0 (TID 71) (d026eecc4cbb executor driver): TaskResultLost (result lost from block manager)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2844)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2780)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2779)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2779)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1242)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1242)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1242)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2971)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:984)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2463)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1046)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1045)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:195)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)


In [6]:
grouped_roles = (
    ds1_rdd.map(lambda x: (x[0], {x[3]}))  # Start with a set for roles
    .reduceByKey(lambda roles1, roles2: roles1.union(roles2))  # Merge sets of roles per key
    .filter(lambda x: {"performer", "director"}.issubset(x[1]) and len(x[1]) > 3)  # Apply filters
)


In [7]:
full_cast_movies = grouped_roles.map(lambda x: (x[0], "")) # tconst

In [9]:
full_cast_movies.collect()

[('tt0059226', ''),
 ('tt1265402', ''),
 ('tt1688757', ''),
 ('tt1691065', ''),
 ('tt2638816', ''),
 ('tt0532489', ''),
 ('tt2139596', ''),
 ('tt3417470', ''),
 ('tt5733554', ''),
 ('tt1211501', ''),
 ('tt3815310', ''),
 ('tt2989142', ''),
 ('tt0201707', ''),
 ('tt6411870', ''),
 ('tt0077915', ''),
 ('tt1067465', ''),
 ('tt1202522', ''),
 ('tt0105274', ''),
 ('tt0968834', ''),
 ('tt0055365', ''),
 ('tt10228366', ''),
 ('tt7417126', ''),
 ('tt9150880', ''),
 ('tt2372178', ''),
 ('tt0841309', ''),
 ('tt11117026', ''),
 ('tt5433028', ''),
 ('tt1518926', ''),
 ('tt0513167', ''),
 ('tt2204947', ''),
 ('tt10682728', ''),
 ('tt1471866', ''),
 ('tt0440766', ''),
 ('tt1429765', ''),
 ('tt1434453', ''),
 ('tt2621956', ''),
 ('tt0660626', ''),
 ('tt1087825', ''),
 ('tt3410480', ''),
 ('tt0152612', ''),
 ('tt10450066', ''),
 ('tt9545048', ''),
 ('tt10185196', ''),
 ('tt0966355', ''),
 ('tt1189498', ''),
 ('tt1068792', ''),
 ('tt4898734', ''),
 ('tt0563801', ''),
 ('tt0740867', ''),
 ('tt0279997', 

In [8]:
full_cast_roles = full_cast_movies.join(ds1_rdd.map(lambda x: (x[0], (x[1], x[2]))))
full_cast_roles_count = full_cast_roles.\
    map(lambda x: ((x[1][1][0], x[1][1][1]), 1)).reduceByKey(lambda a,b: a+b) # ((nconst, profession), movies)

In [17]:
full_cast_roles.collect()

[('tt1688757', ('', ('writer', 'writer'))),
 ('tt1688757', ('', ('writer', 'writer'))),
 ('tt1688757', ('', ('actor', 'performer'))),
 ('tt1688757', ('', ('actress', 'performer'))),
 ('tt1688757', ('', ('actress', 'performer'))),
 ('tt1688757', ('', ('actress', 'performer'))),
 ('tt1688757', ('', ('writer', 'writer'))),
 ('tt1688757', ('', ('director', 'director'))),
 ('tt1688757', ('', ('writer', 'writer'))),
 ('tt1688757', ('', ('producer', 'producer'))),
 ('tt1691065', ('', ('writer', 'writer'))),
 ('tt1691065', ('', ('actor', 'performer'))),
 ('tt1691065', ('', ('director', 'director'))),
 ('tt1691065', ('', ('actor', 'performer'))),
 ('tt1691065', ('', ('actor', 'performer'))),
 ('tt1691065', ('', ('actor', 'performer'))),
 ('tt1691065', ('', ('editor', 'editor'))),
 ('tt1691065', ('', ('actor', 'performer'))),
 ('tt1691065', ('', ('actor', 'performer'))),
 ('tt1691065', ('', ('cinematographer', 'cinematographer'))),
 ('tt0532489', ('', ('director', 'director'))),
 ('tt0532489', (

In [21]:
full_cast_roles_count.collect()

[(('nm1416800', 'actor'), 104),
 (('nm0489408', 'actress'), 258),
 (('nm0385543', 'actor'), 130),
 (('nm0000851', 'actor'), 58),
 (('nm3018310', 'actress'), 1),
 (('nm2981783', 'composer'), 65),
 (('nm0643608', 'producer'), 423),
 (('nm10504225', 'actress'), 1473),
 (('nm0290470', 'actor'), 26),
 (('nm0529414', 'actress'), 22),
 (('nm0437073', 'actress'), 1),
 (('nm0579489', 'director'), 580),
 (('nm2043752', 'actor'), 5),
 (('nm5999103', 'director'), 2),
 (('nm0907835', 'director'), 149),
 (('nm0283499', 'actor'), 178),
 (('nm2880902', 'director'), 18),
 (('nm10340011', 'cinematographer'), 2),
 (('nm1533304', 'cinematographer'), 45),
 (('nm3702570', 'composer'), 3),
 (('nm0466838', 'producer'), 89),
 (('nm9635755', 'actress'), 277),
 (('nm4872367', 'director'), 5),
 (('nm7649398', 'self'), 6),
 (('nm5391561', 'writer'), 2),
 (('nm0191159', 'actress'), 215),
 (('nm0644307', 'composer'), 1),
 (('nm1795428', 'actress'), 2),
 (('nm1700440', 'cinematographer'), 92),
 (('nm0000980', 'actor'

In [18]:
actor_data = ds4_rdd.flatMap(
    lambda row: [
        (profession, 1)
        for profession in row[2].split(",")
        if profession and profession != "miscellaneous"
    ]
)
top_professions = actor_data.reduceByKey(lambda a, b: a + b).sortBy(lambda x: x[1], False)

top4_professions = top_professions.zipWithIndex().filter(lambda x: x[1] < 4).map(lambda x: x[0])

In [19]:
top4_professions.collect()

[('actor', 2259212),
 ('actress', 1354336),
 ('producer', 845163),
 ('writer', 641773)]

In [None]:
# actor_data.collect()
# actor_data_test = ds4_rdd.flatMap(lambda row: [(profession, row[0]) for profession in row[2].split(",") if profession and profession != "miscellaneous"])
# actor_data_test.filter(lambda x: x[0] == "").collect()

[]

In [None]:
# print(top_professions)

[('actor', 2259212), ('actress', 1354336), ('producer', 845163), ('writer', 641773)]


In [None]:
# top_professions_arr = [top_profession[0] for top_profession in top_professions]

In [None]:
# movies_per_person = full_cast_roles_count.filter(lambda x: x[0][1] in top_professions_arr).\
#     map(lambda x: (x[0][1], x[0][0], x[1]))

In [32]:
full_cast_roles_count_tx = full_cast_roles_count.map(lambda x: (x[0][1], (x[0][0], x[1])))

movies_per_person = top4_professions.\
    join(full_cast_roles_count_tx).\
    map(lambda x: (x[0], x[1][1][0], x[1][1][1], x[1][0]))

In [30]:
full_cast_roles_count_tx.collect()

[('actor', ('nm1416800', 104)),
 ('actress', ('nm0489408', 258)),
 ('actor', ('nm0385543', 130)),
 ('actor', ('nm0000851', 58)),
 ('actress', ('nm3018310', 1)),
 ('composer', ('nm2981783', 65)),
 ('producer', ('nm0643608', 423)),
 ('actress', ('nm10504225', 1473)),
 ('actor', ('nm0290470', 26)),
 ('actress', ('nm0529414', 22)),
 ('actress', ('nm0437073', 1)),
 ('director', ('nm0579489', 580)),
 ('actor', ('nm2043752', 5)),
 ('director', ('nm5999103', 2)),
 ('director', ('nm0907835', 149)),
 ('actor', ('nm0283499', 178)),
 ('director', ('nm2880902', 18)),
 ('cinematographer', ('nm10340011', 2)),
 ('cinematographer', ('nm1533304', 45)),
 ('composer', ('nm3702570', 3)),
 ('producer', ('nm0466838', 89)),
 ('actress', ('nm9635755', 277)),
 ('director', ('nm4872367', 5)),
 ('self', ('nm7649398', 6)),
 ('writer', ('nm5391561', 2)),
 ('actress', ('nm0191159', 215)),
 ('composer', ('nm0644307', 1)),
 ('actress', ('nm1795428', 2)),
 ('cinematographer', ('nm1700440', 92)),
 ('actor', ('nm0000980'

In [33]:
movies_per_person.collect()

[('producer', 'nm0643608', 423, 845163),
 ('producer', 'nm0466838', 89, 845163),
 ('producer', 'nm1303955', 83, 845163),
 ('producer', 'nm0002349', 69, 845163),
 ('producer', 'nm3658698', 2334, 845163),
 ('producer', 'nm2764478', 71, 845163),
 ('producer', 'nm1422179', 30, 845163),
 ('producer', 'nm4125375', 194, 845163),
 ('producer', 'nm3562882', 3, 845163),
 ('producer', 'nm1016998', 177, 845163),
 ('producer', 'nm1021806', 8, 845163),
 ('producer', 'nm0257426', 67, 845163),
 ('producer', 'nm2479179', 3, 845163),
 ('producer', 'nm0597362', 58, 845163),
 ('producer', 'nm1151555', 25, 845163),
 ('producer', 'nm0920709', 488, 845163),
 ('producer', 'nm1050754', 25, 845163),
 ('producer', 'nm1168599', 5, 845163),
 ('producer', 'nm4487543', 125, 845163),
 ('producer', 'nm0811569', 2, 845163),
 ('producer', 'nm2287334', 1, 845163),
 ('producer', 'nm0850862', 20, 845163),
 ('producer', 'nm2362265', 80, 845163),
 ('producer', 'nm2797682', 5, 845163),
 ('producer', 'nm0887234', 4, 845163),
 

In [34]:
ranked = movies_per_person.groupBy(lambda x: x[0]).\
    mapValues(lambda rows: sorted(rows, key=lambda x: x[2], reverse=True)[:3]).\
    flatMap(lambda x: x[1])


In [None]:
# ranked_arr = [movies_per_person.filter(lambda x: x[0] == profession).top(3, key=lambda x: x[2]) for profession in top_professions_arr]


In [None]:
# ranked = sc.parallelize(ranked_arr).flatMap(lambda arr: [a for a in arr])

In [35]:
ranked.collect()

[('producer', 'nm0438506', 11833, 845163),
 ('producer', 'nm0438471', 8826, 845163),
 ('producer', 'nm0683788', 6081, 845163),
 ('actor', 'nm0609391', 3559, 2259212),
 ('actor', 'nm0747172', 2602, 2259212),
 ('actor', 'nm1118516', 2385, 2259212),
 ('actress', 'nm0525123', 3636, 1354336),
 ('actress', 'nm0368990', 3240, 1354336),
 ('actress', 'nm2588610', 3204, 1354336),
 ('writer', 'nm0912726', 6153, 641773),
 ('writer', 'nm0275585', 6132, 641773),
 ('writer', 'nm2276735', 5205, 641773)]

In [38]:
semi_final_result = ranked.map(lambda x: (x[1], (x[0], x[2], x[3]))).\
    join(ds4_rdd.map(lambda x: (x[0], x[1]))).map(lambda x: (x[1][0][0], x[1][1], x[1][0][1], x[1][0][2]))

In [39]:
semi_final_result.collect()

[('writer', 'Delia Fiallo', 6132, 641773),
 ('producer', 'Ekta Kapoor', 8826, 845163),
 ('producer', 'Valentin Pimstein', 6081, 845163),
 ('actor', 'Luis Eduardo Motoa', 3559, 2259212),
 ('actor', 'Ronit Roy', 2602, 2259212),
 ('actor', 'Dilip Joshi', 2385, 2259212),
 ('writer', 'Sampurn Anand', 5205, 641773),
 ('producer', 'Shobha Kapoor', 11833, 845163),
 ('actress', 'Kavita Lad', 3204, 1354336),
 ('actress', 'Rohini Hattangadi', 3240, 1354336),
 ('writer', 'Tony Warren', 6153, 641773),
 ('actress', 'Luz Stella Luengas', 3636, 1354336)]

In [40]:
final_result_sorted = semi_final_result.sortBy(lambda x: (-x[3], -x[2])).map(lambda x: (x[0], x[1], x[2]))

In [None]:
# # Step 1: Add an index to the ranked RDD
# ranked_with_index = ranked.zipWithIndex().map(lambda x: (x[1], x[0]))  # (index, (role, id, count))

# # Step 2: Transform the ranked RDD while keeping the index
# final_result = ranked_with_index.map(lambda x: (x[1][1], (x[0], x[1][0], x[1][2]))) \
#     .join(ds4_rdd.map(lambda x: (x[0], x[1]))) \
#     .map(lambda x: (x[1][0][0], x[1][1], x[1][0][1], x[1][0][2]))  # (index, name, role, count)

# # Step 3: Sort by the original index to restore the order
# final_result_sorted = final_result.sortBy(lambda x: x[0]).map(lambda x: (x[2], x[1], x[3]))


In [41]:
final_result_sorted.collect()

[('actor', 'Luis Eduardo Motoa', 3559),
 ('actor', 'Ronit Roy', 2602),
 ('actor', 'Dilip Joshi', 2385),
 ('actress', 'Luz Stella Luengas', 3636),
 ('actress', 'Rohini Hattangadi', 3240),
 ('actress', 'Kavita Lad', 3204),
 ('producer', 'Shobha Kapoor', 11833),
 ('producer', 'Ekta Kapoor', 8826),
 ('producer', 'Valentin Pimstein', 6081),
 ('writer', 'Tony Warren', 6153),
 ('writer', 'Delia Fiallo', 6132),
 ('writer', 'Sampurn Anand', 5205)]

In [42]:
# Zapis wyniku do pliku pickle
final_result_sorted.saveAsPickleFile(rdd_result_dir)

Py4JJavaError: An error occurred while calling o906.saveAsObjectFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/tmp/output1 already exists
	at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
	at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:299)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089)
	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027)
	at org.apache.spark.rdd.SequenceFileRDDFunctions.$anonfun$saveAsSequenceFile$1(SequenceFileRDDFunctions.scala:66)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.SequenceFileRDDFunctions.saveAsSequenceFile(SequenceFileRDDFunctions.scala:51)
	at org.apache.spark.rdd.RDD.$anonfun$saveAsObjectFile$1(RDD.scala:1629)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.RDD.saveAsObjectFile(RDD.scala:1629)
	at org.apache.spark.api.java.JavaRDDLike.saveAsObjectFile(JavaRDDLike.scala:579)
	at org.apache.spark.api.java.JavaRDDLike.saveAsObjectFile$(JavaRDDLike.scala:578)
	at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsObjectFile(JavaRDDLike.scala:45)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)


In [29]:
!ls /tmp/output1
!head /tmp/output1/part-00000

part-00000  part-00004	part-00008  part-00012	part-00016
part-00001  part-00005	part-00009  part-00013	part-00017
part-00002  part-00006	part-00010  part-00014	_SUCCESS
part-00003  part-00007	part-00011  part-00015
SEQ!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable      [���dbe{pĉp$!�  �      |�� ur [[BK�gg�7  xp   
nm0000158	Tom�K���nm0000159	Teri�K���nm0000166	Helen�K���OJones	1946	\N	actor,director,soundtrack	tt0106977,tt0477348,tt2398231,tt0443272�K���TLocklear	1961	\N	actress,producer,soundtrack	tt0380623,tt0119695,tt0103491,tt0087262�K���nm0000183	Traci�K���PPacino	1940	\N	actor,soundtrack,director	tt0099422,tt0068646,tt0070666,tt0072890�K���nm0000200	Bill�K���QPhillippe	1974	\N	actor,producer,director	tt0202677,tt0280707,tt0139134,tt0375679�K��e.uq ~   ����      ]�(�OPosey	1968	\N	actress,soundtrack,writer	tt0348150,tt0134084,tt0359013,tt0106677�K���PReeves	1964	\N	actor,producer,soundtrack	tt0111257,tt0133093,tt0102685,tt0234215�

Poniższy paragraf zapisuje metryki po uruchomieniu Twojego rozwiązania *misji głównej*. 

Nie musisz go uruchamiać podczas implementacji rozwiązania.

In [17]:
# NIE ZMIENIAĆ
after_rdd_metrics = get_current_metrics(spark_ui_address)

# Część 2 - Spark SQL (DataFrame)

## Misje poboczne

W ponizszych paragrafach wprowadź swoje rozwiązania *misji pobocznych*, o ile **nie** chcesz, aby oceniana była *misja główna*. W przeciwnym przypadku **KONIECZNIE** pozostaw je **puste**.  

## Misja główna 

Poniższy paragraf zapisuje metryki przed uruchomieniem Twojego rozwiązania *misji głównej*. 

Nie musisz go uruchamiać podczas implementacji rozwiązania.

In [18]:
# NIE ZMIENIAĆ
before_df_metrics = get_current_metrics(spark_ui_address)

W poniższych paragrafach wprowadź **rozwiązanie** *misji głównej* swojego projektu oparte o *DataFrame API*. 

Pamiętaj o wydajności Twojego przetwarzania, *DataFrame API* nie jest w stanie wszystkiego "naprawić". 

Nie wprowadzaj w poniższych paragrafach żadnego kodu, w przypadku wykorzystania *misji pobocznych*.

In [32]:
from pyspark.sql.functions import col, explode, split, count, desc
# Wczytanie danych
datasource1 = spark.read.option("sep", "\t").csv(datasource1_dir, inferSchema=True)
datasource4 = spark.read.option("sep", "\t").csv(datasource4_dir, header=True, inferSchema=True)
datasource1 = datasource1.toDF("tconst", "ordering", "nconst", "role", "job", "characters")
datasource4 = datasource4.toDF("nconst", "primaryName", "birthYear", "deathYear", "primaryProfession", "knownForTitles")

In [None]:
from pyspark.sql.functions import collect_set, size, array_contains, row_number, when
from pyspark.sql import Window

normalized_roles = datasource1.withColumn(
    "normalized_role",
    when(col("role").isin("actor", "actress", "self"), "performer").otherwise(col("role"))
)

full_cast = normalized_roles.groupBy("tconst").agg(collect_set("normalized_role").alias("roles")).\
    filter(
        array_contains(col("roles"), "performer") & 
        array_contains(col("roles"), "director") & 
        (size(col("roles")) > 3)
    ).select("tconst")



In [None]:
# Połączenie z datasource1 dla pełnej obsady
full_cast_roles = full_cast.join(datasource1, "tconst").select("tconst", "nconst", "role")

+---------+---------+---------------+
|   tconst|   nconst|           role|
+---------+---------+---------------+
|tt0000725|nm0226992|          actor|
|tt0000725|nm0366008|          actor|
|tt0000725|nm0000428|         writer|
|tt0000725|nm0005658|cinematographer|
|tt0000725|nm0567363|       director|
|tt0000861|nm0732651|        actress|
|tt0000861|nm0784407|          actor|
|tt0000861|nm0005658|cinematographer|
|tt0000861|nm0163559|          actor|
|tt0000861|nm0000428|       director|
|tt0000861|nm0456804|          actor|
|tt0000861|nm0910400|          actor|
|tt0000861|nm0253652|         writer|
|tt0000861|nm0642722|          actor|
|tt0000861|nm0940488|         writer|
|tt0000862|nm0386036|          actor|
|tt0000862|nm0264569|        actress|
|tt0000862|nm5289829|          actor|
|tt0000862|nm0511080|          actor|
|tt0000862|nm0878467|       director|
+---------+---------+---------------+
only showing top 20 rows



In [None]:
full_cast_roles_count = full_cast_roles.groupBy("nconst", "role").agg(count("tconst").alias("movies")).\
    select("nconst", col("role").alias("profession"), "movies")

+---------+---------------+------+
|   nconst|     profession|movies|
+---------+---------------+------+
|nm0577476|          actor|    16|
|nm0232704|        actress|     2|
|nm0852794|cinematographer|     2|
|nm0430756|         writer|    66|
|nm0354894|         writer|     7|
|nm0001273|        actress|    57|
|nm0222369|        actress|     2|
|nm0294571|        actress|    12|
|nm0746704|        actress|     4|
|nm0376221|       director|   102|
|nm0415405|        actress|     1|
|nm0319702|cinematographer|    26|
|nm0253296|       director|     7|
|nm0299343|         writer|    52|
|nm0221488|          actor|    69|
|nm0384716|        actress|    21|
|nm0006297|       composer|    37|
|nm0622404|          actor|    16|
|nm0518711|         writer|   193|
|nm0703642|       producer|    94|
+---------+---------------+------+
only showing top 20 rows



In [None]:
# Przetwarzanie datasource4: Rozdzielanie profesji
actor_data = datasource4.withColumn("profession", explode(split(col("primaryProfession"), ","))).\
    filter(col("profession") != "miscellaneous")

+---------+---------------+---------+---------+--------------------+--------------------+----------+
|   nconst|    primaryName|birthYear|deathYear|   primaryProfession|      knownForTitles|profession|
+---------+---------------+---------+---------+--------------------+--------------------+----------+
|nm0000001|   Fred Astaire|     1899|     1987|soundtrack,actor,...|tt0072308,tt00430...|soundtrack|
|nm0000001|   Fred Astaire|     1899|     1987|soundtrack,actor,...|tt0072308,tt00430...|     actor|
|nm0000002|  Lauren Bacall|     1924|     2014|  actress,soundtrack|tt0038355,tt01170...|   actress|
|nm0000002|  Lauren Bacall|     1924|     2014|  actress,soundtrack|tt0038355,tt01170...|soundtrack|
|nm0000003|Brigitte Bardot|     1934|       \N|actress,soundtrac...|tt0057345,tt00544...|   actress|
|nm0000003|Brigitte Bardot|     1934|       \N|actress,soundtrac...|tt0057345,tt00544...|soundtrack|
|nm0000003|Brigitte Bardot|     1934|       \N|actress,soundtrac...|tt0057345,tt00544...|  

In [None]:
# Najpopularniejsze profesje
top_professions = actor_data.groupBy("profession").agg(count("nconst").alias("count")).orderBy(desc("count")).limit(4)

+----------+-------+
|profession|  count|
+----------+-------+
|     actor|2259212|
|   actress|1354336|
|  producer| 845163|
|    writer| 641773|
+----------+-------+



In [None]:
movies_per_person = full_cast_roles_count.join(datasource4.select("nconst", "primaryName"), "nconst").join(top_professions, "profession")
window_spec = Window.partitionBy("profession").orderBy(desc("movies"))
ranked_movies_per_person2 = movies_per_person.withColumn(
    "rank", row_number().over(window_spec)
)
final_result = ranked_movies_per_person2.filter(col("rank") <= 3).\
    select("profession", "primaryName", "movies").orderBy("profession", desc("movies"))


In [None]:
final_result.show()

+----------+------------------+------+
|profession|       primaryName|movies|
+----------+------------------+------+
|     actor|Luis Eduardo Motoa|  3559|
|     actor|         Ronit Roy|  2602|
|     actor|       Dilip Joshi|  2385|
|   actress|Luz Stella Luengas|  3636|
|   actress| Rohini Hattangadi|  3240|
|   actress|        Kavita Lad|  3204|
|  producer|     Shobha Kapoor| 11833|
|  producer|       Ekta Kapoor|  8826|
|  producer| Valentin Pimstein|  6081|
|    writer|       Tony Warren|  6153|
|    writer|      Delia Fiallo|  6132|
|    writer|     Sampurn Anand|  5205|
+----------+------------------+------+



In [None]:
# Zapis wyników do tabeli 
final_result.write.mode("overwrite").format("delta").saveAsTable(df_result_table)

Poniższy paragraf zapisuje metryki po uruchomieniu Twojego rozwiązania *misji głównej*. 

Nie musisz go uruchamiać podczas implementacji rozwiązania.

In [22]:
# NIE ZMIENIAĆ
after_df_metrics = get_current_metrics(spark_ui_address)

# Część 3 - Pandas API on Spark

Ta część to wyzwanie. W szczególności dla osób, które nie programują na co dzień w Pythonie, lub które nie nie korzystały do tej pory z Pandas API.  

Powodzenia!

## Misje poboczne

W ponizszych paragrafach wprowadź swoje rozwiązania *misji pobocznych*, o ile **nie** chcesz, aby oceniana była *misja główna*. W przeciwnym przypadku **KONIECZNIE** pozostaw je **puste**.  

In [None]:
import pyspark.pandas as ps

# Mission 1
# Loading data
datasource1 = ps.read_csv(
    datasource1_dir,
    sep="\t",
    header=None,
    names=["tconst", "ordering", "nconst", "role", "job", "characters"],
)

datasource4 = ps.read_csv(datasource4_dir, sep="\t")

# Calculating age
datasource4["birthYear"] = datasource4["birthYear"].replace(r"\N", None)
datasource4["deathYear"] = datasource4["deathYear"].replace(r"\N", None)
datasource4_clean = datasource4.dropna(
    subset=["birthYear", "deathYear"]
)  # Filter the nulls

# print(datasource4_clean.head(50))
datasource4_clean["birthYear"] = datasource4_clean["birthYear"].astype("float")
datasource4_clean["deathYear"] = datasource4_clean["deathYear"].astype("float")
datasource4_clean["age"] = (datasource4_clean["deathYear"] - datasource4_clean["birthYear"]).astype(
    "float"
)

# Helper function to check if a profession exists in a comma-separated list
def contains_any(value, keywords):
    if value is None:
        return False
    professions = value.split(",")
    return int(any(keyword in professions for keyword in keywords))


# Update columns for actor and director checks
datasource4_clean["is_actor"] = datasource4_clean["primaryProfession"].apply(
    lambda x: contains_any(x, ["actor", "actress"])
)

datasource4_clean["is_director"] = datasource4_clean["primaryProfession"].apply(
    lambda x: contains_any(x, ["director"])
)


datasource4_clean["is_actor"] = datasource4_clean["primaryProfession"].apply(
    lambda x: int(contains_any(x, ["actor", "actress"]))
)
datasource4_clean["is_director"] = datasource4_clean["primaryProfession"].apply(
    lambda x: int("director" in x if x else False)
)

results1 = (
    datasource4_clean.groupby("age")
    .agg(
        persons=("nconst", "count"),
        actors=("is_actor", "sum"),
        directors=("is_director", "sum"),
    )
    .reset_index()
)

print(results1)


ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_com

ConnectionRefusedError: [Errno 111] Connection refused

In [None]:
# Mission 2
# 20th century lasted between 1901.01.01 and 2000.12.31
xx_century = datasource4_clean[
    (datasource4_clean['birthYear'] > 1900) & 
    (datasource4_clean['birthYear'] <= 2000) &
    (datasource4_clean['age'] > 70)
]

# Merge with ds1
merged = datasource1.merge(xx_century, on='nconst', how='inner')

# Calculate number of movies
merged['is_actor_in_film'] = merged['role'].apply(
    lambda x: int(x in ['actor', 'actress', 'self'])
)
merged['is_director_in_film'] = merged['role'].apply(
    lambda x: int(x == 'director')
)

results2 = merged.groupby(['nconst', 'primaryName', 'birthYear', 'age']).agg(
    filmCount=('tconst', 'count'),
    filmCountAsActor=('is_actor_in_film', 'sum'),
    filmCountAsDirector=('is_director_in_film', 'sum')
).reset_index()

# Top3
results2_top3 = results2.sort_values('filmCount', ascending=False).head(3)

print(results2_top3)

## Misja główna 

Poniższy paragraf zapisuje metryki przed uruchomieniem Twojego rozwiązania *misji głównej*. 

Nie musisz go uruchamiać podczas implementacji rozwiązania.

In [23]:
#NIE ZMIENIAĆ
before_ps_metrics = get_current_metrics(spark_ui_address)

W poniższych paragrafach wprowadź **rozwiązanie** swojego projektu oparte o *Pandas API on Spark*. 

Pamiętaj o wydajności Twojego przetwarzania, *Pandas API on Spark* nie jest w stanie wszystkiego "naprawić". 

Nie wprowadzaj w poniższych paragrafach żadnego kodu, w przypadku wykorzystania *misji pobocznych*.

In [5]:
import pyspark.pandas as ps

datasource1 = ps.read_csv(datasource1_dir, sep="\t", header=None, names=["tconst", "ordering", "nconst", "role", "job", "characters"])
datasource4 = ps.read_csv(datasource4_dir, sep="\t", header=0)



In [5]:
datasource1["normalized_role"] = datasource1["role"].apply(
    lambda x: "performer" if x in ["actor", "actress", "self"] else x
)

In [None]:
grouped_roles = datasource1.groupby("tconst")["normalized_role"].apply(list).reset_index()
grouped_roles["unique_roles"] = grouped_roles["normalized_role"].apply(lambda roles: list(set(roles)))
grouped_roles["role_count"] = grouped_roles["unique_roles"].apply(len)

filtered_roles = grouped_roles[
    grouped_roles["unique_roles"].apply(lambda roles: "performer" in roles and "director" in roles) &
    (grouped_roles["role_count"] > 3)
]


ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt



KeyboardInterrupt



In [None]:
full_cast_tconsts = filtered_roles["tconst"]


In [None]:
full_cast_roles = datasource1[datasource1["tconst"].isin(full_cast_tconsts)][["tconst", "nconst", "role"]]


In [None]:
full_cast_roles_count = full_cast_roles.groupby(["nconst", "role"]).size().reset_index(name="movies")
full_cast_roles_count = full_cast_roles_count.rename(columns={"role": "profession"})


In [None]:
datasource4["primaryProfession"] = datasource4["primaryProfession"].fillna("")
actor_data = datasource4.assign(profession=datasource4["primaryProfession"].str.split(",")).explode("profession")
actor_data = actor_data[actor_data["profession"] != "miscellaneous"]

In [None]:
top_professions = actor_data["profession"].value_counts().head(4).index.tolist()


In [None]:
movies_per_person = full_cast_roles_count.merge(
    datasource4[["nconst", "primaryName"]], on="nconst"
).merge(
    actor_data[actor_data["profession"].isin(top_professions)], on="nconst"
)


In [None]:
movies_per_person["rank"] = movies_per_person.groupby("profession")["movies"].rank(method="first", ascending=False)


In [None]:
final_result = movies_per_person[movies_per_person["rank"] <= 3][["profession", "primaryName", "movies"]]
final_result = final_result.sort_values(by=["profession", "movies"], ascending=[True, False]).reset_index(drop=True)


In [None]:
print(final_result.to_pandas())  # Convert to pandas for better display if needed


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 44620)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "/opt/conda/lib/python3.

ConnectionRefusedError: [Errno 111] Connection refused

In [None]:
final_result.to_json(ps_result_file, orient='index')

Poniższy paragraf zapisuje metryki po uruchomieniu Twojego rozwiązania *misji głównej*. 

Nie musisz go uruchamiać podczas implementacji rozwiązania.

In [30]:
#NIE ZMIENIAĆ
after_ps_metrics = get_current_metrics(spark_ui_address)

# Analiza wyników i wydajności *misji głównych*

## Część 1 - Spark Core (RDD)

In [31]:
# Wczytanie wyników z pliku pickle
word_counts = sc.pickleFile(rdd_result_dir)

# Wyświetlenie 50 pierwszych elementów
result_sample = word_counts.take(50)
for item in result_sample:
    print(item)

('TeriyakiApps\x01~\x01teriyakiapps@gmail.com\x018589934594', 1)
('Koza\x01http://www.xynapse.pl\x01xynapse@xynapse.pl\x018589934595', 1)
('Tools\x01https://vcb30cb43.app-ads-txt.com/app-ads.txt\x01androtools222@gmail.com\x018589934596', 1)
('Muslim', 110)
('FireFlies', 2)
('Studio\x01~\x01manuariza95@gmail.com\x018589934602', 1)
('News', 494)
('IST-Development\x01https://istanbulit.com\x01info@istanbulit.com\x018589934604', 1)
('FAStuidoTI\x01~\x01karimkhalfy@gmail.com\x018589934605', 1)
('Web4Minds,', 1)
('V3', 8)
('Smart', 2437)
('Ltd\x01http://www.v3smarttech.com\x01support@v3smarttech.com\x018589934607', 1)
('Mobil', 143)
('UNDERSCORE:', 1)
('Apps', 6350)
('and', 5289)
('Games\x01~\x01ergamesapps@gmail.com\x018589934609', 1)
('tamapps\x01~\x01zakdermeister@gmail.com\x018589934614', 1)
('S.', 397)
('Connect', 331)
('Team\x01https://mewe.com/join/klwpdevelopersteam\x01designcorpviti@gmail.com\x018589934618', 1)
('for', 2565)
('with', 262)
('NETWORKS', 23)
('PTE', 226)
('Art\x01https

In [32]:
subtract_metrics(after_rdd_metrics, before_rdd_metrics)

{'numTasks': 6,
 'numActiveTasks': 0,
 'numCompleteTasks': 6,
 'numFailedTasks': 0,
 'numKilledTasks': 0,
 'numCompletedIndices': 6,
 'executorDeserializeTime': 763,
 'executorDeserializeCpuTime': 288417800,
 'executorRunTime': 52789,
 'executorCpuTime': 3791290300,
 'resultSize': 12143,
 'jvmGcTime': 1808,
 'resultSerializationTime': 19,
 'memoryBytesSpilled': 0,
 'diskBytesSpilled': 0,
 'peakExecutionMemory': 0,
 'inputBytes': 84276905,
 'inputRecords': 1179547,
 'outputBytes': 90624535,
 'outputRecords': 14566,
 'shuffleRemoteBlocksFetched': 0,
 'shuffleLocalBlocksFetched': 9,
 'shuffleFetchWaitTime': 0,
 'shuffleRemoteBytesRead': 0,
 'shuffleRemoteBytesReadToDisk': 0,
 'shuffleLocalBytesRead': 49730906,
 'shuffleReadBytes': 49730906,
 'shuffleReadRecords': 228,
 'shuffleWriteBytes': 49730906,
 'shuffleWriteTime': 405851800,
 'shuffleWriteRecords': 228}

## Część 2 - Spark SQL (DataFrame)

In [33]:
df = spark.table(df_result_table)

# Wyświetlenie 50 pierwszych rekordów
df.show(50)

+-------------------------+-----+
|                     word|count|
+-------------------------+-----+
|                      The| 9372|
|                   Bidhee|    7|
|                Solutions| 6041|
|                   ArtAce|    2|
|                  PuyTech|    1|
|                   McLeod|  208|
|                      RTV|   13|
|     Softwarehttp://p...|    1|
|紫荊雜誌社https://bau...|    1|
|                  Bacilio|    2|
|     Developerhttps:/...|    1|
|     Softwarehttp://w...|    1|
|                  Backend|   13|
|하이퍼펌프~hyper.cho...|    1|
|                    METRO|   21|
|     ADBANDhttp://www...|    1|
|                      Tcf|    1|
|                      Pug|   12|
|              Techologies|    4|
|     Tourismhttps://t...|    1|
|     Kinsale~gourmet...|    1|
|     Englishhttps://w...|    1|
|                    Darul|   10|
|                       📱|    3|
|                  Panipat|    2|
|     Konyukhovhttp://...|    1|
|                     Bol

In [34]:
subtract_metrics(after_df_metrics, before_df_metrics)

{'numTasks': 12,
 'numActiveTasks': 0,
 'numCompleteTasks': 8,
 'numFailedTasks': 0,
 'numKilledTasks': 0,
 'numCompletedIndices': 8,
 'executorDeserializeTime': 1254,
 'executorDeserializeCpuTime': 446474600,
 'executorRunTime': 54900,
 'executorCpuTime': 22626614900,
 'resultSize': 36185,
 'jvmGcTime': 2428,
 'resultSerializationTime': 110,
 'memoryBytesSpilled': 0,
 'diskBytesSpilled': 0,
 'peakExecutionMemory': 440400752,
 'inputBytes': 84344235,
 'inputRecords': 1179547,
 'outputBytes': 50321941,
 'outputRecords': 1456441,
 'shuffleRemoteBlocksFetched': 0,
 'shuffleLocalBlocksFetched': 16,
 'shuffleFetchWaitTime': 0,
 'shuffleRemoteBytesRead': 0,
 'shuffleRemoteBytesReadToDisk': 0,
 'shuffleLocalBytesRead': 63406957,
 'shuffleReadBytes': 63406957,
 'shuffleReadRecords': 1622698,
 'shuffleWriteBytes': 63406957,
 'shuffleWriteTime': 817119700,
 'shuffleWriteRecords': 1622698}

## Część 3 - Pandas API on Spark

In [15]:
import json

# Odczytaj zawartość pliku JSON
with open(ps_result_file, 'r') as file:
    json_content = json.load(file)

# Wyświetl zawartość
print(json.dumps(json_content, indent=2))

FileNotFoundError: [Errno 2] No such file or directory: 'D:/studia/Semestr 7/Big Data/laby/projekt2/output3.json'

In [36]:
subtract_metrics(after_ps_metrics, before_ps_metrics)

{'numTasks': 33,
 'numActiveTasks': 0,
 'numCompleteTasks': 25,
 'numFailedTasks': 0,
 'numKilledTasks': 0,
 'numCompletedIndices': 25,
 'executorDeserializeTime': 1838,
 'executorDeserializeCpuTime': 440241100,
 'executorRunTime': 166601,
 'executorCpuTime': 55323279000,
 'resultSize': 134363,
 'jvmGcTime': 4753,
 'resultSerializationTime': 123,
 'memoryBytesSpilled': 0,
 'diskBytesSpilled': 0,
 'peakExecutionMemory': 427817888,
 'inputBytes': 385819487,
 'inputRecords': 5409845,
 'outputBytes': 0,
 'outputRecords': 0,
 'shuffleRemoteBlocksFetched': 0,
 'shuffleLocalBlocksFetched': 20,
 'shuffleFetchWaitTime': 0,
 'shuffleRemoteBytesRead': 0,
 'shuffleRemoteBytesReadToDisk': 0,
 'shuffleLocalBytesRead': 61239298,
 'shuffleReadBytes': 61239298,
 'shuffleReadRecords': 1573467,
 'shuffleWriteBytes': 61239298,
 'shuffleWriteTime': 1111152100,
 'shuffleWriteRecords': 1573467}

In [23]:
s = "miscellaneous"
arr = [(profession, 1) for profession in s.split(",") if profession != "miscellaneous"]
for a in arr:
    print("NEXT: "+ a[0])

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

# Initialize Spark session
spark = SparkSession.builder.appName("Mission1").getOrCreate()

# Load datasource1
datasource1 = spark.read.csv(
    datasource1_dir,
    sep="\t",
    header=False,
    inferSchema=True
).toDF("tconst", "ordering", "nconst", "role", "job", "characters")

# Load datasource4
datasource4 = spark.read.csv(
    datasource4_dir,
    sep="\t",
    header=True,
    inferSchema=True
)

# Replace "\\N" with None (null) in birthYear and deathYear
datasource4 = datasource4.withColumn(
    "birthYear", when(col("birthYear") == "\\N", None).otherwise(col("birthYear"))
).withColumn(
    "deathYear", when(col("deathYear") == "\\N", None).otherwise(col("deathYear"))
)

# Drop rows where birthYear or deathYear is null
datasource4_clean = datasource4.dropna(subset=["birthYear", "deathYear"])

# Convert birthYear and deathYear to float
datasource4_clean = datasource4_clean.withColumn("birthYear", col("birthYear").cast("float"))
datasource4_clean = datasource4_clean.withColumn("deathYear", col("deathYear").cast("float"))

# Calculate age
datasource4_clean = datasource4_clean.withColumn(
    "age", (col("deathYear") - col("birthYear")).cast("float")
)

In [6]:
# Filter rows where age < 0
datasource4_negative_age = datasource4_clean.filter(col("age") < 0)
datasource4_negative_age.show(10)

+---------+----------------+---------+---------+--------------------+--------------------+-----+
|   nconst|     primaryName|birthYear|deathYear|   primaryProfession|      knownForTitles|  age|
+---------+----------------+---------+---------+--------------------+--------------------+-----+
|nm0393598|    Michael Hook|   1946.0|   1913.0|assistant_directo...|tt0087414,tt00783...|-33.0|
|nm0515385|    Titus Livius|     59.0|     17.0|              writer|           tt0003740|-42.0|
|nm1076886|      Gerda Ital|   2002.0|   1988.0|              writer| tt0036586,tt0036298|-14.0|
|nm3623931|Stanislaw Kucner|   1932.0|   1918.0|cinematographer,c...|tt0200223,tt41170...|-14.0|
|nm5121666|   George Sanger|   1927.0|   1911.0|                NULL|           tt2228122|-16.0|
|nm9543104| Viktor Cholnoky|   1868.0|   1812.0|              writer|                  \N|-56.0|
+---------+----------------+---------+---------+--------------------+--------------------+-----+

