<a href="https://colab.research.google.com/github/urszkam/AoC_2022/blob/main/Copy_of_03_spark_student_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark

Zacznijmy od zainstalowania Sparka lokalnie na colabie:

In [1]:
!pip install pyspark --quiet
!pip install -U -q PyDrive --quiet
!apt install openjdk-8-jdk-headless &> /dev/null

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/987.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/987.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m983.0/987.4 kB[0m [31m15.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m987.4/987.4 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for PyDrive (setup.py) ... [?25l[?25hdone


In [2]:
import os
import pyspark
from pyspark import SparkContext, SparkConf

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
conf = SparkConf().set('spark.ui.port', '4050').setAppName("mlibs").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf=conf)

In [3]:
import kagglehub

path = kagglehub.dataset_download("sobhanmoosavi/us-accidents")
print(path)

Using Colab cache for faster access to the 'us-accidents' dataset.
/kaggle/input/us-accidents


In [7]:
!cp -R /root/.cache/kagglehub/datasets/sobhanmoosavi/us-accidents/versions/13/US_Accidents_March23.csv /content/sample_data/US_Accidents_March23.csv


cp: cannot stat '/root/.cache/kagglehub/datasets/sobhanmoosavi/us-accidents/versions/13/US_Accidents_March23.csv': No such file or directory


### **Podsumowanie zbioru danych o wypadkach samochodowych:**

- **Identyfikacja:**
  - `ID`: Unikalny identyfikator wypadku.
  - `Source`: Źródło danych (np. raporty policyjne, systemy monitoringu).

- **Czas:**
  - `Start_Time`, `End_Time`: Czas rozpoczęcia i zakończenia wypadku.

- **Lokalizacja:**
  - `Start_Lat`, `Start_Lng`, `End_Lat`, `End_Lng`: Współrzędne geograficzne miejsca wypadku.
  - `Distance(mi)`: Dystans, jaki obejmuje wypadek.
  - `City`, `State`, `Zipcode`, `Country`: Informacje o lokalizacji administracyjnej.
  - `Street`: Ulica, na której doszło do wypadku.

- **Pogoda:**
  - `Weather_Condition`: Warunki pogodowe (np. "Rain", "Clear").
  - `Temperature(F)`, `Humidity(%)`, `Visibility(mi)`: Kluczowe dane atmosferyczne.
  - `Wind_Speed(mph)`, `Wind_Direction`: Prędkość i kierunek wiatru.
  - `Precipitation(in)`: Opady deszczu lub śniegu.

- **Infrastruktura:**
  - `Amenity`, `Bump`, `Crossing`, `Traffic_Signal`: Informacje o obiektach i elementach drogowych w pobliżu wypadku (np. ronda, progi zwalniające, skrzyżowania).
  - `Junction`, `No_Exit`, `Railway`: Lokalizacje związane z infrastrukturą drogową.

- **Oświetlenie:**
  - `Sunrise_Sunset`: Pora dnia w chwili wypadku (np. "Day" lub "Night").
  - `Civil_Twilight`, `Nautical_Twilight`, `Astronomical_Twilight`: Różne fazy zmierzchu i świtu.

- **Opis i szczegóły:**
  - `Description`: Tekstowy opis wypadku.
  - `Severity`: Skala powagi wypadku (np. od 1 do 5).
  - `Airport_Code`: Kod lotniska najbliższego miejsca wypadku.

---

### **Przykładowe zastosowania danych:**
- Identyfikacja lokalizacji wysokiego ryzyka.
- Analiza wpływu pogody na bezpieczeństwo ruchu drogowego.
- Ocena infrastruktury drogowej i jej związku z wypadkami.
- Monitorowanie trendów czasowych i przestrzennych wypadków.

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .master("local[*]") \
                    .config("spark.executor.memory", "8g") \
                    .config("spark.driver.memory", "4g") \
                    .appName("mlibs") \
                    .getOrCreate()

inDF = spark.read.format("csv") \
  .option("sep", ",") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .load(path + "/US_Accidents_March23.csv")

inDF.printSchema()

inDF.createOrReplaceTempView("accidents")

root
 |-- ID: string (nullable = true)
 |-- Source: string (nullable = true)
 |-- Severity: integer (nullable = true)
 |-- Start_Time: timestamp (nullable = true)
 |-- End_Time: timestamp (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- Start_Lng: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- End_Lng: double (nullable = true)
 |-- Distance(mi): double (nullable = true)
 |-- Description: string (nullable = true)
 |-- Street: string (nullable = true)
 |-- City: string (nullable = true)
 |-- County: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Timezone: string (nullable = true)
 |-- Airport_Code: string (nullable = true)
 |-- Weather_Timestamp: timestamp (nullable = true)
 |-- Temperature(F): double (nullable = true)
 |-- Wind_Chill(F): double (nullable = true)
 |-- Humidity(%): double (nullable = true)
 |-- Pressure(in): double (nullable = true)
 |-- V

### **Treści do zadań**

---

### **Zadanie 1: Podstawowe statystyki dla kolumny `Severity`**
Policz liczbę wypadków dla każdej wartości kolumny `Severity` i posortuj wyniki w kolejności malejącej.


+--------+---------------+
|Severity|Total_Accidents|
+--------+---------------+
|       2|        6156981|
|       3|        1299337|
|       4|         204710|
|       1|          67366|
+--------+---------------+



In [12]:
inDF.select("Severity").groupBy("Severity").count().orderBy("count", ascending=False).show(50)

+--------+-------+
|Severity|  count|
+--------+-------+
|       2|6156981|
|       3|1299337|
|       4| 204710|
|       1|  67366|
+--------+-------+




---

### **Zadanie 2: Analiza wypadków według stanu (`State`)**
Znajdź stany z największą liczbą wypadków. Wyświetl tylko te stany, w których liczba wypadków przekracza 10 000.

---

+-----+---------------+
|State|Total_Accidents|
+-----+---------------+
|   CA|        1741433|
|   FL|         880192|
|   TX|         582837|
|   SC|         382557|
|   NY|         347960|
|   NC|         338199|
|   VA|         303301|
|   PA|         296620|
|   MN|         192084|
|   OR|         179660|
|   AZ|         170609|
|   GA|         169234|
|   IL|         168958|
|   TN|         167388|
|   MI|         162191|
|   LA|         149701|
|   NJ|         140719|
|   MD|         140417|
|   OH|         118115|
|   WA|         108221|
+-----+---------------+
only showing top 20 rows



In [14]:
inDF.select("State").groupBy("State").count().withColumnRenamed("count", "Total_Accidents").orderBy("Total_Accidents", ascending=False).show(20)

+-----+---------------+
|State|Total_Accidents|
+-----+---------------+
|   CA|        1741433|
|   FL|         880192|
|   TX|         582837|
|   SC|         382557|
|   NY|         347960|
|   NC|         338199|
|   VA|         303301|
|   PA|         296620|
|   MN|         192084|
|   OR|         179660|
|   AZ|         170609|
|   GA|         169234|
|   IL|         168958|
|   TN|         167388|
|   MI|         162191|
|   LA|         149701|
|   NJ|         140719|
|   MD|         140417|
|   OH|         118115|
|   WA|         108221|
+-----+---------------+
only showing top 20 rows





### **Zadanie 3: Analiza długości wypadków**
Oblicz średni czas trwania wypadków (`End_Time - Start_Time`) dla każdego poziomu `Severity`. Wynik podaj w minutach.

---

+--------+------------------------+
|Severity|Average_Duration_Minutes|
+--------+------------------------+
|       1|        53.7829938940514|
|       2|      485.43694180908903|
|       3|       74.81007870424173|
|       4|      1685.3479593082898|
+--------+------------------------+



In [21]:
from pyspark.sql.functions import avg, col


inDF.withColumn("Duration", (col("End_Time").cast("long") - col("Start_Time").cast("long")) / 60).groupBy("Severity").agg(avg("Duration").alias("Average_Duration_Minutes")).orderBy("Severity", ascending=True).show(50)

+--------+------------------------+
|Severity|Average_Duration_Minutes|
+--------+------------------------+
|       1|       53.78299389405067|
|       2|      485.43694180908466|
|       3|       74.81007870424233|
|       4|      1685.3479593082866|
+--------+------------------------+





### **Zadanie 4: Najbardziej niebezpieczne miasta**
Znajdź 10 miast, w których wystąpiło najwięcej wypadków, i posortuj wyniki w kolejności malejącej.

---


+-----------+---------------+
|       City|Total_Accidents|
+-----------+---------------+
|      Miami|         186917|
|    Houston|         169609|
|Los Angeles|         156491|
|  Charlotte|         138652|
|     Dallas|         130939|
|    Orlando|         109733|
|     Austin|          97359|
|    Raleigh|          86079|
|  Nashville|          72930|
|Baton Rouge|          71588|
+-----------+---------------+



In [26]:
inDF.select("City").groupBy("City").count().withColumnRenamed("count", "Total_Accidents").orderBy("Total_Accidents", ascending=False).limit(10).show()

+-----------+---------------+
|       City|Total_Accidents|
+-----------+---------------+
|      Miami|         186917|
|    Houston|         169609|
|Los Angeles|         156491|
|  Charlotte|         138652|
|     Dallas|         130939|
|    Orlando|         109733|
|     Austin|          97359|
|    Raleigh|          86079|
|  Nashville|          72930|
|Baton Rouge|          71588|
+-----------+---------------+




### **Zadanie 5: Wypadki według warunków pogodowych**
Policz liczbę wypadków dla każdego typu warunków pogodowych (`Weather_Condition`) i wyświetl je w kolejności malejącej.

---

+--------------------+---------------+
|   Weather_Condition|Total_Accidents|
+--------------------+---------------+
|                Fair|        2560802|
|       Mostly Cloudy|        1016195|
|              Cloudy|         817082|
|               Clear|         808743|
|       Partly Cloudy|         698972|
|            Overcast|         382866|
|          Light Rain|         352957|
|    Scattered Clouds|         204829|
|          Light Snow|         128680|
|                 Fog|          99238|
|                Rain|          84331|
|                Haze|          76223|
|        Fair / Windy|          35671|
|          Heavy Rain|          32309|
|       Light Drizzle|          22684|
|Thunder in the Vi...|          17611|
|      Cloudy / Windy|          17035|
|             T-Storm|          16810|
|Mostly Cloudy / W...|          16508|
|                Snow|          15537|
|             Thunder|          14202|
|Light Rain with T...|          13597|
|               Smoke|   

In [27]:
inDF.select("Weather_Condition").groupBy("Weather_Condition").count().withColumnRenamed("count", "Total_Accidents").orderBy("Total_Accidents", ascending=False).show(50)

+--------------------+---------------+
|   Weather_Condition|Total_Accidents|
+--------------------+---------------+
|                Fair|        2560802|
|       Mostly Cloudy|        1016195|
|              Cloudy|         817082|
|               Clear|         808743|
|       Partly Cloudy|         698972|
|            Overcast|         382866|
|          Light Rain|         352957|
|    Scattered Clouds|         204829|
|                NULL|         173459|
|          Light Snow|         128680|
|                 Fog|          99238|
|                Rain|          84331|
|                Haze|          76223|
|        Fair / Windy|          35671|
|          Heavy Rain|          32309|
|       Light Drizzle|          22684|
|Thunder in the Vi...|          17611|
|      Cloudy / Windy|          17035|
|             T-Storm|          16810|
|Mostly Cloudy / W...|          16508|
|                Snow|          15537|
|             Thunder|          14202|
|Light Rain with T...|   



### **Zadanie 6: Wypadki a pora dnia**
Sprawdź, jak liczba wypadków rozkłada się w zależności od pory dnia (`Sunrise_Sunset`).

---

+--------------+---------------+
|Sunrise_Sunset|Total_Accidents|
+--------------+---------------+
|          NULL|          23246|
|         Night|        2370595|
|           Day|        5334553|
+--------------+---------------+



In [28]:
inDF.select("Sunrise_Sunset").groupBy("Sunrise_Sunset").count().withColumnRenamed("count", "Total_Accidents").show()

+--------------+---------------+
|Sunrise_Sunset|Total_Accidents|
+--------------+---------------+
|          NULL|          23246|
|         Night|        2370595|
|           Day|        5334553|
+--------------+---------------+





### **Zadanie 7: Eksport przetworzonych danych**
Wyeksportuj dane o wypadkach, które miały `Severity >= 3` i wystąpiły podczas warunków pogodowych "Rain" lub "Snow", do pliku w formacie Parquet.

---


In [32]:
inDF.filter(
    (inDF.Severity >= 3) &
    (inDF.Weather_Condition.isin("Rain", "Snow"))
).write.parquet("file.parquet", mode="overwrite")

### Zadania Tips


In [29]:
!wget https://students.mimuw.edu.pl/~mw404851/dane.zip


--2025-09-27 08:03:14--  https://students.mimuw.edu.pl/~mw404851/dane.zip
Resolving students.mimuw.edu.pl (students.mimuw.edu.pl)... 193.0.96.129, 2001:6a0:5001:1::3
Connecting to students.mimuw.edu.pl (students.mimuw.edu.pl)|193.0.96.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 396807884 (378M) [application/zip]
Saving to: ‘dane.zip’


2025-09-27 08:03:31 (23.1 MB/s) - ‘dane.zip’ saved [396807884/396807884]



In [30]:
!mkdir dane && unzip dane.zip -d dane

Archive:  dane.zip
  inflating: dane/data_part_1.gz.parquet  
  inflating: dane/data_part_0.gz.parquet  
  inflating: dane/data_part_8.gz.parquet  
  inflating: dane/data_part_2.gz.parquet  
  inflating: dane/data_part_4.gz.parquet  
  inflating: dane/data_part_7.gz.parquet  
  inflating: dane/data_part_5.gz.parquet  
  inflating: dane/data_part_9.gz.parquet  
  inflating: dane/data_part_3.gz.parquet  
  inflating: dane/data_part_6.gz.parquet  


Kolumny i ich znaczenie:
total_bill (float):

1. Całkowity rachunek (w dolarach). tip (float):
2. Kwota napiwku (w dolarach). sex (string):
3. Płeć osoby, która zapłaciła rachunek (Male lub Female). smoker (string):
4. Informacja o tym, czy osoba była palaczem (Yes lub No). day (string):
5. Dzień tygodnia, w którym dokonano transakcji (Thur, Fri, Sat, Sun). time (string):
6. Porę dnia: Lunch lub Dinner. size (integer):

Liczba osób w grupie.


In [33]:
data_path = "dane/*.parquet"

df = spark.read.parquet(data_path)

df.show(3)

+----------+----+------+------+----+------+----+
|total_bill| tip|   sex|smoker| day|  time|size|
+----------+----+------+------+----+------+----+
|     10.24|5.22|  Male|    No| Fri| Lunch|   3|
|     11.17|6.78|Female|    No| Fri|Dinner|   5|
|     29.01|6.84|Female|    No|Thur| Lunch|   4|
+----------+----+------+------+----+------+----+
only showing top 3 rows



#### **Zadanie 1: Średni rachunek dla każdej pory dnia**
Oblicz średni rachunek (`total_bill`) dla posiłków serwowanych w porach dnia `Lunch` i `Dinner`.


In [34]:
df.createOrReplaceTempView("tips")

In [58]:
from pyspark.sql import functions as F

In [59]:
df.groupBy("time").agg(avg("total_bill").alias("avg_total_bill")).orderBy("avg_total_bill", ascending=False).show()

+------+------------------+
|  time|    avg_total_bill|
+------+------------------+
|Dinner| 20.29296757667226|
| Lunch|20.291929660043618|
+------+------------------+



In [35]:
spark.sql("SELECT time, avg(total_bill) FROM tips GROUP BY time").show()

+------+------------------+
|  time|   avg(total_bill)|
+------+------------------+
| Lunch|20.291929660043618|
|Dinner| 20.29296757667226|
+------+------------------+




---

#### **Zadanie 2: Średni napiwek dla każdej płci**
Oblicz średni napiwek (`tip`) dla osób o płci `Male` i `Female`.

---

In [60]:
df.groupBy("sex").agg(avg("tip").alias("avg_tip")).orderBy("avg_tip", ascending=False).show()

+------+------------------+
|   sex|           avg_tip|
+------+------------------+
|Female|5.0171324587970965|
|  Male| 5.016569841822055|
+------+------------------+



In [36]:
spark.sql("SELECT sex, avg(tip) FROM tips GROUP BY sex").show()

+------+------------------+
|   sex|          avg(tip)|
+------+------------------+
|Female|5.0171324587970965|
|  Male| 5.016569841822055|
+------+------------------+





#### **Zadanie 3: Liczba transakcji w każdym dniu tygodnia**
Policz liczbę transakcji (`COUNT(*)`) dla każdego dnia tygodnia (`day`) i posortuj wyniki malejąco według liczby transakcji.

---

In [61]:
df.groupBy("day").count().withColumnRenamed("count", "transaction_count").orderBy("transaction_count", ascending=False).show()

+----+-----------------+
| day|transaction_count|
+----+-----------------+
| Sat|         25012880|
| Sun|         24998593|
| Fri|         24997616|
|Thur|         24990911|
+----+-----------------+



In [37]:
spark.sql("SELECT day, count(*) as transaction_count FROM tips GROUP BY day ORDER BY transaction_count DESC").show()

+----+-----------------+
| day|transaction_count|
+----+-----------------+
| Sat|         25012880|
| Sun|         24998593|
| Fri|         24997616|
|Thur|         24990911|
+----+-----------------+





#### **Zadanie 4: Napiwki w zależności od palenia**
Porównaj średni napiwek (`tip`) dla osób palących (`smoker = 'Yes'`) i niepalących (`smoker = 'No'`).

---


In [66]:
df.groupBy("smoker").agg(avg("tip").alias("avg_tip")).orderBy("avg_tip", ascending=False).show()

+------+-----------------+
|smoker|          avg_tip|
+------+-----------------+
|   Yes|5.017205265844466|
|    No|5.016496952168923|
+------+-----------------+



In [38]:
spark.sql("SELECT smoker, avg(tip) as avg_tip FROM tips GROUP BY smoker").show()

+------+-----------------+
|smoker|          avg_tip|
+------+-----------------+
|    No|5.016496952168923|
|   Yes|5.017205265844466|
+------+-----------------+




#### **Zadanie 5: Średni procent napiwku w zależności od płci**
Oblicz średni procent napiwku (`tip / total_bill * 100`) dla `Male` i `Female`.

---

In [65]:
df.groupBy("sex").agg(avg(col("tip") / col("total_bill") * 100).alias("avg_tip_percentage")).orderBy("avg_tip_percentage", ascending=False).show()

+------+------------------+
|   sex|avg_tip_percentage|
+------+------------------+
|  Male| 33.74454066919119|
|Female| 33.74193257449481|
+------+------------------+



In [39]:
spark.sql("SELECT sex, avg(100 * tip / total_bill) as avg_tip_percentage FROM tips GROUP BY sex").show()

+------+------------------+
|   sex|avg_tip_percentage|
+------+------------------+
|Female| 33.74193257449481|
|  Male| 33.74454066919119|
+------+------------------+





#### **Zadanie 6: Najbardziej dochodowy dzień tygodnia**
Znajdź dzień tygodnia (`day`) z najwyższą sumą rachunków (`total_bill`).

---


In [69]:
df.groupBy("day").agg(F.sum("total_bill").alias("total_revenue")).orderBy("total_revenue", ascending=False).limit(1).show()

+---+-------------------+
|day|      total_revenue|
+---+-------------------+
|Sat|5.076083770100002E8|
+---+-------------------+



In [40]:
spark.sql("SELECT day, sum(total_bill) as total_revenue FROM tips GROUP BY day ORDER BY total_revenue DESC LIMIT 1").show()

+---+-------------------+
|day|      total_revenue|
+---+-------------------+
|Sat|5.076083770100002E8|
+---+-------------------+




#### **Zadanie 7: Transakcje dla dużych grup**
Policz liczbę transakcji, gdzie liczba osób (`size`) była większa niż 4, dla każdego dnia tygodnia.

---


In [73]:
df.filter(df.size > 4).groupBy("day").count().withColumnRenamed("count", "large_group_transactions").orderBy("large_group_transactions", ascending=False).show()

+----+------------------------+
| day|large_group_transactions|
+----+------------------------+
| Sat|                 8340277|
| Fri|                 8334496|
| Sun|                 8331552|
|Thur|                 8330339|
+----+------------------------+



In [43]:
spark.sql("SELECT day, count(*) as large_group_transactions FROM tips WHERE size > 4 GROUP BY day ORDER BY large_group_transactions DESC").show()

+----+------------------------+
| day|large_group_transactions|
+----+------------------------+
| Sat|                 8340277|
| Fri|                 8334496|
| Sun|                 8331552|
|Thur|                 8330339|
+----+------------------------+




#### **Zadanie 8: Najhojniejsze grupy**
Znajdź 5 największych napiwków (`tip`) w zbiorze danych i podaj ich procent w stosunku do całkowitego rachunku (`tip / total_bill * 100`).

---

In [77]:
df.select(["total_bill", "tip", "size"]).withColumn("tip_percentage", (col("tip") / col("total_bill") * 100)).orderBy("tip", ascending=False).limit(5).show()

+----------+-----+----+-----------------+
|total_bill|  tip|size|   tip_percentage|
+----------+-----+----+-----------------+
|     18.53|15.63|   2| 84.3497031840259|
|     25.56|15.57|   3|60.91549295774649|
|     37.81|15.54|   1|41.10023803226659|
|     28.29|15.53|   3|54.89572287027218|
|      35.2|15.48|   3|43.97727272727273|
+----------+-----+----+-----------------+



In [45]:
spark.sql("SELECT total_bill, tip, size, tip / total_bill * 100 as tip_percentage FROM tips ORDER BY tip DESC LIMIT 5").show()

+----------+-----+----+-----------------+
|total_bill|  tip|size|   tip_percentage|
+----------+-----+----+-----------------+
|     18.53|15.63|   2| 84.3497031840259|
|     25.56|15.57|   3|60.91549295774649|
|     37.81|15.54|   1|41.10023803226659|
|     28.29|15.53|   3|54.89572287027218|
|      35.2|15.48|   3|43.97727272727273|
+----------+-----+----+-----------------+





#### **Zadanie 9: Proporcja rachunków na osoby**
Oblicz średni rachunek na osobę (`total_bill / size`) dla każdego dnia tygodnia (`day`).

---

In [79]:
df.groupBy("day").agg(avg(col("total_bill") / col("size")).alias("avg_bill_per_person")).orderBy("avg_bill_per_person", ascending=False).show()

+----+-------------------+
| day|avg_bill_per_person|
+----+-------------------+
| Sun|  8.287338644725219|
| Fri|  8.287220061105542|
| Sat|  8.285342872632492|
|Thur|  8.283478717149496|
+----+-------------------+



In [46]:
spark.sql("SELECT day, avg(total_bill / size) as avg_bill_per_person FROM tips GROUP BY day").show()

+----+-------------------+
| day|avg_bill_per_person|
+----+-------------------+
|Thur|  8.283478717149496|
| Sun|  8.287338644725219|
| Sat|  8.285342872632492|
| Fri|  8.287220061105542|
+----+-------------------+





#### **Zadanie 10: Dzień o największym odsetku palaczy**
Znajdź dzień tygodnia, w którym występuje największy odsetek transakcji wykonywanych przez osoby palące (`smoker = 'Yes'`).

---

In [88]:
df.groupBy("day").agg(avg(F.when(col("smoker") == "Yes", 1.0).otherwise(0.0)).alias("smoker_ratio")).orderBy(col("smoker_ratio").desc()).limit(1).show()

+----+------------------+
| day|      smoker_ratio|
+----+------------------+
|Thur|0.5001874881631966|
+----+------------------+



In [50]:
spark.sql("SELECT day, AVG(CASE WHEN smoker = 'Yes' THEN 1.0 ELSE 0.0 END) as smoker_ratio FROM tips GROUP BY day ORDER BY smoker_ratio DESC LIMIT 1").show()

+----+------------+
| day|smoker_ratio|
+----+------------+
|Thur|     0.50019|
+----+------------+





#### **Zadanie 11: Porównanie średnich rachunków dla Lunch i Dinner**
Oblicz średnie wartości `total_bill` oraz `tip` dla `Lunch` i `Dinner` i porównaj je między sobą.

---

In [51]:
spark.sql("SELECT time, AVG(total_bill) as avg_total_bill, AVG(tip) as avg_tip FROM tips GROUP BY time").show()

+------+------------------+-----------------+
|  time|    avg_total_bill|          avg_tip|
+------+------------------+-----------------+
| Lunch|20.291929660043618| 5.01692891241368|
|Dinner| 20.29296757667226|5.016773372484346|
+------+------------------+-----------------+



In [83]:
df.groupBy("time").agg(avg("total_bill").alias("avg_total_bill"), avg("tip").alias("avg_tip")).show()

+------+------------------+-----------------+
|  time|    avg_total_bill|          avg_tip|
+------+------------------+-----------------+
| Lunch|20.291929660043618| 5.01692891241368|
|Dinner| 20.29296757667226|5.016773372484346|
+------+------------------+-----------------+





#### **Zadanie 12: Liczba transakcji według rozmiaru grupy**
Policz liczbę transakcji dla każdej liczby osób w grupie (`size`).

---

In [52]:
spark.sql("SELECT size, count(*) as total_transactions FROM tips GROUP BY size ORDER BY total_transactions DESC").show()

+----+------------------+
|size|total_transactions|
+----+------------------+
|   3|          16672170|
|   6|          16669732|
|   5|          16666932|
|   4|          16665096|
|   2|          16664231|
|   1|          16661839|
+----+------------------+



In [84]:
df.groupBy("size").count().withColumnRenamed("count", "total_transactions").show()

+----+------------------+
|size|total_transactions|
+----+------------------+
|   6|          16669732|
|   5|          16666932|
|   1|          16661839|
|   3|          16672170|
|   2|          16664231|
|   4|          16665096|
+----+------------------+





#### **Zadanie 13: Eksport dużych transakcji**
Wyeksportuj dane o transakcjach, gdzie `total_bill > 50`, do pliku CSV.

---

In [54]:
spark.sql("SELECT * FROM tips WHERE total_bill > 50").write.mode("overwrite").csv("transactionsOver50", header=True)

In [56]:
df.filter(df.total_bill >= 50).write.mode("overwrite").csv("transactions_over_50", header=True)



#### **Zadanie 14: Największe napiwki dla każdej pory dnia**
Znajdź największy napiwek (`tip`) w porze dnia `Lunch` i `Dinner`.

---


In [53]:
spark.sql("SELECT time, MAX(tip) as max_tip FROM tips GROUP BY time").show()

+------+-------+
|  time|max_tip|
+------+-------+
| Lunch|  15.57|
|Dinner|  15.63|
+------+-------+



In [85]:
df.groupBy("time").agg(F.max("tip").alias("max_tip")).show()

+------+-------+
|  time|max_tip|
+------+-------+
| Lunch|  15.57|
|Dinner|  15.63|
+------+-------+

