# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
    )

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).

In [3]:
!wget -O users.csv https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user

--2025-12-02 13:30:28--  https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22667 (22K) [text/plain]
Saving to: ‘users.csv’


2025-12-02 13:30:28 (15.4 MB/s) - ‘users.csv’ saved [22667/22667]



### Step 3. Assign it to a variable called users.

In [4]:
users = spark.read.csv('users.csv',header=True, sep="|", inferSchema=True)

In [4]:
users.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 4. Discover what is the mean age per occupation

In [4]:
%%time
age_avg = (
    users.groupBy('occupation')
    .agg(F.sum('age').alias('age'),
         F.count('*').alias('count')
    ).select(
        F.col('occupation'),
        (F.col('age') / F.col('count')).alias('avg_age')
    )
)



age_avg.show()

+-------------+------------------+
|   occupation|           avg_age|
+-------------+------------------+
|    librarian|              40.0|
|      retired| 63.07142857142857|
|       lawyer|             36.75|
|         none|26.555555555555557|
|       writer| 36.31111111111111|
|   programmer|33.121212121212125|
|    marketing| 37.61538461538461|
|        other|34.523809523809526|
|    executive|          38.71875|
|    scientist| 35.54838709677419|
|      student|22.081632653061224|
|     salesman|35.666666666666664|
|       artist|31.392857142857142|
|   technician|33.148148148148145|
|administrator| 38.74683544303797|
|     engineer| 36.38805970149254|
|   healthcare|           41.5625|
|     educator| 42.01052631578948|
|entertainment| 29.22222222222222|
|    homemaker| 32.57142857142857|
+-------------+------------------+
only showing top 20 rows

CPU times: user 7.22 ms, sys: 1.42 ms, total: 8.64 ms
Wall time: 2.03 s


In [5]:
%%time
age_avg = (
    users.groupBy("occupation")
         .agg(F.avg("age").alias("avg_age"))
)

age_avg.show()


+-------------+------------------+
|   occupation|           avg_age|
+-------------+------------------+
|    librarian|              40.0|
|      retired| 63.07142857142857|
|       lawyer|             36.75|
|         none|26.555555555555557|
|       writer| 36.31111111111111|
|   programmer|33.121212121212125|
|    marketing| 37.61538461538461|
|        other|34.523809523809526|
|    executive|          38.71875|
|    scientist| 35.54838709677419|
|      student|22.081632653061224|
|     salesman|35.666666666666664|
|       artist|31.392857142857142|
|   technician|33.148148148148145|
|administrator| 38.74683544303797|
|     engineer| 36.38805970149254|
|   healthcare|           41.5625|
|     educator| 42.01052631578948|
|entertainment| 29.22222222222222|
|    homemaker| 32.57142857142857|
+-------------+------------------+
only showing top 20 rows

CPU times: user 4.96 ms, sys: 1.11 ms, total: 6.06 ms
Wall time: 2.58 s


### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

### Step 6. For each occupation, calculate the minimum and maximum ages

### Step 7. For each combination of occupation and gender, calculate the mean age

### Step 8.  For each occupation present the percentage of women and men