<a href="https://colab.research.google.com/github/zalosh12/pyspark_exercises/blob/master/01_Getting_%26_Knowing_Your_Data/Occupation/Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ex3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [32]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
    )

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).

In [33]:
!wget -O users.csv https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user

--2025-12-01 18:02:40--  https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22667 (22K) [text/plain]
Saving to: ‘users.csv’


2025-12-01 18:02:40 (71.9 MB/s) - ‘users.csv’ saved [22667/22667]



### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [34]:
users = spark.read.csv('users.csv',header=True, sep="|", inferSchema=True)

### Step 4. See the first 25 entries

In [35]:
users.show(25)

+-------+---+------+-------------+--------+
|user_id|age|gender|   occupation|zip_code|
+-------+---+------+-------------+--------+
|      1| 24|     M|   technician|   85711|
|      2| 53|     F|        other|   94043|
|      3| 23|     M|       writer|   32067|
|      4| 24|     M|   technician|   43537|
|      5| 33|     F|        other|   15213|
|      6| 42|     M|    executive|   98101|
|      7| 57|     M|administrator|   91344|
|      8| 36|     M|administrator|   05201|
|      9| 29|     M|      student|   01002|
|     10| 53|     M|       lawyer|   90703|
|     11| 39|     F|        other|   30329|
|     12| 28|     F|        other|   06405|
|     13| 47|     M|     educator|   29206|
|     14| 45|     M|    scientist|   55106|
|     15| 49|     F|     educator|   97301|
|     16| 21|     M|entertainment|   10309|
|     17| 30|     M|   programmer|   06355|
|     18| 35|     F|        other|   37212|
|     19| 40|     M|    librarian|   02138|
|     20| 42|     F|    homemake

### Step 5. See the last 10 entries

In [10]:
users.orderBy(F.desc('user_id')).limit(10).show()

+-------+---+------+-------------+--------+
|user_id|age|gender|   occupation|zip_code|
+-------+---+------+-------------+--------+
|    943| 22|     M|      student|   77841|
|    942| 48|     F|    librarian|   78209|
|    941| 20|     M|      student|   97229|
|    940| 32|     M|administrator|   02215|
|    939| 26|     F|      student|   33319|
|    938| 38|     F|   technician|   55038|
|    937| 48|     M|     educator|   98072|
|    936| 24|     M|        other|   32789|
|    935| 42|     M|       doctor|   66221|
|    934| 61|     M|     engineer|   22902|
+-------+---+------+-------------+--------+



### Step 6. What is the number of observations in the dataset?

In [36]:
users.count()

943

### Step 7. What is the number of columns in the dataset?

In [12]:
len(users.columns)

5

### Step 8. Print the name of all the columns.

In [37]:
print(users.columns)

['user_id', 'age', 'gender', 'occupation', 'zip_code']


### Step 9. How is the dataset indexed?

In [None]:
# There are no indexing in Spark

### Step 10. What is the data type of each column?

In [38]:
users.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 11. Print only the occupation column

In [39]:
users.select(F.col('occupation')).show()

+-------------+
|   occupation|
+-------------+
|   technician|
|        other|
|       writer|
|   technician|
|        other|
|    executive|
|administrator|
|administrator|
|      student|
|       lawyer|
|        other|
|        other|
|     educator|
|    scientist|
|     educator|
|entertainment|
|   programmer|
|        other|
|    librarian|
|    homemaker|
+-------------+
only showing top 20 rows



### Step 12. How many different occupations are in this dataset?

In [40]:
users.select('occupation').distinct().count()

21

### Step 13. What is the most frequent occupation?

In [41]:
users.select(F.mode('occupation')).collect()[0][0]

'student'

### Step 14. Summarize the DataFrame.

In [42]:
users.describe().show()

+-------+-----------------+-----------------+------+-------------+------------------+
|summary|          user_id|              age|gender|   occupation|          zip_code|
+-------+-----------------+-----------------+------+-------------+------------------+
|  count|              943|              943|   943|          943|               943|
|   mean|            472.0|34.05196182396607|  NULL|         NULL| 50868.78810810811|
| stddev|272.3649512449549|12.19273973305903|  NULL|         NULL|30891.373254138176|
|    min|                1|                7|     F|administrator|             00000|
|    max|              943|               73|     M|       writer|             Y1A6B|
+-------+-----------------+-----------------+------+-------------+------------------+



### Step 15. Summarize all the columns

In [43]:
users.summary().show()

+-------+-----------------+-----------------+------+-------------+------------------+
|summary|          user_id|              age|gender|   occupation|          zip_code|
+-------+-----------------+-----------------+------+-------------+------------------+
|  count|              943|              943|   943|          943|               943|
|   mean|            472.0|34.05196182396607|  NULL|         NULL| 50868.78810810811|
| stddev|272.3649512449549|12.19273973305903|  NULL|         NULL|30891.373254138176|
|    min|                1|                7|     F|administrator|             00000|
|    25%|              236|               25|  NULL|         NULL|           21227.0|
|    50%|              472|               31|  NULL|         NULL|           53711.0|
|    75%|              708|               43|  NULL|         NULL|           78741.0|
|    max|              943|               73|     M|       writer|             Y1A6B|
+-------+-----------------+-----------------+------+--

### Step 16. Summarize only the occupation column

In [44]:
users.select(F.col('occupation')).summary().show()

+-------+-------------+
|summary|   occupation|
+-------+-------------+
|  count|          943|
|   mean|         NULL|
| stddev|         NULL|
|    min|administrator|
|    25%|         NULL|
|    50%|         NULL|
|    75%|         NULL|
|    max|       writer|
+-------+-------------+



### Step 17. What is the mean age of users?

In [45]:
users.select(F.avg('age')).collect()[0][0]

34.05196182396607

### Step 18. What is the age with least occurrence?

In [46]:
users.groupBy('age').count().orderBy('count').show(5)


+---+-----+
|age|count|
+---+-----+
| 10|    1|
|  7|    1|
| 73|    1|
| 11|    1|
| 66|    1|
+---+-----+
only showing top 5 rows

