# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


In [1]:
!pip install pyspark




In [2]:
from pyspark.sql import SparkSession

Spark=(
    SparkSession.builder.appName("Spark App").master("local[*]").getOrCreate()
)
print(Spark)
print(Spark.version)

<pyspark.sql.session.SparkSession object at 0x7d2484a200e0>
3.5.1


In [18]:
# Emp Data & Schema

emp_data = [
    ["001","101","John Doe","30","Male","50000","2015-01-01"],
    ["002","101","Jane Smith","25","Female","45000","2016-02-15"],
    ["003","102","Bob Brown","35","Male","55000","2014-05-01"],
    ["004","102","Alice Lee","28","Female","48000","2017-09-30"],
    ["005","103","Jack Chan","40","Male","60000","2013-04-01"],
    ["006","103","Jill Wong","32","Female","52000","2018-07-01"],
    ["007","101","James Johnson","42","Male","70000","2012-03-15"],
    ["008","102","Kate Kim","29","Female","51000","2019-10-01"],
    ["009","103","Tom Tan","33","Male","58000","2016-06-01"],
    ["010","104","Lisa Lee","27","Female","47000","2018-08-01"],
    ["011","104","David Park","38","Male","65000","2015-11-01"],
    ["012","105","Susan Chen","31","Female","54000","2017-02-15"],
    ["013","106","Brian Kim","45","Male","75000","2011-07-01"],
    ["014","107","Emily Lee","26","Female","46000","2019-01-01"],
    ["015","106","Michael Lee","37","Male","63000","2014-09-30"],
    ["016","107","Kelly Zhang","30","Female","49000","2018-04-01"],
    ["017","105","George Wang","34","Male","57000","2016-03-15"],
    ["018","104","Nancy Liu","29","Female","50000","2017-06-01"],
    ["019","103","Steven Chen","36","Male","62000","2015-08-01"],
    ["020","102","Grace Kim","32","Female","53000","2018-11-01"],
    ["021","102","Kim jerry","32","FFM","59000","2018-11-01"]

]

emp_schema = "employee_id string, department_id string, name string, age string, gender string, salary string, hire_date string"

emp=Spark.createDataFrame(data=emp_data,schema=emp_schema);

emp.show();

+-----------+-------------+-------------+---+------+------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|
+-----------+-------------+-------------+---+------+------+----------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|
|        007|          101|James Johnson| 42|  Male| 70000|2012-03-15|
|        008|          102|     Kate Kim| 29|Female| 51000|2019-10-01|
|        009|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|
|        010|          104|     Lisa Lee| 27|Female| 47000|2018-08-01|
|        011|          104|   David Park| 38|  Male| 65000|2015-11-01|
|     

In [4]:
emp.rdd.getNumPartitions()
# Install pyngrok
!pip install pyngrok

# Start Spark with port 4040 (default)
!pyspark &

# Expose port 4040 to the internet
from pyngrok import ngrok
ngrok.connect(4040)

Collecting pyngrok
  Downloading pyngrok-7.3.0-py3-none-any.whl.metadata (8.1 kB)
Downloading pyngrok-7.3.0-py3-none-any.whl (25 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.3.0


ERROR:pyngrok.process.ngrok:t=2025-09-05T12:39:49+0000 lvl=eror msg="failed to reconnect session" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-09-05T12:39:49+0000 lvl=eror msg="session closing" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-09-05T12:39:49+0000 lvl=eror msg="terminating with error" obj=app err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your aut

PyngrokNgrokError: The ngrok process errored on start: authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n.

In [5]:
from pyngrok import ngrok
ngrok.set_auth_token("329eBX2JdUFb29BTWvsDWlVfmVl_4Rx6vRhQ4ML2Bn7KSRcox")   # paste your token
ngrok.connect(4040)




<NgrokTunnel: "https://4c4f5a6fbb68.ngrok-free.app" -> "http://localhost:4040">

In [None]:
emp_final=emp.where("salary > 50000")
emp_final.show()

+-----------+-------------+-------------+---+------+------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|
+-----------+-------------+-------------+---+------+------+----------+
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|
|        007|          101|James Johnson| 42|  Male| 70000|2012-03-15|
|        008|          102|     Kate Kim| 29|Female| 51000|2019-10-01|
|        009|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|
|        011|          104|   David Park| 38|  Male| 65000|2015-11-01|
|        012|          105|   Susan Chen| 31|Female| 54000|2017-02-15|
|        013|          106|    Brian Kim| 45|  Male| 75000|2011-07-01|
|        015|          106|  Michael Lee| 37|  Male| 63000|2014-09-30|
|        017|          105|  George Wang| 34|  Male| 57000|2016-03-15|
|     

In [None]:
emp.schema
from pyspark.sql.types import StructType,StructField,StringType,IntegerType

spark_schema=StructType([
    StructField("name",StringType(),True),
    StructField("age",IntegerType(),True)
])
spark_schema

StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)])

In [None]:
from pyspark.sql.functions import col,expr
col("name")
expr("name")
emp.employee_id
emp["salary"]

Column<'salary'>

In [None]:
emp_filtered=emp.select(col("name"),expr("age"),emp.salary,emp["department_id"],emp.employee_id)
emp_filtered.show()

In [None]:
emp_filtered.show()

+-------------+---+------+-------------+-----------+
|         name|age|salary|department_id|employee_id|
+-------------+---+------+-------------+-----------+
|     John Doe| 30| 50000|          101|        001|
|   Jane Smith| 25| 45000|          101|        002|
|    Bob Brown| 35| 55000|          102|        003|
|    Alice Lee| 28| 48000|          102|        004|
|    Jack Chan| 40| 60000|          103|        005|
|    Jill Wong| 32| 52000|          103|        006|
|James Johnson| 42| 70000|          101|        007|
|     Kate Kim| 29| 51000|          102|        008|
|      Tom Tan| 33| 58000|          103|        009|
|     Lisa Lee| 27| 47000|          104|        010|
|   David Park| 38| 65000|          104|        011|
|   Susan Chen| 31| 54000|          105|        012|
|    Brian Kim| 45| 75000|          106|        013|
|    Emily Lee| 26| 46000|          107|        014|
|  Michael Lee| 37| 63000|          106|        015|
|  Kelly Zhang| 30| 49000|          107|      

In [None]:
emp_casted=emp_filtered.select(expr("employee_id as emp_id"),emp.name,expr("cast(age as int) as age_int"),expr("salary as emp_salary"))

emp_casted

DataFrame[emp_id: string, name: string, age_int: int, emp_salary: string]

In [None]:
emp_casted_1=emp_casted.selectExpr("emp_id","name","age_int","emp_salary").where("age > 20")

emp_casted_1.show()

+------+-------------+-------+----------+
|emp_id|         name|age_int|emp_salary|
+------+-------------+-------+----------+
|   001|     John Doe|     30|     50000|
|   002|   Jane Smith|     25|     45000|
|   003|    Bob Brown|     35|     55000|
|   004|    Alice Lee|     28|     48000|
|   005|    Jack Chan|     40|     60000|
|   006|    Jill Wong|     32|     52000|
|   007|James Johnson|     42|     70000|
|   008|     Kate Kim|     29|     51000|
|   009|      Tom Tan|     33|     58000|
|   010|     Lisa Lee|     27|     47000|
|   011|   David Park|     38|     65000|
|   012|   Susan Chen|     31|     54000|
|   013|    Brian Kim|     45|     75000|
|   014|    Emily Lee|     26|     46000|
|   015|  Michael Lee|     37|     63000|
|   016|  Kelly Zhang|     30|     49000|
|   017|  George Wang|     34|     57000|
|   018|    Nancy Liu|     29|     50000|
|   019|  Steven Chen|     36|     62000|
|   020|    Grace Kim|     32|     53000|
+------+-------------+-------+----

In [None]:
emp_casted_1.show();

+------+-------------+-------+----------+
|emp_id|         name|age_int|emp_salary|
+------+-------------+-------+----------+
|   001|     John Doe|     30|     50000|
|   002|   Jane Smith|     25|     45000|
|   003|    Bob Brown|     35|     55000|
|   004|    Alice Lee|     28|     48000|
|   005|    Jack Chan|     40|     60000|
|   006|    Jill Wong|     32|     52000|
|   007|James Johnson|     42|     70000|
|   008|     Kate Kim|     29|     51000|
|   009|      Tom Tan|     33|     58000|
|   010|     Lisa Lee|     27|     47000|
|   011|   David Park|     38|     65000|
|   012|   Susan Chen|     31|     54000|
|   013|    Brian Kim|     45|     75000|
|   014|    Emily Lee|     26|     46000|
|   015|  Michael Lee|     37|     63000|
|   016|  Kelly Zhang|     30|     49000|
|   017|  George Wang|     34|     57000|
|   018|    Nancy Liu|     29|     50000|
|   019|  Steven Chen|     36|     62000|
|   020|    Grace Kim|     32|     53000|
+------+-------------+-------+----

In [None]:
schema_str="name String,age int,employee_id String"
from pyspark.sql.types import _parse_datatype_string
schema_1=_parse_datatype_string(schema_str)
schema_1

StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True), StructField('employee_id', StringType(), True)])

In [None]:
emp.show(5);

+-----------+-------------+----------+---+------+------+----------+
|employee_id|department_id|      name|age|gender|salary| hire_date|
+-----------+-------------+----------+---+------+------+----------+
|        001|          101|  John Doe| 30|  Male| 50000|2015-01-01|
|        002|          101|Jane Smith| 25|Female| 45000|2016-02-15|
|        003|          102| Bob Brown| 35|  Male| 55000|2014-05-01|
|        004|          102| Alice Lee| 28|Female| 48000|2017-09-30|
|        005|          103| Jack Chan| 40|  Male| 60000|2013-04-01|
+-----------+-------------+----------+---+------+------+----------+
only showing top 5 rows



In [None]:
from pyspark.sql.functions import col,cast

emp_casted=emp.select("employee_id","name","age","department_id",col("salary").cast("double"))


In [None]:
emp_casted.printSchema();

root
 |-- employee_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- department_id: string (nullable = true)
 |-- salary: double (nullable = true)



In [None]:
emp_taxed=emp_casted.withColumn("tax",col("salary") * 0.2)
emp_taxed.show()

+-----------+-------------+---+-------------+-------+-------+
|employee_id|         name|age|department_id| salary|    tax|
+-----------+-------------+---+-------------+-------+-------+
|        001|     John Doe| 30|          101|50000.0|10000.0|
|        002|   Jane Smith| 25|          101|45000.0| 9000.0|
|        003|    Bob Brown| 35|          102|55000.0|11000.0|
|        004|    Alice Lee| 28|          102|48000.0| 9600.0|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|
|        007|James Johnson| 42|          101|70000.0|14000.0|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|
|        009|      Tom Tan| 33|          103|58000.0|11600.0|
|        010|     Lisa Lee| 27|          104|47000.0| 9400.0|
|        011|   David Park| 38|          104|65000.0|13000.0|
|        012|   Susan Chen| 31|          105|54000.0|10800.0|
|        013|    Brian Kim| 45|          106|75000.0|15000.0|
|       

In [None]:
from pyspark.sql.functions import lit

emp_new_cols=emp_taxed.withColumn("ColumnOne",lit(1)).withColumn("ColumnTwo",lit("two"))
emp_new_cols.show()

+-----------+-------------+---+-------------+-------+-------+---------+---------+
|employee_id|         name|age|department_id| salary|    tax|ColumnOne|ColumnTwo|
+-----------+-------------+---+-------------+-------+-------+---------+---------+
|        001|     John Doe| 30|          101|50000.0|10000.0|        1|      two|
|        002|   Jane Smith| 25|          101|45000.0| 9000.0|        1|      two|
|        003|    Bob Brown| 35|          102|55000.0|11000.0|        1|      two|
|        004|    Alice Lee| 28|          102|48000.0| 9600.0|        1|      two|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|        1|      two|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|        1|      two|
|        007|James Johnson| 42|          101|70000.0|14000.0|        1|      two|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|        1|      two|
|        009|      Tom Tan| 33|          103|58000.0|11600.0|        1|      two|
|        010|   

In [None]:
emp1=emp_new_cols.withColumnRenamed("employee_id","emp_id")

emp1.show()

+------+-------------+---+-------------+-------+-------+---------+---------+
|emp_id|         name|age|department_id| salary|    tax|ColumnOne|ColumnTwo|
+------+-------------+---+-------------+-------+-------+---------+---------+
|   001|     John Doe| 30|          101|50000.0|10000.0|        1|      two|
|   002|   Jane Smith| 25|          101|45000.0| 9000.0|        1|      two|
|   003|    Bob Brown| 35|          102|55000.0|11000.0|        1|      two|
|   004|    Alice Lee| 28|          102|48000.0| 9600.0|        1|      two|
|   005|    Jack Chan| 40|          103|60000.0|12000.0|        1|      two|
|   006|    Jill Wong| 32|          103|52000.0|10400.0|        1|      two|
|   007|James Johnson| 42|          101|70000.0|14000.0|        1|      two|
|   008|     Kate Kim| 29|          102|51000.0|10200.0|        1|      two|
|   009|      Tom Tan| 33|          103|58000.0|11600.0|        1|      two|
|   010|     Lisa Lee| 27|          104|47000.0| 9400.0|        1|      two|

In [None]:
emp1=emp_new_cols.withColumnRenamed("employee_id","Emp Id");
emp1.show()

+------+-------------+---+-------------+-------+-------+---------+---------+
|Emp Id|         name|age|department_id| salary|    tax|ColumnOne|ColumnTwo|
+------+-------------+---+-------------+-------+-------+---------+---------+
|   001|     John Doe| 30|          101|50000.0|10000.0|        1|      two|
|   002|   Jane Smith| 25|          101|45000.0| 9000.0|        1|      two|
|   003|    Bob Brown| 35|          102|55000.0|11000.0|        1|      two|
|   004|    Alice Lee| 28|          102|48000.0| 9600.0|        1|      two|
|   005|    Jack Chan| 40|          103|60000.0|12000.0|        1|      two|
|   006|    Jill Wong| 32|          103|52000.0|10400.0|        1|      two|
|   007|James Johnson| 42|          101|70000.0|14000.0|        1|      two|
|   008|     Kate Kim| 29|          102|51000.0|10200.0|        1|      two|
|   009|      Tom Tan| 33|          103|58000.0|11600.0|        1|      two|
|   010|     Lisa Lee| 27|          104|47000.0| 9400.0|        1|      two|

In [None]:
emp_dropped=emp_new_cols.drop("ColumnOne")
emp_dropped.show()

+-----------+-------------+---+-------------+-------+-------+---------+
|employee_id|         name|age|department_id| salary|    tax|ColumnTwo|
+-----------+-------------+---+-------------+-------+-------+---------+
|        001|     John Doe| 30|          101|50000.0|10000.0|      two|
|        002|   Jane Smith| 25|          101|45000.0| 9000.0|      two|
|        003|    Bob Brown| 35|          102|55000.0|11000.0|      two|
|        004|    Alice Lee| 28|          102|48000.0| 9600.0|      two|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|      two|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|      two|
|        007|James Johnson| 42|          101|70000.0|14000.0|      two|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|      two|
|        009|      Tom Tan| 33|          103|58000.0|11600.0|      two|
|        010|     Lisa Lee| 27|          104|47000.0| 9400.0|      two|
|        011|   David Park| 38|          104|65000.0|13000.0|   

In [None]:
emp_drop1=emp_taxed.drop("ColumnOne").drop("ColumnTwo")
emp_drop1.show()

+-----------+-------------+---+-------------+-------+-------+
|employee_id|         name|age|department_id| salary|    tax|
+-----------+-------------+---+-------------+-------+-------+
|        001|     John Doe| 30|          101|50000.0|10000.0|
|        002|   Jane Smith| 25|          101|45000.0| 9000.0|
|        003|    Bob Brown| 35|          102|55000.0|11000.0|
|        004|    Alice Lee| 28|          102|48000.0| 9600.0|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|
|        007|James Johnson| 42|          101|70000.0|14000.0|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|
|        009|      Tom Tan| 33|          103|58000.0|11600.0|
|        010|     Lisa Lee| 27|          104|47000.0| 9400.0|
|        011|   David Park| 38|          104|65000.0|13000.0|
|        012|   Susan Chen| 31|          105|54000.0|10800.0|
|        013|    Brian Kim| 45|          106|75000.0|15000.0|
|       

In [None]:
emp_filter=emp_drop1.where("tax > 10000")
emp_filter.show()

+-----------+-------------+---+-------------+-------+-------+
|employee_id|         name|age|department_id| salary|    tax|
+-----------+-------------+---+-------------+-------+-------+
|        003|    Bob Brown| 35|          102|55000.0|11000.0|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|
|        007|James Johnson| 42|          101|70000.0|14000.0|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|
|        009|      Tom Tan| 33|          103|58000.0|11600.0|
|        011|   David Park| 38|          104|65000.0|13000.0|
|        012|   Susan Chen| 31|          105|54000.0|10800.0|
|        013|    Brian Kim| 45|          106|75000.0|15000.0|
|        015|  Michael Lee| 37|          106|63000.0|12600.0|
|        017|  George Wang| 34|          105|57000.0|11400.0|
|        019|  Steven Chen| 36|          103|62000.0|12400.0|
|        020|    Grace Kim| 32|          102|53000.0|10600.0|
+-------

In [None]:
emp_limit=emp_taxed.limit(8)
emp_limit.show();

+-----------+-------------+---+-------------+-------+-------+
|employee_id|         name|age|department_id| salary|    tax|
+-----------+-------------+---+-------------+-------+-------+
|        001|     John Doe| 30|          101|50000.0|10000.0|
|        002|   Jane Smith| 25|          101|45000.0| 9000.0|
|        003|    Bob Brown| 35|          102|55000.0|11000.0|
|        004|    Alice Lee| 28|          102|48000.0| 9600.0|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|
|        007|James Johnson| 42|          101|70000.0|14000.0|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|
+-----------+-------------+---+-------------+-------+-------+



In [None]:
emp_limit.show(10)


+-----------+-------------+---+-------------+-------+-------+
|employee_id|         name|age|department_id| salary|    tax|
+-----------+-------------+---+-------------+-------+-------+
|        001|     John Doe| 30|          101|50000.0|10000.0|
|        002|   Jane Smith| 25|          101|45000.0| 9000.0|
|        003|    Bob Brown| 35|          102|55000.0|11000.0|
|        004|    Alice Lee| 28|          102|48000.0| 9600.0|
|        005|    Jack Chan| 40|          103|60000.0|12000.0|
|        006|    Jill Wong| 32|          103|52000.0|10400.0|
|        007|James Johnson| 42|          101|70000.0|14000.0|
|        008|     Kate Kim| 29|          102|51000.0|10200.0|
+-----------+-------------+---+-------------+-------+-------+



In [None]:
columns={
    "tax":col("salary") *0.2,
    "threeNumber":lit(3),
    "fourNumber":lit("Four Number")
}
emp_new=emp_taxed.withColumns(columns)

In [15]:
emp_new.show()

NameError: name 'emp_new' is not defined

In [16]:
emp.show();

+-----------+-------------+-------------+---+------+------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|
+-----------+-------------+-------------+---+------+------+----------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|
|        007|          101|James Johnson| 42|  Male| 70000|2012-03-15|
|        008|          102|     Kate Kim| 29|Female| 51000|2019-10-01|
|        009|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|
|        010|          104|     Lisa Lee| 27|Female| 47000|2018-08-01|
|        011|          104|   David Park| 38|  Male| 65000|2015-11-01|
|     

In [14]:
from pyspark.sql.functions import when,col,expr
emp_gender=emp.withColumn("new_gender",when(col("gender")=='Male','M').when(col("gender")=="Female",'F').otherwise(None))
emp_gender.show()

+-----------+-------------+-------------+---+------+------+----------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|new_gender|
+-----------+-------------+-------------+---+------+------+----------+----------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|         M|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|         F|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|         M|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|         F|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|         M|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|         F|
|        007|          101|James Johnson| 42|  Male| 70000|2012-03-15|         M|
|        008|          102|     Kate Kim| 29|Female| 51000|2019-10-01|         F|
|        009|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|         M|
|        010|   

In [19]:
emp_gender1=emp.withColumn("new_gender",expr("case when gender='Male' then 'M' when gender='Female' then 'F' else null end"))
emp_gender1.show()

+-----------+-------------+-------------+---+------+------+----------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|new_gender|
+-----------+-------------+-------------+---+------+------+----------+----------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|         M|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|         F|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|         M|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|         F|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|         M|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|         F|
|        007|          101|James Johnson| 42|  Male| 70000|2012-03-15|         M|
|        008|          102|     Kate Kim| 29|Female| 51000|2019-10-01|         F|
|        009|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|         M|
|        010|   

In [21]:
from pyspark.sql.functions  import regexp_replace,col
emp_name_fixed=emp_gender1.withColumn("new_name",regexp_replace(col("name"),'J','X'));
emp_name_fixed.show()

+-----------+-------------+-------------+---+------+------+----------+----------+-------------+
|employee_id|department_id|         name|age|gender|salary| hire_date|new_gender|     new_name|
+-----------+-------------+-------------+---+------+------+----------+----------+-------------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|         M|     Xohn Doe|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|         F|   Xane Smith|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|         M|    Bob Brown|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|         F|    Alice Lee|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|         M|    Xack Chan|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|         F|    Xill Wong|
|        007|          101|James Johnson| 42|  Male| 70000|2012-03-15|         M|Xames Xohnson|
|        008|          102|     Kate Kim

In [41]:
from pyspark.sql.functions import to_date,col
emp_dated=emp_name_fixed.withColumn('hire_dated',to_date(col("hire_date"),'yyyy-MM-dd'));
emp_dated.show()

+-----------+-------------+-------------+---+------+------+----------+----------+-------------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|new_gender|     new_name|hire_dated|
+-----------+-------------+-------------+---+------+------+----------+----------+-------------+----------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|         M|     Xohn Doe|2015-01-01|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|         F|   Xane Smith|2016-02-15|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|         M|    Bob Brown|2014-05-01|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|         F|    Alice Lee|2017-09-30|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|         M|    Xack Chan|2013-04-01|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|         F|    Xill Wong|2018-07-01|
|        007|          101|James John

+-----------+-------------+-------------+---+------+------+----------+----------+-------------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|new_gender|     new_name|hire_dated|
+-----------+-------------+-------------+---+------+------+----------+----------+-------------+----------+
|        001|          101|     John Doe| 30|  Male| 50000|2015-01-01|         M|     Xohn Doe|      NULL|
|        002|          101|   Jane Smith| 25|Female| 45000|2016-02-15|         F|   Xane Smith|      NULL|
|        003|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|         M|    Bob Brown|      NULL|
|        004|          102|    Alice Lee| 28|Female| 48000|2017-09-30|         F|    Alice Lee|      NULL|
|        005|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|         M|    Xack Chan|      NULL|
|        006|          103|    Jill Wong| 32|Female| 52000|2018-07-01|         F|    Xill Wong|      NULL|
|        007|          101|James John

In [43]:
from pyspark.sql.functions import current_date,current_timestamp

emp_date_fix=emp_dated.withColumn("currentDate",current_date()).withColumn("currentTimeStamp",current_timestamp())
emp_date_fix.show(truncate=False);


+-----------+-------------+-------------+---+------+------+----------+----------+-------------+----------+-----------+--------------------------+
|employee_id|department_id|name         |age|gender|salary|hire_date |new_gender|new_name     |hire_dated|currentDate|currentTimeStamp          |
+-----------+-------------+-------------+---+------+------+----------+----------+-------------+----------+-----------+--------------------------+
|001        |101          |John Doe     |30 |Male  |50000 |2015-01-01|M         |Xohn Doe     |2015-01-01|2025-09-05 |2025-09-05 12:59:51.614309|
|002        |101          |Jane Smith   |25 |Female|45000 |2016-02-15|F         |Xane Smith   |2016-02-15|2025-09-05 |2025-09-05 12:59:51.614309|
|003        |102          |Bob Brown    |35 |Male  |55000 |2014-05-01|M         |Bob Brown    |2014-05-01|2025-09-05 |2025-09-05 12:59:51.614309|
|004        |102          |Alice Lee    |28 |Female|48000 |2017-09-30|F         |Alice Lee    |2017-09-30|2025-09-05 |2025-0

In [44]:
emp_final=emp_date_fix.drop("name","gender").withColumnRenamed("new_name","name").withColumnRenamed("new_gender","gender");
emp_final.show()

+-----------+-------------+---+------+----------+------+-------------+----------+-----------+--------------------+
|employee_id|department_id|age|salary| hire_date|gender|         name|hire_dated|currentDate|    currentTimeStamp|
+-----------+-------------+---+------+----------+------+-------------+----------+-----------+--------------------+
|        001|          101| 30| 50000|2015-01-01|     M|     Xohn Doe|2015-01-01| 2025-09-05|2025-09-05 13:02:...|
|        002|          101| 25| 45000|2016-02-15|     F|   Xane Smith|2016-02-15| 2025-09-05|2025-09-05 13:02:...|
|        003|          102| 35| 55000|2014-05-01|     M|    Bob Brown|2014-05-01| 2025-09-05|2025-09-05 13:02:...|
|        004|          102| 28| 48000|2017-09-30|     F|    Alice Lee|2017-09-30| 2025-09-05|2025-09-05 13:02:...|
|        005|          103| 40| 60000|2013-04-01|     M|    Xack Chan|2013-04-01| 2025-09-05|2025-09-05 13:02:...|
|        006|          103| 32| 52000|2018-07-01|     F|    Xill Wong|2018-07-01

In [46]:
from pyspark.sql.functions import col,date_format
emp_fixed=emp_final.withColumn("date_string",date_format(col("hire_date"),'dd/MM/yyyy'))
emp_fixed.show();

+-----------+-------------+---+------+----------+------+-------------+----------+-----------+--------------------+-----------+
|employee_id|department_id|age|salary| hire_date|gender|         name|hire_dated|currentDate|    currentTimeStamp|date_string|
+-----------+-------------+---+------+----------+------+-------------+----------+-----------+--------------------+-----------+
|        001|          101| 30| 50000|2015-01-01|     M|     Xohn Doe|2015-01-01| 2025-09-05|2025-09-05 13:04:...| 01/01/2015|
|        002|          101| 25| 45000|2016-02-15|     F|   Xane Smith|2016-02-15| 2025-09-05|2025-09-05 13:04:...| 15/02/2016|
|        003|          102| 35| 55000|2014-05-01|     M|    Bob Brown|2014-05-01| 2025-09-05|2025-09-05 13:04:...| 01/05/2014|
|        004|          102| 28| 48000|2017-09-30|     F|    Alice Lee|2017-09-30| 2025-09-05|2025-09-05 13:04:...| 30/09/2017|
|        005|          103| 40| 60000|2013-04-01|     M|    Xack Chan|2013-04-01| 2025-09-05|2025-09-05 13:04:.