# MGTA 466: Programming Assignment 1

## Setup of PySpark

#### Tasks: 

* Verify that the installation is correct - Follow instructions in this notebook
* Read to/from HDFS, perform spark dataframe operations, write to/from HDFS

#### Submission on Gradescope:
  * You need to submit the following files under "PA1"
      * The current notebook - **PA1_Starter.ipynb**
      * File containing first 31 lines of the result(Instructions given below) - **results_30.txt**
      
#### IMPORTANT submission guidelines enforced by autograder. Please read carefully:
  * Make sure that all the cells in this notebook are executed before submission
  * Some cells are maked **DO NOT DELETE**. These cells cannot be deleted and the output of these cells will be used for autograding
  * You can add cells or delete(NOT recommended) other cells, but the **Expected Output** for each of the tasks MUST be the output of the cells marked as such
  * DO NOT print anything other than the *exact* expected output. Do not include any sentences describing the output. This is strictly enforced by the autograder which checks for an *exact* match of the expected output. For example, if you are expected to print the PySpark version:
      * '10.9.8' - <span style="color:#093">CORRECT</span>
      * 'The PySpark version is 10.9.8' - <span style="color:#FF0000">INCORRECT</span>
  * You can add cells for printing debugging information anywhere, but do not print anything else in **Expected Output** cells other than the expected output for the task
---

Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame. 

PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html

---

In [19]:
# Suppress native-hadoop warning
!sed -i '$a\# Add the line for suppressing the NativeCodeLoader warning \nlog4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR,console' /$HADOOP_HOME/etc/hadoop/log4j.properties

### 1. Copy data file(s) to HDFS

Spark can work with many data sources, including the Hadoop File System (HDFS). Here are the steps (assuming you have opened your jupyter-lab in the container) you need to copy files between the local file system and HDFS:

<ol>
    <li type ="a">Using the File Browser on the left, navigate to the working folder. We suggest maintaining a separate folder for each assignment.</li>
    <li type ="a">Upload this notebook <code>PA1_Starter.ipynb</code> and the data file <code>BookReviews_1M.txt</code> to your working folder.</li>
    <li type ="a">Open a terminal by navigating to <code>Terminal</code> &#8594; <code>New Terminal</code>. <code>cd</code> to your working folder.</li>
    <li type ="a">Initialize and start up HDFS, similar to the demo from class.
    </li>
    <li type ="a"> Now copy the  <code>BookReviews_1M.txt</code> datafile from your local system to the root of the Hadoop File System, similar to the demo from class.</li>
    <li type ="a"> Run <code>hdfs dfs -ls /</code> to list the files and check that the file was copied.</li>
</ol>

#### **Expected output**: None

### 2. Start Spark Session - 1 point

Start the Spark session with a configuration for local execution, then output the pyspark version.  Remember to first import PySpark and SparkSession.

#### **Expected output** - PySpark version

In [20]:
import pyspark
from pyspark.sql import SparkSession

conf = pyspark.SparkConf().setAll([
	('spark.master', 'local[1]'),
	('spark.app.name', 'App Name')
])
spark = SparkSession.builder.config(conf=conf).getOrCreate()

pyspark.__version__

'3.5.1'

### 3. Load Data - 1 point

Read data from the `BookReviews_1M.txt` file on HDFS, and print the number of rows in that dataframe.

#### **Expected output**: Number of rows of the dataframe i.e. number of lines in the file

In [21]:
# Number of rows of the dataframe i.e. number of lines in the file
df = spark.read.text("hdfs:///Demo1-HDFS/BookReviews_1M.txt").cache()
df.head()
df.count()

1000000

### 4. Examine the data - 2 points

Examine the contents of the dataframe that you've just read from file.

Expected output: 
<ol>
    <li type = "a">Schema of the raw dataframe</li>
    <li type = "a">First 10 rows of the dataframe</li>
    <li type = "a">First 10 rows of the dataframe showing the <b>entire</b> sentence. Pass <code>truncate=False</code> as argument to <code>DataFrame.show</code> to see the entire sentence.</li>
</ol>


#### **Expected output**: Schema of the raw dataframe

In [22]:
df.printSchema()

root
 |-- value: string (nullable = true)



#### **Expected output**: First 10 rows of the dataframe.

In [23]:
df.show(10)

+--------------------+
|               value|
+--------------------+
|This was the firs...|
|Also after going ...|
|As with all of Ms...|
|I've not read any...|
|This romance nove...|
|Carolina Garcia A...|
|Not only can she ...|
|Once again Garcia...|
|The timing is jus...|
|Engaging. Dark. R...|
+--------------------+
only showing top 10 rows



#### **Expected output**: First 10 rows of the dataframe showing the entire sentence.

In [24]:
df.show(10, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                

### 5. Get the first 100 characters of text in each row of the data and convert them to lowercase - 2 points

<ol>
    <li type ="a">Get the first 100 characters of text in each row of the dataframe and convert them to lowercase.</li>
    <li type ="a">Alias/Rename the resulting column name to <code>lowerfirst100</code> in the new dataframe</li>
    <li type ="a">Useful functions - <p><a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.substring.html#pyspark.sql.functions.substring">substring</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.alias.html">Column.alias</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html">DataFrame.select</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lower.html">lower</a></p></li>
</ol>


#### **Expected output**: First 10 rows of the resulting dataframe with a single column name `lowerfirst100`

NOTE:  Pass <code>truncate=False</code> as argument to <code>DataFrame.show()</code>

In [25]:
from pyspark.sql.functions import substring, lower

# Assuming the text column is named "text_column"
df_transformed = df.select(
    lower(substring("value", 1, 100)).alias("lowerfirst100")
)

# Show the first 10 rows of the new dataframe
df_transformed.show(10, truncate=False)

+----------------------------------------------------------------------------------------------------+
|lowerfirst100                                                                                       |
+----------------------------------------------------------------------------------------------------+
|this was the first time i read garcia-aguilera.  i came upon the name of this book on live with regi|
|also after going through all of daisy's perils ... i closed the book with a feeling i had grown emot|
|as with all of ms. garcia-aguilera's books, i think this is a must read, impossible to put down. suc|
|i've not read any of ms aguilera's works before, but after having just finished one hot summer i'm g|
|this romance novel is right up there with the rest of her amazing mystery novels.  being a guy, i wa|
|carolina garcia aguilera has done it again.  she's written another highly enjoyable book and infused|
|not only can she write mysteries,but she sure can write a love story! th

### 6. Save results to HDFS - 1 point

NOTE: Spark uses a distributed memory system, and stores working data in fragments known as "partitions". This is advantageous when a Spark cluster spans multiple machines, as each machine will only require part of the working data to do its own job. By default, Spark will save each of these data partitions into a individual file to avoid I/O collisions. We want only one output file, so we'll need to fuse all the data into a single partition first. 

Your task: 
<ol>
    <li type="a">Coalesce the previous dataframe to one partition using <code>DataFrame.coalesce(1)</code>. This returns a 1-partition dataframe. This makes sure that all our results will end up in the same text file.</li>
    <li type = "a">Save the 1-partition dataframe to HDFS using the <code>DataFrame.write.text(&ltpath&gt)</code> method to the root directory of the HDFS, i.e. <code>hdfs:///BookReviews_1M_lowerfirst100.txt</code>.</li>
</ol>

#### **Expected output**: None

In [29]:
df_single_partition = df_transformed.coalesce(1)

df_single_partition.write.mode("overwrite").text("hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt")

Counterintuitively, the resultant file saved in the step above is actually a folder, which contains individually saved files from each partition of the saved dataframe. <br> <br>
Now, use an HDFS command to show the contents of the resulting folder on HDFS from the last step in the cell below. <br>
You will need to include ‘!’ before the HDFS command for Jupyter Notebook to recognize it as an operating system command. For instance '! pwd' displays the path name of your current directory.

In [None]:
! pwd

/home/jovyan/MGTA466:Session1/Demo1-Local


#### **Expected output**: List of files in the result directory

In [30]:
! hdfs dfs -ls hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt

Found 2 items
-rw-r--r--   1 jovyan supergroup          0 2025-02-15 16:21 hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt/_SUCCESS
-rw-r--r--   1 jovyan supergroup   81404214 2025-02-15 16:21 hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt/part-00000-5b8e635d-fcc2-46c8-8143-0129a1fbf49d-c000.txt


### 7. Copy the results from HDFS to the local file system - 0.5 points

Now that we have our results stored in HDFS, we need to copy it back to the local file system to access it. This process may sound cumbersome, but it is a necessary result of Spark and Hadoop's distributed architecture, and their ability to scale up to arbitrarily large datasets and computing operations. 

Copying the results from HDFS to the local file system is fairly simple. Here are the steps:
<ol>
    <li type ="a" >Run an hdfs command in the terminal to list the root directory of the HDFS. You should see the text file that you have saved. Counterintuitively, this text file is actually a folder, which contains individually saved files from each partition of the saved dataframe (see above for data partitioning). </li>
    <li type ="a">Run another hdfs command to see what's inside the saved folder. Since we made sure to coalesce our dataframe to just one partition, we should expect to find only one saved partition in this folder, saved also as a txt. Note the name of this file, it should look something like <code>part-00000-xx.....xx.txt</code>. </li>
    <li type ="a">Now, copy the resultant txt file from HDFS to the current folder on your local file system using an hdfs command in the terminal. You may rename this file to something more interpretable like <code>results.txt</code>.</li>
    <li type ="a">We only want you to submit a <code>.txt</code> containing the first 31 lines of the results file. To do this, you can use the command <code>head -n 31 results.txt > results_30.txt</code>.</li>
</ol>

#### **Expected output**: None

In [31]:
! hdfs dfs -ls /

Found 3 items
drwxr-xr-x   - jovyan supergroup          0 2025-02-15 16:05 /BookReviews_1M_lowerfirst100.txt
-rw-r--r--   1 jovyan supergroup  219041604 2025-02-15 15:59 /Demo1-
drwxr-xr-x   - jovyan supergroup          0 2025-02-15 16:21 /Demo1-HDFS


In [32]:
! hdfs dfs -ls hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt

Found 2 items
-rw-r--r--   1 jovyan supergroup          0 2025-02-15 16:21 hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt/_SUCCESS
-rw-r--r--   1 jovyan supergroup   81404214 2025-02-15 16:21 hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt/part-00000-5b8e635d-fcc2-46c8-8143-0129a1fbf49d-c000.txt


In [35]:
! hdfs dfs -get hdfs:///Demo1-HDFS/BookReviews_1M_lowerfirst100.txt/part-00000-5b8e635d-fcc2-46c8-8143-0129a1fbf49d-c000.txt results.txt

In [36]:
! head -n 31 results.txt > results_30.txt

### 8 a. Stop Spark Instance

In [None]:
# Stop Spark session

# spark.stop()

### 8 b. Stop HDFS

In [None]:
# Stop HDFS

# !$HADOOP_HOME/stop-dfs.sh

Stopping HDFS ...


### 9. Submission of `results_30.txt` - 2 points

#### **Expected output**(in the results_30.txt file) - First 31 lines of the results in step 5.
The autograder will check whether the results that you submit in the `results_30.txt` file matches **exactly** with the expected results or not.