# Consumer Complaints Challenge with Apache

We are greatly inspired by the [Consumer Complaints](https://github.com/InsightDataScience/consumer_complaints) challenge from [InsightDataScience](https://github.com/InsightDataScience/). In fact, we are going to tackle the same challenge but using Apache Spark. Please read through the challenge at the following link:

<https://github.com/InsightDataScience/consumer_complaints>

The most important sections are **Input dataset** and **Expected output**, which are quoted below:

## Input dataset
For this challenge, when we grade your submission, an input file, `complaints.csv`, will be moved to the top-most `input` directory of your repository. Your code must read that input file, process it and write the results to an output file, `report.csv` that your code must place in the top-most `output` directory of your repository.

Below are the contents of an example `complaints.csv` file:
```
Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,"transworld systems inc. is trying to collect a debt that is not mine, not owed and is inaccurate.",,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,N/A,3384392
2019-09-19,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,N/A,3379500
2020-01-06,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,Experian Information Solutions Inc.,CA,92532,,N/A,Email,2020-01-06,In progress,Yes,N/A,3486776
2019-10-24,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",CA,925XX,,Other,Web,2019-10-24,Closed with explanation,Yes,N/A,3416481
2019-11-20,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Account information incorrect,I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX,Company has responded to the consumer and the CFPB and chooses not to provide a public response,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",TX,77004,,Consent provided,Web,2019-11-20,Closed with explanation,Yes,N/A,3444592
```
Each line of the input file, except for the first-line header, represents one complaint. Consult the [Consumer Finance Protection Bureau's technical documentation](https://cfpb.github.io/api/ccdb/fields.html) for a description of each field.  

* Notice that complaints were not listed in chronological order
* In 2019, there was a complaint against `TRANSWORLD SYSTEMS INC` for `Debt collection`
* Also in 2019, `Experian Information Solutions Inc.` received one complaint for `Credit reporting, credit repair services, or other personal consumer reports` while `TRANSUNION INTERMEDIATE HOLDINGS, INC.` received two
* In 2020, `Experian Information Solutions Inc.` received a complaint for `Credit reporting, credit repair services, or other personal consumer reports`

In summary that means
* In 2019, there was one complaint for `Debt collection`, and 100% of it went to one company
* Also in 2019, three complaints against two companies were received for `Credit reporting, credit repair services, or other personal consumer reports` and 2/3rd of them (or 67% if we rounded the percentage to the nearest whole number) were against one company (TRANSUNION INTERMEDIATE HOLDINGS, INC.)
* In 2020, only one complaint was received for `Credit reporting, credit repair services, or other personal consumer reports`, and so the highest percentage received by one company would be 100%

For this challenge, we want for each product and year that complaints were received, the total number of complaints, number of companies receiving a complaint and the highest percentage of complaints directed at a single company.

For the purposes of this challenge, all names, including company and product, should be treated as case insensitive. For example, "Acme", "ACME", and "acme" would represent the same company.

## Expected output

After reading and processing the input file, your code should create an output file, `report.csv`, with as many lines as unique pairs of product and year (of `Date received`) in the input file.

Each line in the output file should list the following fields in the following order:
* product (name should be written in all lowercase)
* year
* total number of complaints received for that product and year
* total number of companies receiving at least one complaint for that product and year
* highest percentage (rounded to the nearest whole number) of total complaints filed against one company for that product and year. Use standard rounding conventions (i.e., Any percentage between 0.5% and 1%, inclusive, should round to 1% and anything less than 0.5% should round to 0%)

The lines in the output file should be sorted by product (alphabetically) and year (ascending)

Given the above `complaints.csv` input file, we'd expect an output file, `report.csv`, in the following format
```
"credit reporting, credit repair services, or other personal consumer reports",2019,3,2,67
"credit reporting, credit repair services, or other personal consumer reports",2020,1,1,100
debt collection,2019,1,1,100
```
Notice that because `debt collection` was only listed for 2019 and not 2020, the output file only has a single entry for debt collection. Also, notice that when a product has a comma (`,`) in the name, the name should be enclosed by double quotation marks (`"`). Finally, notice that percentages are listed as numbers and do not have `%` in them.

# Objectives

In this homework, we will tackle the above problem in two steps (2 tasks):

1. In Task 1, we work on a solution with PySpark on Google Colab using a sample of the data. The data is available on Google Drive and is to be downloaded by the `gdown` command in Task 1.

2. In Task 2, we create a standalone Python script that work on the full dataset using GCP DataProc. The full dataset is downloaded from [here](https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data). The data is available on the class bucket as: `gs://bdma/data/complaints.csv`



## Environment Setup

In [None]:
%%shell
gdown --quiet 1-IeoZDwT5wQzBUpsaS5B6vTaP-2ZBkam
pip --quiet install pyspark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone




In [None]:
COMPLAINTS_FN = 'complaints_sample.csv'

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
sc = pyspark.SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()
spark

## Task 1

Use PySpark to derive the expected output. Your computation must be done entirely on Spark's transformation. The output MUST be in the CSV form, i.e. each output line is a complete comma separated string that can be fed into a CSV reader. It is okay if your output are divided into multiple parts (due to the nature of distributed computing of Spark).

In [None]:
import csv
len(list(csv.reader(open(COMPLAINTS_FN, 'r'))))-1

6623

In [None]:
!ls -lh .

total 3.8M
-rw-r--r-- 1 root root 3.8M Apr 15 04:48 complaints_sample.csv
drwxr-xr-x 1 root root 4.0K Apr 13 13:30 sample_data


In [None]:
!wc -l {COMPLAINTS_FN}

9965 complaints_sample.csv


In [None]:
!head -n 7 {COMPLAINTS_FN}

Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
2015-12-31,Bank account or service,Checking account,"Making/receiving payments, sending money",,,,FIRSTBANK PUERTO RICO,PR,00902,Older American,N/A,Referral,2016-02-04,Closed with explanation,Yes,No,1723943
2016-03-15,Bank account or service,Other bank product/service,Problems caused by my funds being low,,,,FIRSTBANK PUERTO RICO,PR,00926,,Consent not provided,Web,2016-03-15,Closed with explanation,Yes,No,1833740
2016-10-24,Bank account or service,Checking account,"Account opening, closing, or management",,"In the month of XX/XX/2015, my email address ( XXXX ) was hacked and used to send messages to people associated with my business. At that time, transactions for the purchase and sales of products were made. The ha

In [None]:
df = spark.read.load(COMPLAINTS_FN, format='csv', header=True)
print(df.count())
df.show(10)

8595
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+-----------------------+--------------------+-----+----------+--------------------+--------------------------+-------------+--------------------+----------------------------+----------------+------------------+------------+
|       Date received|             Product|         Sub-product|               Issue|           Sub-issue|Consumer complaint narrative|Company public response|             Company|State|  ZIP code|                Tags|Consumer consent provided?|Submitted via|Date sent to company|Company response to consumer|Timely response?|Consumer disputed?|Complaint ID|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+-----------------------+--------------------+-----+----------+--------------------+--------------------------+-------------+------------------

In [None]:
dfA = spark.read.option('escape', '"').option('multiLine', True).csv(COMPLAINTS_FN, header=True, inferSchema=True) \
     .select('Date received','Product','Company')
print(dfA.count())
dfA.show(20)

6623
+-------------+--------------------+--------------------+
|Date received|             Product|             Company|
+-------------+--------------------+--------------------+
|   2015-12-31|Bank account or s...|FIRSTBANK PUERTO ...|
|   2016-03-15|Bank account or s...|FIRSTBANK PUERTO ...|
|   2016-10-24|Bank account or s...|WELLS FARGO & COM...|
|   2017-09-08|Checking or savin...|            Comerica|
|   2018-09-19|Checking or savin...|WELLS FARGO & COM...|
|   2018-12-04|Checking or savin...|BANK OF AMERICA, ...|
|   2018-12-05|Checking or savin...|HSBC NORTH AMERIC...|
|   2018-12-06|Checking or savin...|JPMORGAN CHASE & CO.|
|   2018-12-10|Checking or savin...|NAVY FEDERAL CRED...|
|   2018-12-10|Checking or savin...|JPMORGAN CHASE & CO.|
|   2018-12-13|Checking or savin...|       PNC Bank N.A.|
|   2018-12-13|Checking or savin...|REGIONS FINANCIAL...|
|   2018-12-16|Checking or savin...|UNITED SERVICES A...|
|   2018-12-17|Checking or savin...|BBVA FINANCIAL CO...|
|   2018-

In [None]:
from pyspark.sql.functions import year, lower
dfB = dfA.withColumn('Product', lower('Product')) \
         .withColumn('Company', lower('Company'))

print(dfB.count())
dfB.show(20)

6623
+-------------+--------------------+--------------------+
|Date received|             Product|             Company|
+-------------+--------------------+--------------------+
|   2015-12-31|bank account or s...|firstbank puerto ...|
|   2016-03-15|bank account or s...|firstbank puerto ...|
|   2016-10-24|bank account or s...|wells fargo & com...|
|   2017-09-08|checking or savin...|            comerica|
|   2018-09-19|checking or savin...|wells fargo & com...|
|   2018-12-04|checking or savin...|bank of america, ...|
|   2018-12-05|checking or savin...|hsbc north americ...|
|   2018-12-06|checking or savin...|jpmorgan chase & co.|
|   2018-12-10|checking or savin...|navy federal cred...|
|   2018-12-10|checking or savin...|jpmorgan chase & co.|
|   2018-12-13|checking or savin...|       pnc bank n.a.|
|   2018-12-13|checking or savin...|regions financial...|
|   2018-12-16|checking or savin...|united services a...|
|   2018-12-17|checking or savin...|bbva financial co...|
|   2018-

In [None]:
# calculate the required values
from pyspark.sql.window import Window

df2 = dfB.groupBy('Product', year('Date received').alias('Year'), 'Company') \
       .agg(F.count('*').alias('Complaints')) \
       .groupBy('Product', 'Year') \
       .agg(F.sum('Complaints').alias('Total Complaints'),
            F.countDistinct('Company').alias('Total Companies'),
            F.max('Complaints').alias('Max Complaints'),
            F.sum('Complaints').alias('Complaints')) \
       .withColumn('Max Percentage', F.round(F.col('Max Complaints')/F.col('Complaints') * 100).cast(T.IntegerType())) \
       .select('Product', 'Year', 'Total Complaints', 'Total Companies', 'Max Percentage')

print(df2.count())
df2.show(20)

46
+--------------------+----+----------------+---------------+--------------+
|             Product|Year|Total Complaints|Total Companies|Max Percentage|
+--------------------+----+----------------+---------------+--------------+
|credit reporting,...|2019|            3114|            203|            50|
|         credit card|2016|               4|              4|            25|
|checking or savin...|2020|               3|              3|            33|
|money transfer, v...|2019|              87|             33|            33|
|            mortgage|2018|              39|             27|            10|
|credit card or pr...|2019|             437|             42|            15|
|       consumer loan|2015|               1|              1|           100|
|     debt collection|2017|              13|             11|            15|
|            mortgage|2019|             415|             98|            10|
|        student loan|2019|             157|             37|            37|
|payday l

In [None]:
# sort the output by product and year
df3 = df2.orderBy('Product', 'Year')
print(df3.count())
df3.show(20)

46
+--------------------+----+----------------+---------------+--------------+
|             Product|Year|Total Complaints|Total Companies|Max Percentage|
+--------------------+----+----------------+---------------+--------------+
|bank account or s...|2015|               1|              1|           100|
|bank account or s...|2016|               2|              2|            50|
|checking or savin...|2017|               1|              1|           100|
|checking or savin...|2018|              20|             10|            25|
|checking or savin...|2019|             461|             72|            13|
|checking or savin...|2020|               3|              3|            33|
|       consumer loan|2015|               1|              1|           100|
|       consumer loan|2016|               1|              1|           100|
|       consumer loan|2017|               1|              1|           100|
|         credit card|2016|               4|              4|            25|
|        

In [None]:
# outputTask1 is an output RDD, you can use DataFrame as well but each line
# still needs to be a string
outputTask1.take(20)

['bank account or service,2015,1,1,100',
 'bank account or service,2016,2,2,50',
 'checking or savings account,2017,1,1,100',
 'checking or savings account,2018,20,10,25',
 'checking or savings account,2019,461,72,13',
 'checking or savings account,2020,3,3,33',
 'consumer loan,2015,1,1,100',
 'consumer loan,2016,1,1,100',
 'consumer loan,2017,1,1,100',
 'credit card,2016,4,4,25',
 'credit card,2017,1,1,100',
 'credit card or prepaid card,2017,1,1,100',
 'credit card or prepaid card,2018,27,12,33',
 'credit card or prepaid card,2019,437,42,15',
 'credit card or prepaid card,2020,13,10,23',
 'credit reporting, credit repair services, or other personal consumer reports,2017,7,5,29',
 'credit reporting, credit repair services, or other personal consumer reports,2018,238,22,56',
 'credit reporting, credit repair services, or other personal consumer reports,2019,3114,203,50',
 'credit reporting, credit repair services, or other personal consumer reports,2020,144,10,51',
 'debt collection,20

## Task 2

For this task, please convert what you have in Task 1 to a standalone file that can be run on any DataProc cluster. The input and output locations must be taken from the command line, e.g. using my cluster named `bdma`:

```shell
gcloud dataproc jobs submit pyspark --cluster bdma BDM_HW3_EMPLID_LastName.py gs://bdma/data/complaints.csv gs://bdma/shared/2023_spring/HW3/EMPLID_LastName
```

As part of the test, you must be able to run your code and output to the class shared folder, i.e.: `gs://bdma/shared/2023_spring/HW3/EMPLID_LastName`, **replacing `EMPLID` and `LastName` with your actual EMPL ID and Last Name**.

Note that, if you run your code multiple times, make sure to only run your working version when output to the shared folder, or you must remove the existing output to run your code again.

### For PhD students

Your solution must take into account multiple line records without coalescing into a single partition and count them all in the output. We should have *3,458,906* records in the full dataset.

In [None]:
!pip install google-cloud-dataproc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-cloud-dataproc
  Downloading google_cloud_dataproc-5.4.1-py2.py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.5/307.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpc-google-iam-v1<1.0.0dev,>=0.12.4
  Downloading grpc_google_iam_v1-0.12.6-py2.py3-none-any.whl (26 kB)
Installing collected packages: grpc-google-iam-v1, google-cloud-dataproc
Successfully installed google-cloud-dataproc-5.4.1 grpc-google-iam-v1-0.12.6


In [None]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=24obxA9Oen40eZa7eeWAq119K4dkH9&prompt=consent&access_type=offline&code_challenge=8v4Ine_Yr0uOwdi4knyFF6emS6ZzqVqMa59i-budVmQ&code_challenge_method=S256

Enter authorization code: 4/0AVHEtk5Otbod8kUgpdZZKDw-1mZ853w9XPbKZCCkNmsnhBbUEKyS9a3dvqKHz4ex1JmE1Q

You are now logged in as [sumaiyauddin1995@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_I

In [None]:
!gcloud projects list

PROJECT_ID             NAME              PROJECT_NUMBER
big-data-380022        big data          901723776519
rich-suprstate-380020  My First Project  72957410723


In [None]:
!gcloud config set project big-data-380022
!gcloud config set compute/region us-west1
!gcloud config set compute/zone us-west1-a
!gcloud config set dataproc/region us-west1

Updated property [core/project].
Updated property [compute/region].
Updated property [compute/zone].
Updated property [dataproc/region].


In [None]:
!gcloud dataproc clusters create bdma --enable-component-gateway --region us-west1 --zone us-west1-a --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 2.0-debian10 --project big-data-380022

[1;31mERROR:[0m (gcloud.dataproc.clusters.create) ALREADY_EXISTS: Already exists: Failed to create cluster: Cluster projects/big-data-380022/regions/us-west1/clusters/bdma


In [None]:
!gcloud dataproc clusters list

NAME  PLATFORM  WORKER_COUNT  PREEMPTIBLE_WORKER_COUNT  STATUS   ZONE        SCHEDULED_DELETE
bdma  GCE       2                                       RUNNING  us-west1-a


In [None]:
!gcloud dataproc clusters describe bdma

clusterName: bdma
clusterUuid: a199b803-8e6d-40e6-8f25-be0de8a8d5db
config:
  configBucket: dataproc-staging-us-west1-901723776519-ddnkwogk
  endpointConfig:
    enableHttpPortAccess: true
    httpPorts:
      HDFS NameNode: https://kf5ele4aq5cqvdvfjmutptdjpu-dot-us-west1.dataproc.googleusercontent.com/hdfs/dfshealth.html
      HiveServer2 (bdma-m): https://kf5ele4aq5cqvdvfjmutptdjpu-dot-us-west1.dataproc.googleusercontent.com/hiveserver2ui/bdma-m?host=bdma-m
      MapReduce Job History: https://kf5ele4aq5cqvdvfjmutptdjpu-dot-us-west1.dataproc.googleusercontent.com/jobhistory/
      Spark History Server: https://kf5ele4aq5cqvdvfjmutptdjpu-dot-us-west1.dataproc.googleusercontent.com/sparkhistory/
      Tez: https://kf5ele4aq5cqvdvfjmutptdjpu-dot-us-west1.dataproc.googleusercontent.com/apphistory/tez-ui/
      YARN Application Timeline: https://kf5ele4aq5cqvdvfjmutptdjpu-dot-us-west1.dataproc.googleusercontent.com/apphistory/
      YARN ResourceManager: https://kf5ele4aq5cqvdvfjmutptdjpu

In [None]:
%%writefile BDM_HW3_24373710_Uddin.py
#!/usr/bin/python

import pyspark
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.functions import lower, year

if __name__ == '__main__':
    # read input and output paths from command line arguments
    input_path = sys.argv[1]
    output_path = sys.argv[2]

    sc = pyspark.SparkContext.getOrCreate()
    spark = SparkSession.builder.config("spark.sql.shuffle.partitions", "200").getOrCreate()

    # define the schema for the CSV file
    schema = T.StructType([
        T.StructField('Date received', T.DateType()),
        T.StructField('Product', T.StringType()),
        T.StructField('Company', T.StringType())
    ])

    # read the csv file into a DataFrame
    dfA = spark.read.option('escape', '"').option('multiLine', True).csv(input_path, header=True, inferSchema=True) \
     .select('Date received','Product','Company')

    # lowercase product and company columns
    dfB = dfA.withColumn('Product', lower('Product')) \
         .withColumn('Company', lower('Company'))

    # calculate the required values
    df2 = dfB.groupBy('Product', year('Date received').alias('Year'), 'Company') \
          .agg(F.count('*').alias('Complaints')) \
          .groupBy('Product', 'Year') \
          .agg(F.sum('Complaints').alias('Total Complaints'),
                F.countDistinct('Company').alias('Total Companies'),
                F.max('Complaints').alias('Max Complaints'),
                F.sum('Complaints').alias('Complaints')) \
          .withColumn('Max Percentage', F.round(F.col('Max Complaints')/F.col('Complaints') * 100).cast(T.IntegerType())) \
          .select('Product', 'Year', 'Total Complaints', 'Total Companies', 'Max Percentage')

    # sort and save results to output folder
    df3 = df2.orderBy('Product', 'Year')

    # write output to file
    df3.write.format('csv').option('header', True).mode('overwrite').save(output_path)

    # stop SparkSession
    spark.stop()

    print('Completed')


Writing BDM_HW3_24373710_Uddin.py


In [None]:
# !gcloud dataproc jobs submit pyspark --cluster bdma BDM_HW3_24373710_Uddin.py gs://bdma/data/complaints.csv gs://bdma/shared/2023_spring/HW3/24373710_Uddin

In [None]:
!gcloud dataproc jobs submit pyspark --cluster bdma BDM_HW3_24373710_Uddin.py -- gs://bdma/data/complaints.csv gs://bdma/shared/2023_spring/HW3/24373710_Uddin

Job [fbdf9c7d90ec4e5080259b35e88f4c2c] submitted.
Waiting for job output...
23/04/15 06:01:24 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
23/04/15 06:01:24 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
23/04/15 06:01:24 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
23/04/15 06:01:24 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
23/04/15 06:01:25 INFO org.sparkproject.jetty.util.log: Logging initialized @4793ms to org.sparkproject.jetty.util.log.Slf4jLog
23/04/15 06:01:25 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_362-b09
23/04/15 06:01:25 INFO org.sparkproject.jetty.server.Server: Started @4934ms
23/04/15 06:01:25 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@762eec30{HTTP/1.1, (http/1.1)}{0.0.0.0:34819}
23/04/15 06:01:26 INFO org.apache.hadoop.yarn.client.RMPro

In [None]:
!gsutil ls gs://bdma/shared/2023_spring/HW3

gs://bdma/shared/2023_spring/HW3/
gs://bdma/shared/2023_spring/HW3/16141003_Olsen/
gs://bdma/shared/2023_spring/HW3/24363838_Lau/
gs://bdma/shared/2023_spring/HW3/24369480_Chandani/
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/
gs://bdma/shared/2023_spring/HW3/24438996_Radaelli/


In [None]:
!gsutil ls gs://bdma/shared/2023_spring/HW3/24373710_Uddin/

gs://bdma/shared/2023_spring/HW3/24373710_Uddin/
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/_SUCCESS
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00000-fc38ba72-13b6-455c-8cbf-b98f2df80384-c000.csv
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00001-fc38ba72-13b6-455c-8cbf-b98f2df80384-c000.csv
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00002-fc38ba72-13b6-455c-8cbf-b98f2df80384-c000.csv
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00003-fc38ba72-13b6-455c-8cbf-b98f2df80384-c000.csv
gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00004-fc38ba72-13b6-455c-8cbf-b98f2df80384-c000.csv


In [None]:
# !gsutil rm -r gs://bdma/shared/2023_spring/HW3/24373710_Uddin/

Removing gs://bdma/shared/2023_spring/HW3/24373710_Uddin/#1681537550695534...
Removing gs://bdma/shared/2023_spring/HW3/24373710_Uddin/_SUCCESS#1681537550979022...
Removing gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00000-50802079-6825-4993-bf8a-eb57ff1d4948-c000.csv#1681537527060904...
Removing gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00001-50802079-6825-4993-bf8a-eb57ff1d4948-c000.csv#1681537527650720...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rm ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00002-50802079-6825-4993-bf8a-eb57ff1d4948-c000.csv#1681537525605261...
Removing gs://bdma/shared/2023_spring/HW3/24373710_Uddin/part-00003-50802079-6825-4993-bf8a-eb57ff1d4

# IMPORTANT: DELETE YOUR CLUSTER AFTER DONE

In [None]:
!gcloud dataproc clusters delete bdma -q
!gcloud dataproc clusters list

Waiting on operation [projects/big-data-380022/regions/us-west1/operations/5949efbe-adc7-3c66-9cd5-3ab382fff8be].
Deleted [https://dataproc.googleapis.com/v1/projects/big-data-380022/regions/us-west1/clusters/bdma].
Listed 0 items.
