# Data Analysis
The notebook focuses on connecting to GCP to access the data from the bucket.

To make the connection, we use a [GCP service account ](https://cloud.google.com/iam/docs/service-account-overview)that holds permissions to access our bucket.
### Steps
1. Access Service accounts in the GCP account
2. Open `deng-capstone-service-account`
3. Create a new key file and download it locally for access in the next step. Rename file to a concise name.
4. Set path of the key file as option in your spark configuration -    `spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","/path/to/file/<renamed>.json")`

### Resources
1. https://gobiviswa.medium.com/google-cloud-storage-handson-connecting-using-pyspark-5eefc0d8d932
2. https://cloud.google.com/iam/docs/service-account-overview


## Spark Application setup

In [3]:
from pyspark import SparkConf
from pyspark.sql import SparkSession


spark = SparkSession.builder \
    .appName('data-engineering-capstone') \
    .config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .getOrCreate()

# Set GCS credentials. Ensure path points to you downloaded key file
spark._jsc.hadoopConfiguration().set(
    "google.cloud.auth.service.account.json.keyfile",
    "/path/service/account/gcp-service-account.json")



Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/12 13:31:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Read from GCS

In [4]:
# file path to data in GCS bucket

file_path = "gs://ecommerce-customer-bucket/e-commerce-customer-behavior.csv"

df = spark.read.csv(file_path, header=True, inferSchema=True)

df.show(5)

                                                                                

+-----------+------+---+-------------+---------------+-----------+---------------+--------------+----------------+------------------------+------------------+
|Customer ID|Gender|Age|         City|Membership Type|Total Spend|Items Purchased|Average Rating|Discount Applied|Days Since Last Purchase|Satisfaction Level|
+-----------+------+---+-------------+---------------+-----------+---------------+--------------+----------------+------------------------+------------------+
|        101|Female| 29|     New York|           Gold|     1120.2|             14|           4.6|            true|                      25|         Satisfied|
|        102|  Male| 34|  Los Angeles|         Silver|      780.5|             11|           4.1|           false|                      18|           Neutral|
|        103|Female| 43|      Chicago|         Bronze|     510.75|              9|           3.4|            true|                      42|       Unsatisfied|
|        104|  Male| 30|San Francisco|        