### **Pyspark**
- PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.
- PySpark combines Python’s learnability and ease of use with the power of Apache Spark to enable processing and analysis of data at any size for everyone familiar with Python.
- PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.

In [1]:
# Import the SparkSession module from PySpark
from pyspark.sql import SparkSession

# Create a SparkSession
# A SparkSession is the entry point to using Spark functionality.
# It configures Spark and sets up an environment for using Spark.
# The `builder` object is used to configure various settings for Spark.
# `appName` is used to set a name for your Spark application.
# `getOrCreate()` tries to get an existing SparkSession or creates a new one if it doesn't exist.
spark = SparkSession.builder.appName("4gb_Data_Handling!").getOrCreate()

# Now, you have a SparkSession named 'spark' that you can use for various Spark operations.


In [2]:
spark

**Read the dataset**

In [11]:
raw_data = spark.read.options(delimiter="\t",header=True).csv("./Data/en.openfoodfacts.org.products.csv")

- The result is a pyspark.sql.dataframe variable. It is important to keep in mind that at this point the data is not actually loaded into the RAM memory. Data is only loaded when an action is called on the pyspark variable, an action that needs to return a computed value. If I ask for instance for a count of the number of products in the data set, Spark is smart enough not to try and load the whole 8 GB of data in order to compute this value (almost 2 million products).

In [12]:
raw_data.count()

2977334

In [13]:
raw_data.printSchema()

root
 |-- code: string (nullable = true)
 |-- url: string (nullable = true)
 |-- creator: string (nullable = true)
 |-- created_t: string (nullable = true)
 |-- created_datetime: string (nullable = true)
 |-- last_modified_t: string (nullable = true)
 |-- last_modified_datetime: string (nullable = true)
 |-- last_modified_by: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- abbreviated_product_name: string (nullable = true)
 |-- generic_name: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- packaging: string (nullable = true)
 |-- packaging_tags: string (nullable = true)
 |-- packaging_en: string (nullable = true)
 |-- packaging_text: string (nullable = true)
 |-- brands: string (nullable = true)
 |-- brands_tags: string (nullable = true)
 |-- categories: string (nullable = true)
 |-- categories_tags: string (nullable = true)
 |-- categories_en: string (nullable = true)
 |-- origins: string (nullable = true)
 |-- origins_tags: string (nulla

In [14]:
from pyspark.sql.functions import col
BDD_countries = raw_data.groupBy("countries_tags").count().persist()

In [15]:
BDD_countries.printSchema()

root
 |-- countries_tags: string (nullable = true)
 |-- count: long (nullable = false)



In [16]:
BDD_res = BDD_countries.filter(col("count") > 5000).orderBy("count",ascending = False).toPandas()

In [19]:
type(BDD_res)

pandas.core.frame.DataFrame

- Now, we can do the EDA using pandas function

In [None]:
# The END