# EDA with PySpark

The training set contains a large amount of data with 957,919 rows and 120 columns. This is the opportunity to use PySpark to query data efficiently. PySpark is a language for practicing EDA in particular, which uses Python to perform parallel tasks on the data. The data is loaded into a Spark dataframe which looks like what is known with R and Pandas but which is a distributed table. Spark dataframes thus offer a wider range of transformation than with Pandas dataframes. We'll finally see that Spark dataframes offer the possibility of using SQL for queries.

If you're browsing this notebook, feel free to comment. I'll take into account the advice and comments to improve it.

[I. Setting up spark](#spark)

[II. Loading the dataset](#load)

[III. Summary statistics](#stats)

[IV. Filtering based on claim values](#filters)

[V. Grouping and aggregating data](#groupagg)

[VI. Using SparkSQL](#sql)

For more information on EDA including visualizations with Pandas and Seaborn, see as well: [EDA skewness](https://www.kaggle.com/cmarquay/eda-skewness)

For more information on a tutorial and manipulations with PySpark: [try this video](https://www.youtube.com/watch?v=3-pnWVWyH-s)

<a id="spark"></a>
## I. Setting up spark

We start by setting up a Spark context.

In [None]:
!pip3 install pyspark

In [None]:
from pyspark.sql import SparkSession  # required to created a dataframe
spark=SparkSession.builder.appName("EDA_PySpark").getOrCreate() 

import pyspark.sql.functions

from pyspark.sql.types import DoubleType, IntegerType

<a id="load"></a>
## II. Loading the dataset

We load the training set to perform the EDA. We aren't supposed to know the content of the test set to avoid overfitting it.

PySpark reads features as a string by default, it's able to infer column types but a good habit is to set them manually (this avoids unpleasant surprises). We print the schema to verify that it matches our [EDA with Pandas](https://www.kaggle.com/cmarquay/eda-skewness).

We start by displaying the first few lines and some basic information. So we see that the training set contains 957,919 rows and 120 columns. The id column is actually the index, and the claim column is our y target: both are of type integer. Finally, we have 118 features of type double which constitute our X.

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
X = spark.read.csv("/kaggle/input/tabular-playground-series-sep-2021/train.csv", header=True)

In [None]:
X.printSchema()

In [None]:
X = X.withColumn("id", X["id"].cast(IntegerType()))
X = X.withColumn("claim", X["claim"].cast(IntegerType()))
for i in range(1, 119):
    X = X.withColumn("f"+str(i), X["f"+str(i)].cast(DoubleType()))

In [None]:
X.printSchema()

In [None]:
for i in range(12):
    X.select(X.columns[(i*10):(i*10+10)]).show(5, truncate=0)

In [None]:
(X.count(), len(X.columns))

<a id="stats"></a>
## III. Summary statistics

We display some statistics concerning the 120 columns.

In [None]:
for i in range(24):
    X.select(X.columns[(i*5):(i*5+5)]).describe().show(truncate=0)

<a id="filters"></a>
## IV. Filtering based on claim values

The where() and filter() methods are identical, they allow us to filter the values of the Spark dataframe according to a logical condition.

The claim column is a column containing binary values 0 or 1. Here, we therefore seek to know the statistics of the columns according to the value of the claim column. We first display column statistics for rows where claim is 0, then we display column statistics for rows where claim is 1.

In [None]:
for i in range(24):
    X.where(X["claim"] == 0)[X.columns[(i*5):(i*5+5)]].describe().show(truncate=0)

In [None]:
for i in range(24):
    X.where(X["claim"] == 1)[X.columns[(i*5):(i*5+5)]].describe().show(truncate=0)

<a id="groupagg"></a>
## V. Grouping and aggregating data

We display a lot of stats, but we may also want to use them. Spark dataframes can be grouped by column values to obtain statistics through aggregations. With min, max, and skewness, we can find the minimum, the maximum, and the skewness of a column. With mean and stddev, we can standardize the data.

Here, we try to compare the statistics obtained just above to see if there is a difference in the distribution of the features when claim is equal to 0 and when claim is equal to 1. We therefore calculate the z-score to know if claim==0 is significantly different from a population where claim==1. A z-score is the distance in standard deviations of a value from the mean. 67% of the data in a distribution have a z-score between -1 and 1. 95% of the data in a distribution have a z-score between -2 and 2. 97.5% of the data in a distribution have a z-score between -3 and 3. This is why we consider z-scores less than -3 or greater than 3 as outliers.

We find that no z-score exceeds 0.05 and that the two distributions are therefore very similar. The objective of this contest is to differentiate claim==0 and claim==1 while they're statistically identical.

In [None]:
for i in range(120):
    means = X.groupBy("claim").agg({X.columns[i]:"mean"}).collect()
    stDev = X.groupBy("claim").agg({X.columns[i]:"stddev"}).collect()
    xbar = means[1]["avg("+X.columns[i]+")"]
    mu = means[0]["avg("+X.columns[i]+")"]
    sigma = stDev[0]["stddev("+X.columns[i]+")"]
    print(X.columns[i], ": ", (xbar - mu) / sigma)

<a id="sql"></a>
## VI. Using SparkSQL

The createOrReplaceTempView() method loads the Spark dataframe for use with SQL queries.

The feature with the largest data gap when claim is 0 or 1 is f34 with a z-score of 0.04313979653511135. We note that the means of f34 are however very close to each other, their distance is much lower than their standard deviations.

In [None]:
view = X.createOrReplaceTempView("playground")

In [None]:
spark.sql("SELECT claim, AVG(f34), STD(f34) FROM playground GROUP BY claim ORDER BY claim").show(truncate=0)

Feel free to comment. I'll take into account the advice and comments to improve this notebook.

For more information on EDA including visualizations with Pandas and Seaborn, see as well: [EDA skewness](https://www.kaggle.com/cmarquay/eda-skewness)

For more information on a tutorial and manipulations with PySpark: [try this video](https://www.youtube.com/watch?v=3-pnWVWyH-s)