##                     ENTITY RESOLUTION

## PySpark and System Imports

Import essential libraries for running Spark applications and managing system-related tasks



In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext

## PySpark SparkSession Configuration

Import the SparkSession class from PySpark. SparkSession is the entry point to programming Spark with the Dataset and DataFrame API. Create a SparkSession with specific configuration settings. The `config` method is used to set the `spark.driver.memory` property to "16g", allocating 16 gigabytes of memory to the driver program. The `appName` method sets the name of the Spark application to "chapter_2". The `getOrCreate` method either retrieves an existing SparkSession or creates a new one if it doesn't exist.


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.memory", "16g").appName('chapter_2').getOrCreate()

## Reading CSV Data with PySpark

The code snippet reads a CSV file into a PySpark DataFrame using the `spark.read.csv` method:

- `prev = spark.read.csv("data/linkage/donation/block_1/block_1.csv")`: Reads the CSV file located at the specified path ("data/linkage/donation/block_1/block_1.csv") into a DataFrame named `prev`. The `read.csv` method automatically infers the schema from the CSV file.

- `prev`: Displays the DataFrame `prev`, showing the contents of the CSV file as a tabular representation.


In [None]:
prev = spark.read.csv("data/linkage/donation/block_1/block_1.csv")
prev

## Displaying DataFrame Contents in PySpark

The code snippet uses the `show` method to display the first two rows of the DataFrame `prev`:

- `prev.show(2)`: Displays the first two rows of the DataFrame `prev`. The `show` method is used to show a specific number of rows from the DataFrame. In this case, `2` indicates that the first two rows of the DataFrame will be displayed.


In [None]:
prev.show(2)

## Reading and Parsing CSV Data with PySpark

The code snippet reads and parses a CSV file into a PySpark DataFrame using various options:

- `parsed = spark.read.option("header", "true").option("nullValue", "?").option("inferSchema", "true").csv("data/linkage/donation/block_1/block_1.csv")`: Reads the CSV file located at the specified path ("data/linkage/donation/block_1/block_1.csv") into a DataFrame named `parsed`. Several options are used to customize the parsing behavior:
  - `header="true"`: Specifies that the first row of the CSV file contains the header, which is used to name the columns of the DataFrame.
  - `nullValue="?"`: Specifies that the string "?" should be interpreted as a null value in the DataFrame.
  - `inferSchema="true"`: Specifies that the data types of columns should be inferred from the data in the CSV file.

The `read.csv` method automatically infers the schema from the CSV file.


In [None]:
parsed = spark.read.option("header", "true").option("nullValue", "?").option("inferSchema", "true").csv("data/linkage/donation/block_1/block_1.csv")

## Printing DataFrame Schema and Displaying Data in PySpark

The code snippet prints the schema of a DataFrame and displays the first five rows of the DataFrame:

- `parsed.printSchema()`: Prints the schema of the DataFrame `parsed`, showing the data types of each column.
- `parsed.show(5)`: Displays the first five rows of the DataFrame `parsed`. The `show` method is used to display a specific number of rows from the DataFrame. In this case, `5` indicates that the first five rows will be displayed.


In [None]:
parsed.printSchema()
parsed.show(5)

## Counting Rows in a DataFrame with PySpark

The code snippet counts the number of rows in the DataFrame `parsed`:

- `parsed.count()`: Returns the number of rows in the DataFrame `parsed`. The `count` method is used to calculate the total number of rows in a DataFrame.


In [None]:
parsed.count()

## Caching DataFrame in PySpark

The code snippet caches the DataFrame `parsed` in memory for faster access:

- `parsed.cache()`: Caches the DataFrame `parsed` in memory. Caching a DataFrame allows subsequent operations on the DataFrame to be faster, as the data is stored in memory and can be accessed more quickly than reading from disk. This can be particularly useful when you plan to reuse the DataFrame in multiple operations.


In [None]:
parsed.cache()

## Grouping, Aggregating, and Ordering Data in PySpark

The code snippet demonstrates how to group, aggregate, and order data in a PySpark DataFrame:

- `from pyspark.sql.functions import col`: Imports the `col` function from `pyspark.sql.functions`. The `col` function is used to create a column expression that represents a column in a DataFrame.

- `parsed.groupBy("is_match").count()`: Groups the DataFrame `parsed` by the column "is_match" and calculates the count of each group.

- `orderBy(col("count").desc())`: Orders the result by the "count" column in descending order. The `desc` method is used to specify descending order.

- `show()`: Displays the result of the grouping, aggregation, and ordering operations. The `show` method is used to display the resulting DataFrame.


In [None]:
from pyspark.sql.functions import col
parsed.groupBy("is_match").count().orderBy(col("count").desc()).show()

## Creating a Temporary View in PySpark

The code snippet creates a temporary view named "linkage" from the DataFrame `parsed`:

- `parsed.createOrReplaceTempView("linkage")`: Creates a temporary view named "linkage" from the DataFrame `parsed`. A temporary view allows you to query the DataFrame using Spark SQL. This view is temporary and will be available only within the current Spark session.


In [None]:
parsed.createOrReplaceTempView("linkage")

## Executing SQL Queries on a DataFrame in PySpark

The code snippet executes a SQL query on the DataFrame `linkage` using Spark SQL:

- `spark.sql("""...""")`: Executes a SQL query using Spark SQL. The triple quotes (`"""`) are used to create a multi-line string that contains the SQL query.

- The SQL query:
  - `SELECT is_match, COUNT(*) cnt`: Selects the "is_match" column from the "linkage" table and calculates the count of rows for each value of "is_match". The result column is named "cnt".
  - `FROM linkage`: Specifies the "linkage" table as the source of data for the query.
  - `GROUP BY is_match`: Groups the rows by the "is_match" column.
  - `ORDER BY cnt DESC`: Orders the result by the "cnt" column in descending order.

- `show()`: Displays the result of the SQL query.


In [None]:
spark.sql("""
SELECT is_match, COUNT(*) cnt
FROM linkage
GROUP BY is_match
ORDER BY cnt DESC
""").show()

## Describing Data in a PySpark DataFrame

The code snippet computes summary statistics for the DataFrame `parsed`:

- `summary = parsed.describe()`: Computes summary statistics for each numerical column in the DataFrame `parsed` and stores the result in the DataFrame `summary`. The `describe` method calculates statistics like count, mean, standard deviation, min, and max for each numerical column.


In [None]:
summary = parsed.describe()

## Selecting Columns from Summary Statistics in PySpark

The code snippet selects specific columns from the `summary` DataFrame to display:

- `summary.select("summary", "cmp_fname_c1", "cmp_fname_c2")`: Selects the columns "summary", "cmp_fname_c1", and "cmp_fname_c2" from the `summary` DataFrame. The `select` method is used to choose specific columns from a DataFrame.

- `show()`: Displays the selected columns from the `summary` DataFrame.


In [None]:
summary.select("summary", "cmp_fname_c1", "cmp_fname_c2").show()

## Filtering Data and Computing Summary Statistics in PySpark

The code snippet filters the DataFrame `parsed` to separate matches and misses based on the "is_match" column, and then computes summary statistics for each:

- `matches = parsed.where("is_match = true")`: Filters the DataFrame `parsed` to include only rows where the "is_match" column is true. The `where` method is used to filter rows based on a condition specified in SQL syntax.

- `match_summary = matches.describe()`: Computes summary statistics for each numerical column in the `matches` DataFrame and stores the result in the DataFrame `match_summary`.

- `misses = parsed.filter(col("is_match") == False)`: Filters the DataFrame `parsed` to include only rows where the "is_match" column is false. The `filter` method is used to filter rows based on a condition specified using the `col` function.

- `miss_summary = misses.describe()`: Computes summary statistics for each numerical column in the `misses` DataFrame and stores the result in the DataFrame `miss_summary`.


In [None]:
matches = parsed.where("is_match = true")
match_summary = matches.describe()
misses = parsed.filter(col("is_match") == False)
miss_summary = misses.describe()

## Converting PySpark DataFrame to Pandas DataFrame

The code snippet converts a PySpark DataFrame `summary` to a Pandas DataFrame `summary_p`:

- `summary_p = summary.toPandas()`: Converts the PySpark DataFrame `summary` to a Pandas DataFrame `summary_p`. The `toPandas` method is used to convert a PySpark DataFrame to a Pandas DataFrame, allowing for easier manipulation and analysis using Pandas.


In [None]:
summary_p = summary.toPandas()

## Displaying Summary Statistics in Pandas DataFrame

The code snippet displays the first few rows and shape of the Pandas DataFrame `summary_p`:

- `summary_p.head()`: Displays the first few rows of the Pandas DataFrame `summary_p` using the `head` method.

- `summary_p.shape`: Returns the dimensions (number of rows, number of columns) of the Pandas DataFrame `summary_p` using the `shape` attribute.


In [None]:
summary_p.head()
summary_p.shape

## Manipulating Pandas DataFrame Structure

The code snippet manipulates the structure of the Pandas DataFrame `summary_p`:

- `summary_p = summary_p.set_index('summary').transpose().reset_index()`: Transposes the DataFrame `summary_p` and sets the "summary" column as the index. The `reset_index()` method resets the index to integers.

- `summary_p = summary_p.rename(columns={'index':'field'})`: Renames the "index" column to "field" in the DataFrame `summary_p`.

- `summary_p = summary_p.rename_axis(None, axis=1)`: Removes the axis name from the columns of the DataFrame `summary_p`.

- `summary_p.shape`: Returns the dimensions (number of rows, number of columns) of the Pandas DataFrame `summary_p` after manipulation.


In [None]:
summary_p = summary_p.set_index('summary').transpose().reset_index()
summary_p = summary_p.rename(columns={'index':'field'})
summary_p = summary_p.rename_axis(None, axis=1)
summary_p.shape

## Creating a PySpark DataFrame from Pandas DataFrame

The code snippet creates a PySpark DataFrame `summaryT` from the Pandas DataFrame `summary_p`:

- `summaryT = spark.createDataFrame(summary_p)`: Creates a PySpark DataFrame `summaryT` from the Pandas DataFrame `summary_p`. The `createDataFrame` method is used to convert a Pandas DataFrame to a PySpark DataFrame.


In [None]:
summaryT = spark.createDataFrame(summary_p)
summaryT

## Printing Schema of PySpark DataFrame

The code snippet prints the schema of the PySpark DataFrame `summaryT`:

- `summaryT.printSchema()`: Prints the schema of the PySpark DataFrame `summaryT`, showing the data types of each column.


In [None]:
summaryT.printSchema()

## Converting Column Data Types in PySpark DataFrame

The code snippet converts the data type of columns in the PySpark DataFrame `summaryT` to `DoubleType`:

- `from pyspark.sql.types import DoubleType`: Imports the `DoubleType` class from `pyspark.sql.types`. This class represents double precision (64-bit) floating-point numbers.

- The loop iterates over each column (`c`) in the DataFrame `summaryT`:
  - `if c == 'field': continue`: Skips the 'field' column to avoid attempting to convert it to `DoubleType`.

  - `summaryT = summaryT.withColumn(c, summaryT[c].cast(DoubleType()))`: Converts the data type of the column `c` to `DoubleType` using the `cast` method and updates the DataFrame `summaryT` with the new column data type.

- `summaryT.printSchema()`: Prints the schema of the PySpark DataFrame `summaryT` after converting column data types, showing the updated data types of each column.


In [None]:
from pyspark.sql.types import DoubleType
for c in summaryT.columns:
    if c == 'field':
        continue
summaryT = summaryT.withColumn(c, summaryT[c].cast(DoubleType()))
summaryT.printSchema()

## Function for Pivoting Summary Statistics in PySpark

The code defines a function `pivot_summary` that pivots summary statistics in a PySpark DataFrame:

- `from pyspark.sql import DataFrame`: Imports the `DataFrame` class from `pyspark.sql`.

- `from pyspark.sql.types import DoubleType`: Imports the `DoubleType` class from `pyspark.sql.types`. This class represents double precision (64-bit) floating-point numbers.

- The `pivot_summary` function:
  - Accepts a PySpark DataFrame `desc` containing summary statistics.
  - Converts the PySpark DataFrame `desc` to a Pandas DataFrame `desc_p` using the `toPandas` method.
  - Transposes the Pandas DataFrame `desc_p`, sets the index, and resets the columns.
  - Converts the Pandas DataFrame `desc_p` back to a PySpark DataFrame `descT` using the `createDataFrame` method.
  - Converts metric columns to `DoubleType` from string in the PySpark DataFrame `descT` using a loop and the `cast` method.
  - Returns the pivoted PySpark DataFrame `descT`.


In [None]:
from pyspark.sql import DataFrame
from pyspark.sql.types import DoubleType
def pivot_summary(desc):
    # convert to pandas dataframe
    desc_p = desc.toPandas()
    # transpose
    desc_p = desc_p.set_index('summary').transpose().reset_index()
    desc_p = desc_p.rename(columns={'index':'field'})
    desc_p = desc_p.rename_axis(None, axis=1)
    # convert to Spark dataframe
    descT = spark.createDataFrame(desc_p)
    # convert metric columns to double from string
    for c in descT.columns:
        if c == 'field':
            continue
        else:
            descT = descT.withColumn(c, descT[c].cast(DoubleType()))
        return descT
match_summaryT = pivot_summary(match_summary)
miss_summaryT = pivot_summary(miss_summary)

## Creating Temporary Views and Executing SQL Query in PySpark

The code snippet creates temporary views for the DataFrames `match_summaryT` and `miss_summaryT` and then executes a SQL query to compare summary statistics between the two views:

- `match_summaryT.createOrReplaceTempView("match_desc")`: Creates a temporary view named "match_desc" from the DataFrame `match_summaryT`.

- `miss_summaryT.createOrReplaceTempView("miss_desc")`: Creates a temporary view named "miss_desc" from the DataFrame `miss_summaryT`.

- The SQL query:
  - `SELECT a.field, a.count + b.count total, a.mean - b.mean delta`: Selects the "field" column from the "match_desc" view, calculates the total count of rows for each field, and computes the difference in mean values between "match_desc" and "miss_desc" views for each field. The result columns are named "field", "total", and "delta", respectively.
  
  - `FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field`: Performs an inner join between the "match_desc" and "miss_desc" views on the "field" column to compare summary statistics for each field.

  - `WHERE a.field NOT IN ("id_1", "id_2")`: Filters out fields "id_1" and "id_2" from the result to focus on other fields.

  - `ORDER BY delta DESC, total DESC`: Orders the result by the "delta" column in descending order (largest differences first), and then by the "total" column in descending order (largest totals first).

The SQL query provides insights into how summary statistics differ between matches and misses, excluding the "id_1" and "id_2" fields.


In [None]:
match_summaryT.createOrReplaceTempView("match_desc")
miss_summaryT.createOrReplaceTempView("miss_desc")
spark.sql("""
SELECT a.field, a.count + b.count total, a.mean - b.mean delta
FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field
WHERE a.field NOT IN ("id_1", "id_2")
ORDER BY delta DESC, total DESC
""")

## Creating Summation Expression for Good Features

The code snippet creates a summation expression for the list of good features:

- `good_features = ["cmp_lname_c1", "cmp_plz", "cmp_by", "cmp_bd", "cmp_bm"]`: Defines a list of good features.

- `sum_expression = " + ".join(good_features)`: Joins the elements of the `good_features` list with the " + " separator to create a summation expression. The `join` method concatenates the elements of the list into a single string separated by the specified separator.

The `sum_expression` string contains the summation expression for the good features, which can be used in further computations or expressions.


In [None]:
good_features = ["cmp_lname_c1", "cmp_plz", "cmp_by", "cmp_bd", "cmp_bm"]
sum_expression = " + ".join(good_features)
sum_expression

## Calculating Score for Good Features in PySpark DataFrame

The code snippet calculates a score for the good features in the PySpark DataFrame `parsed` and displays the result:

- `from pyspark.sql.functions import expr`: Imports the `expr` function from `pyspark.sql.functions`. This function allows you to create a new column in a DataFrame by evaluating a string expression.

- `scored = parsed.fillna(0, subset=good_features).withColumn('score', expr(sum_expression)).select('score', 'is_match')`: Calculates the score for the good features in the `parsed` DataFrame. The `fillna` method is used to replace null values with 0 in the good features columns. The `withColumn` method adds a new column named 'score' to the DataFrame, which is calculated using the `sum_expression`. Finally, the `select` method is used to select only the 'score' and 'is_match' columns from the DataFrame.

- `scored.show()`: Displays the 'score' and 'is_match' columns of the `scored` DataFrame, showing the calculated score for each row along with the 'is_match' value.


In [None]:
from pyspark.sql.functions import expr
scored = parsed.fillna(0, subset=good_features).\
withColumn('score', expr(sum_expression)).\
select('score', 'is_match')
scored.show()

## Creating CrossTabs in PySpark

The code defines a function `crossTabs` that creates cross-tabulations for a given DataFrame `scored` and threshold `t`:

- `def crossTabs(scored: DataFrame, t: DoubleType) -> DataFrame:`: Defines a function `crossTabs` that takes a DataFrame `scored` and a threshold `t` as input and returns a DataFrame as output.

- `return scored.selectExpr(f"score >= {t} as above", "is_match").\`: Selects the 'score' column and creates a new column 'above' that indicates whether the score is above or equal to the threshold `t`. It also selects the 'is_match' column.

- `groupBy("above").pivot("is_match", ("true", "false")).\`: Groups the DataFrame by the 'above' column and creates a pivot table with 'is_match' as the pivot column. The pivot table shows the count of 'true' and 'false' values for each 'above' value.

- `count()`: Calculates the count of occurrences for each combination of 'above' and 'is_match'.

The function returns a DataFrame containing the cross-tabulations.


In [None]:
def crossTabs(scored: DataFrame, t: DoubleType) -> DataFrame:
    return scored.selectExpr(f"score >= {t} as above", "is_match").\
    groupBy("above").pivot("is_match", ("true", "false")).\
    count()

## Displaying CrossTabs Result in PySpark

The code snippet calls the `crossTabs` function with the `scored` DataFrame and a threshold of `4.0`, and then displays the result:

- `crossTabs(scored, 4.0)`: Calls the `crossTabs` function with the `scored` DataFrame and a threshold of `4.0` to calculate cross-tabulations.

- `show()`: Displays the result of the cross-tabulations, showing the count of 'true' and 'false' values for each 'above' value (indicating whether the score is above or equal to the threshold).


In [None]:
crossTabs(scored, 4.0).show()

## Displaying CrossTabs Result in PySpark

The code snippet calls the `crossTabs` function with the `scored` DataFrame and a threshold of `2.0`, and then displays the result:

- `crossTabs(scored, 2.0)`: Calls the `crossTabs` function with the `scored` DataFrame and a threshold of `2.0` to calculate cross-tabulations.

- `show()`: Displays the result of the cross-tabulations, showing the count of 'true' and 'false' values for each 'above' value (indicating whether the score is above or equal to the threshold).


In [None]:
crossTabs(scored, 2.0).show()