To import the necessary pyspark components type the following: 

In [2]:
from pyspark.sql import SparkSession, SQLContext

Next, we create our Spark session.

In [4]:
spark = SparkSession \
    .builder \
    .appName('Import Texas PUDF Data') \
    .getOrCreate() \

Now, we import the data into a DataFrame:

In [5]:
data_sp = spark.read.load(
    "C:\\Users\\Vikas\\PUDF_base1_1q2012_tab.txt",
    format="csv",
    sep="\t",
    header=True
)

Let us now show the first five rows:

In [18]:
data_sp.show(5, truncate=False)

+------------+---------+--------+---------------------+-----------------+-------------------+-----------+-----------+-----------+-----------+-----------+---------+-------+-----------+------+--------------------+----------+--------+----+---------+-------------+--------------+-------+-----------------+---------------------+------------+-------------+---------------------+--------------------+----------------------------+-------------------+---------------------------+----------------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+---------------+-------------------+----------------+--------------------+----------------+--------------------+----------------+--------------------+-------------

Using the scrollbar you can verify that the first five rows printed (under the sequence of dashes and plus signs!).

Let's now measure how long it takes to import the data via Pandas.

In [12]:
import pandas as pd
import timeit

stmt_pandas = 'import pandas as pd; ' + \
    'pd.read_csv("C:\\\\Users\\\\Vikas\\\\PUDF_base1_1q2012_tab.txt",' + \
        'delimiter="\t",' + \
        'header=0,' + \
        'dtype="str"' + \
    ')'

timeit.timeit(stmt=stmt_pandas, number = 1)

35.551525896599244

It takes 35 seconds in Pandas (on my laptop).  

We now benchmark the data import in Spark:

In [14]:
stmt_spark = 'from pyspark.sql import SparkSession; ' + \
    'spark = SparkSession.builder.appName("test").getOrCreate();' + \
    'spark.read.load("C:\\\\Users\\\\Vikas\\\\PUDF_base1_1q2012_tab.txt",' + \
        'format="csv",' + \
        'sep="\t",' + \
        'header=True' + \
    ')'

timeit.timeit(stmt=stmt_spark, number = 1)

1.9205595757441074

It takes ~ 2 seconds using Spark.