<a href="https://colab.research.google.com/github/tyri0n11/distributed-system/blob/main/6_3_data_structure_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession(sparkContext=sc)

In [None]:
spark

In [2]:
# from a list
rdd = sc.parallelize([1,2,3])
rdd.collect()

[1, 2, 3]

In [3]:
type(rdd)

In [4]:
# from a tuple
rdd = sc.parallelize(('cat', 'dog', 'fish'))
rdd.collect()

['cat', 'dog', 'fish']

In [5]:
# from a list of tuple
list_t = [('cat', 'dog', 'fish'), ('orange', 'apple')]
rdd = sc.parallelize(list_t)
rdd.collect()

[('cat', 'dog', 'fish'), ('orange', 'apple')]

In [6]:
# from a set
s = {'cat', 'dog', 'fish', 'cat', 'dog', 'dog'}
rdd = sc.parallelize(s)
rdd.collect()

['cat', 'fish', 'dog']

In [7]:
# from a dict
d = {
    'a': 100,
    'b': 200,
    'c': 300
}
rdd = sc.parallelize(d)
rdd.collect()

['a', 'b', 'c']

In [8]:
# from numpy array
import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9])
rdd = sc.parallelize(arr)
rdd.collect()

[np.int64(1),
 np.int64(2),
 np.int64(3),
 np.int64(4),
 np.int64(5),
 np.int64(6),
 np.int64(7),
 np.int64(8),
 np.int64(9)]

In [None]:
from google.colab import files
files.upload()

{}

In [None]:
mtcars = spark.read.csv(path='./mtcars.csv',
                        sep=',',
                        encoding='UTF-8',
                        comment=None,
                        header=True,
                        inferSchema=True)
mtcars.show(n=5, truncate=False)

+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|_c0              |mpg |cyl|disp |hp |drat|wt   |qsec |vs |am |gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|Mazda RX4        |21.0|6  |160.0|110|3.9 |2.62 |16.46|0  |1  |4   |4   |
|Mazda RX4 Wag    |21.0|6  |160.0|110|3.9 |2.875|17.02|0  |1  |4   |4   |
|Datsun 710       |22.8|4  |108.0|93 |3.85|2.32 |18.61|1  |1  |4   |1   |
|Hornet 4 Drive   |21.4|6  |258.0|110|3.08|3.215|19.44|1  |0  |3   |1   |
|Hornet Sportabout|18.7|8  |360.0|175|3.15|3.44 |17.02|0  |0  |3   |2   |
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows



In [None]:
type(mtcars)

In [None]:
from pyspark.sql import Row
rdd = sc.parallelize([
    Row(x=[1,2,3], y=['a','b','c']),
    Row(x=[4,5,6], y=['e','f','g'])
])
rdd.collect()

[Row(x=[1, 2, 3], y=['a', 'b', 'c']), Row(x=[4, 5, 6], y=['e', 'f', 'g'])]

In [None]:
spark_df = spark.createDataFrame(rdd)
spark_df.show()

+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[e, f, g]|
+---------+---------+



In [None]:
type(spark_df)

In [None]:
import pandas as pd
pdf = pd.DataFrame({
    'x': [[1,2,3], [4,5,6]],
    'y': [['a','b','c'], ['e','f','g']]
})
pdf

Unnamed: 0,x,y
0,"[1, 2, 3]","[a, b, c]"
1,"[4, 5, 6]","[e, f, g]"


In [None]:
type(pdf)

In [None]:
df = spark.createDataFrame(pdf)
df.show()

+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[e, f, g]|
+---------+---------+



In [None]:
type(df)

In [None]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['letter', 'number'])
df.show()

+------+------+
|letter|number|
+------+------+
|     a|     1|
|     b|     2|
+------+------+



In [None]:
df.dtypes

[('letter', 'string'), ('number', 'bigint')]

In [None]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['my_column'])
df.show()

+---------+---+
|my_column| _2|
+---------+---+
|        a|  1|
|        b|  2|
+---------+---+



In [None]:
df.dtypes

[('my_column', 'string'), ('_2', 'bigint')]

In [None]:
my_list = [(['a', 1], ['b', 2])]
df = spark.createDataFrame(my_list, ['x', 'y'])
df.show()

+------+------+
|     x|     y|
+------+------+
|[a, 1]|[b, 2]|
+------+------+





## Column instance

Column instances can be created in two ways:

1. directly select a column out of a *DataFrame*: `df.colName`
2. create from a column expression: `df.colName + 1`

Technically, there is only one way to create a column instance. Column expressions start from a column instance.

**Remember how to create column instances, because this is usually the starting point if we want to operate DataFrame columns.**

The column classes come with some methods that can operate on a column instance. ***However, almost all functions from the `pyspark.sql.functions` module take one or more column instances as argument(s)***. These functions are important for data manipulation tools.

## DataFrame column methods

### Methods that take column names as arguments:

* `corr(col1, col2)`: two column names.
* `cov(col1, col2)`: two column names.
* `crosstab(col1, col2)`: two column names.
* `describe(*cols)`: ***`*cols` refers to only column names (strings).***

### Methods that take column names or column expressions or **both** as arguments:

* `cube(*cols)`: column names (string) or column expressions or **both**.
* `drop(*cols)`: ***a list of column names OR a single column expression.***
* `groupBy(*cols)`: column name (string) or column expression or **both**.
* `rollup(*cols)`: column name (string) or column expression or **both**.
* `select(*cols)`: column name (string) or column expression or **both**.
* `sort(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `sortWithinPartitions(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `orderBy(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `sampleBy(col, fractions, sed=None)`: a column name.
* `toDF(*cols)`: **a list of column names (string).**
* `withColumn(colName, col)`: `colName` refers to column name; `col` refers to a column expression.
* `withColumnRenamed(existing, new)`: takes column names as arguments.
* `filter(condition)`: ***condition** refers to a column expression that returns `types.BooleanType` of values.

## DataFrame to RDD
A **DataFrame** can be easily converted to an **RDD** by calling the `pyspark.sql.DataFrame.rdd()` function. Each element in the returned RDD is an **pyspark.sql.Row** object. An Row is a list of key-value pairs.

In [None]:
mtcars.rdd.take(2)

[Row(_c0='Mazda RX4', mpg=21.0, cyl=6, disp=160.0, hp=110, drat=3.9, wt=2.62, qsec=16.46, vs=0, am=1, gear=4, carb=4),
 Row(_c0='Mazda RX4 Wag', mpg=21.0, cyl=6, disp=160.0, hp=110, drat=3.9, wt=2.875, qsec=17.02, vs=0, am=1, gear=4, carb=4)]

With an RDD object, we can apply a set of mapping functions, such as **map**, **mapValues**, **flatMap**, **flatMapValues** and a lot of other methods that come from RDD.

In [None]:
mtcars_map = mtcars.rdd.map(lambda x: (x['_c0'], x['mpg']))
mtcars_map.take(5)

[('Mazda RX4', 21.0),
 ('Mazda RX4 Wag', 21.0),
 ('Datsun 710', 22.8),
 ('Hornet 4 Drive', 21.4),
 ('Hornet Sportabout', 18.7)]

In [None]:
mtcars_mapvalues = mtcars_map.mapValues(lambda x: [x, x * 10])
mtcars_mapvalues.take(5)

[('Mazda RX4', [21.0, 210.0]),
 ('Mazda RX4 Wag', [21.0, 210.0]),
 ('Datsun 710', [22.8, 228.0]),
 ('Hornet 4 Drive', [21.4, 214.0]),
 ('Hornet Sportabout', [18.7, 187.0])]

## RDD to DataFrame

To convert an RDD to a DataFrame, we can use the `SparkSession.createDataFrame()` function. Every element in the RDD **has be to an Row object**.

Create an RDD

In [None]:
rdd_raw = sc.textFile('./mtcars.csv')
rdd_raw.take(5)

[',mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb',
 'Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4',
 'Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4',
 'Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1',
 'Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1']

In [None]:
header = rdd_raw.map(lambda x: x.split(',')).filter(lambda x: x[1] == 'mpg').collect()[0]
header[0] = 'model'
header

['model',
 'mpg',
 'cyl',
 'disp',
 'hp',
 'drat',
 'wt',
 'qsec',
 'vs',
 'am',
 'gear',
 'carb']

#### Save the rest to a new RDD

In [None]:
rdd = rdd_raw.map(lambda x: x.split(',')).filter(lambda x: x[1] != 'mpg')
rdd.take(2)

[['Mazda RX4',
  '21',
  '6',
  '160',
  '110',
  '3.9',
  '2.62',
  '16.46',
  '0',
  '1',
  '4',
  '4'],
 ['Mazda RX4 Wag',
  '21',
  '6',
  '160',
  '110',
  '3.9',
  '2.875',
  '17.02',
  '0',
  '1',
  '4',
  '4']]

First we define a function which takes a list of column names and a list of values and create a Row of key-value pairs. **Since keys in an Row object are variable names, we can’t simply pass a dictionary to the Row() function**. We can think of a dictionary as an argument list and use the `**` to unpack the argument list.

See an example.

In [None]:
from pyspark.sql import Row
my_dict = dict(zip(['a', 'b', 'c'], range(1, 4)))
Row(**my_dict)

Row(a=1, b=2, c=3)

#### Let’s define the function.

In [None]:
def dict_to_row(keys, values):
    row_dict = dict(zip(keys, values))
    return Row(**row_dict)

In [None]:
rdd_rows = rdd.map(lambda x: dict_to_row(header, x))
rdd_rows.take(3)

[Row(model='Mazda RX4', mpg='21', cyl='6', disp='160', hp='110', drat='3.9', wt='2.62', qsec='16.46', vs='0', am='1', gear='4', carb='4'),
 Row(model='Mazda RX4 Wag', mpg='21', cyl='6', disp='160', hp='110', drat='3.9', wt='2.875', qsec='17.02', vs='0', am='1', gear='4', carb='4'),
 Row(model='Datsun 710', mpg='22.8', cyl='4', disp='108', hp='93', drat='3.85', wt='2.32', qsec='18.61', vs='1', am='1', gear='4', carb='1')]

In [None]:
# check
type(rdd_rows)

In [None]:
df = spark.createDataFrame(rdd_rows)
df.show(5)

+-----------------+----+---+----+---+----+-----+-----+---+---+----+----+
|            model| mpg|cyl|disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-----------------+----+---+----+---+----+-----+-----+---+---+----+----+
|        Mazda RX4|  21|  6| 160|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|    Mazda RX4 Wag|  21|  6| 160|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|       Datsun 710|22.8|  4| 108| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|   Hornet 4 Drive|21.4|  6| 258|110|3.08|3.215|19.44|  1|  0|   3|   1|
|Hornet Sportabout|18.7|  8| 360|175|3.15| 3.44|17.02|  0|  0|   3|   2|
+-----------------+----+---+----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows



## Merge and split columns

Sometimes we need to merge multiple columns in a Dataframe into one column, or split a column into multiple columns. We can easily achieve this by converting a DataFrame to RDD, applying map functions to manipulate elements, and then converting the RDD back to a DataFrame.

In [None]:
# adjust first column name
colnames = mtcars.columns
colnames[0] = 'model'
mtcars = mtcars.rdd.toDF(colnames)
mtcars.show(5)

+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|            model| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|       Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|
|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows



### Merge multiple columns

We convert DataFrame to RDD and then apply the **map** function to merge values and convert
elements to **Row** objects.

In [None]:
type(mtcars)

In [None]:
from pyspark.sql import Row
mtcars_rdd = mtcars.rdd.map(lambda x: Row(model=x[0], values=x[1:]))
mtcars_rdd.take(5)

[Row(model='Mazda RX4', values=(21.0, 6, 160.0, 110, 3.9, 2.62, 16.46, 0, 1, 4, 4)),
 Row(model='Mazda RX4 Wag', values=(21.0, 6, 160.0, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)),
 Row(model='Datsun 710', values=(22.8, 4, 108.0, 93, 3.85, 2.32, 18.61, 1, 1, 4, 1)),
 Row(model='Hornet 4 Drive', values=(21.4, 6, 258.0, 110, 3.08, 3.215, 19.44, 1, 0, 3, 1)),
 Row(model='Hornet Sportabout', values=(18.7, 8, 360.0, 175, 3.15, 3.44, 17.02, 0, 0, 3, 2))]

In [None]:
type(mtcars_rdd)

Then we create a new DataFrame from the obtained RDD.

In [None]:
mtcars_df = spark.createDataFrame(mtcars_rdd)
mtcars_df.show(5, truncate=False)

+-----------------+-----------------------------------------------------+
|model            |values                                               |
+-----------------+-----------------------------------------------------+
|Mazda RX4        |{21.0, 6, 160.0, 110, 3.9, 2.62, 16.46, 0, 1, 4, 4}  |
|Mazda RX4 Wag    |{21.0, 6, 160.0, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4} |
|Datsun 710       |{22.8, 4, 108.0, 93, 3.85, 2.32, 18.61, 1, 1, 4, 1}  |
|Hornet 4 Drive   |{21.4, 6, 258.0, 110, 3.08, 3.215, 19.44, 1, 0, 3, 1}|
|Hornet Sportabout|{18.7, 8, 360.0, 175, 3.15, 3.44, 17.02, 0, 0, 3, 2} |
+-----------------+-----------------------------------------------------+
only showing top 5 rows



## Split one column

We use the above DataFrame as our example data. Again, we need to convert the DataFrame to an RDD to achieve our goal.

Let's split the **values** column into two columns: x1 and x2. The first 4 values will be in column **x1** and the remaining values will be in column **x2**.

In [None]:
mtcars_rdd_2 = mtcars_df.rdd.map(lambda x: Row(model=x[0], x1=x[1][:5], x2=x[1][5:]))
# convert RDD back to DataFrame
mtcars_df_2 = spark.createDataFrame(mtcars_rdd_2)
mtcars_df_2.show(5, truncate=False)

+-----------------+---------------------------+--------------------------+
|model            |x1                         |x2                        |
+-----------------+---------------------------+--------------------------+
|Mazda RX4        |{21.0, 6, 160.0, 110, 3.9} |{2.62, 16.46, 0, 1, 4, 4} |
|Mazda RX4 Wag    |{21.0, 6, 160.0, 110, 3.9} |{2.875, 17.02, 0, 1, 4, 4}|
|Datsun 710       |{22.8, 4, 108.0, 93, 3.85} |{2.32, 18.61, 1, 1, 4, 1} |
|Hornet 4 Drive   |{21.4, 6, 258.0, 110, 3.08}|{3.215, 19.44, 1, 0, 3, 1}|
|Hornet Sportabout|{18.7, 8, 360.0, 175, 3.15}|{3.44, 17.02, 0, 0, 3, 2} |
+-----------------+---------------------------+--------------------------+
only showing top 5 rows



In [None]:
mtcars_rdd_3 = mtcars_df.rdd.map(lambda x: Row(model=x[0], x1=x[1][:7], x2=x[1][7:]))
# convert RDD back to DataFrame
mtcars_df_3 = spark.createDataFrame(mtcars_rdd_3)
mtcars_df_3.show(5, truncate=False)

+-----------------+-----------------------------------------+------------+
|model            |x1                                       |x2          |
+-----------------+-----------------------------------------+------------+
|Mazda RX4        |{21.0, 6, 160.0, 110, 3.9, 2.62, 16.46}  |{0, 1, 4, 4}|
|Mazda RX4 Wag    |{21.0, 6, 160.0, 110, 3.9, 2.875, 17.02} |{0, 1, 4, 4}|
|Datsun 710       |{22.8, 4, 108.0, 93, 3.85, 2.32, 18.61}  |{1, 1, 4, 1}|
|Hornet 4 Drive   |{21.4, 6, 258.0, 110, 3.08, 3.215, 19.44}|{1, 0, 3, 1}|
|Hornet Sportabout|{18.7, 8, 360.0, 175, 3.15, 3.44, 17.02} |{0, 0, 3, 2}|
+-----------------+-----------------------------------------+------------+
only showing top 5 rows



In [None]:
mtcars_rdd_4 = mtcars_df.rdd.map(lambda x: Row(model=x[0], x1=x[1][:5], x2=x[1][5:8], x3=x[1][8:]))
# convert RDD back to DataFrame
mtcars_df_4 = spark.createDataFrame(mtcars_rdd_4)
mtcars_df_4.show(5, truncate=False)

+-----------------+---------------------------+-----------------+---------+
|model            |x1                         |x2               |x3       |
+-----------------+---------------------------+-----------------+---------+
|Mazda RX4        |{21.0, 6, 160.0, 110, 3.9} |{2.62, 16.46, 0} |{1, 4, 4}|
|Mazda RX4 Wag    |{21.0, 6, 160.0, 110, 3.9} |{2.875, 17.02, 0}|{1, 4, 4}|
|Datsun 710       |{22.8, 4, 108.0, 93, 3.85} |{2.32, 18.61, 1} |{1, 4, 1}|
|Hornet 4 Drive   |{21.4, 6, 258.0, 110, 3.08}|{3.215, 19.44, 1}|{0, 3, 1}|
|Hornet Sportabout|{18.7, 8, 360.0, 175, 3.15}|{3.44, 17.02, 0} |{0, 3, 2}|
+-----------------+---------------------------+-----------------+---------+
only showing top 5 rows



### Exercise
1. Split `mtcars_df` into 5 columns given that the first column is `model`, call the new dataframe as `mtcars_df_5`.
2. Merge all columns (except the first colums `model`) of `mtcars_df_5` to one column `X`. Call this new dataframe as `mtcars_df_6`
3. Split `mtcars_df` into some columns given that the first column is `model`. From the second column and more, each column has three values, while the last column contains the remining values. Then we call the new dataframe as `mtcars_df_7`.


In [None]:
#Split mtcars_df into 5 columns given that the first column is model,
#call the new dataframe as mtcars_df_5.
mtcars_rdd_5 = mtcars_df.rdd.map(lambda x: Row(model=x[0], x1=x[1][:2], x2=x[1][2:4], x3=x[1][4:6], x4=x[1][6:]))
mtcars_df_5 = spark.createDataFrame(mtcars_rdd_5)
mtcars_df_5.show(5, truncate=False)

+-----------------+---------+------------+-------------+-------------------+
|model            |x1       |x2          |x3           |x4                 |
+-----------------+---------+------------+-------------+-------------------+
|Mazda RX4        |{21.0, 6}|{160.0, 110}|{3.9, 2.62}  |{16.46, 0, 1, 4, 4}|
|Mazda RX4 Wag    |{21.0, 6}|{160.0, 110}|{3.9, 2.875} |{17.02, 0, 1, 4, 4}|
|Datsun 710       |{22.8, 4}|{108.0, 93} |{3.85, 2.32} |{18.61, 1, 1, 4, 1}|
|Hornet 4 Drive   |{21.4, 6}|{258.0, 110}|{3.08, 3.215}|{19.44, 1, 0, 3, 1}|
|Hornet Sportabout|{18.7, 8}|{360.0, 175}|{3.15, 3.44} |{17.02, 0, 0, 3, 2}|
+-----------------+---------+------------+-------------+-------------------+
only showing top 5 rows



In [None]:
# Merge all columns (except the first colums model) of mtcars_df_5 to one column X.
# Call this new dataframe as mtcars_df_6
mtcars_rdd_6 = mtcars_df_5.rdd.map(lambda x: Row(model=x[0], X=x[1:]))
mtcars_df_6 = spark.createDataFrame(mtcars_rdd_6)
mtcars_df_6.show(5, truncate=False)

+-----------------+-------------------------------------------------------------+
|model            |X                                                            |
+-----------------+-------------------------------------------------------------+
|Mazda RX4        |{{21.0, 6}, {160.0, 110}, {3.9, 2.62}, {16.46, 0, 1, 4, 4}}  |
|Mazda RX4 Wag    |{{21.0, 6}, {160.0, 110}, {3.9, 2.875}, {17.02, 0, 1, 4, 4}} |
|Datsun 710       |{{22.8, 4}, {108.0, 93}, {3.85, 2.32}, {18.61, 1, 1, 4, 1}}  |
|Hornet 4 Drive   |{{21.4, 6}, {258.0, 110}, {3.08, 3.215}, {19.44, 1, 0, 3, 1}}|
|Hornet Sportabout|{{18.7, 8}, {360.0, 175}, {3.15, 3.44}, {17.02, 0, 0, 3, 2}} |
+-----------------+-------------------------------------------------------------+
only showing top 5 rows



In [None]:
#Split mtcars_df into some columns given that the first column is model.
#From the second column and more, each column has three values, while the last column contains the remining values.
#Then we call the new dataframe as mtcars_df_7.

mtcars_rdd_7 = mtcars_df.rdd.map(lambda x: Row(model=x[0], x1=x[1][:2], x2=x[1][2:5], x3=x[1][5:8], x4=x[1][8:]))
mtcars_df_7 = spark.createDataFrame(mtcars_rdd_7)
mtcars_df_7.show(5, truncate=False)


+-----------------+---------+------------------+-----------------+---------+
|model            |x1       |x2                |x3               |x4       |
+-----------------+---------+------------------+-----------------+---------+
|Mazda RX4        |{21.0, 6}|{160.0, 110, 3.9} |{2.62, 16.46, 0} |{1, 4, 4}|
|Mazda RX4 Wag    |{21.0, 6}|{160.0, 110, 3.9} |{2.875, 17.02, 0}|{1, 4, 4}|
|Datsun 710       |{22.8, 4}|{108.0, 93, 3.85} |{2.32, 18.61, 1} |{1, 4, 1}|
|Hornet 4 Drive   |{21.4, 6}|{258.0, 110, 3.08}|{3.215, 19.44, 1}|{0, 3, 1}|
|Hornet Sportabout|{18.7, 8}|{360.0, 175, 3.15}|{3.44, 17.02, 0} |{0, 3, 2}|
+-----------------+---------+------------------+-----------------+---------+
only showing top 5 rows



## Exercise

Do the same thing for titanic dataset.

In [10]:
rdd_raw_titanic = sc.textFile('./kaggle-titanic-test.csv')
.take(5)


['PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked',
 '892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q',
 '893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S',
 '894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q',
 '895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S']

In [25]:
from pyspark.sql import Row
def dict_to_row(keys, values):
    row_dict = dict(zip(keys, values))
    return Row(**row_dict)

In [14]:
rdd_titanic = rdd_raw_titanic.map(lambda x: x.split(','))

In [26]:
header_titanic = rdd_raw_titanic.map(lambda x: x.split(',')).collect()[0]
header_titanic = rdd_raw_titanic.first()
rdd_no_header = rdd_raw_titanic.filter(lambda x: x != header_titanic)