<font color=red>
Spark_Version:2.0.1<br/>
Python_Version:Python 3.5.2 | Anaconda4.2.0(64-bit)<br/>
Jupyter_Version:4.2.3  &emsp;Kernel:Python[default]<br/>
System:Ubuntu 16.04 LTS(64-bit)
</font>

In [1]:
import platform
print("Spark_Version:",sc.version)
print("Python_Version:",platform.python_version())
print("System:",platform.system())

Spark_Version: 2.0.1
Python_Version: 3.5.2
System: Linux


## Data acquisition
<p><strong>Data acquisition</strong>, or <strong>data collection</strong>, is the very first step in any data science project. <font color=red>Usually,you won't find the complete set of required data in one place as it is distributed across lineof-business (LOB) applications and systems.</font></p>
<p>The majority of this section has already been covered in the previous chapter, which outlined how to source data from different data sources and store the data in DataFrames for easier analysis. There is a built-in mechanism in Spark to fetch data from some of the common data sources and the Data Source API is provided for the ones not supported out of the box on Spark.</p><p>To get a better understanding of the data acquisition and preparation phases, let us assume a scenario and try to address all the steps involved with example code snippets. The scenario is such that employee data is present across native RDDs, JSON files, and on a SQL server. So, let's see how we can get those to Spark DataFrames:</p>

In [2]:
#From RDD: Create an RDD and convert to DataFrame
employees = sc.parallelize([(1, 'John', 25), (2, 'Ray', 35), (3, 'Mike', 24), 
                            (4, 'Jane', 28), (5, 'Kevin', 26), (6, 'Vincent', 35), 
                            (7, 'James', 38), (8, 'Shane', 32), (9, 'Larry', 29), 
                            (10, 'Kimberly', 29), (11, 'Alex', 28), (12, 'Garry', 25), 
                            (13, 'Max', 31)]).toDF(['emp_id', 'name', 'age'])

In [3]:
employees.show()

+------+--------+---+
|emp_id|    name|age|
+------+--------+---+
|     1|    John| 25|
|     2|     Ray| 35|
|     3|    Mike| 24|
|     4|    Jane| 28|
|     5|   Kevin| 26|
|     6| Vincent| 35|
|     7|   James| 38|
|     8|   Shane| 32|
|     9|   Larry| 29|
|    10|Kimberly| 29|
|    11|    Alex| 28|
|    12|   Garry| 25|
|    13|     Max| 31|
+------+--------+---+



In [4]:
#From JSON: reading a JSON file
salary = sqlContext.read.json("resource/salary.json")
designation = sqlContext.read.json("resource/designation.json")

In [5]:
salary.show()

+----+------+
|e_id|salary|
+----+------+
|   1| 10000|
|   2| 12000|
|   3| 12000|
|   4|  null|
|   5|   120|
|   6| 22000|
|   7| 20000|
|   8| 12000|
|   9| 10000|
|  10|  8000|
|  11| 12000|
|  12| 12000|
|  13|120000|
+----+------+



In [6]:
designation.show()

+---+--------------+
| id|          role|
+---+--------------+
|  1|     Associate|
|  2|       Manager|
|  3|       Manager|
|  4|     Associate|
|  5|       Manager|
|  6|Senior Manager|
|  7|Senior Manager|
|  8|       Manager|
|  9|       Manager|
| 10|     Associate|
| 11|       Manager|
| 12|       Manager|
| 13|       Manager|
+---+--------------+



## Data preparation
<p>Data quality has always been a pervasive problem in the industry. The presence of incorrect or inconsistent data can produce misleading results of your analysis. Implementing better algorithm or building better models will not help much if the data is not cleansed and prepared well, as per the requirement. There is an industry jargon called <strong>data engineering</strong> that refers to <strong>data sourcing</strong> and <strong>preparation</strong>. This is typically done by data scientists and in a few organizations, there is a dedicated team for this purpose. However, <font color=red>while preparing data, a scientific perspective is often needed to do it right. As an example, you may not just do mean substitution to treat missing values and look into data distribution to find more appropriate values to substitute. Another such example is that you may not just look at a box plot or scatter plot to look for outliers, as there could be multivariate outliers which are not visible if you plot a single variable.</font> There are different approaches, such as <strong>Gaussian Mixture Models (GMMs)</strong> and <strong>Expectation Maximization (EM)</strong> algorithms that use <strong>Mahalanobis distance</strong> to look for multivariate outliers.</p>
<p>The data preparation phase is an extremely important phase, not only for the algorithms to work properly, but also for you to develop a better understanding of your data so that you can take the right approach while implementing an algorithm.</p>
<p>Once the data has been acquired from different sources, the next step is to consolidate them all so that the data as a whole can be cleaned, formatted, and transformed to the format needed for your analysis. Please note that you might have to take samples of data from the sources, depending on the scenario, and then prepare the data for further analysis. Various sampling techniques that can be used are discussed later in this chapter.</p>

## Data consolidation
In this section, we will take a look at how to combine data acquired from various data sources:

In [7]:
#Creating the final data matrix using the join operation
final_data = employees.join(salary, employees.emp_id == salary.e_id).\
                       join(designation, employees.emp_id == designation.id).\
                       select("emp_id", "name", "age", "role", "salary")
final_data.show(5)

+------+-----+---+---------+------+
|emp_id| name|age|     role|salary|
+------+-----+---+---------+------+
|     1| John| 25|Associate| 10000|
|     2|  Ray| 35|  Manager| 12000|
|     3| Mike| 24|  Manager| 12000|
|     4| Jane| 28|Associate|  null|
|     5|Kevin| 26|  Manager|   120|
+------+-----+---+---------+------+
only showing top 5 rows



## Data cleansing
<p>Once you have the data consolidated in one place, it is extremely important that you spend enough time and effort in cleaning it before analyzing it. This is an iterative process because you have to validate the actions you have taken on the data and continue till you are satisfied with the data quality. It is advisable that you spend time analyzing the causes of anomalies you detect in the data.</p>
<p>Some level of impurity in data usually exists in any dataset. There can be various kinds of issues with data, but we are going to address a few common cases, such as <strong>missing values</strong>,<strong>duplicate values</strong>, <strong>transforming</strong>, or <strong>formatting </strong>(adding or removing digits from a number,splitting a column into two, merging two columns into one).</p>

## Missing value treatment

<p>There are various ways of handling missing values. One way is dropping rows containing missing values. We may want to drop a row even if a single column has missing value, or may have different strategies for different columns. We may want to retain the row as long as the total number of missing values in that row are under a threshold. Another approach may be to replace nulls with a constant value, say the mean value in case of numeric
variables.</p>
<p>In this section, we will not be providing some examples in both Scala and Python and will try to cover various scenarios to give you a broader perspective.</p>

In [8]:
#Dropping rows with missing value(s)
clean_data = final_data.na.drop()
clean_data.show()

+------+--------+---+--------------+------+
|emp_id|    name|age|          role|salary|
+------+--------+---+--------------+------+
|     1|    John| 25|     Associate| 10000|
|     2|     Ray| 35|       Manager| 12000|
|     3|    Mike| 24|       Manager| 12000|
|     5|   Kevin| 26|       Manager|   120|
|     6| Vincent| 35|Senior Manager| 22000|
|     7|   James| 38|Senior Manager| 20000|
|     8|   Shane| 32|       Manager| 12000|
|     9|   Larry| 29|       Manager| 10000|
|    10|Kimberly| 29|     Associate|  8000|
|    11|    Alex| 28|       Manager| 12000|
|    12|   Garry| 25|       Manager| 12000|
|    13|     Max| 31|       Manager|120000|
+------+--------+---+--------------+------+



In [9]:
#Replacing missing value by mean
import math
from pyspark.sql import functions as F
mean_salary = math.floor(salary.select(F.mean('salary')).collect()[0][0])

In [10]:
mean_salary

20843

In [11]:
clean_data = final_data.na.fill({'salary' : mean_salary})
clean_data.show()

+------+--------+---+--------------+------+
|emp_id|    name|age|          role|salary|
+------+--------+---+--------------+------+
|     1|    John| 25|     Associate| 10000|
|     2|     Ray| 35|       Manager| 12000|
|     3|    Mike| 24|       Manager| 12000|
|     4|    Jane| 28|     Associate| 20843|
|     5|   Kevin| 26|       Manager|   120|
|     6| Vincent| 35|Senior Manager| 22000|
|     7|   James| 38|Senior Manager| 20000|
|     8|   Shane| 32|       Manager| 12000|
|     9|   Larry| 29|       Manager| 10000|
|    10|Kimberly| 29|     Associate|  8000|
|    11|    Alex| 28|       Manager| 12000|
|    12|   Garry| 25|       Manager| 12000|
|    13|     Max| 31|       Manager|120000|
+------+--------+---+--------------+------+



In [12]:
#Another example for missing value treatment
authors = [['Thomas', 'Hardy', 'June 2, 1840'], 
           ['Charles', 'Dickens', '7 February 1812'],
           ['Mark', 'Twain', None],
           ['Jane', 'Austen', '16 December 1775'],
           ['Emily', None, None]]
df1 = sc.parallelize(authors).toDF(['FirstName', 'LastName', 'Dob'])
df1.show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|    June 2, 1840|
|  Charles| Dickens| 7 February 1812|
|     Mark|   Twain|            null|
|     Jane|  Austen|16 December 1775|
|    Emily|    null|            null|
+---------+--------+----------------+



In [13]:
#Drop rows with missing values
df1.na.drop().show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|    June 2, 1840|
|  Charles| Dickens| 7 February 1812|
|     Jane|  Austen|16 December 1775|
+---------+--------+----------------+



In [14]:
#Drop rows with at least 2 missing values
df1.na.drop(thresh=2).show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|    June 2, 1840|
|  Charles| Dickens| 7 February 1812|
|     Mark|   Twain|            null|
|     Jane|  Austen|16 December 1775|
+---------+--------+----------------+



In [15]:
#Fill all missing values with a given string
df1.na.fill("Unknown").show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|    June 2, 1840|
|  Charles| Dickens| 7 February 1812|
|     Mark|   Twain|         Unknown|
|     Jane|  Austen|16 December 1775|
|    Emily| Unknown|         Unknown|
+---------+--------+----------------+



In [16]:
#Fill missing values in each column with a given string
df1.na.fill({'LastName':'--','Dob':'Unknown'}).show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|    June 2, 1840|
|  Charles| Dickens| 7 February 1812|
|     Mark|   Twain|         Unknown|
|     Jane|  Austen|16 December 1775|
|    Emily|      --|         Unknown|
+---------+--------+----------------+



## Outlier treatment

Understanding what an outlier is also important to treat it well. To put it simply, an outlier is a data point that does not share the same characteristics as the rest of the data points. Example: if you have a dataset of schoolchildren and there are a few age values in the range of 30-40 then they could be outliers. Let us look into a different example now: if you have a dataset where a variable can have data points only in two ranges, say, in the 10-20 or 80-90 range, then the data points (say, 40 or 55) with values in between these two ranges could also be outliers. In this example, 40 or 55 do not belong to the 10-20 range, nor do they belong to the 80-90 range, and are outliers.Also, there can be **univariate outliers** and there can be **multivariate outliers** as well. We will focus on univariate outliers in this book for simplicity's sake as Spark MLlib may not have all the algorithms needed at the time of writing this book.

In order to treat the outliers, you have to first see if there are outliers. There are different ways, such as summary statistics and plotting techniques, to find the outliers. You can use the built-in library functions such as **matplotlib** of Python to visualize your data. You can do so by connecting to Spark through a notebook (for example, **Jupyter**) so that the visuals can be generated, which may not be possible on a command shell. Once you find outliers, you can either delete the rows containing outliers or impute the **mean values** in place of outliers or do something more relevant, as applicable to your case.Let us have a look at the **mean substitution** method here:

In [17]:
# Identify outliers and replace them with mean
# The following example reuses the clean_data dataset and mean_salary computed in previous examples
mean_salary

20843

In [18]:
#Compute deviation for each row
devs = final_data.select(((final_data.salary - mean_salary) ** 2).alias("deviation"))
devs.show()

+-------------+
|    deviation|
+-------------+
| 1.17570649E8|
|  7.8198649E7|
|  7.8198649E7|
|         null|
| 4.29442729E8|
|    1338649.0|
|     710649.0|
|  7.8198649E7|
| 1.17570649E8|
| 1.64942649E8|
|  7.8198649E7|
|  7.8198649E7|
|9.832110649E9|
+-------------+



In [19]:
#Compute standard deviation
stddev = math.floor(math.sqrt(devs.groupBy().avg("deviation").first()[0]))
stddev

30351

In [20]:
round(stddev, 2)

30351

<dl>
<dt>
<tt>pyspark.sql.functions.</tt><tt>when</tt><big>(</big><em>condition</em>, <em>value</em><big>)</big></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <tt>Column.otherwise()</tt> is not invoked, None is returned for unmatched conditions.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>condition</strong> &#8211; a boolean <tt class="xref py py-class docutils literal"><span class="pre">Column</span></tt> expression.</li>
<li><strong>value</strong> &#8211; a literal value, or a <tt class="xref py py-class docutils literal"><span class="pre">Column</span></tt> expression.</li>
</ul>
</td>
</tr>
</tbody>
</table>

<pre>
>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
[Row(age=3), Row(age=4)]

>>> df.select(when(df.age == 2, df.age + 1).alias("age")).collect()
[Row(age=3), Row(age=None)]

</pre>

</dd></dl>

In [21]:
# Replace outliers beyond 2 standard deviations with the mean salary
no_outlier = final_data.select(final_data.emp_id, final_data.name, 
                               final_data.age, final_data.salary, final_data.role,
                               F.when(final_data.salary.between(mean_salary-(2*stddev),mean_salary+(2*stddev)),
                                    final_data.salary).otherwise(mean_salary).alias("updated_salary"))

In [22]:
# Observe modified values
no_outlier.filter(no_outlier.salary != no_outlier.updated_salary).show()

+------+----+---+------+-------+--------------+
|emp_id|name|age|salary|   role|updated_salary|
+------+----+---+------+-------+--------------+
|    13| Max| 31|120000|Manager|         20843|
+------+----+---+------+-------+--------------+



In [23]:
no_outlier_devs = no_outlier.select(((no_outlier.updated_salary - mean_salary) ** 2).alias("no_outlier_deviation"))
no_outlier_devs.show()

+--------------------+
|no_outlier_deviation|
+--------------------+
|        1.17570649E8|
|         7.8198649E7|
|         7.8198649E7|
|                 0.0|
|        4.29442729E8|
|           1338649.0|
|            710649.0|
|         7.8198649E7|
|        1.17570649E8|
|        1.64942649E8|
|         7.8198649E7|
|         7.8198649E7|
|                 0.0|
+--------------------+



In [24]:
no_outlier_stddev = math.floor(math.sqrt(no_outlier_devs.groupBy().avg("no_outlier_deviation").first()[0]))
no_outlier_stddev

9697

## Duplicate values treatment
There are different ways of treating the duplicate records in a dataset. We will demonstrate those in the following code snippets:

In [25]:
#Deleting the duplicate rows
authors = [['Thomas', 'Hardy', 'June 2,1840'],
           ['Thomas', 'Hardy', 'June 2,1840'],
           ['Thomas', 'H', None], 
           ['Jane', 'Austen', '16 December 1775'],
           ['Emily', None,None]]
df1 = sc.parallelize(authors).toDF(["FirstName","LastName","Dob"])
df1.show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|     June 2,1840|
|   Thomas|   Hardy|     June 2,1840|
|   Thomas|       H|            null|
|     Jane|  Austen|16 December 1775|
|    Emily|    null|            null|
+---------+--------+----------------+



In [26]:
# Drop duplicated rows
df1.dropDuplicates().show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|     Jane|  Austen|16 December 1775|
|    Emily|    null|            null|
|   Thomas|   Hardy|     June 2,1840|
|   Thomas|       H|            null|
+---------+--------+----------------+



In [27]:
# Drop duplicates based on a sub set of columns
df1.dropDuplicates(subset=['FirstName']).show()

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|    Emily|    null|            null|
|     Jane|  Austen|16 December 1775|
|   Thomas|   Hardy|     June 2,1840|
+---------+--------+----------------+



## Data transformation

There can be various kinds of data transformation needs and every case is mostly unique.We are going to cover some basic types of transformations, as follows:
* Merging two columns into one
* Adding characters/numbers to the existing ones
* Deleting or replacing characters/numbers from the existing ones
* Changing date formats

<dl>
    <dt><tt>withColumn</tt><big>(</big><em>colName</em>, <em>col</em><big>)</big></dt>
    <dd>
    <p>Returns a new <tt >DataFrame</tt> by adding a column or replacing the existing column that has the same name.</p>

<table frame="void" rules="none">
<tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>colName</strong> &#8211; string, name of the new column.</li>
<li><strong>col</strong> &#8211; a <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=when#pyspark.sql.Column"><tt>Column</tt></a> expression for the new column.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<pre>
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
</pre>
</dd></dl>

In [28]:
# Merging columns
# Create a udf to concatenate two column values
import pyspark.sql.functions
concat_func = pyspark.sql.functions.udf(lambda name, age: name + '_' + str(age))
# Apply the udf to create merged column
concat_df = final_data.withColumn('name_age', concat_func(final_data.name, final_data.age))
concat_df.show(4)

+------+----+---+---------+------+--------+
|emp_id|name|age|     role|salary|name_age|
+------+----+---+---------+------+--------+
|     1|John| 25|Associate| 10000| John_25|
|     2| Ray| 35|  Manager| 12000|  Ray_35|
|     3|Mike| 24|  Manager| 12000| Mike_24|
|     4|Jane| 28|Associate|  null| Jane_28|
+------+----+---+---------+------+--------+
only showing top 4 rows



In [29]:
# Adding constant to data
data_new = concat_df.withColumn('age_incremented', concat_df.age + 10)
data_new.show(4)

+------+----+---+---------+------+--------+---------------+
|emp_id|name|age|     role|salary|name_age|age_incremented|
+------+----+---+---------+------+--------+---------------+
|     1|John| 25|Associate| 10000| John_25|             35|
|     2| Ray| 35|  Manager| 12000|  Ray_35|             45|
|     3|Mike| 24|  Manager| 12000| Mike_24|             34|
|     4|Jane| 28|Associate|  null| Jane_28|             38|
+------+----+---+---------+------+--------+---------------+
only showing top 4 rows



<dl>
<dt>
<tt>replace</tt><big>(</big><em>to_replace</em>, <em>value</em>, <em>subset=None</em><big>)</big></dt>
<dd><p>Returns a new <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=replace#pyspark.sql.DataFrame"><tt>DataFrame</tt></a> replacing a value with another value.
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=replace#pyspark.sql.DataFrame.replace"><tt>DataFrame.replace()</tt></a> and <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=replace#pyspark.sql.DataFrameNaFunctions.replace"><tt>DataFrameNaFunctions.replace()</tt></a> are
aliases of each other.</p>
<table frame="void" rules="none">

<tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>to_replace</strong> &#8211; int, long, float, string, or list.<strong>Value to be replaced</strong>.If the value is a dict, then <cite>value</cite> is ignored and <cite>to_replace</cite> must be a mapping from column name (string) to replacement value. The value to be replaced must be an int, long, float, or string.</li>
<li><strong>value</strong> &#8211; int, long, float, string, or list.
<strong>Value to use to replace holes</strong>.The replacement value must be an int, long, float, or string. If <cite>value</cite> is a list or tuple, <cite>value</cite> should be of the same length with <cite>to_replace</cite>.</li>
<li><strong>subset</strong> &#8211; optional list of column names to consider.Columns specified in subset that do not have matching data type are ignored.For example, if <cite>value</cite> is a string, and subset contains a non-string column,then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<pre>
>>> df4.na.replace(10, 20).show()
+----+------+-----+
| age|height| name|
+----+------+-----+
|  20|    80|Alice|
|   5|  null|  Bob|
|null|  null|  Tom|
|null|  null| null|
+----+------+-----+

>>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|   A|
|   5|  null|   B|
|null|  null| Tom|
|null|  null|null|
+----+------+----+
</pre>
</dd></dl>

In [30]:
# Replace values in a column
df1.replace("Emily", "Charlotte", "FirstName").show()
#If the column name argument is omitted in replace, then replacement is applicable to all columns

+---------+--------+----------------+
|FirstName|LastName|             Dob|
+---------+--------+----------------+
|   Thomas|   Hardy|     June 2,1840|
|   Thomas|   Hardy|     June 2,1840|
|   Thomas|       H|            null|
|     Jane|  Austen|16 December 1775|
|Charlotte|    null|            null|
+---------+--------+----------------+



<dl>
<dt>
<tt>withColumn</tt><big>(</big><em>colName</em>, <em>col</em><big>)</big></dt>
<dd><p>Returns a new <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=replace#pyspark.sql.DataFrame"><tt>DataFrame</tt></a> by adding a column or replacing the existing column that has the same name.</p>
<table frame="void" rules="none">
<tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>colName</strong> – string, name of the new column.</li>
<li><strong>col</strong> – a <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=when#pyspark.sql.Column"><tt>Column</tt></a> expression for the new column.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<pre>
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
</pre>
</dd></dl>

<dl>
<dt>
<tt>substr</tt><big>(</big><em>startPos</em>, <em>length</em><big>)</big></dt>
<dd><p>Return a <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substr#pyspark.sql.Column" ><tt>Column</tt></a> which is a substring of the column.</p>
<table frame="void" rules="none">

<tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>startPos</strong> &#8211; start position (int or Column)</li>
<li><strong>length</strong> &#8211; length of the substring (int or Column)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<pre>
>>> df.select(df.name.substr(1, 3).alias("col")).collect()
[Row(col=u'Ali'), Row(col=u'Bob')]
</pre>
</dd></dl>

In [31]:
#Append new columns based on existing values in a column
#Give 'LastName' instead of 'Initial' if you want to overwrite
df1.withColumn('Initial', df1.LastName.substr(1, 1)).show()

+---------+--------+----------------+-------+
|FirstName|LastName|             Dob|Initial|
+---------+--------+----------------+-------+
|   Thomas|   Hardy|     June 2,1840|      H|
|   Thomas|   Hardy|     June 2,1840|      H|
|   Thomas|       H|            null|      H|
|     Jane|  Austen|16 December 1775|      A|
|    Emily|    null|            null|   null|
+---------+--------+----------------+-------+



Now that we are familiar with basic examples, let us put together a somewhat complex example. You might have noticed that the date column in Authors data has different date formats. In some cases, month is followed by day, and vice versa. Such anomalies are
common in the real world, wherein data might be collected from different sources. Here, we are looking at a case where the date column has data points with many different date formats. We need to standardize all the different date formats into one format. To do so, we first have to create a user-defined function (udf) that can take care of the different formats and convert those to one common format.

**格式限定符**
它有着丰富的的“格式限定符”（语法是{}中带:号），比如：

<strong>填充与对齐</strong><br/>
填充常跟对齐一起使用,居中^、左对齐<、右对齐>，后面带宽度,:号后面带填充的字符，只能是一个字符，不指定的话默认是用空格填充,比如:<br/>

```
In [01]: '{:>8}'.format('189')
Out[01]: '   189'
In [02]: '{:0>8}'.format('189')
Out[02]: '00000189'
In [03]: '{:a>8}'.format('189')
Out[03]: 'aaaaa189'
```
<strong>精度与类型f</strong><br/>
精度常跟类型f一起使用:<br/>
```
In [04]: '{:.2f}'.format(321.33345)
Out[04]: '321.33'
```
其中.2表示长度为2的精度，f表示float类型。<br/>
<strong>其他类型</strong>主要就是进制了，b、d、o、x分别是二进制、十进制、八进制、十六进制:<br/>
```
In [05]: '{:b}'.format(17)
Out[05]: '10001'
In [06]: '{:d}'.format(17)
Out[06]: '17'
In [07]: '{:o}'.format(17)
Out[07]: '21'
In [08]: '{:x}'.format(17)
Out[08]: '11'
```
用，还能用来做金额的千位分隔符:<br/>
```
In [08]: '{:,}'.format(1234567890)
Out[08]: '1,234,567,890'
```

In [35]:
#Date conversions
#Create udf for date conversion that converts incoming string to YYYY-MMDD format
#The function assumes month is full month name and year is always 4 digits
#Separator is always a space or comma
#Month, date and year may come in any order
#Reusing authors data

authors = [['Thomas', 'Hardy', 'June 2, 1840'], 
          ['Charles', 'Dickens', '7 February 1812'],
          ['Mark', 'Twain', None], 
          ['Jane', 'Austen', '16 December 1775'], 
          ['Emily', None, None]]

df1 = sc.parallelize(authors).toDF(['FirstName', 'LastName', 'Dob'])

#define udf
#Note:You may create this in a script file and execute with execfile(filename.py)
def toDate(s):
    import re
    year = month = day = ""
    if not s:
        return None
    mn = [0,'January', 'February', 'March', 
          'April', 'May', 'June', 'July', 
          'August', 'September', 'October', 'November', 'December']
    #Split the string and remove empty tokens
    l = [tok for tok in re.split(',| ',s) if tok]
    #Assign token to year ,month or day
    for a in l:
        if a in mn:
            month = '{:0>2d}'.format(mn.index(a))
            print(month)
        elif len(a) == 4:
            year = a
        elif len(a) == 1:
            day = '0' + a
        else:
            day = a
    return year + '-' + month + '-' + day

#Register the udf
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
toDateUDF = udf(toDate, StringType())

#Apply udf
df1.withColumn("Dob", toDateUDF("Dob")).show()

+---------+--------+----------+
|FirstName|LastName|       Dob|
+---------+--------+----------+
|   Thomas|   Hardy|1840-06-02|
|  Charles| Dickens|1812-02-07|
|     Mark|   Twain|      null|
|     Jane|  Austen|1775-12-16|
|    Emily|    null|      null|
+---------+--------+----------+

