# Some questions I have about my code:

- My code mostly uses RDD dataframes. That is, I have code to read files in both SQL Dataframes and RDD Dataframes but all of my code after the "read" phase uses the RDD dataframes. However, I'm not sure if it's best/preferable for my purposes to use SQL or RDD dataframes. Is there some good reason I should be choosing RDD over SQL, or vice versa?

- On my RDD dataframes, I have written a few different custom functions including (1) setting the correct delimiters, (2) removing unicode information, and (3) a mapping function that preserves only the columns in my datasets that are needed while ignoring those columns in my datasets that are not needed.

    - However, what other map/reduce/filter/transformation functions should I be using to prepare my data for my  
    - database?


#### Towards the end of my code here, I used the following format to write my data back to S3:

cleaned_rdd_BROADBAND.rdd.repartition(1).saveAsTextFile("s3n://sparkforinsightproject/database_data/cleaned_BROADBAND")


- I have not yet even tested whether this works. However, my questions are 

    - "Is this the correct/best method write data back to S3?"
    - "If not, is there a better method for writing data back to S3?"

In [12]:
# This code imports the needed modules/libraries for reading, transforming, and writing my input data

# *****I need to check if some of these import lines are either redundant or otherwise not needed*****

import pyspark
import spark
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("whatever name") \
.getOrCreate()

In [2]:
# This code stops any existing sc -- SparkContext() -- if it is running (because you can't start it if
# it's already running) and then restarts it.

sc.stop()

sc = SparkContext()



# Paths for Data

# Warning: need to fix paths to correct paths in S3 bucket for Broadband, Census, GPS, and other datasets

In [3]:
# Create string object with path and name of the relevant **BROADBAND** dataset available in S3 bucket
# This dataset is from data showing available broadband speeds in zip codes, cities, and by othnsus and 
# includes relevant information for this project like zip codes, census tracts, and other information about 
# the United States

pathname_BROADBAND_dataset = 's3a://sparkforinsightproject/XXXXXBROADBANDXXXXX.txt'


# Create string object with path and name of the single most recent **ACQUISITIONS** dataset available in S3 bucket
# This dataset contains HOUSING information from the last quarter of 2017 and includes things like housing values 
# by zip code (and possibly by census tract?, city, state)

pathname_ACQUISIT_2017Q4 = 's3a://sparkforinsightproject/fannie_freddie_data/Acquisition_2017Q4.txt'



# Create string object with path and name of the single most recent **PERFORMANCE** dataset available in S3 bucket
# This dataset contains HOUSING information from the last quarter of 2017 and includes things like housing values 
# by zip code (and possibly by census tract?, city, state)

pathname_PERFORM_2017Q4 = 's3a://sparkforinsightproject/fannie_freddie_data/Performance_2017Q4.txt'



# Create string object with path and name of the relevant **CENSUS** dataset available in S3 bucket
# This dataset is from the US Census and includes relevant information for this project like zip codes, census 
# tracts, and other information about the United States


pathname_CENSUS_dataset = 's3a://sparkforinsightproject/XXXXXCENSUSXXXXXXX.txt'


# Create string object with path and name of the relevant **GPS COORDINATES** dataset available in S3 bucket
# This dataset contains GPS COORDINATE information (longitude and latitude) alongside relevant information for
# identifying the GPS coordinates in the US paired with with data such as zip codes, city names, states, and 
# possibly census tracts/counties. Data will be used to calculate the distance of various zip codes from the
# city center of cities around the nation, along with the typical/median/average housing prices of those zip
# codes / census tracts as well


pathname_CENSUS_dataset = 's3a://sparkforinsightproject/XXXXXCENSUSXXXXXXX.txt'



# Broadband data (RDD Dataframe)

In [None]:
# Read in the BROADBAND dataset that shows information about typical broadband speeds in all zip codes/counties, 
# etc... in the United States. NOTE: RDD is an acronym for Resilient Distributed Dataset (RDD)

rdd_BROADBAND = sc.textFile(pathname_BROADBAND_dataset)


In [None]:
# Creates view of the top 10 columns in the raw rdd BROADBAND dataframe

rdd_BROADBAND.take(10)

In [None]:
# Clean/Process/Transform BROADBAND dataset for loading into database
# NOTE: The map(lambda x) function is more generally known as a transformation method.
# Also note that transformations are lazy operations that return a reference to an RDD object, but Spark
# doesn't actually run the transformations at the time of creating this reference. Instead, an action needs 
# to be peformed to use the RDD reference object, and at that time the action method will run the data.

cleaned_rdd_BROADBAND = rdd_BROADBAND.map(lambda x: x.encode('ascii', 'ignore')).\
                                                        map(lambda x: x.split('|')).\
                                                        map(lambda x: x[4], x[8], x[11], x[19], x[34])

# Housing data (SQL Dataframes)

In [16]:
# Read in most recent single ACQUISITIONS dataset as SQL Dataframe

sql_ACQUISIT_2017Q4 = spark.read.csv(pathname_ACQUISIT_2017Q4, header=False, mode="DROPMALFORMED", encoding='UTF-8', sep='|')

In [17]:
# Read in most recent single PERFORMANCE dataset as SQL Dataframe

sql_PERFORM_2017Q4 = spark.read.csv(pathname_PERFORM_2017Q4, header=False, mode="DROPMALFORMED", encoding='UTF-8', sep='|')

# Housing data (RDD Dataframes)

In [None]:
# Read in most recent single ACQUISITIONS dataset as RDD Dataframe

rdd_ACQUISIT_2017Q4 = sc.textFile(pathname_ACQUISIT_2017Q4)



In [None]:
# Creates view of the top 10 columns in the raw rdd ACQUISITION RDD # 

rdd_ACQUISIT_2007Q4.take(10)

In [None]:
# Read in most recent single PERFORMANCE dataset as RDD Dataframe

rdd_PERFORM_2007q4 = sc.textFile(pathname_PERFORM_2017Q4)



In [None]:
# Creates view of the top 10 columns in the raw rdd PERFORMANCE RDD # 


rdd_PERFORM_2007Q4.take(10)

# Warning: for Clean/Process/Transform code blocks below, need to fix lambda function -- need to identify which columns are most important/necessary to keep and then slice those columns in the third map(lambda...) command -- eliminating unecessary columns helps to reduce amount of columns needed

# Also should convert columns to correct datatypes in this step

# Also should consider

In [None]:
# Clean/Process/Transform ACQUISITIONS dataset for loading into database

cleaned_rdd_ACQUISIT_2017Q4 = rdd_ACQUISIT_2017Q4.map(lambda x: x.encode('ascii', 'ignore')).\
                                                        map(lambda x: x.split('|')).\
                                                        map(lambda x: x[4], x[8], x[11], x[19], x[34])

In [None]:
'''
# Clean/Process/Transform ACQUISITIONS RDD Dataframe for loading into database


cleaned_rdd_ACQUISIT_2017Q4 = data_Acquistion_2017Q1.map(lambda x: x.encode('ascii', 'ignore')).\
                                                        map(lambda x: x.split('|')).\
                                                        map(lambda x: x[4], x[8], x[11], x[19], x[34])
                                                        '''

In [None]:
# Clean/Process/Transform PERFORMANCE RDD Dataframe for loading into database


cleaned_rdd_PERFORM_2007q4 = rdd_PERFORM_2007q4.map(lambda x: x.encode('ascii', 'ignore')).\
                                                        map(lambda x: x.split('|')).\
                                                        map(lambda x: x[4], x[8], x[11], x[19], x[34])

# Census data

In [None]:
# Read in Census data with population, zip codes, census tracts, and other relevant population and geographic data for all parts of the United States

rdd_CENSUS = sc.textFile(path_name_CENSUS_dataset)

In [None]:
# View first 10 rows of CENSUS RDD

rdd_CENSUS.take(10)


# GPS coordinates data

In [None]:
# Read in GPS coordinates data with GPS coordinates (latitude and longitude), zip codes, and other relevant population and geographic data in the United States

rdd_CENSUS = sc.textFile(path_name_GPS_COORDINATES_data)

In [None]:
# View first 10 rows of CENSUS RDD
rdd_GPS_COORDINATE_data.take(10)


# Write all datasets back to S3

In [None]:
# Write BROADBAND dataset to S3

cleaned_rdd_BROADBAND.rdd.repartition(1).saveAsTextFile("s3n://sparkforinsightproject/database_data/cleaned_BROADBAND")


In [None]:
# Write HOUSING ACQUISITIONS dataset to S3

cleaned_rdd_ACQUISIT2017Q4.rdd.repartition(1).saveAsTextFile("s3n://sparkforinsightproject/database_data/cleaned_ACQUISIT2017Q4")

In [None]:
# Write CENSUS dataset to S3

cleaned_rdd_PERFORM2017Q4.rdd.repartition(1).saveAsTextFile("s3n://sparkforinsightproject/database_data/cleaned_PERFORM2017Q4")

In [None]:
# Write CENSUS dataset to S3

cleaned_rdd_CENSUS.rdd.repartition(1).saveAsTextFile("s3n://sparkforinsightproject/database_data/cleaned_CENSUS")

In [None]:
# Write GPS COORDINATES dataset to S3

cleaned_rdd_GPS.rdd.repartition(1).saveAsTextFile("s3n://sparkforinsightproject/database_data/cleaned_CENSUS")

# YOU CAN IGNORE EVERYTHING BELOW THIS POINT 
# -- IT IS ALL ROUGH/SCRATCH CODE BELOW THIS POINT

In [31]:
'''
# Read in the SINGLE, most recent Fannie Mae/Freddie Mac Acquisitions dataset as RDD
data_ACQUISIT_2017Q1_rdd = sc.textFile(df_sql_ACQUISIT_2007q4)

        
# df = SQLContext.read.csv(
#     broadband_filename, header=True, mode="DROPMALFORMED"
# )
'''

AttributeError: 'property' object has no attribute 'csv'

In [None]:
# Clean and transform single dataset to be loaded into database

data_Acquistion_2017_rdd_cleaned = rdd_ACQUISIT_2017Q4.map(lambda x: x.encode('ascii', 'ignore')).\
                                                        map(lambda x: x.split('|'))

In [None]:
data_Acquistion_2017_rdd_cleaned = data_PERFORM_2017Q4_rdd.map(lambda x: x.encode('ascii', 'ignore')).\
                                                        map(lambda x: x.split('|'))

In [None]:
# Read in a LIST of the most recent Fannie Mae/Freddie Mac Acquisitions datasets from each quarter in 2017 that shows housing values by zip code (and possibly by census tract?)

for ?? in ???data_Acquistion_2017Q1_rdd = sc.textFile(single_quarter4_acquisitions_dataset)

# Read in the most recent Fannie Mae/Freddie Mac Acquisitions dataset from 2017 that shows housing values by zip code (and possibly by census tract?)

In [43]:
data_Acquistion_2007_rdd.take(4)

[['objectid',
  'random_pt_objectid',
  'datasource',
  'frn',
  'provname',
  'dbaname',
  'hoconum',
  'hoconame',
  'stateabbr',
  'fullfipsid',
  'county_fips',
  'transtech',
  'maxaddown',
  'maxadup',
  'typicdown',
  'typicup',
  'downloadspeed',
  'uploadspeed',
  'provider_type',
  'end_user_cat'],
 ['16600',
  '4837795020014440017',
  'RoadSegment',
  '0000012781',
  'Neu Ventures, Inc.',
  'Mountain Zone TV Systems',
  '240068',
  'Neu Ventures, Inc.',
  'TX',
  '483779502001444',
  '48377',
  '41',
  '7',
  '5',
  '5',
  '2',
  '7',
  '5',
  '1',
  '5'],
 ['16600',
  '4837795020014440018',
  'RoadSegment',
  '0000012781',
  'Neu Ventures, Inc.',
  'Mountain Zone TV Systems',
  '240068',
  'Neu Ventures, Inc.',
  'TX',
  '483779502001444',
  '48377',
  '41',
  '7',
  '5',
  '5',
  '2',
  '7',
  '5',
  '1',
  '5'],
 ['16685',
  '4837795020014440018',
  'RoadSegment',
  '0000012781',
  'Neu Ventures, Inc.',
  'Mountain Zone TV Systems',
  '240068',
  'Neu Ventures, Inc.',
  '

# Broadband data

In [4]:
broadband_rdd = sc.textFile(broadband_filename)

broadband_rdd.take(3)

[u'objectid|random_pt_objectid|datasource|frn|provname|dbaname|hoconum|hoconame|stateabbr|fullfipsid|county_fips|transtech|maxaddown|maxadup|typicdown|typicup|downloadspeed|uploadspeed|provider_type|end_user_cat',
 u'16600|4837795020014440017|RoadSegment|0000012781|Neu Ventures, Inc.|Mountain Zone TV Systems|240068|Neu Ventures, Inc.|TX|483779502001444|48377|41|7|5|5|2|7|5|1|5',
 u'16600|4837795020014440018|RoadSegment|0000012781|Neu Ventures, Inc.|Mountain Zone TV Systems|240068|Neu Ventures, Inc.|TX|483779502001444|48377|41|7|5|5|2|7|5|1|5']

AttributeError: 'module' object has no attribute 'read'

In [16]:
df_sql_acquisition_2007q1

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string, _c14: string, _c15: string, _c16: string, _c17: string, _c18: string, _c19: string, _c20: string, _c21: string, _c22: string, _c23: string, _c24: string]

In [17]:
type(df_sql_acquisition_2007q1)

pyspark.sql.dataframe.DataFrame

In [20]:
type(broadband_filename)

str

In [18]:
df_sql_acquisition_2007q1.take(5)

[Row(_c0=u'100001461640', _c1=u'R', _c2=u'PNC BANK, N.A.', _c3=u'6.25', _c4=u'137000', _c5=u'360', _c6=u'01/2007', _c7=u'03/2007', _c8=u'56', _c9=u'56', _c10=u'2', _c11=u'37', _c12=u'741', _c13=u'N', _c14=u'C', _c15=u'SF', _c16=u'1', _c17=u'P', _c18=u'MI', _c19=u'486', _c20=None, _c21=u'FRM', _c22=u'734', _c23=None, _c24=u'N'),
 Row(_c0=u'100015135004', _c1=u'R', _c2=u'SUNTRUST MORTGAGE INC.', _c3=u'6', _c4=u'116000', _c5=u'360', _c6=u'02/2007', _c7=u'04/2007', _c8=u'80', _c9=u'80', _c10=u'2', _c11=u'11', _c12=u'796', _c13=u'N', _c14=u'R', _c15=u'SF', _c16=u'1', _c17=u'S', _c18=u'GA', _c19=u'302', _c20=None, _c21=u'FRM', _c22=u'762', _c23=None, _c24=u'N'),
 Row(_c0=u'100015306566', _c1=u'C', _c2=u'CITIMORTGAGE, INC.', _c3=u'6.375', _c4=u'58000', _c5=u'180', _c6=u'02/2007', _c7=u'03/2007', _c8=u'78', _c9=u'78', _c10=u'2', _c11=u'30', _c12=u'710', _c13=u'N', _c14=u'R', _c15=u'SF', _c16=u'1', _c17=u'P', _c18=u'IN', _c19=u'465', _c20=None, _c21=u'FRM', _c22=None, _c23=None, _c24=u'N'),
 Ro

# RDD Dataframe

In [15]:
data_Acquistion_2007_rdd = sc.textFile(filename).map(lamdba x: x.split('|'))

SyntaxError: invalid syntax (<ipython-input-15-2ae4886beff9>, line 1)

In [None]:
mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (y[0], y[2], y[1]))

In [28]:
broadband_first = broadband_rdd.first()
broadband_header = sc.parallelize([broadband_first])
broadband_w0_header_rdd = broadband_first.subtract(broadband_header)

AttributeError: subtract

In [None]:
#broadband_xxx = broadband_rdd.map(lambda x: x.split('|')).map(lambda x: x.encode('ascii', 'ignore'))

broadband_rdd = spark.read.option("header","true").csv(broadband_filename)

# header = broadband_rdd.first() #extract header
# broadband_rdd = broadband_rdd.filter(row => row != header) 

broadband_xxx = broadband_rdd.map(lambda x: x.encode('ascii', 'ignore').\
                                  split('|')).\
                                 map(lambda y: (y[0], y[1], y[2], y[3], y[4], y[5], y[6], y[7], y[8], y[9], y[10], y[11], y[12], y[13], y[14], y[15], y[16], y[17], y[18], y[19]))
                                

In [18]:
broadband_xxx.take(4)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1371, in takeUpToNumLeft
    yield next(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-14-130e883d5712>", line 3, in <lambda>
ValueError: invalid literal for int() with base 10: 'objectid'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:149)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1371, in takeUpToNumLeft
    yield next(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-14-130e883d5712>", line 3, in <lambda>
ValueError: invalid literal for int() with base 10: 'objectid'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


In [27]:
type(data_Acquistion_2007_rdd)

pyspark.rdd.RDD

In [19]:
data_Acquistion_2007_rdd.take(5)

[u'100001461640|R|PNC BANK, N.A.|6.25|137000|360|01/2007|03/2007|56|56|2|37|741|N|C|SF|1|P|MI|486||FRM|734||N',
 u'100015135004|R|SUNTRUST MORTGAGE INC.|6|116000|360|02/2007|04/2007|80|80|2|11|796|N|R|SF|1|S|GA|302||FRM|762||N',
 u'100015306566|C|CITIMORTGAGE, INC.|6.375|58000|180|02/2007|03/2007|78|78|2|30|710|N|R|SF|1|P|IN|465||FRM|||N',
 u'100015319835|C|BANK OF AMERICA, N.A.|6.125|353000|360|12/2006|02/2007|80|80|2|28|778|N|R|SF|1|P|MA|021||FRM|656||N',
 u'100030521552|C|GMAC MORTGAGE, LLC|5.875|385000|360|12/2006|03/2007|70|70|2|50|720|N|C|SF|1|P|CA|917||FRM|700||N']

In [20]:
df1 = sc.textFile(filename)\
... .map(lambda x: x.encode('ascii', 'ignore').split('|'))

In [21]:
df1

def toCSVLine(data):
  return ','.join(str(d) for d in data)

lines = labelsAndPredictions.map(toCSVLine)
lines.saveAsTextFile('hdfs://my-node:9000/tmp/labels-and-predictions.csv')

PythonRDD[54] at RDD at PythonRDD.scala:49

In [22]:
df1.take(10)

[['100001461640',
  'R',
  'PNC BANK, N.A.',
  '6.25',
  '137000',
  '360',
  '01/2007',
  '03/2007',
  '56',
  '56',
  '2',
  '37',
  '741',
  'N',
  'C',
  'SF',
  '1',
  'P',
  'MI',
  '486',
  '',
  'FRM',
  '734',
  '',
  'N'],
 ['100015135004',
  'R',
  'SUNTRUST MORTGAGE INC.',
  '6',
  '116000',
  '360',
  '02/2007',
  '04/2007',
  '80',
  '80',
  '2',
  '11',
  '796',
  'N',
  'R',
  'SF',
  '1',
  'S',
  'GA',
  '302',
  '',
  'FRM',
  '762',
  '',
  'N'],
 ['100015306566',
  'C',
  'CITIMORTGAGE, INC.',
  '6.375',
  '58000',
  '180',
  '02/2007',
  '03/2007',
  '78',
  '78',
  '2',
  '30',
  '710',
  'N',
  'R',
  'SF',
  '1',
  'P',
  'IN',
  '465',
  '',
  'FRM',
  '',
  '',
  'N'],
 ['100015319835',
  'C',
  'BANK OF AMERICA, N.A.',
  '6.125',
  '353000',
  '360',
  '12/2006',
  '02/2007',
  '80',
  '80',
  '2',
  '28',
  '778',
  'N',
  'R',
  'SF',
  '1',
  'P',
  'MA',
  '021',
  '',
  'FRM',
  '656',
  '',
  'N'],
 ['100030521552',
  'C',
  'GMAC MORTGAGE, LLC',
  '5.

In [23]:
df1.count()

253292

In [28]:
for i in df1.take(253292):
    print(x)

NameError: name 'x' is not defined