### Reading a file in PySpark Shell

In [1]:
import pyspark
import random

sc = pyspark.SparkContext()
RDD_collections = sc.textFile("wikipedia.txt")

### See the contents using RDD Collect

In [2]:
RDD_collections.collect()

['Amazon rainforest',
 'From Wikipedia, the free encyclopedia',
 'Jump to navigationJump to search',
 '"Amazonia" redirects here. For the river, see Amazon River. For other uses, see Amazon and Amazonia (disambiguation).',
 'Amazon rainforest',
 'Portuguese: Floresta amaz�nica',
 'Spanish: Selva amaz�nica',
 'Amazon Manaus forest.jpg',
 'Amazon rainforest, near Manaus, Brazil',
 'Geography',
 'Amazon biome outline map.svg',
 'Map of the Amazon rainforest ecoregions as delineated by the WWF in white[1] and the Amazon drainage basin in blue.',
 'Location\tBrazil, Peru, Colombia, Venezuela, Ecuador, Bolivia, Guyana, Suriname, France (French Guiana)',
 'Coordinates\tCoordinates: 3�S 60�W',
 'Area\t5,500,000 km2 (2,100,000 sq mi)',
 'The Amazon rainforest,[a] also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 km2 (2,700,000 sq mi), of whic

In [3]:
RDD_collections.first()

'Amazon rainforest'

### Using Take Command

In [4]:
samples = 10
RDD_collections.take(samples)

['Amazon rainforest',
 'From Wikipedia, the free encyclopedia',
 'Jump to navigationJump to search',
 '"Amazonia" redirects here. For the river, see Amazon River. For other uses, see Amazon and Amazonia (disambiguation).',
 'Amazon rainforest',
 'Portuguese: Floresta amaz�nica',
 'Spanish: Selva amaz�nica',
 'Amazon Manaus forest.jpg',
 'Amazon rainforest, near Manaus, Brazil',
 'Geography']

### Counting elements

In [5]:
RDD_collections.count()

106

In [6]:
RDD_collections.cache()

wikipedia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [7]:
print('Number of elements: ', RDD_collections.count())

Number of elements:  106


### Apache Spark Actions 

### Using Map operations in Spark

In [8]:
new_collections = RDD_collections.map(lambda line : line.split(" "))

In [9]:
new_collections.take(10)

[['Amazon', 'rainforest'],
 ['From', 'Wikipedia,', 'the', 'free', 'encyclopedia'],
 ['Jump', 'to', 'navigationJump', 'to', 'search'],
 ['"Amazonia"',
  'redirects',
  'here.',
  'For',
  'the',
  'river,',
  'see',
  'Amazon',
  'River.',
  'For',
  'other',
  'uses,',
  'see',
  'Amazon',
  'and',
  'Amazonia',
  '(disambiguation).'],
 ['Amazon', 'rainforest'],
 ['Portuguese:', 'Floresta', 'amaz�nica'],
 ['Spanish:', 'Selva', 'amaz�nica'],
 ['Amazon', 'Manaus', 'forest.jpg'],
 ['Amazon', 'rainforest,', 'near', 'Manaus,', 'Brazil'],
 ['Geography']]

### Using flat Map operations in Apache Spark

In [10]:
new_collections = RDD_collections.flatMap(lambda line : line.split(" "))

In [11]:
new_collections.take(10)

['Amazon',
 'rainforest',
 'From',
 'Wikipedia,',
 'the',
 'free',
 'encyclopedia',
 'Jump',
 'to',
 'navigationJump']

### Using persist in Apache Spark

In [12]:
RDD_storage = RDD_collections.persist()

#### Memory Only

In [13]:
RDD_storage.persist(pyspark.StorageLevel.MEMORY_ONLY)

wikipedia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [14]:
RDD_storage.unpersist()

wikipedia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

#### Disk Only

In [15]:
RDD_storage.persist(pyspark.StorageLevel.DISK_ONLY)

wikipedia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [16]:
RDD_storage.unpersist()

wikipedia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

#### Disk and Memory 

In [17]:
RDD_storage.persist(pyspark.StorageLevel.MEMORY_AND_DISK)

wikipedia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0