## Analytics Vidhya Dataframe Examples

### 1. some common characteristics with RDD:

    Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD  after applying transformations.
    Lazy Evaluations: Which means that a task is not executed until an action is performed.
    Distributed: RDD and DataFrame both are distributed in nature.
    
### 2. Why DataFrames are Useful ?

I am sure this question must be lingering in your mind. To make things simpler for you, I’m listing down few advantages of DataFrames:

    DataFrames are designed for processing large collection of structured or semi-structured data.
    Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
    DataFrame in Apache Spark has the ability to handle petabytes of data.
    DataFrame has a support for wide range of data format and sources.
    It has API support for different languages like Python, R, Scala, Java.    

#### SparkContext is required when we want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. And the first step is to connect with Apache Cluster.
        from pyspark import SparkContext
        sc = SparkContext()
        
        
#### A DataFrame in Apache Spark can be created in multiple ways:
    It can be created using different data formats. For example, loading the data from JSON, CSV.
    Loading data from Existing RDD.
    Programmatically specifying schema

spark, dataframe, pyspark, python, sql        

### Pandas vs PySpark DataFrame

Pandas and Spark DataFrame are designed for structural and semistructral data processing. Both share some similar properties (which I have discussed above). The few differences between Pandas and PySpark DataFrame are:

    Operation on Pyspark DataFrame run parallel on different nodes in cluster but, in case of pandas it is not possible.
    Operations in PySpark DataFrame are lazy in nature but, in case of pandas we get the result as soon as we apply any operation.
    In PySpark DataFrame, we can’t change the DataFrame due to it’s immutable property, we need to transform it. But in pandas it is not the case.
    Pandas API support more operations than PySpark DataFrame. Still pandas API is more powerful than Spark.
    Complex operations in pandas are easier to perform than Pyspark DataFrame

In [9]:
import findspark
findspark.init('C:\\spark\\spark-3.0.0-hadoop2.7')
import pyspark

In [10]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import Row

In [12]:
conf = pyspark.SparkConf().setAppName('appVidiya').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=appName, master=local) created by __init__ at <ipython-input-3-7213d1829619>:2 

In [14]:
spark = SparkSession.builder.appName('Basic').getOrCreate()

In [13]:
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)

NameError: name 'sqlContext' is not defined

In [35]:
schemaPeople = spark.createDataFrame([
    ("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"),
      ("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"),
      ("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"),
      ("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico"),
      ("Banana",0,"Canada"),(None,2000,"Canada"),("Beans",2000,"Mexico"),
      ("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")
], ["Product","Amount","Country"])

In [28]:
schemaPeople.head(5)

[Row(Product='Banana', Amount=1000, Country='USA'),
 Row(Product='Carrots', Amount=1500, Country='USA'),
 Row(Product='Beans', Amount=1600, Country='USA'),
 Row(Product='Orange', Amount=2000, Country='USA'),
 Row(Product='Orange', Amount=2000, Country='USA')]

In [17]:
# number of rows in DataFrame
schemaPeople.count()

12

In [18]:
schemaPeople.columns

['Product', 'Amount', 'Country']

In [19]:
len(schemaPeople.columns)

3

In [20]:
# get the summary statistics (mean, standard deviance, min ,max, count) of numerical columns in a DataFrame

schemaPeople.describe().show()

+-------+-------+------------------+-------+
|summary|Product|            Amount|Country|
+-------+-------+------------------+-------+
|  count|     12|                12|     12|
|   mean|   null|1766.6666666666667|   null|
| stddev|   null| 863.7479991644589|   null|
|    min| Banana|               400| Canada|
|    max| Orange|              4000|    USA|
+-------+-------+------------------+-------+



In [21]:
schemaPeople.describe('Amount').show()

+-------+------------------+
|summary|            Amount|
+-------+------------------+
|  count|                12|
|   mean|1766.6666666666667|
| stddev| 863.7479991644589|
|    min|               400|
|    max|              4000|
+-------+------------------+



In [22]:
schemaPeople.select('Product','Amount').show(5)

+-------+------+
|Product|Amount|
+-------+------+
| Banana|  1000|
|Carrots|  1500|
|  Beans|  1600|
| Orange|  2000|
| Orange|  2000|
+-------+------+
only showing top 5 rows



In [23]:
schemaPeople.select('Product').distinct().count()

4

In [24]:
# want to calculate pair wise frequency of categorical columns
schemaPeople.crosstab('Product', 'Country').show()

+---------------+------+-----+------+---+
|Product_Country|Canada|China|Mexico|USA|
+---------------+------+-----+------+---+
|         Banana|     1|    1|     0|  1|
|        Carrots|     1|    1|     0|  1|
|         Orange|     0|    1|     0|  2|
|          Beans|     0|    1|     1|  1|
+---------------+------+-----+------+---+



In [25]:
# want to get the DataFrame which won’t have duplicate rows of given DataFrame
schemaPeople.select('Country','Product').dropDuplicates().show()

+-------+-------+
|Country|Product|
+-------+-------+
| Mexico|  Beans|
|  China|  Beans|
|    USA|Carrots|
|  China| Banana|
|  China| Orange|
| Canada|Carrots|
| Canada| Banana|
|  China|Carrots|
|    USA| Orange|
|    USA|  Beans|
|    USA| Banana|
+-------+-------+



In [29]:
schemaPeople.dropna().count()

18

In [36]:
schemaPeople.fillna('NA').show()

+-------+------+-------+
|Product|Amount|Country|
+-------+------+-------+
| Banana|  1000|    USA|
|Carrots|  1500|    USA|
|  Beans|  1600|    USA|
| Orange|  2000|    USA|
| Orange|  2000|    USA|
| Banana|   400|  China|
|Carrots|  1200|  China|
|  Beans|  1500|  China|
| Orange|  4000|  China|
| Banana|  2000| Canada|
|Carrots|  2000| Canada|
|  Beans|  2000| Mexico|
| Banana|     0| Canada|
|     NA|  2000| Canada|
|  Beans|  2000| Mexico|
| Banana|  2000| Canada|
|Carrots|  2000| Canada|
|  Beans|  2000| Mexico|
+-------+------+-------+



In [38]:
schemaPeople.filter(schemaPeople['Amount']>100).count()

17

In [39]:
schemaPeople.groupBy('Product').agg({'Amount': 'mean'}).show()

+-------+------------------+
|Product|       avg(Amount)|
+-------+------------------+
| Orange|2666.6666666666665|
|  Beans|            1820.0|
| Banana|            1080.0|
|   null|            2000.0|
|Carrots|            1675.0|
+-------+------------------+



In [41]:
schemaPeople.groupBy('Product').count().show()

+-------+-----+
|Product|count|
+-------+-----+
| Orange|    3|
|  Beans|    5|
| Banana|    5|
|   null|    1|
|Carrots|    4|
+-------+-----+



####     How to create a sample DataFrame from the base DataFrame?

We can use sample operation to take sample of a DataFrame. The sample method on DataFrame will return a DataFrame containing the sample of base DataFrame. The sample method will take 3 parameters.

    withReplacement = True or False to select a observation with or without replacement.
    fraction = x, where x = .5 shows that we want to have 50% data in sample DataFrame.
    seed for reproduce the result

In [42]:
schemaPeople.count()

18

In [44]:
df_samp1 = schemaPeople.sample(False, 0.3, 42)
df_samp1.count()

3

In [45]:
df_samp2 = schemaPeople.sample(False,0.3, 43)
df_samp2.count()

5

###     How to apply map operation on DataFrame columns?

We can apply a function on each row of DataFrame using map operation. After applying this function, we get the result in the form of RDD. Let’s apply a map operation on User_ID column of train and print the first 5 elements of mapped RDD(x,1) after applying the function (I am applying lambda function).

In [47]:
schemaPeople.select('Product').map(lambda prod: (prod,1)).take()

AttributeError: 'DataFrame' object has no attribute 'map'

In [50]:
schemaPeople.orderBy(schemaPeople.Amount.desc()).show()

+-------+------+-------+
|Product|Amount|Country|
+-------+------+-------+
| Orange|  4000|  China|
|Carrots|  2000| Canada|
| Orange|  2000|    USA|
| Banana|  2000| Canada|
|Carrots|  2000| Canada|
|  Beans|  2000| Mexico|
|  Beans|  2000| Mexico|
| Orange|  2000|    USA|
|   null|  2000| Canada|
|  Beans|  2000| Mexico|
| Banana|  2000| Canada|
|  Beans|  1600|    USA|
|Carrots|  1500|    USA|
|  Beans|  1500|  China|
|Carrots|  1200|  China|
| Banana|  1000|    USA|
| Banana|   400|  China|
| Banana|     0| Canada|
+-------+------+-------+



In [57]:
schemaPeople.filter(schemaPeople.Country=='USA').select('Product').subtract(schemaPeople.filter(schemaPeople.Country=='China').select('Product')).show()

+-------+
|Product|
+-------+
+-------+



In [53]:
schemaPeople.filter(schemaPeople.Country=='USA').select('Product').show()

+-------+
|Product|
+-------+
| Banana|
|Carrots|
|  Beans|
| Orange|
| Orange|
+-------+



In [55]:
schemaPeople.filter(schemaPeople.Country=='China').select('Product').show()

+-------+
|Product|
+-------+
| Banana|
|Carrots|
|  Beans|
| Orange|
+-------+

