# Data Exploration

Let's load a dataset with the pyspark

## Packages

In [4]:
from pyspark.sql import SQLContext
from pyspark import SparkConf,SparkContext

In [5]:
conf =  SparkConf().setMaster("local").setAppName("daily_water")
sc = SparkContext(conf =conf)
sqlContext = SQLContext(sc)

In [7]:
!dir

 El volumen de la unidad D es DATOS
 El n£mero de serie del volumen es: 3449-1924

 Directorio de D:\Usuarios\rhaps\Documents\GitHub\Apache-Spark\python

15/09/2019  09:50 p. m.    <DIR>          .
15/09/2019  09:50 p. m.    <DIR>          ..
15/09/2019  09:48 p. m.    <DIR>          .ipynb_checkpoints
15/09/2019  09:43 p. m.               887 customer-orders.py
15/09/2019  09:50 p. m.             1.340 Data Exploration - Pyspark.ipynb
15/09/2019  09:43 p. m.             4.400 degrees-of-separation.py
15/09/2019  09:43 p. m.               800 friends-by-age.py
15/09/2019  09:43 p. m.               946 max-temperatures.py
15/09/2019  09:43 p. m.               966 min-temperatures.py
15/09/2019  09:43 p. m.             1.001 most-popular-marveleheroe.py
15/09/2019  09:43 p. m.             3.712 movie-similarities-1m.py
15/09/2019  09:43 p. m.             3.769 movie-similarities.py
15/09/2019  09:43 p. m.             1.484 popular-movies-dataframe.py
15/09/2019  09:43 p. m.             1

In [10]:
# the second argument specifies the spark-csv format, the last one says to infer the data types
df = sqlContext.read.load("../datasets/daily_weather.csv", 
                          format="com.databricks.spark.csv",
                          header=True, inferSchema="true")

In [11]:
type(df)

pyspark.sql.dataframe.DataFrame

In [12]:
df.columns

['number',
 'air_pressure_9am',
 'air_temp_9am',
 'avg_wind_direction_9am',
 'avg_wind_speed_9am',
 'max_wind_direction_9am',
 'max_wind_speed_9am',
 'rain_accumulation_9am',
 'rain_duration_9am',
 'relative_humidity_9am',
 'relative_humidity_3pm']

In [13]:
df.dtypes

[('number', 'int'),
 ('air_pressure_9am', 'double'),
 ('air_temp_9am', 'double'),
 ('avg_wind_direction_9am', 'double'),
 ('avg_wind_speed_9am', 'double'),
 ('max_wind_direction_9am', 'double'),
 ('max_wind_speed_9am', 'double'),
 ('rain_accumulation_9am', 'double'),
 ('rain_duration_9am', 'double'),
 ('relative_humidity_9am', 'double'),
 ('relative_humidity_3pm', 'double')]

In [14]:
df.printSchema()

root
 |-- number: integer (nullable = true)
 |-- air_pressure_9am: double (nullable = true)
 |-- air_temp_9am: double (nullable = true)
 |-- avg_wind_direction_9am: double (nullable = true)
 |-- avg_wind_speed_9am: double (nullable = true)
 |-- max_wind_direction_9am: double (nullable = true)
 |-- max_wind_speed_9am: double (nullable = true)
 |-- rain_accumulation_9am: double (nullable = true)
 |-- rain_duration_9am: double (nullable = true)
 |-- relative_humidity_9am: double (nullable = true)
 |-- relative_humidity_3pm: double (nullable = true)



In [17]:
df.describe().toPandas().T

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
number,1095,547.0,316.24357700987383,0,1094
air_pressure_9am,1092,918.8825513138094,3.184161180386833,907.9900000000024,929.3200000000012
air_temp_9am,1090,64.93300141287072,11.175514003175877,36.752000000000685,98.90599999999992
avg_wind_direction_9am,1091,142.2355107005759,69.13785928889189,15.500000000000046,343.4
avg_wind_speed_9am,1092,5.50828424225493,4.5528134655317185,0.69345139999974,23.554978199999763
max_wind_direction_9am,1092,148.95351796516923,67.23801294602953,28.89999999999991,312.19999999999993
max_wind_speed_9am,1091,7.019513529175272,5.598209170780958,1.1855782000000479,29.84077959999996
rain_accumulation_9am,1089,0.20307895225211126,1.5939521253574893,0.0,24.01999999999907
rain_duration_9am,1092,294.1080522756142,1598.0787786601481,0.0,17704.0


In [21]:
df.describe().toPandas().air_pressure_9am

0                 1092
1    918.8825513138094
2    3.184161180386833
3    907.9900000000024
4    929.3200000000012
Name: air_pressure_9am, dtype: object

In [24]:
df.describe("air_pressure_9am").show()

+-------+-----------------+
|summary| air_pressure_9am|
+-------+-----------------+
|  count|             1092|
|   mean|918.8825513138094|
| stddev|3.184161180386833|
|    min|907.9900000000024|
|    max|929.3200000000012|
+-------+-----------------+



In [26]:
df.toPandas().shape

(1095, 11)

In [27]:
df.count()

1095

Let's see the number of NaN in our dataset

In [29]:
df.toPandas().isna().sum()

number                    0
air_pressure_9am          3
air_temp_9am              5
avg_wind_direction_9am    4
avg_wind_speed_9am        3
max_wind_direction_9am    3
max_wind_speed_9am        4
rain_accumulation_9am     6
rain_duration_9am         3
relative_humidity_9am     0
relative_humidity_3pm     0
dtype: int64

We are going to erase them 

In [30]:
df2 = df.na.drop(subset="air_pressure_9am")

In [31]:
df2.count()

1092

Let's compute the correlation between two columns 

In [32]:
# In spite of both variables have NaN we can use the method 
df2.stat.corr("rain_accumulation_9am","rain_duration_9am")

0.7298253479609021