#SparkR ile SparkSQL kullanımı

Bu notebook ile SparkR ın kullanımına [SparkR dokümanlarında](http://spark.apache.org/docs/latest/sparkr.html) verildiği şekilde bakacağız. Veriyi SparkSQL data frame e aktaracağız, sonrasında schemaya bakacağız.

#SparkSQL context i yaratmak

Bu ve sonraki notebooklarda veriyi data frame aktarmak için öncelikle bir SparkSQL context e ihtiyacımız olacak. Ayrıca, SPARK_HOME gibi temel değişkenlere uygun değerleri atamamız da gerekiyor.

In [1]:
# Set this to where Spark is installed
Sys.setenv(SPARK_HOME="/usr/local/spark")
# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

SparkR kütüphanesini yükleyelim.

In [2]:
library(SparkR)


Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, intersect, rank, rbind, sample, subset,
    summary, table, transform



Spark ı kullanabilmemiz için bir SparkContext te ihtiyacımız var. Bunu Spark ın [sayfasında](http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparkcontext-sqlcontext) anlatıldığı şekilde yapacak olursak sparkR.init komutunu kullanmamız gerekiyor. Burada master olarak Spark ın bulundugu makinanın IP sini yada lokalde ise *local* kelimesini kullanıyoruz.

In [3]:
sc <- sparkR.init(master="local", sparkPackages="com.databricks:spark-csv_2.11:1.2.0")

Launching java with spark-submit command /usr/local/spark/bin/spark-submit  --packages com.databricks:spark-csv_2.11:1.2.0 sparkr-shell /tmp/RtmpIMFoFJ/backend_port25ee41e3aa9a 


Bu şekilde emrimizi bekleyen bir spark elde ettik. sparkPackages a koyduğumuz paket csv formatındaki dosyaları okumak için kullanılan bir paket. Artık dataFrame oluşturmak için gereken sparkSQL context i oluşturabiliriz.

In [4]:
sqlContext <- sparkRSQL.init(sc)

#SparkSQL data frame lerin yaratılması

#CSV dosyanın okunması
Databricks firmasının csv formatlı dosyalardan data frame oluşturmak için kullanıma sunduğu [paket](https://github.com/databricks/spark-csv) i kullanarak veriyi data frame e aktarıyoruz.

In [5]:
data_file_path <- '/home/dsuser/shared'

In [6]:
traffic_injuries_file_path <- file.path('','home','dsuser','shared','Road_Traffic_Injuries.txt')

In [7]:
system.time(
    traffic_injuries_df <- read.df(sqlContext, 
                        paste('file:', traffic_injuries_file_path, sep=''), 
                        header='true', 
                        source = "com.databricks.spark.csv", 
                        inferSchema='true')
)

   user  system elapsed 
  0.004   0.000  18.437 

Schemaya bakalım.

In [8]:
system.time(
    printSchema(traffic_injuries_df)
)

root
 |-- ind_id: integer (nullable = true)
 |-- ind_definition: string (nullable = true)
 |-- reportyear: string (nullable = true)
 |-- race_eth_code: integer (nullable = true)
 |-- race_eth_name: string (nullable = true)
 |-- geotype: string (nullable = true)
 |-- geotypevalue: long (nullable = true)
 |-- geoname: string (nullable = true)
 |-- county_name: string (nullable = true)
 |-- county_fips: integer (nullable = true)
 |-- region_name: string (nullable = true)
 |-- region_code: integer (nullable = true)
 |-- mode: string (nullable = true)
 |-- severity: string (nullable = true)
 |-- injuries: double (nullable = true)
 |-- totalpop: double (nullable = true)
 |-- poprate: double (nullable = true)
 |-- LL95CI_poprate: double (nullable = true)
 |-- UL95CI_poprate: double (nullable = true)
 |-- poprate_se: double (nullable = true)
 |-- poprate_rse: double (nullable = true)
 |-- CA_decile_pop: string (nullable = true)
 |-- CA_RR_poprate: double (nullable = true)
 |-- avmttotal: doubl

   user  system elapsed 
  0.000   0.000   0.145 

In [9]:
head(traffic_injuries_df)

Unnamed: 0,ind_id,ind_definition,reportyear,race_eth_code,race_eth_name,geotype,geotypevalue,geoname,county_name,county_fips,ellip.h,avmttotal,avmtrate,LL95CI_avmtrate,UL95CI_avmtrate,avmtrate_se,avmtrate_rse,CA_decile_avmt,CA_RR_avmtrate,groupquarters,version
1,753,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002,9,Total,CA,6,California,,,⋯,326842416136.0,12.51062,12.12715,12.89408,0.1956456,1.563837,,1.0,823151.0,10/10/2014 12:00:00 AM
2,753,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002,9,Total,CA,6,California,,,⋯,326842416136.0,41.12991,40.43462,41.8252,0.3547396,0.8624857,,1.0,823151.0,10/10/2014 12:00:00 AM
3,753,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002,9,Total,CA,6,California,,,⋯,1214809885.0,69.9698,45.99164,93.94795,12.23375,17.48433,,1.0,,10/10/2014 12:00:00 AM
4,753,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002,9,Total,CA,6,California,,,⋯,1214809885.0,452.7457,325.3094,580.1821,65.01855,14.36094,,1.0,,10/10/2014 12:00:00 AM
5,753,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002,9,Total,CA,6,California,,,⋯,,,,,,,,,823151.0,10/10/2014 12:00:00 AM
6,753,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002,9,Total,CA,6,California,,,⋯,,,,,,,,,823151.0,10/10/2014 12:00:00 AM


In [10]:
nrow(traffic_injuries_df)

In [11]:
str(traffic_injuries_df)

Formal class 'DataFrame' [package "SparkR"] with 2 slots
  ..@ env:<environment: 0x47fb148> 
  ..@ sdf:Class 'jobj' <environment: 0x4800088> 


Burada DataFrame nesnesi SparkSQL de işlem yapmamızı sağlıyor.

In [12]:
traffic_injuries_dfx <- filter(
    traffic_injuries_df, 
    isNotNull(traffic_injuries_df$reportyear) 
    & isNotNull(traffic_injuries_df$geotype)
    & isNotNull(traffic_injuries_df$race_eth_code)
)

In [13]:
nrows <- nrow(traffic_injuries_dfx)
nrows

Null olmayan değer yok.

In [14]:
?summary

0,1
describe {SparkR},R Documentation

0,1
x,A DataFrame to be computed.
col,A string of name
...,Additional expressions
object,A fitted MLlib model


In [15]:
system.time(
    traffic_injuries_dfx_summary <- describe(traffic_injuries_dfx)
)

   user  system elapsed 
  0.040   0.008 154.043 

In [16]:
collect(traffic_injuries_dfx_summary)

Unnamed: 0,summary,ind_id,ind_definition,reportyear,race_eth_code,race_eth_name,geotype,geotypevalue,geoname,county_name,ellip.h,avmttotal,avmtrate,LL95CI_avmtrate,UL95CI_avmtrate,avmtrate_se,avmtrate_rse,CA_decile_avmt,CA_RR_avmtrate,groupquarters,version
1,count,494226.0,494226,494226.0,494226.0,494226,494226,494226.0,494226,494226,⋯,12320.0,11774.0,11774.0,11774.0,11774.0,11774.0,494226.0,11774.0,59418.0,494226
2,mean,753.0,,2005.8811921576623,9.0,,,4431001576.189009,1847.7611721738963,,⋯,3030276587.57078,63.90611302781159,20.57549144469093,122.50280012022058,29.896268924698862,52.30289196387992,5.50058207217695,2.658148755860837,9092.740300918913,
3,stddev,0.0,,2.5442331668765643,0.0,,,3256269125.340502,2376.7218177940317,,⋯,23100238529.79673,130.3678309097662,57.25034022249537,290.4571520031402,89.09777729450262,39.04897780494879,2.8690840032038887,5.41543275468215,49322.98790573563,
4,min,753.0,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2002.0,9.0,Total,CA,1.0,0001.00,,⋯,0.0,0.674050088276818,0.0,1.60823584930548,0.136645672215764,0.858187328755105,,0.091088010318861,0.0,10/10/2014 12:00:00 AM
5,max,753.0,Annual number of fatal and severe road traffic injuries per population and per miles traveled by transport mode,2010.0,9.0,Total,RE,99999999999.0,Zayante CDP,Yuba,⋯,336306148012.608,4949.82133549228,770.840342077224,11809.9237101896,3500.05223198844,223.606797749979,9.0,246.054811599176,834673.0,10/10/2014 12:00:00 AM


Tabii burada sadece istatistik olarak anlamlı olan kolonları dikkate almak gerekir.

In [18]:
collect(select(traffic_injuries_dfx_summary,"summary","avmtrate"))

Unnamed: 0,summary,avmtrate
1,count,11774.0
2,mean,63.90611302781159
3,stddev,130.3678309097662
4,min,0.674050088276818
5,max,4949.82133549228


Bu şekilde bu notebook da CSV formatlı bir veriyi SparkR kullanarak SparkSQL data frame ine aktardık. Sonrasında kolan bazında özet istatistiklerini aldık.