애플 주식 데이터를 가지고 간단한 데이터 분석을 해보자. 모든 답은 Pyspark을 통해 이뤄져야 한다.

먼저 PySpark과 Py4J를 설치하자

In [10]:
!pip install pyspark==3.0.1 py4j==0.10.9 



#### Spark Session 만들기

In [11]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark Dataframe basic example") \
    .getOrCreate()

#### 애플 주식 CSV 파일 로딩하기: https://pyspark-test-sj.s3-us-west-2.amazonaws.com/appl_stock.csv
일단 pandas 데이터프레임으로 로딩해서 Spark 데이터프레임으로 변경한다

In [12]:
import pandas as pd

apple_pandas_df = pd.read_csv("https://pyspark-test-sj.s3-us-west-2.amazonaws.com/appl_stock.csv")
apple_spark_df = spark.createDataFrame(apple_pandas_df)

#### 1> 어떤 컬럼 이름들이 있는가?

In [13]:
apple_spark_df.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

#### 2> 스키마를 프린트해보기

In [14]:
apple_spark_df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: long (nullable = true)
 |-- Adj Close: double (nullable = true)



#### 3> 처음 5개의 레코드를 출력해보기

In [15]:
apple_spark_df.show(n=5)

+----------+----------+----------+----------+----------+---------+------------------+
|      Date|      Open|      High|       Low|     Close|   Volume|         Adj Close|
+----------+----------+----------+----------+----------+---------+------------------+
|2010-01-04|213.429998|214.499996|212.380001|214.009998|123432400|         27.727039|
|2010-01-05|214.599998|215.589994|213.249994|214.379993|150476200|         27.774976|
|2010-01-06|214.379993|    215.23|210.750004|210.969995|138040000|27.333178000000004|
|2010-01-07|    211.75|212.000006|209.050005|    210.58|119282800|          27.28265|
|2010-01-08|210.299994|212.000006|209.060005|211.980005|111902700|         27.464034|
+----------+----------+----------+----------+----------+---------+------------------+
only showing top 5 rows



#### 4> describe를 사용하여 데이터프레임의 컬럼별 통계보기

In [16]:
apple_spark_df.describe().show()

+-------+----------+------------------+------------------+-----------------+-----------------+-------------------+------------------+
|summary|      Date|              Open|              High|              Low|            Close|             Volume|         Adj Close|
+-------+----------+------------------+------------------+-----------------+-----------------+-------------------+------------------+
|  count|      1762|              1762|              1762|             1762|             1762|               1762|              1762|
|   mean|      null|313.07631115891036| 315.9112880164586|309.8282405079455|312.9270656379115|9.422577587968218E7| 75.00174115607263|
| stddev|      null|185.29946803981542|186.89817686485773|183.3839166437097|185.1471036170944|6.020518776592715E7|28.574929721799045|
|    min|2010-01-04|              90.0|         90.699997|        89.470001|        90.279999|           11475900|         24.881912|
|    max|2016-12-30|        702.409988|        705.070023|    

#### 5> Close 컬럼의 평균값은 얼마인가?

In [17]:
from pyspark.sql.functions import mean

apple_spark_df.select(mean("Close")).show()

+-----------------+
|       avg(Close)|
+-----------------+
|312.9270656379115|
+-----------------+



#### 6> Volume 컬럼의 최대값과 최소값은?

In [18]:
from pyspark.sql.functions import min, max

apple_spark_df.select(max("Volume"), min("Volume")).show()

+-----------+-----------+
|max(Volume)|min(Volume)|
+-----------+-----------+
|  470249500|   11475900|
+-----------+-----------+



#### 보너스 질문: HV ratio라는 이름의 새로운 컬럼을 추가한 데이터프레임을 만들기. 이 컬럼의 값은 High/Volume으로 계산된다

In [19]:
apple_spark_df_with_hv = apple_spark_df.withColumn("hv ratio", apple_spark_df.High/apple_spark_df.Volume) 

In [20]:
apple_spark_df_with_hv.show(5)

+----------+----------+----------+----------+----------+---------+------------------+--------------------+
|      Date|      Open|      High|       Low|     Close|   Volume|         Adj Close|            hv ratio|
+----------+----------+----------+----------+----------+---------+------------------+--------------------+
|2010-01-04|213.429998|214.499996|212.380001|214.009998|123432400|         27.727039|1.737793286041590...|
|2010-01-05|214.599998|215.589994|213.249994|214.379993|150476200|         27.774976|1.432718223878593...|
|2010-01-06|214.379993|    215.23|210.750004|210.969995|138040000|27.333178000000004|1.559185743262822...|
|2010-01-07|    211.75|212.000006|209.050005|    210.58|119282800|          27.28265|1.777288980473295...|
|2010-01-08|210.299994|212.000006|209.060005|211.980005|111902700|         27.464034|1.894503045949740...|
+----------+----------+----------+----------+----------+---------+------------------+--------------------+
only showing top 5 rows



#### 보너스 질문: 월별 Close 컬럼의 평균값은?

In [21]:
from pyspark.sql.functions import month

monthdf = apple_spark_df.withColumn("Month", month("Date"))

In [22]:
monthavgdf = monthdf.select(["Month", "Close"]).groupBy("Month").mean()

In [23]:
monthavgdf.show()

+-----+----------+------------------+
|Month|avg(Month)|        avg(Close)|
+-----+----------+------------------+
|   12|      12.0| 302.3505362684563|
|    1|       1.0| 322.2097142571429|
|    6|       6.0|      288.12546566|
|    3|       3.0| 332.9115673137254|
|    5|       5.0| 351.6210208571428|
|    9|       9.0|301.07631959027776|
|    4|       4.0| 340.5104108150685|
|    8|       8.0|300.43858096129026|
|    7|       7.0|281.72216211486483|
|   10|      10.0| 308.3055256315789|
|   11|      11.0| 306.2725174895105|
|    2|       2.0| 321.3595563037037|
+-----+----------+------------------+



In [24]:
monthavgdf.select(["Month", "avg(Close)"]).orderBy("Month").show()

+-----+------------------+
|Month|        avg(Close)|
+-----+------------------+
|    1| 322.2097142571429|
|    2| 321.3595563037037|
|    3| 332.9115673137254|
|    4| 340.5104108150685|
|    5| 351.6210208571428|
|    6|      288.12546566|
|    7|281.72216211486483|
|    8|300.43858096129026|
|    9|301.07631959027776|
|   10| 308.3055256315789|
|   11| 306.2725174895105|
|   12| 302.3505362684563|
+-----+------------------+

