# "[Spark] PySpark 비지도 학습 모델"
> pyspark 비지도 학습

- toc: true 
- badges: true
- comments: true
- categories: [Spark]
- tags: [spark, pyspark, unsupervised, kmeans]

# 비지도 학습

In [1]:
import os
MINIO_ACCESS_KEY = os.environ['MINIO_ACCESS_KEY']
MINIO_SECRET_KEY = os.environ['MINIO_SECRET_KEY']

spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.access.key", MINIO_ACCESS_KEY)
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.secret.key", MINIO_SECRET_KEY)
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.endpoint", "http://lab101:10170")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.connection.ssl.enabled", "false")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.path.style.access", "true")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("com.amazonaws.services.s3.enableV2", "true")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

In [2]:
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler()\
    .setInputCols(["Quantity", "UnitPrice"])\
    .setOutputCol("features")
sales_raw = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("s3a://data/retail-data/by-day/*.csv")\
    .limit(50)\
    .coalesce(1)\
    .where("Description IS NOT NULL")
sales_raw.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|   14075.0|United Kingdom|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|   14075.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



In [3]:
sales = va.transform(sales_raw)
sales.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: string, UnitPrice: double, CustomerID: double, Country: string, features: vector]

In [4]:
sales.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|   features|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|[48.0,1.79]|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|[20.0,1.25]|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|   14075.0|United Kingdom|[24.0,1.65]|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|[24.0,1.25]|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|   14075.0|United Kingdom| [6.0,2.55]|
+---------+---------+-------------------

# K-평균
- 사용자가 지정한 군집 수인 k가 데이터셋 내 서로 다른 포인트에 무작위로 할당
- 할당되지 않은 포인트들은 직전에 할당된 포인트와의 거리, 즉 유클리드 거리를 계산하여 가장 가까이에 위치한 군집으로 할당
- 이렇게 각 포인트를 사전에 정의된 군집(k개의 군집)으로 모두 할당하는 작업이 끝나면 센트로이드라는 각 군집의 중심이 계산되고, 프로세스가 반복
- 모든 포인트가 특정 중심에 할당되면 새로운 중심값이 계산
- 이과정은 지정한 횟수만큼 반복되거나 중심값이 변경되지 않는 수렴<sup>convergence</sup> 에 도달할 때 까지 반복

In [5]:
from pyspark.ml.clustering import KMeans

km = KMeans().setK(5)
print(km.explainParams())
kmModel = km.fit(sales)

distanceMeasure: the distance measure. Supported options: 'euclidean' and 'cosine'. (default: euclidean)
featuresCol: features column name. (default: features)
initMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)
initSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)
k: The number of clusters to create. Must be > 1. (default: 2, current: 5)
maxIter: max number of iterations (>= 0). (default: 20)
predictionCol: prediction column name. (default: prediction)
seed: random seed. (default: 2732625111183068529)
tol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)


In [6]:
summary = kmModel.summary
print(summary.clusterSizes)

[10, 3, 10, 19, 8]


In [7]:
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[12.    0.93]
[44.          1.16333333]
[23.2    0.956]
[5.21052632 3.74105263]
[ 2.5     11.24375]


# 이분법 $k-$평균
- 이분법<sup>Bisecting</sup> $k-$평균은 $k-$평균에서 변형된 알고리즘
- $k-$평균은 초기 데이터에 여러 그룹을 할당함으로써 데이터 포인트들을 군집화하는 상향식<sup>bottom-up</sup> 군집화 방법인 반면, 이분법 $k-$평균은 이와는 반대로 하향식<sup>top-down</sup> 군집화 방법
- 최초에 단일 그룹을 생성한 다음 해당 그룹을 더 작은 그룹으로 나누고 마지막에는 사용자가 지정한 수의 군집으로 끝나게 됨
- 일반적으로 $k-$평균보다 빠르며 군집결과도 차이가 있음

In [8]:
from pyspark.ml.clustering import BisectingKMeans

bkm = BisectingKMeans().setK(5).setMaxIter(5)
bkmModel = bkm.fit(sales)

In [9]:
summary = bkmModel.summary
print(summary.clusterSizes)

[8, 16, 13, 10, 3]


In [11]:
bkmModel.computeCost(sales)
centers = bkmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[ 2.5     11.24375]
[4.8125   4.095625]
[10.92307692  1.14230769]
[23.2    0.956]
[44.          1.16333333]




# 가우시안 혼합 모델
- 가우시안 혼합 모델<sup>Gaussian Mixture Model GMM</sup> 은 이분법 **k-** 평균이나 **k-** 평균과는 다른 가정
- **k-** 평균 알고리즘이 군집의 중심으로부터 거리제곱합을 줄임으로써 데이터를 그룹화하는 반면 가우시안 혼합 모델은 각 군집이 가우시안 분포<sup>Gaussian distribution</sup> 으로부터 무작위 추출을 하여 데이터를 생성한다고 가정
- 생성된 군집 가장자리에는 데이터가 포함될 확률이 낮아야 하며(가우시안 분포에 반영), 군집 중앙에는 데이터가 포함될 확률이 훨씬 높아야 함

In [12]:
from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture().setK(5)
print(gmm.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
featuresCol: features column name. (default: features)
k: Number of independent Gaussians in the mixture model. Must be > 1. (default: 2, current: 5)
maxIter: max number of iterations (>= 0). (default: 100)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
seed: random seed. (default: 1554811416070000952)
tol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.01)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)


In [13]:
model = gmm.fit(sales)

In [14]:
summary = model.summary
print(model.weights)

[0.19996361501436077, 0.2650758391757304, 0.4549241736967528, 0.0600363721131339, 0.02000000000002211]


In [15]:
model.gaussiansDF.show()



+--------------------+--------------------+
|                mean|                 cov|
+--------------------+--------------------+
|[23.1998838713342...|2.560278699114912...|
|[3.12998187641630...|1.222198686648849...|
|[8.33175706727699...|11.78027654132310...|
|[43.9877858780125...|32.22708770259816...|
|[8.00000000001486...|6.062776947139928...|
+--------------------+--------------------+



In [16]:
summary.cluster.show()

+----------+
|prediction|
+----------+
|         3|
|         0|
|         0|
|         0|
|         2|
|         3|
|         4|
|         0|
|         2|
|         2|
|         1|
|         2|
|         2|
|         2|
|         2|
|         2|
|         3|
|         0|
|         2|
|         2|
+----------+
only showing top 20 rows



In [18]:
summary.clusterSizes

[10, 11, 25, 3, 1]

In [19]:
summary.probability.show()

+--------------------+
|         probability|
+--------------------+
|[1.37632436456510...|
|[0.99999639767391...|
|[0.99999717659721...|
|[0.99997707446759...|
|[1.33915680121729...|
|[1.37558540775791...|
|[8.24399594549003...|
|[0.99999776336008...|
|[1.06905010900131...|
|[1.06905010900131...|
|[1.65702583224003...|
|[1.98857135239437...|
|[8.70737611462661...|
|[8.70737611462661...|
|[4.99008297332649...|
|[1.08580292118476...|
|[4.39868445087168...|
|[0.99997707446759...|
|[1.06905010900131...|
|[1.06905010900131...|
+--------------------+
only showing top 20 rows



# 잠재 디리클레 할당
- 잠재 디리클레 할당<sup>Latent Dirichlet Allocation, LDA</sup>은 일반적으로 텍스트 문서에 대한 토픽 모델링을 수행하는 데 사용되는 계층적 군집화 모델
- 주제와 관련된 일련의 문서와 키워드로부터 주제를 추출하고 각 문서가 입력된 여러 주제에 얼마나 기여했는지 횟수를 계산
- 스파크에서 **LDA** 모델을 구현하는 방법은 온라인<sup>online</sup> **LDA**와 기댓값 최대화<sup>expectation maximization</sup>
- 일반적으로 온라인 **LDA**는 샘플 데이터가 많은 경우에 적합하며, 기댓값 최대화는 어휘수가 많은 경우에 적합
- 확장성 관점에서는 수백에서 수천 개의 주제까지 가능

In [22]:
from pyspark.ml.feature import Tokenizer, CountVectorizer

tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.drop("features"))
cv = CountVectorizer()\
    .setInputCol("DescOut")\
    .setOutputCol("features")\
    .setVocabSize(500)\
    .setMinTF(0)\
    .setMinDF(0)\
    .setBinary(True)
cvFitted = cv.fit(tokenized)
prepped = cvFitted.transform(tokenized)

In [23]:
from pyspark.ml.clustering import LDA
lda = LDA().setK(10).setMaxIter(5)
print(lda.explainParams())

checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
docConcentration: Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta"). (undefined)
featuresCol: features column name. (default: features)
k: The number of topics (clusters) to infer. Must be > 1. (default: 10, current: 10)
keepLastCheckpoint: (For EM optimizer) If using checkpointing, this indicates whether to keep the last checkpoint. If false, then the checkpoint will be deleted. Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care. (default: True)
learningDecay: Learning rate, set as anexponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. (default: 0.51)
learningOffset: A (pos

In [24]:
model = lda.fit(prepped)

In [25]:
model.describeTopics(3).show()

+-----+---------------+--------------------+
|topic|    termIndices|         termWeights|
+-----+---------------+--------------------+
|    0| [60, 116, 124]|[0.00913991758035...|
|    1|  [114, 11, 92]|[0.01183209974899...|
|    2|  [55, 112, 27]|[0.01574372445964...|
|    3|  [56, 106, 88]|[0.01117898536785...|
|    4|   [7, 40, 102]|[0.01211664967569...|
|    5|[112, 139, 121]|[0.00933025066380...|
|    6|    [122, 2, 4]|[0.01118652206493...|
|    7|    [78, 6, 51]|[0.00880822582151...|
|    8|  [23, 48, 100]|[0.01152078111328...|
|    9|  [59, 127, 19]|[0.01128129509911...|
+-----+---------------+--------------------+



In [26]:
cvFitted.vocabulary

['water',
 'hot',
 'vintage',
 'bottle',
 'paperweight',
 '6',
 'home',
 'doormat',
 'landmark',
 'bicycle',
 'frame',
 'ribbons',
 '',
 'classic',
 'rose',
 'kit',
 'leaf',
 'sweet',
 'bag',
 'airline',
 'doorstop',
 'light',
 'in',
 'christmas',
 'heart',
 'calm',
 'set',
 'keep',
 'balloons',
 'night',
 'lights',
 '12',
 'tin',
 'english',
 'caravan',
 'stuff',
 'tidy',
 'oxford',
 'full',
 'cottage',
 'notting',
 'drawer',
 'mushrooms',
 'chrome',
 'champion',
 'amelie',
 'mini',
 'the',
 'giant',
 'design',
 'elegant',
 'tins',
 'jet',
 'fairy',
 "50's",
 'holder',
 'message',
 'blue',
 'storage',
 'tier',
 'covent',
 'world',
 'skulls',
 'font',
 'hearts',
 'skull',
 'clips',
 'bell',
 'red',
 'party',
 'chalkboard',
 'save',
 '4',
 'coloured',
 'poppies',
 'garden',
 'nine',
 'girl',
 'shimmering',
 'doughnut',
 'dog',
 '3',
 'tattoos',
 'chilli',
 'coat',
 'torch',
 'sunflower',
 'tale',
 'cards',
 'puncture',
 'woodland',
 'bomb',
 'knack',
 'lip',
 'collage',
 'rabbit',
 'sex