### Visualization

### Session 2

The classes of all columns can defined manually with the `StructType` and `StructField` command. The latter has three parameters:
- `name` of the column
- `dataType` of the column
- `nullable`which defines whether the column can be null: true/false

In [None]:
# define the schemas
transactionSchema  = StructType([StructField('_c0', IntegerType(), True),
                                StructField('InvoiceNo', StringType(), True),
                                StructField('StockCode', StringType(), True),
  # change the datatype of InvoiceDate from string to timestamp
invoices = invoices.withColumn("InvoiceDate", F.to_timestamp("InvoiceDate", "M/d/yyyy H:m"))                              StructField('Quantity', IntegerType(), True)])

In [19]:
# change the datatype of InvoiceDate from string to timestamp
invoices = invoices.withColumn("InvoiceDate", F.to_timestamp("InvoiceDate", "M/d/yyyy H:m"))

In [None]:
# option 2: pyspark DataFrame functions
totalSold_2 = transactions.groupBy("StockCode") \
                            .agg({"Quantity": "sum"}) \
                            .withColumnRenamed("sum(Quantity)", "totalQuantitySold") \
                            .filter("totalQuantitySold > 25000") \
                            .sort("totalQuantitySold", ascending=False)

In [None]:
# get number of returned deliveries
df_ret = transactions.join(inventory, "StockCode") \
                    .select("InvoiceNo", (transactions.Quantity * inventory.UnitPrice).alias("Revenue")) \
                    .groupBy("InvoiceNo") \
                    .sum("Revenue") \
                    .withColumnRenamed("sum(Revenue)", "TotalRevenue") \
                    .filter("TotalRevenue < 0") \
                    .join(invoices, "InvoiceNo") \
                    .groupBy("CustomerID") \
                    .count() \
                    .withColumnRenamed("count", "nbReturned")# check
df_ret.show(5)

In [180]:
# check
df_ret.show(5)

+----------+----------+
|CustomerID|nbReturned|
+----------+----------+
|     13282|         3|
|     13610|         2|
|     15555|         4|
|     15271|         1|
|     14157|         1|
+----------+----------+
only showing top 5 rows



In [None]:
# define function
def add_one(var):
    var_new = var + 1
    return(var_new)

# wrap in udf
add_one_udf = udf(add_one, returnType=LongType())

# create new column
df = df.withColumn("x2", add_one_udf(df.value))

In [269]:
# check
df.show()

+---+-----+-------+---+
| id|value|and_one| x2|
+---+-----+-------+---+
|  A|    5|      6|  6|
|  B|   67|     68| 68|
|  C|  567|    568|568|
+---+-----+-------+---+



### Regrssion

In [None]:
read data -> create basetable -> pipeline

In [None]:
# remove the observations containing missing values
houses = houses.dropna('any')

# Keyword 'any' removes the row if any value of that row is NULL
# Keyword 'all' removes the row only if all values of that row are NULL

<h4> Pipelines </h4>
   
    - Because we are working with Big Data processing infrastructure (using distributed processing) the way we code is slightly different than other data processing tools. Here the infrastructure has inherrent built-in functionality that optimizes the way our code is processed. 
    - In short it means: the program will choose which steps to do when and how they are distributed over the nodes. 
    - In order for this to be done efficiently we need to give as many instructions as possible at the same time. This way the machine can decide how to divide an conquer. This is done by using pipelines. 
    - Each step in a pipeline is called a pipeline stage.

<br> **Pipelines consist of different stages (transformers & estimators)**, some examples:
- **`StringIndexer`**: <br> As a first general step we need to check if there are any text-variabels (usually categorical variables) in the dataset. Not all ML algorithms are able to handle this type of data. That's why it's always good practice to translate textual categories into numerical categories (e.g. A,B,C -> 1,2,3). For our dataset it is not needed.<br> https://spark.apache.org/docs/latest/ml-features.html#stringindexer <br>
- **`OneHotEncoderEstimator`**:<br> Another way to handle categorical labels is by transforming them into a vector with 0's and 1's. <br> https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator <br>
- **`VectorIndexer`**: <br> Helps index categorical features in datasets of Vectors. Required for Tree methods.  <br>https://spark.apache.org/docs/latest/ml-features.html#vectorindexer <br>
- **`StandardScaler`**: <br> Transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. <br>https://spark.apache.org/docs/latest/ml-features.html#standardscaler <br>
- **`VectorAssembler`**: <br> Transforms a number of input columns into one vector. This is used for combining features in order to train ML models like LR and DTs. <br>https://spark.apache.org/docs/latest/ml-features.html#vectorassembler <br>
- **Full overview**: <br> https://spark.apache.org/docs/latest/ml-features.html<br> Select only what is needed for the data you have at hand.

<br> **Exercice:** Apply several transformers and estimators in the dataset to create the final basetable
<br> **NOTE:** Take into account that some transformers need to be applied to the entire dataset, while others need to be applied to train/test set seperately.

In [None]:
# define the categorical variables
cat_cols = ['waterfront', 'view', 'floors', 'condition', 'grade', 'zipcode', 'renovated', 'bedrooms', 'bathrooms']

# define the assembler
VA_cat = VectorAssembler(inputCols=cat_cols, outputCol="cat_features")

In [None]:
# define indexer
VI = VectorIndexer(inputCol="cat_features", outputCol="cat_features_indexed")

<h5> Define, fit and apply Pipeline on data </h5>

In [None]:
# define pipeline model and fit on data
preprocessing_pipeline = Pipeline(stages=[VA_num, VA_cat, VI]).fit(houses)
# transform data by applying pipeline model on data
preprocessed_data = preprocessing_pipeline.transform(houses)

In [None]:
# select features and labels
preprocessed_data = preprocessed_data.select(["num_features", "cat_features_indexed", "price"])
# rename price to label
preprocessed_data = preprocessed_data.withColumnRenamed("price", "label")

In [23]:
# check
preprocessed_data.show(5)

+--------------------+--------------------+--------+
|        num_features|cat_features_indexed|   label|
+--------------------+--------------------+--------+
|[1180.0,5650.0,11...|[0.0,0.0,0.0,2.0,...|221900.0|
|[2570.0,7242.0,21...|[0.0,0.0,2.0,2.0,...|538000.0|
|[770.0,10000.0,77...|[0.0,0.0,0.0,2.0,...|180000.0|
|[1960.0,5000.0,10...|[0.0,0.0,0.0,4.0,...|604000.0|
|[1680.0,8080.0,16...|[0.0,0.0,0.0,2.0,...|510000.0|
+--------------------+--------------------+--------+
only showing top 5 rows



In [None]:
The StandardScaler should only be performed on the trainingset, because an equal mean and standard deviation between the training- and testset need to be assumed to avoid methodological mistakes.m

In [None]:
# split data in train and test set
train, test = preprocessed_data.randomSplit([0.7, 0.3])

# define scaler
SC = StandardScaler(inputCol="num_features", outputCol="num_features_scaled")

# define assembler
VA = VectorAssembler(inputCols=["cat_features_indexed", "num_features_scaled"], outputCol="features")

In [None]:
# define linear regression model
LR = LinearRegression(featuresCol="features", labelCol="label")

# define decision tree model
DT = DecisionTreeRegressor(featuresCol="features", labelCol="label")

# define random forest model
RF = RandomForestRegressor(featuresCol="features", labelCol="label")

<h5> Define Pipeline for each model and fit on data </h5>

In [None]:
# define linear regression model pipeline and fit on training data
LR_Pipeline = Pipeline(stages=[SC, VA, LR]).fit(train)

# define decision tree model pipeline and fit on training data
DT_Pipeline = Pipeline(stages=[SC, VA, DT]).fit(train)


# define random forest model pipeline and fit on data
RF_Pipeline = Pipeline(stages=[SC, VA, RF]).fit(train)
<h5> Define Pipeline for each model and fit on data </h5>

<h5> Get predictions on test set by applying each model pipeline on test data </h5>

In [None]:
# get predictions of linear regression model on test data
lr_preds = LR_Pipeline.transform(test)

# get predictions of decision tree model on test data
dt_preds = DT_Pipeline.transform(test)

# get predictions of random forest model on test data
rf_preds = RF_Pipeline.transform(test)
<h5> Get predictions on test set by applying each model pipeline on test data </h5>

#### Evaluation -lr

In [None]:
# define evaluator
lrEvaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction") 

# Get different metrics using your created evaluator object
lrsq = lrEvaluator.evaluate(lr_preds, {lrEvaluator.metricName: 'r2'})
lrmae = lrEvaluator.evaluate(lr_preds, {lrEvaluator.metricName: 'mae'})
lrrmse = lrEvaluator.evaluate(lr_preds, {lrEvaluator.metricName: 'rmse'})
lrmse = lrEvaluator.evaluate(lr_preds, {lrEvaluator.metricName: 'mse'})

#### Cross validation

<h5> 1.3.3. Cross Validation</h5>
    
    - Try to see if you can improve your models performance even more by adding cross-validation into the mix.
    - Search the web to understand the concept of Cross Validation.
    - Cross validate the random forest model with three values for the `maxDepth`, three values for the `maxBins` and three for the `numTrees`. Use five-fold cross validation.
    - Cross validate the the logistic regression with three values for the `regParam`, three values for the `maxIter` and three for the `elasticNetParam`. Use five-fold cross validation.

<h5>Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.</h5>

In [None]:
# define random forest model
RF = RandomForestRegressor(labelCol="label", featuresCol="features")

# define evaluator
rf_evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction")

# define the parameter space
param_grid = (ParamGridBuilder().addGrid(rf_model.maxDepth, [2, 5, 10])
                                 .addGrid(rf_model.maxBins, [15, 20, 25])
                                 .addGrid(rf_model.numTrees, [5, 20, 50])
                                 .build())

In [None]:
# perform 5-fold cross validation
CV = CrossValidator(estimator=rf_model,
                          estimatorParamMaps=param_grid, 
                          evaluator=rfEvaluator,
                          numFolds=5)

In [None]:
# define pipeline model and fit on training set
CV_Pipeline = pipeline(stages=[SC, VA, CV]).fit(train)

In [None]:
# get preds on test set
cv_preds = CV_Pipeline.transform(test_final)

In [None]:
# define evaluator
cv_evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")

In [None]:
# evaluate model
cv_rmse = cv_evaluator.evaluate(cv_preds)

### Classification 

In [None]:
alias: rename column

In [None]:
# define scaler
SS = StandardScaler(inputCol = 'numFeatures', outputCol = 'scaledNumFeatures', withStd = True, withMean = False)

# define vector assembler
VA = VectorAssembler(inputCols = ['scaledNumFeatures', 'catFeatures'], outputCol = 'features')

# define logistic regression model
LR = LogisticRegression(labelCol = 'label', featuresCol = 'features', maxIter = 10)

In [None]:
# define pipeline stages
stages = [SS, VA, LR]
# create pipeline and fit on training set
lrModelPipeline = Pipeline().setStages(stages).fit(train)
# apply pipeline on test set to get predictions
predictions = lrModelPipeline.transform(test)

In [None]:
# define evaluator
evaluator = BinaryClassificationEvaluator()
# get evaluation metric
lrAUC = evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})
# inspect model performance
print('AUC lr: %f' %(lrAUC))

**Random Forest**
- Build a Random Forest model using the same train and test set.

In [None]:
# define random regression model
RF = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')
# define evaluator
evaluator = BinaryClassificationEvaluator()
# evaluate model
rfAUC = evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})
# inspect model performance
print('AUC lr: %f' %(lrAUC))
print('AUC rf: %f' %(rfAUC))

**Vector Indexer**

As you can see, performance of the Random Forest is lower than the one of Logistic Regression. One reason is the fact that we did not take into account the main advantage of the Random Forest model. This Machine Learning model is able to process real categorical variables. Up until this moment, we fed only binary categoricals to the model by using OneHotEncoding.
- Start from the intial houses dataset (at the end of cmd 11).
- Transform the data as already done, but replace the OneHotEncoder with a VectorIndexer.

In [None]:
vector indexer is not suitable for linear regression but it works well with decision tree, random forest

In [None]:
# define binarizer
houses = houses.withColumn('price', F.col('price').cast(DoubleType()))
BI = Binarizer(threshold = 500000, inputCol = 'price', outputCol = 'price_bin')

# define string indexer
SI = StringIndexer(inputCol = 'price_bin', outputCol = 'label')

# define vector assembler for numeric features
numColumns = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'age', 'bathrooms', 'floors']
VAnum = VectorAssembler(inputCols = numColumns,  outputCol = 'numFeatures')

# define vector assembler for categorical features
catColumns = ['bedrooms', 'waterfront', 'view', 'condition', 'grade', 'renovated']
VAcat = VectorAssembler(inputCols = catColumns, outputCol = 'catFeatures')

# define vector indexer
VI = VectorIndexer(inputCol = 'catFeatures', outputCol = 'indexedCatFeatures', maxCategories = 10)

In [None]:
# define pipeline stages
stages = [BI, SI, VAnum, VAcat, VI]
# create pipeline and fit on data
preprocessingPipeline = Pipeline().setStages(stages).fit(houses)
# apply pipeline on data
basetable = preprocessingPipeline.transform(houses)

In [None]:
# define scaler
SS = StandardScaler(inputCol = 'numFeatures', outputCol = 'scaledNumFeatures', withStd = True, withMean = False)

# define vector assembler
VA = VectorAssembler(inputCols = ['scaledNumFeatures', 'indexedCatFeatures'], outputCol = 'features')

# define random forest model
RF = RandomForestClassifier(labelCol="label", featuresCol="features")

In [None]:
# define pipeline stages
stages = [SS, VA, RF]
# create pipeline and fit on training data
rfModelPipeline = Pipeline().setStages(stages).fit(train)
# apply pipeline on test data to make predictions
predictions = rfModelPipeline.transform(test)

### NLP

In [None]:
# convert to lower case
reviews = reviews.withColumn("to_lower", F.lower(F.col("verified_reviews")))

# remove numbers
reviews = reviews.withColumn("no_num", F.regexp_replace(str=F.col("to_lower"), pattern="[0-9]", replacement=""))

# remove punctuation
reviews = reviews.withColumn("only_str", F.regexp_replace(str=F.col("no_num"), 
                                                          pattern="[{0}]".format(re.escape(PUNCTUATION)), 
                                                          replacement=""))


In [None]:
Pipeline Model 1: Tokenization --> Stop word removal --> BOW --> Logistic Regression
Pipeline Model 2: Tokenization --> Stop word removal --> WORD2VEC --> Random Forest
Pipeline Model 3: Tokenization --> Stop word removal --> TF-IDF --> Logistic Regression

In [None]:
# define the tokenizer
TO = Tokenizer(inputCol="only_str", outputCol="words")

In [None]:
# define the stop word remover
SWR = StopWordsRemover(inputCol='words', outputCol='filtered')

In [None]:
# inspect the output of the stop word remover
temp_pipeline = Pipeline().setStages([TO, SWR]).fit(reviews)

In [None]:
# define bow model
BOW = CountVectorizer(inputCol = 'filtered', outputCol = 'features')


# define tf model
TF = CountVectorizer(inputCol = 'filtered', outputCol = 'featuresTF')
# define tf-idf model
IdF = IDF(inputCol = 'featuresTF', outputCol = 'features')


# define word2vec model
W2V = Word2Vec(inputCol = 'filtered', outputCol = 'features')

In [None]:
# define the logistic regression model
LR = LogisticRegression(labelCol = 'label', featuresCol = 'features', maxIter = 100)

# define the random forest model
RF = RandomForestClassifier(labelCol = 'label', featuresCol = 'features', numTrees = 500)

In [None]:
# define logistic regression pipeline models and fit on training data
lr_BOW_model = Pipeline().setStages([TO, SWR, BOW, LR]).fit(train)
lr_TFIDF_model = Pipeline().setStages([TO, SWR, TF, IdF, LR]).fit(train)
lr_W2V_model = Pipeline().setStages([TO, SWR, W2V, LR]).fit(train)

In [None]:
# define random forest pipeline models and fit on training data
rf_BOW_model = Pipeline().setStages([TO, SWR, BOW, RF]).fit(train)
rf_TFIDF_model = Pipeline().setStages([TO, SWR, TF, IdF, RF]).fit(train)
rf_W2V_model = Pipeline().setStages([TO, SWR, W2V, RF]).fit(train)

In [None]:
# get predictions of logistic regression pipeline models on validation data
lr_BOW_predictions = lr_BOW_model.transform(test)
lr_TFIDF_predictions = lr_TFIDF_model.transform(test)
lr_W2V_predictions = lr_W2V_model.transform(test)

In [None]:
# get predictions of random forest pipeline models on validation data
rf_BOW_predictions = rf_BOW_model.transform(test)
rf_TFIDF_predictions = rf_TFIDF_model.transform(test)
rf_W2V_predictions = rf_W2V_model.transform(test)

#### Model Selection

In [None]:
# define evaluator
evaluator = BinaryClassificationEvaluator(labelCol="label", 
                                          rawPredictionCol="probability", 
                                          metricName="areaUnderROC")

In [None]:
# evaluate the logistic regression pipeline models in terms of AUC
lr_BOW_AUC = evaluator.evaluate(lr_BOW_predictions)
lr_TFIDF_AUC = evaluator.evaluate(lr_TFIDF_predictions)
lr_W2V_AUC = evaluator.evaluate(lr_W2V_predictions)

In [None]:
# accuracy
# define evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", 
                                              probabilityCol="probability", 
                                              metricName="accuracy")

In [None]:
# evaluate the logistic regression pipeline models in temrs of accuracy
lr_BOW_ACC = evaluator.evaluate(lr_BOW_predictions)
lr_TFIDF_ACC = evaluator.evaluate(lr_TFIDF_predictions)
lr_W2V_ACC = evaluator.evaluate(lr_W2V_predictions)

inspect total number of 0's and 1's in predictions of random forest pipeline model and real labels
-> Always do this check before concluding your model is working properly, even when the AUC and ACC are high.

#### Sentiment Anaylsis 

In [None]:
# define the function to extract the sentiment
def get_sentiment(sentence):
    
    # initialize sentiment analyzer
    sid_obj = SentimentIntensityAnalyzer()

    # get sentiment dict
    sentiment_dict = sid_obj.polarity_scores(sentence)
    
    # get positive sentiment score
    pos_sentiment = sentiment_dict["pos"]
    
    # return positive sentiment score
    return(pos_sentiment)

In [None]:
# register functions as udf
get_sentiment_udf = udf(get_sentiment, DoubleType())

In [None]:
# extract positive sentiment score from reviews and store in new columns
reviews = reviews.withColumn("sentiment", get_sentiment_udf("verified_reviews"))

### Instagram tutorial 

In [None]:
# import text files into spark dataframe
text_df = spark.read.text(all_text_file_paths, wholetext=True) \
                    .withColumnRenamed("value", "text") \
                    .withColumn("file_path", F.input_file_name()) \
                    .withColumn("post_id", F.regexp_extract(F.col("file_path"), pattern="(raw_data/)(.*)(.txt)", idx=2)) \
                    .drop("file_path")

In [None]:
# define puncutation and stopwords
PUNCTUATION = [char for char in punctuation if char not in ["!", "@", "#"]]
STOPWORDS = stopwords.words("english")

In [None]:
# define function to remove punctuation
def remove_punct(text):
    # remove punctuation
    text = "".join([char for char in text if char not in PUNCTUATION])
    return(text)

In [None]:
# define function to remove stopwords
def remove_stops(text_tokenized):
    # remove stopwords
    text_tokenized = [word for word in text_tokenized if word not in STOPWORDS]
    return(text_tokenized)

In [None]:
# define function to count hashtags
def get_hashtags(tokenized_text):
    counter = 0
    for word in tokenized_text:
        if "#" in word:
            counter += 1
    return(counter)

In [None]:
# register functions as udf
remove_punct_udf = F.udf(remove_punct, StringType())
remove_stops_udf = F.udf(remove_stops, ArrayType(StringType()))
get_hashtags_udf = F.udf(get_hashtags, IntegerType())
get_tags_udf = F.udf(get_tags, IntegerType())
get_exclamation_marks_udf = F.udf(get_exclamation_marks, IntegerType())
get_sentiment_udf = F.udf(get_sentiment, DoubleType())

In [None]:
# extract features from text
text_df_f = text_df.withColumn("text_lower", F.lower("text")) \
                 .withColumn("text_cleaned", remove_punct_udf("text_lower")) \
                 .withColumn("text_tokenized", F.split("text_cleaned", " ")) \
                 .withColumn("num_words", F.size("text_tokenized")) \
                 .withColumn("num_hashtags", get_hashtags_udf("text_tokenized")) \
                 .withColumn("num_tags", get_tags_udf("text_tokenized")) \
                 .withColumn("num_exclamation_marks", get_exclamation_marks_udf("text_tokenized")) \
                 .withColumn("sentiment", get_sentiment_udf("text")) \
                 .filter("num_words > 0") \
                 .drop("text_tokenized") \
                 .drop("text_lower")

#### Likes model 

In [None]:
# define the binarizer
LABEL_BIN = Binarizer(inputCol="num_likes", threshold=29666, outputCol="num_likes_bin")
# define indexer
LABEL_IDX = StringIndexer(inputCol="num_likes_bin", outputCol="label")
# define the pipeline
pipeline = Pipeline(stages=[LABEL_BIN, LABEL_IDX]).fit(basetable)
# get preprocessed basetable
basetable_preprocessed = pipeline.transform(basetable)

In [None]:
# split data
train, val, test = basetable_preprocessed.randomSplit([0.6, 0.2, 0.2])

In [None]:
# define the class weights
weight_1 = (55 + 161) / (55 * 2)
weight_0 = (55 + 161) / (161 * 2)

In [None]:
# add class weights column
train = train.withColumn("weight", F.when(F.col("label") == 1, weight_1).otherwise(weight_0))
val = val.withColumn("weight", F.when(F.col("label") == 1, weight_1).otherwise(weight_0))

In [None]:
# define categorical variables
cat_var = ['ad', 'video', 'location_cat']
# define indexed cat var
cat_var_idx = [name + "_idx" for name in cat_var]
# define the indexer
CAT_IDX = StringIndexer(inputCols=cat_var, outputCols=cat_var_idx)
# define the assembler
CAT_VA = VectorAssembler(inputCols=cat_var_idx, outputCol="cat_features_idx")

In [None]:
# define numeric variables
num_var = ['num_words',
           'num_hashtags',
           'num_tags',
           'num_exclamation_marks',
           'sentiment',
           'avg_likes_comments',
           'number_ats_comments',
           'num_comments',
           'num_verified_comments',
           'num_followers',
           'num_followed']
# define the vector assembler
NUM_AS = VectorAssembler(inputCols=num_var, outputCol="num_features")
# define the scaler
NUM_SC = StandardScaler(inputCol="num_features", outputCol="num_features_scaled")

In [None]:
# define the tokenizer
TOK = Tokenizer(inputCol="text_cleaned", outputCol="text_tokenized")
# define stop word remover
STOP = StopWordsRemover(inputCol="text_tokenized", outputCol="text_no_stops")
# define word2vec
W2V = Word2Vec(inputCol="text_no_stops", outputCol="text_features")

In [None]:
# define final assembler
AS = VectorAssembler(inputCols=["num_features_scaled", "cat_features_idx", "text_features"], outputCol="features")

In [None]:
# define the models
RF_1 = RandomForestClassifier(featuresCol="features", labelCol="label")
RF_2 = RandomForestClassifier(featuresCol="features", labelCol="label", weightCol="weight")
RF_3 = RandomForestClassifier(featuresCol="features", labelCol="label", weightCol="weight", numTrees=500)

#### Img analysis

In [None]:
# convert image container to matrix
img_matrix = np.array(img_container)
# check
img_matrix.shape

In [None]:
# define pca model to reduce dimensionality
pca_model = PCA(n_components=10)
# fit
pca_model = pca_model.fit(img_matrix)
# get principal components
img_matrix_pca = pca_model.transform(img_matrix)

In [None]:
# defgine clustering model
kmeans_model = KMeans(n_clusters=10)
# fit model
kmeans_model = kmeans_model.fit(img_matrix_pca)
# get labels
img_labels = kmeans_model.predict(img_matrix_pca)

In [None]:
# get cluster images
clusters_dict = dict()
# loop through clusters
for i in range(len(img_labels)):
    # get label
    label = img_labels[i]
    # get image
    img = img_container[i]
    # add to clusters dict
    if label not in clusters_dict.keys():
        clusters_dict[label] = [img]
    else:
        clusters_dict[label].append(img)

In [None]:
# define function to plot image
def plot_images(clusters_dict, label):
    # plot
    # get 10 random imgs
    random_idx = np.random.choice(range(len(clusters_dict[label])), size=10, replace=False)
    random_imgs = [clusters_dict[label][idx] for idx in random_idx]
    plt.figure(figsize=(20, 6))
    print("IMAGES FOR CLUSTER %s" %label)
    for i in range(10): 
        plt.subplot(1, 10, i+1)
        plot_img_array(random_imgs[i])

### Visualization 

In [None]:
# define binarizer
# threshold 넘으면 1, 아니면 0
BI = Binarizer(threshold = 500000, inputCol = 'price', outputCol = 'price_bin')

# define string indexer
# makes string as a number, for example, belgium is 1 and france is 0
SI_lab = StringIndexer(inputCol = 'price_bin', outputCol = 'label')

# define one hot encoder for categorical features
catColumns = ['waterfront', 'view', 'condition', 'grade', 'renovated']
catColumnsIDX = [col + "_IDX" for col in catColumns]
SI_cat = StringIndexer(inputCols = catColumns, outputCols = catColumnsIDX)

# define vector assembler for categorical features
# combines a given list of columns into a single vector column
VA_cat = VectorAssembler(inputCols = catColumnsIDX, outputCol = 'catFeatures')

# define vector assembler for numeric features
numColumns = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'age', 'bedrooms', 'bathrooms', 'floors']
VA_num = VectorAssembler(inputCols = numColumns, outputCol = 'numFeatures')


In [None]:
# define scaler
SS = StandardScaler(inputCol = 'numFeatures', outputCol = 'scaledNumFeatures')

# define vector assembler for all features
VA_all = VectorAssembler(inputCols = ['scaledNumFeatures', 'catFeatures'], outputCol = 'features')

# define logistic regression model
GB = GBTClassifier(labelCol="label", featuresCol="features")

In [None]:
# define pipeline model
model_pipeline = Pipeline().setStages([SS, VA_all, GB]).fit(train)
# get predictions on test set
predictions = model_pipeline.transform(test)

In [None]:
# define evaluator
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="probability", metricName="areaUnderROC")
# get auc
lrAUC = evaluator.evaluate(predictions)
# print auc
print('AUC lr: %f' %(lrAUC))

In [None]:
# get array of labels
y_true_arr = np.squeeze(np.array(predictions.select("label").collect()))
# get array of predicted probabilities
y_pred_arr = np.squeeze(np.array(predictions.select("probability").collect()))[:, 1]
# get fpr and tpr
fpr, tpr, t = roc_curve(y_true_arr, y_pred_arr)