### Lab: Tuning and Topic Modeling

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---  

**INSTRUCTIONS**  
In this assignment, you will do three things:
1) Tune a logistic regression model  
2) Label-balance a dataset  
3) Run the Topic Modeling notebook, making small tweaks and capturing results  

**TOTAL POINTS: 10**

In [1]:
import os

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("data preprocessing") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/19 02:00:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/19 02:00:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### PARAMETERS

In [3]:
# update to match your path
directory_path = '/sfs/qumulo/qhome/apt4c/ds5110/07_tuning_and_nlp_task/'
full_path_to_file = os.path.join(directory_path, 'breast_cancer_wisconsin.csv')
path_to_data = os.path.join(full_path_to_file)

In [8]:
# class = 2 for benign (negative class, 4 for malignant (positive class)
target = 'class'
positive_label = 4
negative_label = 2

SEED = 314

### READ IN DATA

In [4]:
brca = spark.read.csv(path_to_data, header=True, inferSchema=True)

                                                                                

In [5]:
brca.printSchema()

root
 |-- id: integer (nullable = true)
 |-- clump_thickness: integer (nullable = true)
 |-- uniformity_cell_size: integer (nullable = true)
 |-- uniformity_cell_shape: integer (nullable = true)
 |-- marginal_adhesion: integer (nullable = true)
 |-- single_epithelial_cell_size: integer (nullable = true)
 |-- bare_nuclei: string (nullable = true)
 |-- bland_chromatin: integer (nullable = true)
 |-- normal_nucleoli: integer (nullable = true)
 |-- mitoses: integer (nullable = true)
 |-- class: integer (nullable = true)



In [6]:
brca.count()

699

In [9]:
# compute distribution of target variable
brca.groupBy(target).count().show()

+-----+-----+
|class|count|
+-----+-----+
|    4|  241|
|    2|  458|
+-----+-----+



### Task 1:  Cross Validate a Logistic Regression Model
i) (**4 PTS**) This task has the following requirements:
- import necessary modules
- use these features as predictors: `clump_thickness`,`uniformity_cell_size`,`uniformity_cell_shape`,`marginal_adhesion`  
- `class` is response variable. apply recoding as needed. hint: save as new variable.
- use 3 folds in the cross validator object
- use BinaryClassificationEvaluator
- logistic regression model with `maxIter`=10  
- tuning grid with `regParam` values of 0.1 and 0.01
- finally, print the average metrics based on each `regParam` value. the attribute `avgMetrics` in the cv model will hold these. 

### Task 2:  Balancing a DataFrame with Downsampling  
i) (**2 PTS**) Write a function to implement downsampling.  Enter code into the cell containing the `downsample` function.  

INPUTS  
* df               - Spark dataframe  
* target           - string, target variable  
* positive_label   - integer, value of positive label  
* negative_label   - integer, value of negative label  

OUTPUT  
balanced spark dataframe  

Downsampling = sample from larger class to match smaller class  

**Example:**  

INITIAL STATE  
Smaller class has 100 records  
Larger class size has 400 records

ACTION  
Sample 100 records from larger class, without replacement  
Retain all records from smaller class

END STATE    
This produces a balanced dataset containing 100 records from each class

ii) **(1 PT)** Print the target distribution from this balanced dataset, to show the label counts nearly match.

#### IMPORTANT NOTE:
Sampling won't produce the exact fraction you request. In order to sample efficiently, Spark uses Bernouilli Sampling. 
Each row is assigned a probability of being included. If you request a 10% sample, each row individually has a 10% chance of being included but this does not guarantee an exact 10% sample   
(it should be close, however).

### Task 3:  Topic Modeling

In this exercise, you will run the `topic_modeling.ipynb` notebook and answer the questions below.

i) **(1 PT)** For the first headline in the dataset, the code processes it and extracts tokens. Provide a list of the tokens.

+-------------+--------------------------------------------------+  
|publish_date | headline_text                                     |  
+-------------+--------------------------------------------------+  
|20030219     | aba decides against community broadcasting licence|  

ii) **(1 PT)** The code created a count vectorizer and extracted features. 

`cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=500, minDF=3.0)`

The first document had six tokens and the feature vector looked like this:

(500, [118, 498], [1.0, 1.0])   

Explain why there are only two non-zero elements in this feature vector.

iii) **(1 PT)** Change the number of topics to 2, rerun LDA, and visualize the topics by showing the topic words.