In [None]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd

'''
 TIP: 1. Research and import desired preprocessing methods from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
      2. Research and import clustering algorithm(s) of your choice from https://scikit-learn.org/stable/modules/clustering.html
      3. Research and import cluster quality metric(s) of your choice from https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
'''

# Instructions

For each question, a rough outline has been provided to help you get started under "Part 1.x: Work". Feel free to either follow the outline or use your own method for solving the problem. In either case, however, please make sure to include your work in these sections and fill in your answer in the cell titled "Part 1.x: Answer".

**Embedding Images in the Notebook**

To upload an image in a markdown cell in Jupyter Notebook:
1. Go to the menu bar and select Edit -> Insert Image.

2. Select image from your disk and upload.

3. Press Ctrl+Enter or Shift+Enter.

This will make the image as part of the notebook and you don't need to upload it in the directory

**Export Jupyter Notebooks**  
In your local computer, open the notebook you would like to export and navigate to 'File' at the top menu bar. By clickling 'File', you can find 'Download as' in the drop-down menu. Select the format you want to export the notebook as: either direclty as a pdf, or if you download it as an html file, use a website like [html2pdf.com](https://html2pdf.com) to convert it to a pdf file for submission on Gradescope.

Colab does not seem to support exporting their notebooks to other formats, so if you choose to use Colab, you will need to download the notebook as an .ipynb file before following the steps above on your local machine.

# Question 1

## **Part 1.1: Work**



#### Read Data

In [None]:
PATH_TO_Q1_DATA = 'HW1_Q1_Data.csv' # TODO: Change if your path to data is different
df = pd.read_csv(PATH_TO_Q1_DATA) 

#### Standardize Data in Columns 1-52

In [None]:
'''
 TODO: Standardize columns 1-52 by subtracting off mean of each column and scaling to unit variance
'''

#### Cluster Standardized Data in Columns 1-52

In [None]:
# 1 < k < 11
possible_cluster_nums = [2,3,4,5,6,7,8,9,10]

for k in possible_cluster_nums:
  '''
    TODO: 1. Fit data to k clusters using imported clustering algorithm
          2. Compute quaility of results for k clusters using imported
             cluster quality metric and store in a list
  '''

#### Visualize Cluster Quality Metrics

In [None]:
# Creates line chart to visualize values of cluster quality metric for each possible number of clusters
def plotMetricByK(metric_name, metric_results, PATH_TO_SAVE=None):
  '''
  metric_name: Name of cluster quality metric for title and axis label
  metric_results: List containing value of metric, in order, for each
                  possible number of clusters
  PATH_TO_SAVE: Path of file to save plot. If path is not provided, image is not saved
  '''

  plt.clf()

  plt.title(f'{metric_name} by Number of Clusters')
  plt.xlabel('Number of Clusters')
  plt.ylabel(metric_name)
  plt.plot(possible_cluster_nums, metric_results)
  
  if PATH_TO_SAVE:
    plt.savefig(PATH_TO_SAVE)

In [None]:
'''
 TODO: Plot your choice of cluster quality metric by cluster number to help determine k.

 TIP: If using the above function, place each function call in a separate 
      cell to visualize multiple cluster quality metrics
'''

## **Part 1.1: Answer**

How many clusters are there in the data? **YOUR ANSWER HERE** 

Explanation: **Please make sure any relevant plots are either included in the above cells or embedded in this cell and replace this line with a brief explanation of how they justify your choice**



## **Part 1.2: Work**

#### Cluster Data

In [None]:
'''
  TODO: Cluster data with the number of clusters you determined in part 1.1 and store resulting labels
'''

#### Univariate Analysis

In [None]:
'''
  TODO: Find 4 variables that have statistically significant differences between values in the clusters (i.e p < 0.05 using pairwise t-tests)
'''

## **Part 1.2: Answer**

VARIABLE 1, VARIABLE 2, VARIABLE 3, VARIABLE 4

**Replace the above line with 4 variables that you found from the univariate analysis and output the corresponding box plots in the cells below or as images in this cell**

In [None]:
'''
  TODO: Create and display boxplots for each of the 4 variables like those in Fig 1. from the paper by Wu et al

  TIP: 1. If you add the cluster labels as an additional column to the dataframe, then you can follow the example at https://www.pythonprogramming.in/boxplot-group-by-column-data.html
       2. If you want to use subplots to put all 4 boxplots in the same figure, you can plot the dataframes on a specific subplot using the ax keyword. For example,
          
          fig, axs = plt.subplots(2, 2)

          df.boxplot(column=['Variable'], by=['Cluster'], ax=axs[0,0])
          df.boxplot(column=['Variable'], by=['Cluster'], ax=axs[0,1])
''' 

## **Part 1.3: Work**


In [None]:
# Formats Pandas series to string of form 'index_1: value_1, ..., index_n:, value_n' w/ indexes alphabetically sorted
def formatValueCounts(value_counts):
  '''
    value_counts: Pandas series
  '''
  count_string = value_counts.to_string()
  formatted_counts = [': '.join(count.split()) for count in count_string.split('\n')]
  formatted_counts.sort()
  return ', '.join(formatted_counts)

# Creates table summarizing data by cluster and categorical feature
def plotSummaryTable(cellText, PATH_TO_SAVE=None):
  '''
  cellText: num_clusters x 9 2D List where cellText[i][j] contains a string summarizing
            the statistics for cluster i and column (53 + j) in the data
  PATH_TO_SAVE: Path of file to save plot. If path is not provided, image is not saved
  '''

  k = len(cellText)
  colLabels = [f'Cluster {i + 1}' for i in range(k)]

  rowLabels = list(df.columns)[52:61]

  cellText = np.array(cellText).T

  plt.figure(figsize=(10,10))
  table = plt.table(cellText, 
              colLabels=colLabels,
              colColours=['#D3D3D3'] * len(colLabels),
              rowLabels=rowLabels,
              rowColours=['#D3D3D3'] * len(rowLabels),
              cellLoc='center',
              loc='upper center')
  table.scale(2,5)
  table.auto_set_font_size(False)
  table.set_fontsize(12)

  plt.axis('off')
  plt.grid(False)

  if PATH_TO_SAVE:
    plt.savefig(PATH_TO_SAVE)

In [None]:
'''
  TODO: Create a table where the rows correspond to the variables in columns 53-61, and the columns correspond to the k clusters you identified.
        For each cell in the table, put summary statistics for that (variable, cluster) pair

  TIP: 1. If you create a 2D list, cellText, where cellText[i][j] contains a string summarizing the statistics for cluster i and column (53 + j) in
          the data, you can pass this into plotSummaryTable (provided above) to automatically create the table with matplotlib
       2. If you use value_counts() from Pandas, you can pass the resulting series to formatValueCounts (provided above) to convert it to a formatted string
'''

## **Part 1.3: Answer**

**Plot the table in one of the above cells or include it as an image in this cell**

Are any of the clusters significantly enriched for some particular value? **YOUR ANSWER HERE**



## **Part 1.4: Work**

In [None]:
'''
TODO: Cluster the numeric variables (Columns 1-52)
'''

## **Part 1.4: Answer**

How many clusters are there in the numeric variables? **YOUR ANSWER HERE** 

Explanation: **Please make sure any relevant plots are either included in the above cells or embedded in this cell and replace this line with a brief explanation of how they justify your choice**

## **Part 1.5: Work**

In [None]:
'''
TODO: 1. Choose a representative variable from each cluster you determined in Part 1.4
         and create a low-dimensional version of the data using those variables
      2. Re-cluster the data using the reduced representation using the same 
         choices you made for part 1.1.
'''

## **Part 1.5: Answer**

Representative Variables: **Replace this with the representative variables you chose using your work from Part 1.4** 

How many clusters are there in the numeric variables? **YOUR ANSWER HERE** 

Explanation: **Please make sure any relevant plots are either included in the above cells or linked in this cell and replace this line with a brief explanation of how they justify your choice**

## **Part 1.6: Work**

In [None]:
'''
  TODO: Create a table where the rows correspond to the variables in columns 53-61, and the columns correspond to the k clusters you identified.
        in part 1.5. For each cell in the table, put summary statistics for that (variable, cluster) pair
'''

## **Part 1.6: Answer**

**Plot the table in one of the above cells or include it as an image in this cell**

Are any of the clusters significantly enriched for some particular value? **YOUR ANSWER HERE**