<a href="https://colab.research.google.com/github/yh0010/NYU_Summer_Tandon_Scholar_Intro2ML/blob/main/9_hw_unsupervised_state_of_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment: the state of machine learning and data science

Summer 2021

**Attribution**: this notebook is modeled after similar work by [Paige Bailey](https://twitter.com/DynamicWebPaige/status/1406250082194841601).

* **Name**: 
* **Net ID**: 

Now that we're wrapping up our survey of machine learning, you may be wondering what to do next. What are machine learning engineers and data scientists currently most excited about? What software frameworks and tools do they want to try out? Where are they going to learn new things?

Of course, if you ask different people these questions, you'll get many different answers. Or, if you ask 20,000 people, you'll get 20,000 different answers…

In this notebook, we'll work with the 2020 Kaggle Machine Learning & Data Science Survey.  Kaggle is an online community for machine learning and data science enthusiasts to find and share data sets and models. In their annual survey, they ask their users to answer questions about how they use machine learning and what they are looking forward to doing next.

The survey results can potentially give us some insight into what's next in machine learning.

Of course, you could just look at the most common answers to each question and stop there! But, that won't give us the full picture. We expect that there may be "cohorts" among Kaggle users who have different interests or different background: for example, there might be some respondents who use machine learning mainly for business analytics, some who use it as a hobby, some who are students, etc. Among different "cohorts", the most popular tools and techniques are likely to be different. 

Depending on which "cohort" you identify with most closely, the overall most common answers may not be very useful to you - you may be more interested in what other members of "your cohort" are doing and anticipating.

In this notebook, we will use unsupervised learning methods to try and find that underlying "cohort" structure in the data, and use it to gain insight into the state of machine learning and data science.

## Load and install libraries

We'll start by loading some familiar libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, fixed
import ipywidgets as widgets
from mpl_toolkits import mplot3d
from matplotlib import cm, colors
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

We'll also install a new library that's not pre-installed in Colab. This UMAP library will implement a dimensionality reduction method that we'll use later in the notebook:

In [None]:
!pip install umap-learn

In [None]:
from umap import UMAP

## Read in and process data


First, download the data and the survey documentation:

* [Kaggle 2020 survey data](https://drive.google.com/file/d/1fGNDBlpziYMAVHSXQJpcLd_AhrsFkVf3/view?usp=sharing)
* [Kaggle 2020 survey questions and answer options](https://drive.google.com/file/d/1yVsd9r1E6s6qh6n5UYlLxs8mKSl5VMzC/view?usp=sharing)
* [Kaggle 2020 methodology](https://drive.google.com/file/d/1Babng7-Ivfnf34jy5k6jdR8LAFxZuTgo/view?usp=sharing)

Review the survey questions and the answer options for each question.

Upload the survey data (CSV file) to your Colab workspace:

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

The CSV file has two header rows - one with a question number, and one with the question text. We'll read the question text into one data frame, and the responses into another data frame.


In [None]:
questions = pd.read_csv('kaggle_survey_2020_responses.csv', header=[0], nrows=1)
questions

In [None]:
responses = pd.read_csv('kaggle_survey_2020_responses.csv', header=[0], skiprows=[1])
responses

We're going to focus specifically on answers to machine learning-related questions, and exclude demographic information. Also, to make it easier, we'll just use the columns that are already essentially one-hot encoded.

So, we will drop the following columns from the data:

In [None]:
drop_cols = ['Time from Start to Finish (seconds)', 
             'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q8', 'Q11', 
             'Q13', 'Q15', 'Q20', 'Q21', 'Q22', 'Q24', 'Q25', 
             'Q30', 'Q32', 'Q38']
responses_sub = responses.drop(columns = drop_cols)
questions_sub = questions.drop(columns = drop_cols)

In [None]:
responses_sub.describe()

Now, each column has only one possible value (or NaN). 

We can encode those values as 1s:

In [None]:
responses_oh = responses_sub.notnull().astype('int')
responses_oh

How do we interpret this data? 

To take an example: if the response value in row 0 is 1 for `Q7_Part_1`, this means that respondent 0 selected the first option for question 7. Looking at the survey questions and answers, we can see that this means they selected "Python" as a programming language they use on a regular basis.

## Exploratory data analysis


### To do 1: explore the data and look for high-level insight

Later in this notebook, we'll use dimensionality reduction and clustering to try and gain some deeper insight into this data. First, though, see what you can find out from the high-dimensional data. 

Use exploratory data analysis to review the data and describe your high-level insights. According to the data, what are machine learning and data science enthusiasts using right now? What are they hoping to gain more experience with soon?

Show your exploratory data analysis (code + output and visualizations) and also summarize your findings in  a text cell.

(You can use either `responses_oh` or `responses_sub` for this step.)

## Dimensionality reduction


Our ultimate goal is to gain deeper insight by clustering the respondents into "cohorts", and then looking at the state of machine learning and data science as described by each cohort separately.

Because of the dimensionality of the data (hundreds of columns), it will not work very well with K-means clustering. K-means clustering also suffers from the curse of dimensionality: high dimensional data is often very sparse in the overall feature space, so that "closest" cluster mean to a particular sample may not really be much closer than the other cluster means.

Also, since K-means clustering involves distance computations, it is expensive to apply to high-dimensional data.

Finally, we want to be able to visually explore the data and the clusters, and it is very difficult to do this in hundreds of dimensions!

To address this, we'll reduce the dimension of the data to 3D. This will help with clustering, and will also make it easier to visualize the data.

### PCA

The "classic" dimensionality reduction method is PCA. Let's try to apply PCA to this data.

#### To do 2: Apply PCA

In the following cell, use the `sklearn` implementation of PCA. Create a PCA instance in a variable called `pca_reducer`, with `n_components = 3`. Then, fit it using the `responses_oh` data. Finally, use the `transform` method to project the `responses_oh` data into the 3D feature space learned by PCA. Save the result in a variable called `pca_responses`.

In [None]:
# TODO 2

# pca_reducer = ...
# pca_responses = ...

Verify that the `pca_responses` dataset has reduced dimensionality:

In [None]:
pca_responses.shape

Let's visualize this result in 3D to see if it will make clustering easier:

In [None]:
def plot_3D(elev=20, azim=20, pca_responses=pca_responses):

    fig = plt.figure(figsize=(10,10))
    ax = plt.axes(projection='3d')

    ax.scatter3D(pca_responses[:,0], pca_responses[:,1], pca_responses[:,2], s=0.2);

    ax.view_init(elev=elev, azim=azim)


interact(plot_3D, elev=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         azim=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         pca_responses=fixed(pca_responses));

Use the elevation and azimuth sliders to view the data from different perspectives.

### UMAP

A more recent approach called UMAP is known to often produce better results for dimensionality reduction for visualization or clustering. 

Here are some useful resources for learning about UMAP:

* [Understanding UMAP](https://pair-code.github.io/understanding-umap/)
* [How UMAP works](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html)



Let's try it! We can use `UMAP` in exactly the same way that we used `PCA` - specify the number of components as 3, fit the model using the `responses_oh` data, and then use the fitted model to transform the data.

In [None]:
umap_reducer = UMAP(n_components=3).fit(responses_oh)

In [None]:
umap_responses = umap_reducer.transform(responses_oh)

Verify that the `umap_responses` dataset has reduced dimensionality:

In [None]:
umap_responses.shape

And let's plot this version of the data, too:

In [None]:
def plot_3D(elev=20, azim=20, umap_responses=umap_responses):

    fig = plt.figure(figsize=(10,10))
    ax = plt.axes(projection='3d')

    ax.scatter3D(umap_responses[:,0], umap_responses[:,1], umap_responses[:,2], s=0.2);

    ax.view_init(elev=elev, azim=azim)


interact(plot_3D, elev=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         azim=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         umap_responses=fixed(umap_responses));

Use the elevation and azimuth sliders to view the data from different perspectives.

Which transformation of the data seems more useful for clustering?

## Clustering

Next, let's use a clustering algorithm to try and define distinct "cohorts" among the respondents.


#### To do 3: apply a clustering algorithm

Use a clustering algorithm from `sklearn` to find cohorts in the data. The following design choices are up to you:

* You can apply the clustering to `pca_responses` or to `umap_responses` - whichever you think is most useful for clustering.
* You can use `KMeans` or [any other clustering method](https://scikit-learn.org/stable/modules/clustering.html) implemented in `sklearn`. 
* You can decide how to initialize the cluster centers. Read the function documentation to learn about the initialization options available in the method you have chosen.
* You can choose how many clusters to find, but you must have at least 3 clusters. Save the number of clusters in a variable called `n_clusters`.

Save the cluster labels learned by your model in a variable called `c_responses`, and save the list of cluster centers in `c_centers`.

In [None]:
# TODO 3

# n_clusters = ....
# c_responses = ...
# c_centers = ...

Let's visualize the results.

If you used the PCA-transformed data, use this function to visualize the results:

In [None]:
def plot_3D(elev=20, azim=20, pca_responses=pca_responses, n_clusters=n_clusters,
            c_responses=c_responses, c_centers=c_centers):

    fig = plt.figure(figsize=(15,10))
    ax = plt.axes(projection='3d')

    cmap = plt.cm.Dark2
    norm = colors.BoundaryNorm(np.arange(0, n_clusters+1, 1), cmap.N)

    p = ax.scatter3D(pca_responses[:,0], pca_responses[:,1], pca_responses[:,2], 
                 c=c_responses, s=0.2, alpha=0.4, cmap=cmap, norm=norm);
    fig.colorbar(p)
    # note: you can adjust the value of s here to change the size of the cluster centers
    p = ax.scatter3D(c_centers[:,0], c_centers[:,1], c_centers[:,2], edgecolor='black',
                 c=range(n_clusters), s=100, cmap=cmap, norm=norm);

    ax.view_init(elev=elev, azim=azim)


interact(plot_3D, elev=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         azim=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         pca_responses=fixed(pca_responses),  n_clusters=fixed(n_clusters),
         c_responses=fixed(c_responses), c_centers=fixed(c_centers));

If you used the UMAP-transformed data, use this function to visualize the results:

In [None]:
def plot_3D(elev=20, azim=20, umap_responses=umap_responses, n_clusters=n_clusters,
            c_responses=c_responses, c_centers=c_centers):

    fig = plt.figure(figsize=(15,10))
    ax = plt.axes(projection='3d')

    cmap = plt.cm.Dark2
    norm = colors.BoundaryNorm(np.arange(0, n_clusters+1, 1), cmap.N)

    p = ax.scatter3D(umap_responses[:,0], umap_responses[:,1], umap_responses[:,2], 
                 c=c_responses, s=0.2, alpha=0.4, cmap=cmap, norm=norm);
    fig.colorbar(p)
    # note: you can adjust the value of s here to change the size of the cluster centers
    p = ax.scatter3D(c_centers[:,0], c_centers[:,1], c_centers[:,2], edgecolor='black',
                 c=range(n_clusters), s=100, cmap=cmap, norm=norm);

    ax.view_init(elev=elev, azim=azim)


interact(plot_3D, elev=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         azim=widgets.FloatSlider(min=-90,max=90,step=1, value=20),
         umap_responses=fixed(umap_responses),  n_clusters=fixed(n_clusters),
         c_responses=fixed(c_responses), c_centers=fixed(c_centers));

Are you satisfied with your clusters? Do the cluster centers look like a good representation of the samples in the cluster?

Adjust your clustering (you can change the intialization, the number of clusters, or the clustering algorithm) until you are satisfied with the results.

### To do 4: Apply inverse transform to the cluster centers

Next, we'll look at the cluster centers in the original high-dimensional feature space. 

Use the `inverse_transform` method of your reducer (either the `pca_reducer` or the `umap_reducer`, depending on which type of transformed data you used for clustering). Apply this to the `c_centers` variable to get the cluster centers in the original high-dimensional feature space. Save the result in `c_centers_highd`.

In [None]:
# TODO 4

# c_centers_highd = ...

Now, we can look at the cluster centers in the high dimensional feature space to see what the "typical" survey answers are for each cluster.

In [None]:
plt.figure(figsize=(15,n_clusters*2))
for i, c in enumerate(c_centers_highd):
  plt.subplot(n_clusters,1,i+1)
  plt.stem(c, use_line_collection=True, markerfmt='.');
  plt.ylim(-0.5, 1.25) # adjust this as needed to display the data

A value close to 0 for a particular feature means that most respondents in the cluster did *not* select that option. A value close to 1 for a feature means that most respondents in the cluster *did* select that option.

## Cohort analysis

#### To do 5: use the cluster centers in high dimension feature space to explore cohorts

Use the `c_centers_highd`, the cluster labels `c_responses`, and the original data (either `responses`, `responses_sub`, or `responses_oh`) to explore *each cluster* in greater detail.

For each cluster, see if you can identify:

* What do members of the cluster tend to have in common?
* What do members of the cluster say about the state of machine learning and data science? What tools and techniques do they often use? What are they hoping to use?
* Is the cluster center a good representation of the cluster members?

Also note any important differences between clusters. 

Use this analysis to draw high-level conclusions about the state of machine learning and data science.

Show your analysis (code + output and visualizations) and also summarize your findings in one or more text cells.

At the end, please summarize your findings with a brief description of each "cohort" that you found.

(To help you understand the level of effort expected - this section is worth 4/10 points for this assignment. For full credit, the graders will expect to see an analysis of sufficient detail to justify this point value.)