# IST 718: Big Data Analytics

- Professor: Willard Williamson <wewillia@syr.edu>
- Faculty Assistant: Palaniappan Muthukkaruppan
## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Any code is allowed from the class text books or class provided code.__
- Please do not change the file names. The FAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and FAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`).

In [2]:
# load these packages
from pyspark.ml import feature
from pyspark.ml import clustering
from pyspark.ml import Pipeline
from pyspark.sql import functions as fn
import numpy as np
import seaborn as sns
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
from pyspark.ml.feature import PCA 
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd
import os

The following cell is used to determine if the environment is databricks or personal computer and load the csv file accordingly.

In [4]:
def get_training_dataframe():
    data_file_name = "colleges_data_science_programs.csv"
    
    # get the databricks runtime version
    db_env = os.getenv("DATABRICKS_RUNTIME_VERSION")
    grading_env = os.getenv("GRADING_RUNTIME_ENV")
    
    # if the databricks env var exists
    if db_env != None:
        full_path_name = "/FileStore/tables/%s" % data_file_name
    elif grading_env != None:
        full_path_name = "C:/Users/Will/Desktop/SU/datasets/%s" % data_file_name
    else:
        full_path_name = data_file_name
    
    return spark.read.csv(full_path_name, inferSchema=True, header=True).fillna('').orderBy('id')

ds_programs_df = get_training_dataframe

# Unsupervised learning

It is recommended to follow the notebook `unsupervised_learning.ipynb`.

The following dataset contains information about dozens of "data science" programs across the US.

In [7]:
ds_programs_df = get_training_dataframe()
ds_programs_df.head()

## Question 1: (10 pts)

This dataset contains many columns that we can use to understand how these data science programs differ from one another. In this question, you will create a dataframe `ds_programs_text_df` which simply adds a column `text` to the dataframe `ds_programs_df`. This column will have the concatenation of the following columns separated by a space: `program`, `degree` and `department` (find the appropriate function in the `fn` package)

In [9]:
# (10 pts) Create ds_programs_text_df here
from pyspark.sql.functions import concat, col, lit

ds_programs_text_df = ds_programs_df.withColumn('text',concat(col("program"), lit(" "), col("degree"), lit(" "), col("department")))

#raise NotImplementedError()

An example of the `ds_programs_text_df` should give you:

```python
ds_programs_text_df.orderBy('id').first().text
```

```console
'Data Science Masters Mathematics and Statistics'
```

In [11]:
# (10 pts)
np.testing.assert_equal(ds_programs_text_df.count(), 222)
np.testing.assert_equal(set(ds_programs_text_df.columns), {'admit_reqs',
 'business',
 'capstone',
 'cost',
 'country',
 'courses',
 'created_at',
 'databases',
 'degree',
 'department',
 'ethics',
 'id',
 'machine learning',
 'mapreduce',
 'name',
 'notes',
 'oncampus',
 'online',
 'part-time',
 'program',
 'program_size',
 'programminglanguages',
 'state',
 'text',
 'university_count',
 'updated_at',
 'url',
 'visualization', 
 'year_founded'})
np.testing.assert_array_equal(ds_programs_text_df.orderBy('id').rdd.map(lambda x: x.text).take(5),
                              ['Data Science Masters Mathematics and Statistics',
 'Analytics Masters Business and Information Systems',
 'Data Science Masters Computer Science',
 'Business Intelligence & Analytics Masters Business',
 'Advanced Computer Science(Data Analytics) Masters Computer Science'])

# Question 2: (10 pts) 

The following code creates a dataframe `ds_features_df` which adds a column `features` to `ds_programs_text_df` that contains the `tfidf` of the column `text`:

In [13]:
# read-only
pipe_features = \
    Pipeline(stages=[
        feature.Tokenizer(inputCol='text', outputCol='words'),
        feature.CountVectorizer(inputCol='words', outputCol='tf'),
        feature.IDF(inputCol='tf', outputCol='tfidf'),
        feature.StandardScaler(withStd=False, withMean=True, inputCol='tfidf', outputCol='features')]).\
    fit(ds_programs_text_df)

Create a pipeline model `pipe_pca` that computes the two first principal components of `features` as computed by `pipe_features` and outputs a column `scores`. Use that pipeline to create a dataframe `ds_features_df` with the columns `id`, `name`, `url`, and `scores`.

In [15]:
# create the pipe_pca PipelineModel below (10 pts)
pca = PCA(k=2, inputCol="features", outputCol="scores")
pipe_pca = Pipeline(stages=[pipe_features, pca]).fit(ds_programs_text_df)
ds_features_df = pipe_pca.transform(ds_programs_text_df).select('id', 'name', 'url','scores')
#raise NotImplementedError()

In [16]:
# Tests for (10 pts)
np.testing.assert_equal(pipe_pca.stages[0],  pipe_features)
np.testing.assert_equal(type(pipe_pca.stages[1]),  feature.PCAModel)
np.testing.assert_equal(set(ds_features_df.columns), {'id', 'name', 'scores', 'url'})
np.testing.assert_equal(ds_features_df.first().scores.shape, (2, ))

# Question 3: (10 pts)

Create a scatter plot with the x axis containing the first principal component (loading) and the y axis containing the second principal component (loading) of `ds_features_df`

In [18]:
# below perform the appropriate 
loadings = pipe_pca.stages[-1].pc.toArray()
plt.figure(figsize=(8,8))
plot01 = plt.scatter(loadings[:,0], loadings[:,1], edgecolor='', alpha=0.5)
plt.title("PCA Scatter Plot")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
display(plot01.figure)
#raise NotImplementedError()

# Question 4 (10 pts)

Create two Pandas dataframes `pc1_pd` and `pc2_pd` with the columns `word` and `abs_loading` that contain the top 5 words in absolute loading for the principal components 1 and 2, respetively. You can extract the vocabulary from the stage that contains the count vectorizer in `pipe_features`:

In [20]:
# your code here
pc1_pd = pd.DataFrame({'word' : pipe_features.stages[1].vocabulary, 
                      'abs_loading' : np.abs(loadings[:,0])}).sort_values('abs_loading', ascending=False).head()
pc2_pd = pd.DataFrame({'word' : pipe_features.stages[1].vocabulary, 
                      'abs_loading' : np.abs(loadings[:,1])}).sort_values('abs_loading', ascending=False).head()
#raise NotImplementedError()

In [21]:
pc1_pd

Unnamed: 0,word,abs_loading
19,computational,0.440858
16,sciences,0.437846
13,of,0.389599
21,school,0.269706
41,"physics,",0.239209


In [22]:
pc2_pd

Unnamed: 0,word,abs_loading
7,computer,0.495368
5,science,0.405065
19,computational,0.317074
31,college,0.302729
2,business,0.201894


In [23]:
# (10 pts)
assert type(pc1_pd) == pd.core.frame.DataFrame
assert type(pc2_pd) == pd.core.frame.DataFrame
np.testing.assert_array_equal(pc1_pd.shape, (5, 2))
np.testing.assert_array_equal(pc2_pd.shape, (5, 2))
np.testing.assert_equal(set(pc1_pd.columns), {'abs_loading', 'word'})
np.testing.assert_equal(set(pc2_pd.columns), {'abs_loading', 'word'})

# Question 5: (10 pts)

Create a new pipeline for PCA called `pipe_pca2` where you fit 50 principal components. Extract the the `PCAModel` from the stages of this pipeline, and assign to a variable `explainedVariance` the variance explained by components of such model. Finally, assign to a variable `best_k` the value $k$ such that ($k+1$)-th component is not able to explain more than 0.01 variance. You can use a for-loop to find such best k.

In [25]:
# your code here
pca1 = PCA(k=50, inputCol="features", outputCol="scores")
pipe_pca2 = Pipeline(stages=[pipe_features, pca1]).fit(ds_programs_text_df)
explainedVariance = pipe_pca2.stages[-1].explainedVariance
#raise NotImplementedError()

In [26]:
k = 0
for k in range(len(explainedVariance)):
  if explainedVariance[k] < 0.01:
    print(k)
    break
best_k = k
best_k

In [27]:
# Tests for (10 pts)
np.testing.assert_equal(pipe_pca2.stages[0],  pipe_features)
np.testing.assert_equal(type(pipe_pca2.stages[1]),  feature.PCAModel)
np.testing.assert_equal(len(explainedVariance), 50)
np.testing.assert_array_less(5, best_k)

# Question 6: (10 pts)

Create a new pipeline for PCA called pipe_pca3 where you fit all possible principal components for this dataset. Extract the the PCAModel from the stages of this pipeline, and use the object property named explainedVariance to create 2 separate plots:  A scree plot and a plot of cumulative variance explained.  Use the plots to check your code in the above question, is you answer from the above question believable?

In [29]:
# your code here
pca2 = PCA(k=118, inputCol="features", outputCol="scores")
pipe_pca3 = Pipeline(stages=[pipe_features, pca2]).fit(ds_programs_text_df)
#raise NotImplementedError()

In [30]:
plt.figure()
explained_var = pipe_pca3.stages[-1].explainedVariance
plot02 = plt.scatter(np.arange(1, len(explained_var)+1), explained_var)
plt.title("DS Programs Explained Variance")
plt.xlabel('Number of components')
plt.ylabel('Eexplained variance')
display(plot02.figure)

In [31]:
cum_sum = np.cumsum(explained_var)
plt.figure()
cumsum_plot = plt.scatter(np.arange(1, len(explained_var)+1), cum_sum)
plt.title("DS Programs Cumulative Sum of Explained Variance")
plt.xlabel("Cumulative Components")
plt.ylabel("Cumulative Sum of Variance Explained")
display(cumsum_plot.figure)

# Question 7: (10 pts)

Create a pipeline named k_means_pipe below by adding a feature.Normalizer and a clustering.KMeans object onto the end of the pipe_features pipeline object.  Set the seed of the KMeans clustering object to 2 and the number of clusters K to 5.  Transform the new pipe on the ds_programs_text_df dataframe creating a new dataframe named k_means_df.

In [34]:
# your code here
norm = feature.Normalizer(inputCol="features", outputCol="norm_features", p=2.0)
kmeans = clustering.KMeans(k=5, seed=2, featuresCol='norm_features', predictionCol='kmeans_clust')
k_means_pipe = Pipeline(stages=[pipe_features, norm, kmeans]).fit(ds_programs_text_df)
k_means_df = k_means_pipe.transform(ds_programs_text_df)
#raise NotImplementedError()

The `kmeans_clust` col in the resulting datafame contains the cluster assignment.  It's hard to visualize the clusters because we trained on all of the features in ds_programs_text_df.

In [36]:
# Show the head of the dataframe from above
display(k_means_df.toPandas().head())

id,name,url,program,degree,country,state,online,oncampus,department,created_at,updated_at,university_count,program_size,courses,admit_reqs,year_founded,notes,cost,visualization,machine learning,business,databases,programminglanguages,capstone,mapreduce,part-time,ethics,text,words,tf,tfidf,features,norm_features,kmeans_clust
1,South Dakota State University,http://www.sdstate.edu/mathstat/grad/masters-in-data-science.cfm,Data Science,Masters,US,SD,False,True,Mathematics and Statistics,2015-01-10 04:13:13 UTC,2015-01-10 04:13:13 UTC,1,,,,,,,,,,,,,,,,Data Science Masters Mathematics and Statistics,"List(data, science, masters, mathematics, and, statistics)","List(0, 118, List(0, 1, 4, 5, 8, 36), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 118, List(0, 1, 4, 5, 8, 36), List(0.2538801769623398, 0.6709733230656233, 1.1586765294107597, 1.2962979072868075, 2.462732792293678, 3.615412302232064))","List(1, 118, List(), List(0.057180220036563034, 0.3052626379712971, -0.4954897991588664, -0.3629550481683585, 0.7672317559611788, 0.8641986048578716, -0.24098572336347812, -0.23897148177448113, 2.2186781912555658, -0.23897148177448113, -0.21909331387449488, -0.20638446153250645, -0.2086993319516678, -0.20406486596464488, -0.19968103721300093, -0.19710492462101956, -0.1792392513332439, -0.17021487380635458, -0.16209679163956386, -0.1590514375466797, -0.13555299543521387, -0.13555299543521387, -0.14459221595152696, -0.13555299543521387, -0.13555299543521387, -0.13984624677775104, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.09771384600627199, -0.0935476114163461, -0.0935476114163461, -0.08142820500522666, -0.08142820500522666, 3.533984097226837, -0.0684276370995679, -0.0684276370995679, -0.0684276370995679, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.04246869000810967, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835))","List(1, 118, List(), List(0.01277008198056407, 0.06817443007396702, -0.11065793996151045, -0.08105889969303649, 0.17134618256892756, 0.19300182868249513, -0.05381944038566742, -0.0533695989859018, 0.4954983099523972, -0.0533695989859018, -0.048930199600171496, -0.0460919263969152, -0.04660890735655772, -0.045573890167876475, -0.04459484789571575, -0.04401952361452673, -0.0400295754755202, -0.038014157543090996, -0.03620114292494886, -0.03552102274700778, -0.030273106037574386, -0.030273106037574386, -0.03229183886091616, -0.030273106037574386, -0.030273106037574386, -0.03123191961982904, -0.023433709434399395, -0.023433709434399395, -0.023433709434399395, -0.023433709434399395, -0.023433709434399395, -0.021822473284265266, -0.02089202640543907, -0.02089202640543907, -0.01818539440355439, -0.01818539440355439, 0.7892461171142605, -0.015281972243881085, -0.015281972243881085, -0.015281972243881085, -0.012134921625990416, -0.012134921625990416, -0.012134921625990416, -0.012134921625990416, -0.012134921625990416, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.009484549948635013, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.008668759959396491, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506, -0.004742274974317506))",3
2,Dakota State University,http://www.dsu.edu/msa/,Analytics,Masters,US,SD,True,True,Business and Information Systems,2015-01-10 04:13:13 UTC,2015-01-10 04:13:13 UTC,1,,,,,,,,,,,,,,,,Analytics Masters Business and Information Systems,"List(analytics, masters, business, and, information, systems)","List(0, 118, List(0, 2, 3, 4, 9, 23), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 118, List(0, 2, 3, 4, 9, 23), List(0.2538801769623398, 1.0377239189930973, 0.8222042927895469, 1.1586765294107597, 2.411439497906128, 3.009276498661748))","List(1, 118, List(), List(0.057180220036563034, -0.3657106850943262, 0.542234119834231, 0.4592492446211884, 0.7672317559611788, -0.43209930242893596, -0.24098572336347812, -0.23897148177448113, -0.24405460103811227, 2.172468016131647, -0.21909331387449488, -0.20638446153250645, -0.2086993319516678, -0.20406486596464488, -0.19968103721300093, -0.19710492462101956, -0.1792392513332439, -0.17021487380635458, -0.16209679163956386, -0.1590514375466797, -0.13555299543521387, -0.13555299543521387, -0.14459221595152696, 2.8737235032265342, -0.13555299543521387, -0.13984624677775104, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.09771384600627199, -0.0935476114163461, -0.0935476114163461, -0.08142820500522666, -0.08142820500522666, -0.08142820500522666, -0.0684276370995679, -0.0684276370995679, -0.0684276370995679, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.04246869000810967, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835))","List(1, 118, List(), List(0.014659338259588692, -0.09375753773797281, 0.13901299039174045, 0.11773809226438323, 0.19669581241452597, -0.11077764009980583, -0.06178170059957735, -0.061265307889462606, -0.06256847119753929, 0.5569573444496602, -0.05616912625461642, -0.052910947722717705, -0.05350441288392036, -0.05231627117140359, -0.05119238552525941, -0.05053194550148033, -0.04595170870292772, -0.04363812189503665, -0.04155688274579897, -0.04077614290710535, -0.03475182870780577, -0.03475182870780577, -0.037069220824641216, 0.7367387686054824, -0.03475182870780577, -0.0358524929519, -0.026900584804280304, -0.026900584804280304, -0.026900584804280304, -0.026900584804280304, -0.026900584804280304, -0.02505097602519474, -0.0239828751664805, -0.0239828751664805, -0.02087581335432895, -0.02087581335432895, -0.02087581335432895, -0.017542847472526803, -0.017542847472526803, -0.017542847472526803, -0.013930209777802375, -0.013930209777802375, -0.013930209777802375, -0.013930209777802375, -0.013930209777802375, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.010887731664418317, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.009951250487614981, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158, -0.005443865832209158))",0
3,Lewis University,http://www.lewisu.edu/academics/data-science/index.htm,Data Science,Masters,US,IL,True,True,Computer Science,2015-01-10 04:13:13 UTC,2015-01-10 04:13:13 UTC,1,,,,,,,,,,,,,,,,Data Science Masters Computer Science,"List(data, science, masters, computer, science)","List(0, 118, List(0, 1, 5, 7), List(1.0, 1.0, 2.0, 1.0))","List(0, 118, List(0, 1, 5, 7), List(0.2538801769623398, 0.6709733230656233, 2.592595814573615, 2.411439497906128))","List(1, 118, List(), List(0.057180220036563034, 0.3052626379712971, -0.4954897991588664, -0.3629550481683585, -0.3914447734495809, 2.160496512144679, -0.24098572336347812, 2.172468016131647, -0.24405460103811227, -0.23897148177448113, -0.21909331387449488, -0.20638446153250645, -0.2086993319516678, -0.20406486596464488, -0.19968103721300093, -0.19710492462101956, -0.1792392513332439, -0.17021487380635458, -0.16209679163956386, -0.1590514375466797, -0.13555299543521387, -0.13555299543521387, -0.14459221595152696, -0.13555299543521387, -0.13555299543521387, -0.13984624677775104, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.09771384600627199, -0.0935476114163461, -0.0935476114163461, -0.08142820500522666, -0.08142820500522666, -0.08142820500522666, -0.0684276370995679, -0.0684276370995679, -0.0684276370995679, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.04246869000810967, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835))","List(1, 118, List(), List(0.017357529694246244, 0.09266500373280509, -0.150410035088999, -0.11017801299523534, -0.11882630522344866, 0.6558366221726206, -0.07315321358498234, 0.6594706714261163, -0.07408479683757231, -0.07254177385687277, -0.0665075912431893, -0.06264971378548104, -0.06335241188655921, -0.06194558132635349, -0.06061483377619552, -0.05983283344841703, -0.0544095602535609, -0.051670135662418924, -0.0492058245390917, -0.04828138206470228, -0.041148235209767396, -0.041148235209767396, -0.0438921655133657, -0.041148235209767396, -0.041148235209767396, -0.042451487236697345, -0.03185189476253927, -0.03185189476253927, -0.03185189476253927, -0.03185189476253927, -0.03185189476253927, -0.029661847794715437, -0.02839715275573879, -0.02839715275573879, -0.0247182064955962, -0.0247182064955962, -0.0247182064955962, -0.020771776360835555, -0.020771776360835555, -0.020771776360835555, -0.016494198140705723, -0.016494198140705723, -0.016494198140705723, -0.016494198140705723, -0.016494198140705723, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.012891722826882125, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.011782873331316788, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062, -0.006445861413441062))",2
4,Saint Joseph's University,http://online.sju.edu/programs/business-intelligence-masters.asp,Business Intelligence & Analytics,Masters,US,PA,True,True,Business,2015-01-10 04:13:13 UTC,2015-01-10 04:13:13 UTC,1,,,,,,,,,,,,,,,,Business Intelligence & Analytics Masters Business,"List(business, intelligence, &, analytics, masters, business)","List(0, 118, List(0, 2, 3, 10, 26), List(1.0, 2.0, 1.0, 1.0, 1.0))","List(0, 118, List(0, 2, 3, 10, 26), List(0.2538801769623398, 2.0754478379861947, 0.8222042927895469, 2.316129318101803, 3.3277302297802827))","List(1, 118, List(), List(0.057180220036563034, -0.3657106850943262, 1.5799580388273284, 0.4592492446211884, -0.3914447734495809, -0.43209930242893596, -0.24098572336347812, -0.23897148177448113, -0.24405460103811227, -0.23897148177448113, 2.097036004227308, -0.20638446153250645, -0.2086993319516678, -0.20406486596464488, -0.19968103721300093, -0.19710492462101956, -0.1792392513332439, -0.17021487380635458, -0.16209679163956386, -0.1590514375466797, -0.13555299543521387, -0.13555299543521387, -0.14459221595152696, -0.13555299543521387, -0.13555299543521387, -0.13984624677775104, 3.222801799111535, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.09771384600627199, -0.0935476114163461, -0.0935476114163461, -0.08142820500522666, -0.08142820500522666, -0.08142820500522666, -0.0684276370995679, -0.0684276370995679, -0.0684276370995679, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.04246869000810967, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835))","List(1, 118, List(), List(0.013186198460999124, -0.08433569632782403, 0.36435047376009316, 0.10590640747388984, -0.0902701749997895, -0.09964542201908093, -0.055573161007536044, -0.055108661407430815, -0.05628086779922693, -0.055108661407430815, 0.48359262895316363, -0.04759384394278116, -0.04812767086298257, -0.04705892736695269, -0.04604798274482203, -0.04545391137059465, -0.04133394972194513, -0.03925285886604335, -0.03738076668966574, -0.03667848461682465, -0.03125956316097934, -0.03125956316097934, -0.03334407692438604, -0.03125956316097934, -0.03125956316097934, -0.03224961993602229, 0.7432028784844826, -0.024197303020425012, -0.024197303020425012, -0.024197303020425012, -0.024197303020425012, -0.02253356431651214, -0.021572798581390558, -0.021572798581390558, -0.018777970263760112, -0.018777970263760112, -0.018777970263760112, -0.015779939329284957, -0.015779939329284957, -0.015779939329284957, -0.012530341239196323, -0.012530341239196323, -0.012530341239196323, -0.012530341239196323, -0.012530341239196323, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.009793606503568903, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.008951233782942066, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452, -0.004896803251784452))",1
5,University Of Leeds,http://www.engineering.leeds.ac.uk/computing/postgraduate/masters-advanced-computer-science/,Advanced Computer Science(Data Analytics),Masters,GB,,False,True,Computer Science,2015-01-10 04:13:13 UTC,2015-01-10 04:13:13 UTC,1,,,,,,,,,,,,,,,,Advanced Computer Science(Data Analytics) Masters Computer Science,"List(advanced, computer, science(data, analytics), masters, computer, science)","List(0, 118, List(0, 5, 7, 44, 99, 106), List(1.0, 1.0, 2.0, 1.0, 1.0, 1.0))","List(0, 118, List(0, 5, 7, 44, 99, 106), List(0.2538801769623398, 1.2962979072868075, 4.822878995812256, 4.020877410340228, 4.7140245909001735, 4.7140245909001735))","List(1, 118, List(), List(0.057180220036563034, -0.3657106850943262, -0.4954897991588664, -0.3629550481683585, -0.3914447734495809, 0.8641986048578716, -0.24098572336347812, 4.583907514037775, -0.24405460103811227, -0.23897148177448113, -0.21909331387449488, -0.20638446153250645, -0.2086993319516678, -0.20406486596464488, -0.19968103721300093, -0.19710492462101956, -0.1792392513332439, -0.17021487380635458, -0.16209679163956386, -0.1590514375466797, -0.13555299543521387, -0.13555299543521387, -0.14459221595152696, -0.13555299543521387, -0.13555299543521387, -0.13984624677775104, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.10492843066874764, -0.09771384600627199, -0.0935476114163461, -0.0935476114163461, -0.08142820500522666, -0.08142820500522666, -0.08142820500522666, -0.0684276370995679, -0.0684276370995679, -0.0684276370995679, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, -0.05433618122081389, 3.966541229119414, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.04246869000810967, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.03881585119632441, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, 4.692790245896119, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, 4.692790245896119, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835, -0.021234345004054835))","List(1, 118, List(), List(0.006274887491516716, -0.04013264380488611, -0.05437444523522305, -0.03983024357510631, -0.04295667122245671, 0.09483609913255597, -0.026445478877147894, 0.5030324105733194, -0.026782253764009875, -0.02622443846590764, -0.024043032605102384, -0.022648378675102054, -0.022902409726895657, -0.02239382909126266, -0.02191275307964444, -0.021630053630957005, -0.019669496470293688, -0.018679172305294093, -0.01778830388591888, -0.01745411044818011, -0.014875420118183448, -0.014875420118183448, -0.01586737313470919, -0.014875420118183448, -0.014875420118183448, -0.015346556275581891, -0.011514717793789365, -0.011514717793789365, -0.011514717793789365, -0.011514717793789365, -0.011514717793789365, -0.010722998086762895, -0.010265800592623596, -0.010265800592623596, -0.008935831738969079, -0.008935831738969079, -0.008935831738969079, -0.007509165299391396, -0.007509165299391396, -0.007509165299391396, -0.005962786146350137, -0.005962786146350137, -0.005962786146350137, -0.005962786146350137, 0.43528338868356, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004660462158812744, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.004259604090166951, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, 0.5149810685488083, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, 0.5149810685488083, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372, -0.002330231079406372))",2


In [37]:
np.testing.assert_equal(len(pd.unique(k_means_df.toPandas()['kmeans_clust'])), 5)

# Question 8: (10 pts)

The goal of this question is to make it easier to visualize the clusters by reducing the dimensionality of ds_programs_text_df to 2 dimensions using PCA.  Replace the Normalizer in the k_means_pipe above with a PCA object that reduces dimensionality to 2 dimensions. Save the PCA output scores in a column named `kmeans_clust`.  Transform the pipeline on ds_programs_text_df creating a new dataframe named k_means_pca_df. <br>

Define a function named get_kmeans_df which takes a parameter num_clusters, builds the pipeline described above using the parameter num_clusters to determine the number of KMeans clusters, and returns a pandas dataframe by transforming the pipeline on ds_programs_text_df.

In [39]:
# your code here
kmeans1 = clustering.KMeans(k=5, seed=2, featuresCol='scores', predictionCol='kmeans_clust')
k_means_pipe = Pipeline(stages=[pipe_features, pca, kmeans1]).fit(ds_programs_text_df)
k_means_pca_df = k_means_pipe.transform(ds_programs_text_df)
#raise NotImplementedError()

In [40]:
def get_kmeans_df(num_clusters):
  kmeans = clustering.KMeans(k=num_clusters, seed=2, featuresCol='scores', predictionCol='kmeans_clust')
  k_means_pipe = Pipeline(stages=[pipe_features, pca, kmeans]).fit(ds_programs_text_df)
  df = k_means_pipe.transform(ds_programs_text_df).toPandas()
  return df
  

In [41]:
# test
k_means_df = get_kmeans_df(3)
np.testing.assert_equal(k_means_df.shape[0], 222)
np.testing.assert_equal(k_means_df.shape[1], 35)
np.testing.assert_equal(len(pd.unique(k_means_df['kmeans_clust'])), 3)

# Question 9: (10 pts)

In the cell below, define a function named plot_kmeans which takes the dataframe returned by the get_kmeans_df function defined above as a single argument.  The plot_kmeans function shall creates a scatter plot of the PCA scores from the input dataframe where each score point is colored by the cluster assignment.  The plot title shall be the number of clusters in the dataframe.  The X and Y axes shall have descriptive labels.  The plot shall include a legend mapping colors to cluster number.  Feel free to use whatever plotting technique you like but the seaborn scatterplot function is one easy way of creating this function.

In [44]:
# your code here
def plot_kmeans(df):
  score1=[]
  score2=[]
  for i in range(len(df.scores)):
    score1.append(df.scores[i][0])
    score2.append(df.scores[i][1])
  df['score1'] = score1  
  df['score2'] = score2
  a = len(df["kmeans_clust"].unique()) 
  colors = df["kmeans_clust"]
  plt.figure(figsize=(8,8))
  plot01 = sns.scatterplot(df['score1'], df['score2'], hue=colors, size=colors, palette="Set1")
  plt.title("PCA Scatter Plot for " + str(a) + " Clusters")
  plt.xlabel("Principal Component 1")
  plt.ylabel("Principal Component 2")
  return display(plot01.figure)
#raise NotImplementedError()

In [45]:
print(sns.__version__)

In [46]:
# test your plot function here
# The plot should have 3 clusters, a title indicating that K=3, a legend, and labeled X and Y axes

plot_kmeans(get_kmeans_df(3))

Create 5 cluster plots picking various K of your choice below using get_kmeans and plot_kmeans.

In [48]:
# your code here
plot_kmeans(get_kmeans_df(4))
#raise NotImplementedError()

In [49]:
plot_kmeans(get_kmeans_df(2))

In [50]:
plot_kmeans(get_kmeans_df(6))

In [51]:
plot_kmeans(get_kmeans_df(9))

# Question 10: (10 pts)

Think about the cumulative variance and the scree plot from the question above.  What is one problem related to variance which might make the above KMeans not be an optimal choice using only 2 principal components.

Your Text Answer Here:
The above KMeans cannot be an optimal choice because of the amount of variance being captured by the first 2 principal components. Though the first two components posses the highest amount of variance in the data, we are still preserving less than 10% of the total variance of the data. In essence, we should preserve 90-95% of the total variance in data by choosing the number of components to be 50 so that we can preserve the variables that contribute most to the “cluster separation”.