# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [92]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [93]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

### What are the benefits of using a pipline?

In [94]:
# call_on_students(1)

Create a pipeline →  it becomes an sklearn estimator →  fit, transform, cross validate, etc. 
- Compartmentalizes each part of the process
- Flexible - can change each part separately
- Helps prevent data leakage in cross validation:
- Column transformer: calling diff transformation objects on diff columns. Ohe, scaling, imputation only on the columns that need it. Like “mini” pipelines for each step
- Handling Missing Values (nulls) can happen differently on diff columns


### What does a gridsearch achieve?

In [95]:
# call_on_students(1)

The problem with doing model tuning by hand - sometimes each adjustment would affect the other if you combined adjustments
- Really you need to tune the model is tandem

Grid Search:
Automatically/Systematically searching for the optimal combination of hypterparametres by providing a grid to our Grid Search with the values of each hypterparameter that we want to try

Grid search will evaluate all those combos with the evaluation metric that we choose, then return the best combination of hypterparameters = the best model

scikit-Learn’s GridSearchCV class w/ .fit() method 
Cross validates each of those combinations


### Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

In [96]:
# call_on_students(1)

**Answer**:

In [97]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

In [98]:
# Your code here
pipeline = Pipeline([('scaler', StandardScaler()),
                   ('lr', LogisticRegression(random_state=1, C=1))])

In [99]:
# call_on_students(1)

### Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

In [100]:
# call_on_students(1)

**Answer**:

In [101]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [102]:
# Your code here
grid = {
    'lr__C': [0.1, 1, 10],
}

grid_search = GridSearchCV(pipeline, grid, verbose=2, return_train_score=True)
#default cv=5

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=1 .........................................................
[CV] .......................................... lr__C=1, total=   0.0s
[CV] lr__C=1 .........................................................
[CV] ............

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ......................................... lr__C=10, total=   0.0s


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.2s finished


GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('lr',
                                        LogisticRegression(C=1,
                                                           random_state=1))]),
             param_grid={'lr__C': [0.1, 1, 10]}, return_train_score=True,
             verbose=2)

In [103]:
grid_search.cv_results_

{'mean_fit_time': array([0.00904102, 0.00924492, 0.01708112]),
 'std_fit_time': array([0.00586284, 0.000918  , 0.00178607]),
 'mean_score_time': array([0.00047827, 0.00034547, 0.00042601]),
 'std_score_time': array([1.69403875e-04, 2.55644314e-05, 1.28482921e-04]),
 'param_lr__C': masked_array(data=[0.1, 1, 10],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'lr__C': 0.1}, {'lr__C': 1}, {'lr__C': 10}],
 'split0_test_score': array([0.95348837, 0.95348837, 0.94186047]),
 'split1_test_score': array([0.95294118, 0.95294118, 0.94117647]),
 'split2_test_score': array([0.97647059, 0.97647059, 1.        ]),
 'split3_test_score': array([0.98823529, 0.98823529, 0.96470588]),
 'split4_test_score': array([0.98823529, 0.98823529, 0.97647059]),
 'mean_test_score': array([0.97187415, 0.97187415, 0.96484268]),
 'std_test_score': array([0.01583032, 0.01583032, 0.02217898]),
 'rank_test_score': array([1, 1, 3], dtype=int32),
 'split0_train_scor

^ in the above we have 5 splits (splits 0 through 4 in rows) with each one containing 3 parameters

In [104]:
grid_search.best_params_

{'lr__C': 0.1}

In [105]:
grid_search.best_estimator_
# could save this off as final_model

Pipeline(steps=[('scaler', StandardScaler()),
                ('lr', LogisticRegression(C=0.1, random_state=1))])

In [106]:
grid_search.best_score_

0.97187414500684

In [107]:
#testing the best estimator out on test data
grid_search.best_estimator_.score(X_test, y_test)

0.958041958041958

# 2) Ensemble Methods

### What sorts of ensembling methods have we looked at?

In [108]:
# call_on_students(1)

Ensemble: a model consisting of a bunch of other models
We want differently randomized models that are overfit in different ways

PROS
- Generalize (perform on unseen test data) well
- Decreases variance →  less overfitting
- More complexity: have to train each model or part of model

CONS:
- BUT sometimes means →  less interpretation
    - Ensembles consisting of a bunch of trees still have feature importances
    - But ensembles consisting of trees and logistic regs and knn’s have feature imps, coeffs, and	but not those for the whole model
- Takes up more space: have to keep each model
- Also means →  more computational power and time

FULL LIST
1. Random forest
2. Gradient Boosting:
    - XG boost
3. Adaboost
4. Bagging
5. Stacking
6. Voting
7. Extra trees
 
Bagging: a bag of marbles, each marble is a model
- Many model types naturally overfit
- So, we can randomly assign data to each model
- Each model overfits in different ways
- Then we aggregate all the models 
- Bootstrap (replacement) aggregating
- Repeat algorithm many times w/ replacement (bootstrap)
- Create multiple samples from your data
- Train models on those samples
- Final model predicts by averaging or voting across those models
    - →  hope that that smooths over/cancels out the different ways each model overfits to reduce variance
    - →  low variance since it averages out quirks the individual models might’ve learned
Bagging:
- Because we're resampling our data with replacement, we're bootstrapping.
- Because we're making use of our many samples' predictions, we're aggregating.
- Because we're bootstrapping and aggregating all in the same algorithm, we're bagging.

4 Levels of Randomization
1. Simple Bag: BaggingClassifier
- Train each model on random samples w/ replacement
- Bagging Algorithm
    1. Take a sample of your X_train and fit a decision tree to it.
    2. Replace the first batch of data and repeat.
    3. When you've got as many trees as you like, make use of all your individual trees' predictions to come up with some holistic prediction. (Most obviously, we could take the average of our predictions, but there are other methods we might try.)

2. Random Forest: random sample AND give model all the features, but at each node only allow it to choose from a random subset of those features to split on, using a splitting criteria like gini or entropy to split on
    - Each tree in the forest is very different
    - Finds best split based on gini or entropy, like normal decision trees
3. Extra Trees: random sample, AND random subset of features, AND among that random subset of features, choosing a random value for each feature and calculating gini or entropy for those values. then whichever value is best, split on that feature

Ensembles of Different Model Types: 
Maybe you have some linear data, some non-linear, a lot of categorical data
Logreg models better for linear relationships, but trees are better for the last two
2 methods: 
1. VotingClassifier or VotingRegressor: taking a simple average (or a vote) of the outputs of the sub-models
Sometimes uses weighted averaging
2. StackingClassifier or StackingRegressor: feeding the ouputted predictions of a bunch of sub-models as features in another final model
    - Also known as Meta-Classifier/Meta-Regressor
    - Approaching a neural network 
    - Estimates the weights in the averaging for us
    - This means we'll have one layer of base estimators and another layer that is "trained to optimally combine the model predictions to form a new set of predictions". See this short blog post for more.


### What is random about a random forest?

In [109]:
# call_on_students(1)

-Still bootstrapping: random samples w/ replacement
- Now give each tree all the features, but at each node we give the tree a random subset of the features to choose from/split on 

Pros:

- High performance, low variance
- Transparent: inherited from Decision Tree (white box??)

Cons:

- So many trees to plant
- Computationally expensive
- Memory: every tree stored back to memory, almost as memory intensive as KNN


### What hyperparameters of a random forest might it be useful to tune? How so?

In [110]:
# call_on_students(1)

- n_estimators: #s of sub-models to train
- Max_features: feature subsetting
    - Can also set this parameter in a Bagging Classifier and it will make that bagging ~ Random Forest
- Bootstrap: whether you want to random sample the dataset or not. We do want this in order to grow different trees each time
- Max_samples: size of the random samples going into each model. controls  
    - Must be 0 -1
    - Default = None, will sample the length of X
    - If it is = 1: →  
    - This can be the same size as the full datase t b/c we allow replacement after we take each data point
    - So we can have a bunch of diff samples that are all the same size as the original for each model
- also all same parameters as a decision tree: max depth, min samples leaf, min samples split, criterion

### Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1).

In [111]:
# call_on_students(1)

**Answer**:

In [112]:
# Your code here
rfc = RandomForestClassifier(random_state=1)
rfc.fit(X_train, y_train)
# y_pred = rfc.predict(X_test)
# ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred), display_labels='0 class', '1 class').plot()

# from sklearn.model_selection import cross_val_score
# scores = cross_val_score(estimator=rfc, X=X_train, y=y_train, cv=5)
# scores

print(rfc.score(X_train, y_train)) # train accuracy
print(rfc.score(X_test, y_test)) # test accuracy


1.0
0.9440559440559441


^ this is expected b/c tree based models tend to be overfit!

# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [113]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [114]:
# call_on_students(1)

1. Remove stop words like "is" that have no semantic value
2. Lower case all the words
3. Remove punctuation
4. Stemming or lemmatize

### Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

In [115]:
# call_on_students(1)

Represented as a document term matrix:
- Each row is a document in our corpus
- Each column is a unique vocabulary n-gram (a token). 1 or 0 for if that doc has it
- the values can be a count, presence, or some form of score (like tf-idf
- typically returned as a sparse matrix given lots of 0 values
- So each row is a numerical vector


### What does TF-IDF do?

Also, what does TF-IDF stand for?

In [116]:
# call_on_students(1)

Tf-Idf: Probably the most commonly used. Very powerful for content-based classification b/c adds importance weight to certain tokens
Useful when the goal is to distinguish the content of documents from others in the corpus.

TF-IDF score: higher = the more important that word is in that document compared to how important it is in all the documents 

TfidVectorizer will weight words by how important it is to that doc
So the sort_values of each type of vectorizer in each doc will look different
- Tf = term frequency (frequency of that word in that document)
- idf = inverse document frequency (log calc of frequency of that word across the full corpus)


## NLP in Code

### Set Up

In [117]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [118]:
policies.head()

Unnamed: 0.1,Unnamed: 0,name,policy,candidate
0,0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",1
1,1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,1
2,2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\nIm...",1
3,3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,1
4,4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,1


The documents for activity are in the `policy` column, and the target is candidate. 

### Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [119]:
# call_on_students(1)

In [120]:
policies.candidate.value_counts()

0    114
1     75
Name: candidate, dtype: int64

In [121]:
# First! Train-test split the dataset
X = policies.policy
y= policies.candidate
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


In [122]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [123]:
# Instantiate it
vec = CountVectorizer(stop_words='english', strip_accents='unicode')
# could enter stop words here

In [124]:
# Fit it
vec.fit(X_train)

CountVectorizer(stop_words='english', strip_accents='unicode')

### Vectorize Your Text, Then Model

In [None]:
# call_on_students(1)

In [125]:
# Code here to transform train and test sets with the vectorizer
X_train_vectorized = vec.transform(X_train)
X_test_vectorized = vec.transform(X_test)

In [126]:
# Code here to instantiate and fit a Random Forest model
rfc2 = RandomForestClassifier(random_state=1)
rfc2.fit(X_train_vectorized, y_train)

RandomForestClassifier(random_state=1)

In [38]:
# Code here to evaluate your model on the test set
y_pred2 = rfc2.predict(X_test)
ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred2)).plot()

# ^ why doesn't this work?

ValueError: could not convert string to float: "In 1987, the United Church of Christ's Commission on Racial Justice commissioned one of the first studies on hazardous waste in communities of color. A few years later -  28 years ago this month - delegates to the First National People of Color Environmental Leadership Summit adopted 17 principles of environmental justice. But in the years since, the federal government has largely failed to live up to the vision these trailblazing leaders outlined, and to its responsibilities to the communities they represent. \nFrom predominantly black neighborhoods in Detroit to Navajo communities in the southwest to Louisiana’s Cancer Alley, industrial pollution has been concentrated in low-income communities for decades - communities that the federal government has tacitly written off as so-called “sacrifice zones.” But it’s not just about poverty, it’s also about race. A seminal study found that black families are more likely to live in neighborhoods with higher concentrations of air pollution than white families - even when they have the same or more income. A more recent study found that while whites largely cause air pollution, Blacks and Latinxs are more likely to breathe it in. Unsurprisingly, these groups also experience higher rates of childhood asthma. And many more low-income and minority communities are exposed to toxins in their water - including lead and chemicals from industrial and agricultural run-off.\nENVIRONMENTAL RACISM ACROSS THE U.S.\nSources: Michigan Radio (for Detroit, MI data); California Department of Public Health (for Los Angeles County, CA data); Coral Davenport & Campbell Robertson, “Resettling the First American ‘Climate Refugees,’” New York Times (May 2, 2016) (for Isle de Jean Charles, LA data). View in full screen.\nAnd these studies don’t tell the whole story. As I’ve traveled this country, I’ve heard the human stories as well. In Detroit, I met with community members diagnosed with cancer linked to exposure to toxins after years of living in the shadow of a massive oil refinery. In New Hampshire, I talked with mothers fighting for clean drinking water free of harmful PFAS chemicals for their children. In South Carolina, I've heard the stories of the most vulnerable coastal communities who face the greatest threats, from not just sea-level rise, but a century of encroaching industrial polluters. In West Virginia, I saw the consequences of the coal industry’s abandonment of the communities that made their shareholders and their executives wealthy - stolen pensions, poisoned miners, and ruined land and water.\nWe didn’t get here by accident. Our crisis of environmental injustice is the result of decades of discrimination and environmental racism compounding in communities that have been overlooked for too long. It is the result of multiple choices that put corporate profits before people, while our government looked the other way. It is unacceptable, and it must change. \nJustice cannot be a secondary concern - it must be at the center of our response to climate change. The Green New Deal commits us to a “just transition” for all communities and all workers. But we won’t create true justice by cleaning up polluted neighborhoods and tweaking a few regulations at the EPA. We also need to prioritize communities that have experienced historic disinvestment, across their range of needs: affordable housing, better infrastructure, good schools, access to health care, and good jobs. We need strong, resilient communities who are prepared and properly resourced to withstand the impacts of climate change. We need big, bottom-up change - focused on, and led by, members of these communities. \nADD YOUR NAME TO SUPPORT OUR PLAN\nWe need big, bottom-up change - focused on, and led by, the communities who have been in this fight since the beginning.\nEmail\nZip\nSUBMIT\nNO COMMUNITY LEFT BEHIND\nThe same communities that have borne the brunt of industrial pollution are now on the front lines of climate change, often getting hit first and worst. In response, local community leaders are leading the fight to hold polluters responsible and combat the effects of the climate crisis.  In Detroit’s 48217 zip code, for example, community members living in the midst of industrial pollution told me how they have banded together to identify refinery leakages and inform their neighbors. In Alabama and Mississippi, I met with residents of formerly redlined neighborhoods who spoke to me about their fight against drinking water pollution caused by inadequate municipal sewage systems. Tribal Nations, which have been disproportionately impacted by environmental racism and the effects of climate change, are leading the way in climate resilience and adaptation strategies, and in supporting healthy ecosystems. The federal government must do more to support and uplift the efforts of these and other communities. Here’s how we can do that:\nImprove environmental equity mapping. The EPA currently maps communities based on basic environmental and demographic indicators, but more can be done across the federal government to identify at-risk communities. We need a rigorous interagency effort to identify cumulative environmental health disparities and climate vulnerabilities and cross-reference that data with other indicators of socioeconomic health. We’ll use these data to adjust permitting rules under Clean Air and Clean Water Act authorities to better consider the impact of cumulative and overlapping pollution, and we’ll make them publicly available online to help communities measure their own health.\nImplement an equity screen for climate investments. Identifying at-risk communities is only the first step. The Green New Deal will involve deploying trillions of dollars to transform the way we source and use energy. In doing so, the government must prioritize resources to support vulnerable communities and remediate historic injustices. My friend Governor Jay Inslee rightly challenged us to fund the most vulnerable communities first, and both New York and California have passed laws to direct funding specifically to frontline and fenceline communities. The federal government should do the same. I’ll direct one-third of my proposed climate investment into the most vulnerable communities - a commitment that would funnel at least $1 trillion into these areas over the next decade. \nStrengthen tools to mitigate environmental harms. Signed into law in 1970, the National Environmental Policy Act provides the original authority for many of our existing environmental protections. But even as climate change has made it clear that we must eliminate our dependence on fossil fuels, the Trump Administration has tried to weaken NEPA with the goal of expediting even more fossil fuel infrastructure projects. At the same time, the Trump Administration has moved to devalue the consideration of climate impacts in all federal decisions. This is entirely unacceptable in the face of the climate emergency our world is facing. As president, I would mandate that all federal agencies consider climate impacts in their permitting and rulemaking processes. Climate action needs to be mainstreamed in everything the federal government does. But we also need a standard that requires the government to do more than merely “assess” the environmental impact of proposed projects - we need to mitigate negative environmental impacts entirely. \nBeyond that, a Warren Administration will do more to give the people who live in a community a greater say in what is sited there - too often today, local desires are discounted or disregarded. And when Tribal Nations are involved, projects should not proceed unless developers have obtained the free, prior and informed consent of the tribal governments concerned. I’ll use the full extent of my executive authority under NEPA to protect these communities and give them a voice in the process. And I’ll fight to improve the law to reflect the realities of today’s climate crisis. \nBuild wealth in frontline communities. People of color are more likely to live in neighborhoods that are vulnerable to climate change risks or where they’re subject to environmental hazards like pollution. That’s not a coincidence - decades of racist housing policy and officially sanctioned segregation that denied people of color the opportunity to build wealth also denied them the opportunity to choose the best neighborhood for their families. Then, these same communities were targeted with the worst of the worst mortgages before the financial crisis, while the government looked the other way. My housing plan includes a first-of-its-kind down-payment assistance program that provides grants to long-term residents of formerly redlined communities so that they can buy homes in the neighborhood of their choice and start to build wealth, beginning to reverse that damage. It provides assistance to homeowners in these communities who still owe more than their homes were worth, which can be used to preserve their homes and revitalize their communities. These communities should have the opportunity to lead us in the climate fight, and have access to the economic opportunities created by the clean energy sector. With the right investments and with community-led planning, we can lift up communities that have experienced historic repression and racism, putting them on a path to a more resilient future.\nExpand health care. People in frontline communities disproportionately suffer from certain cancers and other illnesses associated with environmental pollution. To make matters worse, they are less likely to have access to quality health care. Under Medicare for All, everyone will have high quality health care at a lower cost, allowing disadvantaged communities to get lifesaving services. And beyond providing high quality coverage for all, the simplified Medicare for All system will make it easier for the federal government to quickly tailor health care responses to specific environmental disasters in affected communities when they occur.\nResearch equity. For years we’ve invested in broad-based strategies that are intended to lift all boats, but too often leave communities of color behind. True justice calls for more than ‘one-size-fits-all’ solutions - instead we need targeted strategies that take into account the unique challenges individual frontline communities face. I’ve proposed a historic $400 billion investment in clean energy research and development. We’ll use that funding to research place-based interventions specifically targeting the communities that need more assistance.\nPOLLUTION EXPOSURE BY POPULATION (2003–2015)\nSource: Christopher W. Tessum et al., “Inequity in consumption of goods and services adds to racial–ethnic disparities in air pollution exposure,” Proceedings of the National Academy of Sciences (March 2019). View in full screen.\nNO WORKER LEFT BEHIND\nThe climate crisis will leave no one untouched. But it also represents a once-in-a-generation opportunity: to create millions of good-paying American jobs in clean and renewable energy, infrastructure, and manufacturing; to unleash the best of American innovation and creativity; to rebuild our unions and create real progress and justice for workers; and to directly confront the racial and economic inequality embedded in our fossil fuel economy. \nThe task before us is huge and demands all of us to act. It will require massive retrofits to our nation’s infrastructure and our manufacturing base. It will also require readjusting our economic approach to ensure that communities of color and others who have been systematically harmed from our fossil fuel economy are not left further behind during the transition to clean energy.\nBut it is also an opportunity. We’ll need millions of workers: people who know how to build things and manufacture them; skilled and experienced contractors to plan and execute large construction and engineering projects; and training and joint labor management apprenticeships to ensure a continuous supply of skilled, available workers. This can be a great moment of national unity, of common purpose, of lives transformed for the better. But we cannot succeed in fighting climate change unless the people who have the skills to get the job done are in the room as full partners. \nWe also cannot fight climate change with a low-wage economy.\nWorkers should not be forced to make an impossible choice between fossil fuel industry jobs with superior wages and benefits and green economy jobs that pay far less. For too long, there has been a tension between transitioning to a green economy and creating good, middle class, union jobs. In a Warren Administration we will do both: creating good new jobs through investments in a clean economy coupled with the strongest possible protections for workers. For instance, my Green Manufacturing plan makes a $1.5 trillion procurement commitment to domestic manufacturing contingent on companies providing fair wages, paid family and medical leave, fair scheduling practices, and collective bargaining rights. Similarly, my 100% Clean Energy Plan will require retrofitting our nation’s buildings, reengineering our electrical grid, and adapting our manufacturing base - creating good, union jobs, with prevailing wages determined through collective bargaining, for millions of skilled and experienced workers. \nOur commitment to a Green New Deal is a commitment to a better future for the working people of our country. \nAnd it starts with a real commitment to workers from the person sitting in the White House: I will fight for your job, your family, and your community like I would my own. But there’s so much more we can do to take care of America’s workers before, during, and after this transition. Here are a few ways we can start: \nHonor our commitment to fossil fuel workers. Coal miners, oil rig workers, pipeline builders and millions of other workers have given their life’s blood to build the infrastructure that powered the American economy throughout the 20th century. In return, they deserve more than platitudes - and if we expect them to use their skills to help reengineer America, we owe them a fair day’s pay for the work we need them to do. I’m committed to providing job training and guaranteed wage and benefit parity for workers transitioning into new industries. And for those Americans who choose not to find new employment and wish to retire with dignity, we’ll ensure full financial security, including promised pensions and early retirement benefits. \nDefend worker pensions, benefits, and secure retirement. Together, we will ensure that employers and our government honor the promises they made to workers in fossil fuel industries. I’ve fought for years to protect pensions and health benefits for retired coal workers, and I’ll continue fighting to maintain the solvency of multi-employer pension plans. As president, I’ll protect those benefits that fossil fuel workers have earned. My plan to empower American workers commits to defending pensions, recognizing the value of defined-benefit pensions, and pushing to pass the Butch-Lewis Act to create a loan program for the most financially distressed pension plans in the country. And my Social Security plan would increase benefits by $200 a month for every beneficiary, lifting nearly 5 million seniors out of poverty and expanding benefits for workers with disabilities and their families. \nCreate joint safety-health committees. In 2016, more than 50,000 workers died from occupational-related diseases. And since the beginning of his administration, Trump has rolled back rules and regulations that limit exposure to certain chemicals and requirements around facility safety inspections, further jeopardizing workers and the community around them. When workers have the power to keep themselves safe, they make their communities safer too. A Warren Administration will reinstate the work safety rules and regulations Trump eliminated, and will work to require large companies to create joint safety-health committees with representation from workers and impacted communities. \nForce fossil fuel companies to honor their obligations. As a matter of justice, we should tighten bankruptcy laws to prevent coal and other fossil fuel companies from evading their responsibility to their workers and to the communities that they have helped to pollute. In the Senate, I have fought to improve the standing of coal worker pensions and benefits in bankruptcy - as president, I will work with Congress to pass legislation to make these changes a reality.  \nAnd as part of our commitment, we must take care of all workers, including those who were left behind decades ago by the fossil fuel economy. Although Franklin D. Roosevelt’s New Deal is the inspiration for this full scale mobilization of the federal government to defeat the climate crisis, it was not perfect. The truth is that too often, many New Deal agencies and policies were tainted by structural racism. And as deindustrialization led to prolonged disinvestment, communities of color were too often both the first to lose their job base, and the first place policymakers thought of to dump the refuse of the vanished industries. Now there is a real risk that poor communities dependent on carbon fuels will be asked to bear the costs of fighting climate change on their own. We must take care not to replicate the failings and limitations of the original New Deal as we implement a Green New Deal and transition our economy to 100% clean energy. Instead we need to build an economy that works for every American - and leaves no one behind.\nPRIORITIZING ENVIRONMENTAL JUSTICE AT THE HIGHEST LEVELS\nAs we work to enact a Green New Deal, our commitment to environmental justice cannot be an afterthought - it must be central to our efforts to fight back against climate change. That means structuring our government agencies to ensure that we’re centering frontline and fenceline communities in implementing a just transition. It means ensuring that the most vulnerable have a voice in decision-making that impacts their communities, and direct access to the White House itself. Here’s how we’ll do that:\nElevate environmental justice at the White House. I’ll transform the Council on Environmental Quality into a Council on Climate Action with a broader mandate, including making environmental justice a priority. I’ll update the 1994 executive order that directed federal agencies to make achieving environmental justice part of their missions, and revitalize the cabinet-level interagency council on environmental justice. We will raise the National Environmental Justice Advisory Council to report directly to the White House, bringing in the voices of frontline community leaders at the highest levels. And I will bring these leaders to the White House for an environmental justice summit within my first 100 days in office, to honor the contributions of frontline activists over decades in this fight and to listen to ideas for how we can make progress.  \nEmpower the EPA to support frontline communities. The Trump Administration has proposed dramatic cuts to the EPA, including to its Civil Rights office, and threatened to eliminate EPA’s Office of Environmental Justice entirely. I’ll restore and grow both offices, including by expanding the Community Action for a Renewed Environment (CARE) and Environmental Justice Small Grant programs. We’ll condition these competitive grant funds on the development of state- and local-level environmental justice plans, and ensure that regional EPA offices stay open to provide support and capacity. But it’s not just a matter of size. Historically, EPA’s Office of Civil Rights has rejected nine out of ten cases brought to it for review. In a Warren Administration, we will aggressively pursue cases of environmental discrimination wherever they occur. \nBolster the CDC to play a larger role in environmental justice. The links between industrial pollution and negative public health outcomes are clear. A Warren Administration will fully fund the Center for Disease Control’s environmental health programs, such as childhood lead poisoning prevention, and community health investigations. We will also provide additional grant funding for independent research into environmental health effects.\nDiminish the influence of Big Oil. Powerful corporations rig the system to work for themselves, exploiting and influencing the regulatory process and placing industry representatives in positions of decision-making authority within agencies. My plan to end Washington corruption would slam shut the revolving door between industry and government, reducing industry’s ability to influence the regulatory process and ensuring that the rules promulgated by our environmental agencies reflect the needs of communities, not the fossil fuel industry. \nRIGHT TO AFFORDABLE ENERGY AND CLEAN WATER\nNearly one-third of American households struggle to pay their energy bills, and Native American, Black, and Latinx households are more likely to be energy insecure. Renters are also often disadvantaged by landlords unwilling to invest in safer buildings, weatherization, or cheaper energy. And clean energy adoption is unequal along racial lines, even after accounting for differences in wealth. I have a plan to move the United States to 100% clean, renewable, and zero-emission energy in electricity generation by 2035 - but energy justice must be an integral part of our transition to clean energy. Here’s what that means:\nAddress high energy cost burdens. Low-income families, particularly in rural areas, are spending too much of their income on energy, often the result of older or mobile homes that are not weatherized or that lack energy efficient upgrades. I’ve committed to meet Governor Inslee’s goal of retrofitting 4% of U.S. buildings annually to increase energy efficiency - and we’ll start that national initiative by prioritizing frontline and fenceline communities. In addition, my housing plan includes over $10 billion in competitive grant programs for communities that invest in well-located affordable housing - funding that can be used for modernization and weatherization of homes, infrastructure, and schools. It also targets additional funding to tribal governments, rural communities, and jurisdictions - often majority minority - where homeowners are still struggling with the aftermath of the 2008 housing crash. Energy retrofits can be a large source of green jobs, and I’m committed to ensuring that these are good jobs, with full federal labor protections and the right to organize. \nSupport community power. Consumer-owned energy cooperatives, many of which were established to electrify rural areas during the New Deal, serve an estimated  42 million people across our country. While some co-ops are beginning to transition their assets to renewable energy resources, too many are locked into long-term contracts that make them dependent on coal and other dirty fuels for their power. To speed the transition to clean energy, my administration will offer assistance to write down debt and restructure loans to help cooperatives get out of long-term coal contracts, and provide additional low- or no-cost financing for zero-carbon electricity generation and transmission projects for cooperatives via the Rural Utilities Service. I’ll work with Congress to extend and expand clean energy bonds to allow community groups and nonprofits without tax revenue to access  clean energy incentives. I’ll also provide dedicated support for the four Power Marketing Administrations, the Tennessee Valley Authority, and the Appalachian Regional Commission to help them build publicly-owned clean energy assets and deploy clean power to help communities transition off fossil fuels. Accelerating the transition to clean energy will both reduce carbon emissions, clean up our air,  and help bring down rural consumers’ utility bills.\nProtect local equities. Communities that host large energy projects are entitled to receive a share of the benefits. But too often, large energy companies are offered millions in tax subsidies to locate in a particular area -- without any commitment that they will make a corresponding commitment in that community. Community Benefit Agreements can help address power imbalances between project developers and low-income communities by setting labor, environmental, and transparency standards before work begins. I’ll make additional federal subsidies or tax benefits for large utility projects contingent on strong Community Benefits Agreements, which should include requirements for prevailing wages and collective bargaining rights. And I’ll insist on a clawback provision if a company doesn’t hold up its end of the deal. If developers work with communities to ensure that everyone benefits from clean energy development, we will be able to reduce our emissions faster. \nIt’s simple: access to clean water is a basic human right. Water quality is an issue in both urban and rural communities. In rural areas, for example, runoff into rivers and streams by Big Agriculture has poisoned local drinking water. In urban areas, lack of infrastructure investment has resulted in lead and other poisons seeping into aging community water systems. We need to take action to protect our drinking water. Here’s how we can do that: \nInvest in our nation’s public water systems. America’s water is a public asset and should be owned by and for the public. A Warren Administration will end decades of disinvestment and privatization of our nation’s water system -- our government at every level should invest in safe, affordable drinking water for all of us.\nIncrease and enforce water quality standards. Our government should enforce strict regulations to ensure clean water is available to all Americans. I’ll restore the Obama-era water rule that protected our lakes, rivers, and streams, and the drinking water they provide. We also need a strong and nationwide safe drinking water standard that covers PFAS and other chemicals. A Warren Administration will fully enforce Safe Drinking Water Act standards for all public water systems. I’ll aggressively regulate chemicals that make their way into our water supply, including by designating PFAS as a hazardous substance.\nFund access to clean water. Our clean drinking water challenge goes beyond lead, and beyond Flint and Newark. To respond, a Warren Administration will commit to fully capitalize the Drinking Water State Revolving Fund and the Clean Water State Revolving Fund to refurbish old water infrastructure and support ongoing water treatment operations and maintenance, prioritizing the communities most heavily impacted by inadequate water infrastructure. In rural areas, I’ll increase funding for the Conservation Stewardship Program to $15 billion annually, empowering family farmers to help limit the agricultural runoff that harms local wells and water systems. To address lead specifically, we will establish a lead abatement grant program with a focus on schools and daycare centers, and commit to remediating lead in all federal buildings. We’ll provide a Lead Safety Tax Credit for homeowners to invest in remediation. And a Warren Administration will also fully fund IDEA and other support programs that help children with developmental challenges as a result of lead exposure.\nPROTECTING THE MOST VULNERABLE DURING CLIMATE-RELATED DISASTERS\nIn 2018, the U.S. was home to the world’s three costliest environmental catastrophes. And while any community can be hit by a hurricane, flood, extreme weather, or fire, the impact of these kinds of disasters are particularly devastating for low-income communities, people with disabilities, and people of color. Take Puerto Rico for example. When Hurricane Maria hit the island, decades of racism and neglect were multiplied by the government’s failure to prepare and Trump’s racist post-disaster response - resulting in the deaths of at least 3,000 Puerto Ricans and long-term harm to many more. Even as we fight climate change, we must also prepare for its impacts - building resiliency not just in some communities, but everywhere. Here’s how we can start to do that:\nInvest in pre-disaster mitigation. For every dollar invested in mitigation, the government and communities save $6 overall. But true to form, the Trump Administration has proposed to steep cuts to  FEMA’s Pre-Disaster Mitigation Program, abandoning communities just as the risk of climate-related disasters is on the rise. As president, I’ll invest in programs that help vulnerable communities build resiliency by quintupling this program’s funding. \nBetter prepare for flood events. When I visited Pacific Junction, Iowa, I saw scenes of devastation: crops ruined for the season, cars permanently stalled, a water line 7 or 8 feet high in residents’ living rooms. And many residents in Pacific Junction fear that this could happen all over again next year. Local governments rely on FEMA’s flood maps, but some of these maps haven’t been updated in decades. In my first term as president, I will direct FEMA to fully update flood maps with forward-looking data, prioritizing and including frontline communities in this process. We’ll raise standards for new construction, including by reinstating the Federal Flood Risk Management Standard. And we’ll make it easier for vulnerable residents to move out of flood-prone properties - including by buying back those properties for low-income homeowners at a value that will allow them to relocate, and then tearing down the flood-prone properties, so we can protect everyone.\nMitigate wildfire risk. We must also invest in improved fire mapping and prevention programs. In a Warren Administration, we will dramatically improve fire mapping and prevention by investing in advanced modeling with a focus on helping the most vulnerable - incorporating not only fire vulnerability but community demographics. We will prioritize these data to invest in land management, particularly near the most vulnerable communities, supporting forest restoration, lowering fire risk, and creating jobs all at once. We will also invest in microgrid technology, so that we can de-energize high-risk areas when required without impacting the larger community’s energy supply. And as president, I will collaborate with Tribal governments on land management practices to reduce wildfires, including by incorporating traditional ecological practices and exploring co-management and the return of public resources to indigenous protection wherever possible. \nPrioritize at-risk populations in disaster planning and response. When the most deadly fire in California’s history struck the town of Paradise last November, a majority of the victims were disabled or elderly. People with disabilities face increased difficulties in evacuation assistance and accessing critical medical care. For people who are homeless, disasters exacerbate existing challenges around housing and health. And fear of deportation can deter undocumented people from contacting emergency services for help evacuating or from going to an emergency shelter. As president, I will strengthen rules to require disaster response plans to uphold the rights of vulnerable populations. In my immigration plan, I committed to putting in place strict guidelines to protect sensitive locations, including emergency shelters. We’ll also develop best practices at the federal level to help state and local governments develop plans for at-risk communities - including for extreme heat or cold - and require that evacuation services and shelters are fully accessible to people with disabilities. During emergencies, we will work to ensure that critical information is shared in ways that reflect the diverse needs of people with disabilities and other at-risk communities, including through ASL and Braille and languages spoken in the community. We will establish a National Commission on Disability Rights and Disasters, ensure that federal disaster spending is ADA compliant, and support people with disabilities in disaster planning. We will make certain that individuals have ongoing access to health care services if they have to leave their community or if there is a disruption in care.  And we will ensure that a sufficient number of disability specialists are present in state emergency management teams and FEMA’s disaster response corps. \nEnsure a just and equitable recovery. In the aftermath of Hurricane Katrina, disaster scammers and profiteers swarmed, capitalizing on others’ suffering to make a quick buck. And after George W. Bush suspended the Davis-Bacon Act, the doors were opened for contractors to under-pay and subject workers to dangerous working conditions, particularly low-income and immigrant workers. As president, I’ll put strong protections in place to ensure that federal tax dollars go toward community recovery, not to line the pockets of contractors. And we must maintain high standards for workers even when disaster strikes. \nStudies show that the white and wealthy receive more federal disaster aid, even though they are most able to financially withstand a disaster.\nThis is particularly true when it comes to housing - FEMA’s programs are designed to protect homeowners, even as homeownership has slipped out of reach for an increasing number of Americans. As president, I will reform post-disaster housing assistance to better protect renters, including a commitment to a minimum of one-to-one replacement for any damaged federally-subsidized affordable housing, to better protect low-income families. I will work with Congress to amend the Stafford Act to make grant funding more flexible to allow families and communities to rebuild in more resilient ways. And we will establish a competitive grant program, based on the post-Sandy Rebuild by Design pilot, to offer states and local governments the opportunity to compete for additional funding for creative resilience projects.\nUnder a Warren Administration, we will monitor post-disaster recovery to help states and local governments better understand the long-term consequences and effectiveness of differing recovery strategies, including how to address climate gentrification, to ensure equitable recovery for all communities. We’ll center a right to return for individuals who have been displaced during a disaster and prioritize the voices of frontline communities in the planning of their return or relocation. And while relocation should be a last resort, when it occurs, we must improve living standards and keep communities together whenever possible.\nHOLDING POLLUTERS ACCOUNTABLE\nIn Manchester, Texas, Hurricane Harvey’s damage wasn’t apparent until after the storm had passed - when a thick, chemical smell started wafting through the majority Latinx community, which is surrounded by nearly 30 refineries and chemical plants. A tanker failure had released 1,188 pounds of benzene into the air, one of at least one hundred area leaks that happened in Harvey’s aftermath. But because regulators had turned off air quality and toxic monitoring in anticipation of the storm, the leaks went unnoticed and the community uninformed. \nAVERAGE NEIGHBORHOOD TOXIC CONCENTRATION VALUES BY RACE AND INCOME CATEGORY (2000)\nSource: Liam Downey & Brian Hawkins, “Race, Income, and Environmental Inequality in the United States,” Sociological Perspectives (Dec. 2008). View in full screen.\nThis should have never been allowed to happen. But Manchester is also subject to 484,000 pounds of toxic chemical leaks on an average year. That’s not just a tragedy - it’s an outrage. We must hold polluters accountable for their role in ongoing, systemic damage in frontline communities. As president, I will use all my authorities to hold companies accountable for their role in the climate crisis. Here’s how we can do that: \nExercise all the oversight tools of the federal government. A Warren Administration will encourage the EPA and Department of Justice to aggressively go after corporate polluters, particularly in cases of environmental discrimination. We need real consequences for corporate polluters that break our environmental law. That means steep fines, which we will reinvest in impacted communities. And under my Corporate Executive Accountability Act, we’ll press for criminal penalties for executives when their companies hurt people through criminal negligence.\nUse the power of the courts. Thanks to a Supreme Court decision, companies are often let completely off the hook, even when their operations inflict harm on thousands of victims each year. I’ll work with Congress to create a private right of action for environmental harm at the federal level, allowing individuals and communities impacted by environmental discrimination to sue for damages and hold corporate polluters accountable.\nReinstitute the Superfund Waste Tax. There are over 1300 remaining Superfund sites across the country, many located in or adjacent to frontline communities. So-called “orphan” toxic waste clean-ups were originally funded by a series of excise taxes on the petroleum and chemical industries. But thanks to Big Oil and other industry lobbyists, when that tax authority expired in 1995 it was not renewed. Polluters must pay for the consequences of their actions - not leave them for the communities to clean up. I’ll work with Congress to reinstate and then triple the Superfund tax, generating needed revenue to clean up the mess.\nHold the finance industry accountable for its role in the climate crisis. Financial institutions and the insurance industry underwrite and fund fossil fuel investments around the world, and can play a key role in stopping the climate crisis. Earlier this year, Chubb became the first U.S. insurer to commit to stop insuring coal projects, a welcome development. Unfortunately, many banks and insurers seem to be moving in the opposite direction. In fact, since the Paris Agreement was signed, U.S. banks including JPMorgan Chase, Wells Fargo, Citigroup, and Bank of America have actually increased their fossil fuel investments. And there is evidence that big banks are replicating a tactic they first employed prior to the 2008 crash - shielding themselves from climate losses by selling the mortgages most at risk from climate impacts to Fannie Mae and Freddie Mac to shift the burden off their books and onto taxpayers at a discount. \nTo accelerate the transition to clean energy, my Climate Risk Disclosure Act would require banks and other companies to disclose their greenhouse gas emissions and price their exposure to climate risk into their valuations, raising public awareness of just how dependent our economy is on fossil fuels. And let me be clear: in a Warren Administration, they will no longer be allowed to shift that burden to the rest of us.\nHELP OUR CAMPAIGN KEEP FIGHTING.\nWe're counting on grassroots donors to make this campaign possible.\nIf you've saved your information with ActBlue Express, your donation will go through immediately.\n$15\n$28\n$50\n$100\n$250\nOTHER"

In [35]:
scores = cross_val_score(estimator=rfc2, X=X_train_vectorized, y=y_train, cv=5)
scores


NameError: name 'cross_val_score' is not defined

In [40]:
print(rfc2.score(X_train_vectorized, y_train)) # train accuracy
print(rfc2.score(X_test_vectorized, y_test)) # test accuracy

1.0
0.9375


Is very overfit, how to address this? We let the count vectorizer run unchecked on all of our features in the train data --> too much complexity --> overfitting

We should tune the countvectorizer, add a max_features parameter

In [None]:
cv = CountVectorizer(stop_words='english', strip_accents-'unicode'

# 4) Clustering

## Clustering Concepts

### Describe how the K-Means algorithm updates its cluster centers (centroids) after initialization.

In [None]:
# call_on_students(1)

Process: animation in the nb under methods. 4 clusters set as an example
1. Randomly choose 4 “centroids” = data points that we randomly decide to become our “center point”
    - n_clusters parameter =  # of clusters = # of centroids. Fewer = easier to visualize
2. We assign all the other data points in the dataset to their closest centroid, forming n clusters
3. We take the mean of those 4 clusters and move the centroid to the true mean of that cluster 
    - Goal is to get the tightest clusters possible. Minimizing total intra-cluster distance (sum of distance from centroid to cluster data points)
    - Also want to maximize total inter-cluster distance (distance between data points in one cluster and data points in other clusters)
4. Using that new centroid, we repeat from step 2 again
5. Problem: the resultant 4 clusters will look different depending on the 4 random centroids you chose in step 1
    - Solution: do a bunch of iterations of this with a bunch of different random starting points, then the algorithm
    - n_init parameter = the number of times the k-means algorithm is run with different centroid seeds

### What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# call_on_students(1)

Inertia: squared intra-cluster distance. The sum of squared distances SSE, weighted by the sample weights. Squared = outliers have a big effect
- model.inertia_

Elbow Method:
- Code to plot the SSE’s at various k’s in the nb
- Inertia is on the y axis
- Calculate the inertia for a bunch of different k’s
- Inertia will always decrease as you increase k, b/c no matter what, the more cluster’s you add (k), the tighter those clusters will be (lower intra-cluster distance, smaller inertia)
- Therefore, we are looking for the elbow: whichever k value is associated with the biggest change in inertia. The point of inflection. 


### What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [None]:
# call_on_students(1)

Silhouette Coefficient: more interpretable than elbow, b/c sometimes elbow can have multiple “elbows”/bends
aklearn.metrics.silhouette_score
- Calculated using:
- a = another measure of intra-cluster distance, diff from inertia though. Not the sum of squared, just the avg distance between a point and all the other points in that cluster
- b = INTER-cluster distance. Trying to maximize
- Silhouette coefficient ranges from -1 to 1
    - Closer to 1 = more clearly defined clusters
    - Closer to -1 = data points are incorrectly assigned
- Calculate silhouette score for each data point, then average them
- We just pick the k with the highest silhouette coefficient (peak in the graph below)


## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [41]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

In [42]:
X

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


### Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?


In [None]:
# call_on_students(1)

Need to standard scale - clustering is distance based

In [54]:
# Code to preprocess the data
scaler = StandardScaler()

# Name the processed data X_processed
X_processed = scaler.fit_transform(X)

In [58]:
X_processed

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

### Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [None]:
# call_on_students(1)

In [127]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate and fit
agc = AgglomerativeClustering(n_clusters = 2)
preds = agc.fit_predict(X_processed)

# or just .fit without assigning it to preds, 
# then do preds = agc.labels_ 

In [56]:
# Calculate a silhouette score
from sklearn.metrics import silhouette_score

silhouette_score(X_processed, preds)

0.5770346019475989

### Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [57]:
# call_on_students(1)

def cluster_test(n_clusters, X):
    agc = AgglomerativeClustering(n_clusters = n_clusters)
    preds = agc.fit_predict(X)
    return silhouette_score(X, preds), agc.labels_

cluster_test(3, X_processed)

(0.446689041028591,
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
        1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
        2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
        2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

In [130]:
for k in range(2, 10):
    print(cluster_test(k, X_processed))

(0.5770346019475989, array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
(0.446689041028591, array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0

In [None]:
""