<h1>Clustering US states and territories for fair distribution of US education budget among the states</h1>

<p>
Let us suppose that the US government wants to classify it's states into three tiers so that it can distribute its education budget
among the states in such a manner that the ones belonging to the third tier (having less educational development) get larger share, and the ones belonging
to the first tier (having more educational development) get smaller share.
<br>
Basically, the task here is to divide the states into three clusters. 
<ul>
    <li>Tier 1: More Educationally Developed</li>
    <li>Tier 2: Moderately Educationally Developed</li>
    <li>Tier 3: Less Educationally Developed</li>
</ul>
One very important question arises here. <br/><br/>
<b> How can one measure the educational development of a state?</b>
<br/>

Let's talk about the 'search' endpoint of Foresquare API. One cool thing about it is that if you enter a word in the query when you use
this end point, the Foresquare API, behind the scene, searches not for the query word you mentioned but also for similar words.
<br/><br/>
Let us assume here that the educational development can be measured by the number of educational institutions. The more the educational institutions
a state has, the more it is educatationally developed.
<br/><br/>
And the best thing about the Foursquare API is that if you search for school, it will return results of schools, colleges, universities, etc.
<br/><br/>
So, now we can us the Foresquare API to extract the information of the educational institutes for each state. Since we have to apply clustering on the states, we will try to keep the educational institutions into various categories, e.g. number of primary schools, number of secondary schools, universities, etc. We can then apply clustering on the states to divide them into three clusters. And, finally we will try to see whether the literacy data ( literacy data for each of the states) justifies the results of clustering. Through the same data, we will deduce which clusters should be labelled as tier 1, tier2, and tier 3.
<br/><br/>

In [494]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!pip install geopy 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install folium=0.5.0  
import folium # map rendering library

import re

print('Libraries imported.')

Libraries imported.


<h4>First, let's generate a choropleth of US based on the population of different states to make sure that the geojson is accurate.</h4>

Extracting the literacy data of the states of US from Wikipedia. 

In [498]:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_educational_attainment')
soup = BeautifulSoup(r.text, 'html.parser')
with open('literacy_data.csv', 'w') as f:
    soup = soup.find_all('table')[1]
    f.write('Name,Literacy\n')
    for tr in soup.find_all('tr')[1:]:
        tds = tr.find_all('td')
        try:
            f.write(tds[0].text.strip() + ',' + tds[1].text.strip('%') + '\n') 
            pass
        except:
            print(tds)      

In [500]:
df = pd.read_csv('literacy_data.csv')
df.sort_values('Name')

Unnamed: 0,Name,Literacy
45,Alabama,85.3
4,Alaska,92.4
53,American Samoa,82.1
39,Arizona,86.5
44,Arkansas,85.6
51,California,82.5
13,Colorado,91.1
21,Connecticut,90.2
26,Delaware,89.3
17,District of Columbia,90.3


In [501]:
df.shape

(57, 2)

In [502]:
import json
with open('us_light.geojson') as f:
    data = json.load(f)
names = []
for each in data["features"]:
    names.append(each["properties"]["NAME"])
names.sort()
len(names)

52

In [504]:
for name in df['Name']:
    if name not in names:
        df = df[df['Name'] != name]

In [505]:
df = df.reset_index().drop('index', 1)

In [506]:
df.shape

(52, 2)

In [507]:
df.to_csv('us_literacy_data.csv')

In [508]:
import folium
fmap = folium.Map(location=(37.090240, -95.712891), zoom_start=4)
folium.Choropleth(
    geo_data=r'us_light.geojson', data=df, columns=['Name', 'Literacy'], 
    key_on='feature.properties.NAME', fill_color = 'YlOrRd', fill_opacity = 0.7, line_opacity=0.2, 
    legend_name = 'Literacy Rates in United States'
).add_to(fmap)

<folium.features.Choropleth at 0x1b5a4346b70>

In [511]:
display(fmap)

In [509]:
df['Latitude'] = 0
df['Longitude'] = 0

In [510]:
df.head()

Unnamed: 0,Name,Literacy,Latitude,Longitude
0,Montana,93.0,0,0
1,New Hampshire,92.8,0,0
2,Minnesota,92.8,0,0
3,Wyoming,92.8,0,0
4,Alaska,92.4,0,0


In [512]:
geolocator = Nominatim(user_agent="ny_explorer")
for i in range(df.shape[0]):
    location = geolocator.geocode(df.loc[i, 'Name'])
    df.loc[i, 'Latitude'],  df.loc[i, 'Longitude'] = location.latitude, location.longitude
df.head()

Unnamed: 0,Name,Literacy,Latitude,Longitude
0,Montana,93.0,47.375267,-109.638758
1,New Hampshire,92.8,43.484913,-71.655399
2,Minnesota,92.8,45.989659,-94.611329
3,Wyoming,92.8,43.170026,-107.568535
4,Alaska,92.4,64.445961,-149.680909


In [513]:
df.to_csv('literacy_data_us_with_coords.csv')

In [514]:
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='#3186cc', fill_opacity=0.7, parse_html=False
    ).add_to(fmap)  
    
display(fmap)

In [515]:
CLIENT_ID = 'FTDP12XL5IIEN3DQLLNOE25ZK3OI0R13EJJTOWTF35JBN41Q' # your Foursquare ID
CLIENT_SECRET = 'YRQP2EDWPHNRFF4LURJ4EUP5MR2XMYKKG44A5VZT5NWN4YTQ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 500
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FTDP12XL5IIEN3DQLLNOE25ZK3OI0R13EJJTOWTF35JBN41Q
CLIENT_SECRET:YRQP2EDWPHNRFF4LURJ4EUP5MR2XMYKKG44A5VZT5NWN4YTQ


In [516]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [632]:
def getNearbyVenues(names, latitudes, longitudes, radius=100000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/search?&query=education&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        #print(url)

        results = requests.get(url).json()["response"]['venues']
        for v in results:
            try:
                venues_list.append([(
                    name, lat, lng, 
                    v['name'], 
                    v['location']['lat'], 
                    v['location']['lng'],  
                    v['categories'][0]['name'])])
            except:
                pass
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [633]:
venues = getNearbyVenues(names=df['Name'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

In [634]:
print(venues.shape)

(1747, 7)


In [635]:
venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Montana,47.375267,-109.638758,Malmstrom AFB Education Center,47.517351,-111.191237,General College & University
1,New Hampshire,43.484913,-71.655399,underhill driver education,43.3792,-71.716019,Driving School
2,New Hampshire,43.484913,-71.655399,NH Department of Education,43.197506,-71.542538,Government Building
3,New Hampshire,43.484913,-71.655399,NH Bureau Education and Training,43.214075,-71.495159,Government Building
4,New Hampshire,43.484913,-71.655399,Prescott Farm Environmental Education Center,43.596139,-71.449591,Trail
5,New Hampshire,43.484913,-71.655399,Merrimack River Outdoor Education and Conserva...,43.233028,-71.531381,Trail
6,New Hampshire,43.484913,-71.655399,Physical Education Center,43.761665,-71.681564,College Gym
7,New Hampshire,43.484913,-71.655399,Global Education Office,43.758009,-71.691313,Student Center
8,New Hampshire,43.484913,-71.655399,Nh Association Of Special Education Administra...,43.192372,-71.533292,Building
9,New Hampshire,43.484913,-71.655399,St Katherine religious education class,43.535906,-71.13016,Church


In [636]:
venues.groupby('Neighborhood').count()[['Venue']]

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Alabama,48
Alaska,8
Arizona,48
Arkansas,48
California,46
Colorado,49
Connecticut,49
Delaware,48
District of Columbia,47
Florida,48


In [637]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 160 uniques categories.


In [689]:
#venues['Venue Category'].unique()
# feature_sets = {
#     "feature_set_1" : ['School', 'Elementary School', 'Private School', 'Preschool', 'High School'],
#     "feature_set_2" :  ['College Academic Building', 'College Science Building', 'General College & University', 'University','College Engineering Building', 'College Math Building', 'College Classroom' , 'College Administrative Building', 'College Arts Building', 'College Library', 'Community College', 'College Lab', 'College Theater', 'College Technology Building', 'College Rec Center', 'College Auditorium'],
#     "feature_set_3" : ['Medical School', 'Medical Lab', 'Trade School', 'Cooking School'],
#     "feature_set_4" : ['History Museum', 'Auditorium', 'Convention Center',   'Library', 'Art Gallery',  'Bookstore', 'Arts & Crafts Store',  'Adult Education Center',  'Science Museum', 'Art Museum',  'Cultural Center']
# }
feature_sets = {
    "feature_set_1" : ['School', 'Elementary School', 'Private School', 'Preschool', 'High School', 'College Academic Building', 'College Science Building', 'General College & University', 'University','College Engineering Building', 'College Math Building', 'College Classroom' , 'College Administrative Building', 'College Arts Building', 'College Library', 'Community College', 'College Lab', 'College Theater', 'College Technology Building', 'College Rec Center', 'College Auditorium', 'Medical School', 'Medical Lab', 'Trade School', 'Cooking School', 'History Museum', 'Auditorium', 'Convention Center',   'Library', 'Art Gallery',  'Bookstore', 'Arts & Crafts Store',  'Adult Education Center',  'Science Museum', 'Art Museum',  'Cultural Center']
}

In [690]:
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
onehot['Neighborhood'] = venues['Neighborhood'] 
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]
onehot.head()

Unnamed: 0,Neighborhood,ATM,Adult Education Center,Alternative Healer,Animal Shelter,Aquarium,Arcade,Art Gallery,Art Museum,Art Studio,Arts & Crafts Store,Athletics & Sports,Auditorium,Automotive Shop,Bank,Boat or Ferry,Bookstore,Building,Business Center,Business Service,Campground,Capitol Building,Child Care Service,Chiropractor,Church,City Hall,Clothing Store,Coffee Shop,College Academic Building,College Administrative Building,College Arts Building,College Auditorium,College Basketball Court,College Cafeteria,College Classroom,College Communications Building,College Cricket Pitch,College Engineering Building,College Football Field,College Gym,College Lab,College Library,College Math Building,College Quad,College Rec Center,College Residence Hall,College Science Building,College Technology Building,College Theater,College Track,Comedy Club,Community Center,Community College,Concert Hall,Conference,Conference Room,Convention Center,Cooking School,Corporate Amenity,Courthouse,Coworking Space,Credit Union,Cultural Center,Dance Studio,Daycare,Distribution Center,Doctor's Office,Driving School,Elementary School,Event Space,Exhibit,Fair,Farm,Financial or Legal Service,Fire Station,Food Court,Food Service,Food Truck,Garden,General College & University,General Entertainment,Government Building,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,High School,Historic Site,History Museum,Hospital,Hospital Ward,Hunting Supply,IT Services,Indian Restaurant,Indie Theater,Library,Martial Arts Dojo,Medical Center,Medical Lab,Medical School,Meeting Room,Mental Health Office,Middle School,Military Base,Miscellaneous Shop,Monument / Landmark,Mosque,Moving Target,Museum,Music School,Music Venue,Nightlife Spot,Non-Profit,Nursery School,Office,Other Great Outdoors,Other Nightlife,Outdoors & Recreation,Paper / Office Supplies Store,Park,Performing Arts Venue,Pet Service,Piano Bar,Planetarium,Playground,Preschool,Private School,Professional & Other Places,Real Estate Office,Recycling Facility,Rehab Center,Research Station,Scenic Lookout,School,Science Museum,Shop & Service,Ski Area,Ski Chalet,Social Club,Sorority House,Spa,Spiritual Center,Sports Club,Stables,Student Center,TV Station,Tech Startup,Tennis Court,Theater,Tourist Information Center,Toy / Game Store,Trade School,Trail,Transportation Service,University,Veterinarian,Voting Booth,Warehouse,Zoo,Zoo Exhibit
0,Montana,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,New Hampshire,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,New Hampshire,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,New Hampshire,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,New Hampshire,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [691]:
onehot.shape

(1747, 161)

In [692]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped.head()

Unnamed: 0,Neighborhood,ATM,Adult Education Center,Alternative Healer,Animal Shelter,Aquarium,Arcade,Art Gallery,Art Museum,Art Studio,Arts & Crafts Store,Athletics & Sports,Auditorium,Automotive Shop,Bank,Boat or Ferry,Bookstore,Building,Business Center,Business Service,Campground,Capitol Building,Child Care Service,Chiropractor,Church,City Hall,Clothing Store,Coffee Shop,College Academic Building,College Administrative Building,College Arts Building,College Auditorium,College Basketball Court,College Cafeteria,College Classroom,College Communications Building,College Cricket Pitch,College Engineering Building,College Football Field,College Gym,College Lab,College Library,College Math Building,College Quad,College Rec Center,College Residence Hall,College Science Building,College Technology Building,College Theater,College Track,Comedy Club,Community Center,Community College,Concert Hall,Conference,Conference Room,Convention Center,Cooking School,Corporate Amenity,Courthouse,Coworking Space,Credit Union,Cultural Center,Dance Studio,Daycare,Distribution Center,Doctor's Office,Driving School,Elementary School,Event Space,Exhibit,Fair,Farm,Financial or Legal Service,Fire Station,Food Court,Food Service,Food Truck,Garden,General College & University,General Entertainment,Government Building,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,High School,Historic Site,History Museum,Hospital,Hospital Ward,Hunting Supply,IT Services,Indian Restaurant,Indie Theater,Library,Martial Arts Dojo,Medical Center,Medical Lab,Medical School,Meeting Room,Mental Health Office,Middle School,Military Base,Miscellaneous Shop,Monument / Landmark,Mosque,Moving Target,Museum,Music School,Music Venue,Nightlife Spot,Non-Profit,Nursery School,Office,Other Great Outdoors,Other Nightlife,Outdoors & Recreation,Paper / Office Supplies Store,Park,Performing Arts Venue,Pet Service,Piano Bar,Planetarium,Playground,Preschool,Private School,Professional & Other Places,Real Estate Office,Recycling Facility,Rehab Center,Research Station,Scenic Lookout,School,Science Museum,Shop & Service,Ski Area,Ski Chalet,Social Club,Sorority House,Spa,Spiritual Center,Sports Club,Stables,Student Center,TV Station,Tech Startup,Tennis Court,Theater,Tourist Information Center,Toy / Game Store,Trade School,Trail,Transportation Service,University,Veterinarian,Voting Booth,Warehouse,Zoo,Zoo Exhibit
0,Alabama,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145833,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.104167,0.104167,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0
1,Alaska,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.125,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0
2,Arizona,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0625,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.104167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arkansas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.041667,0.0,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.145833,0.020833,0.0,0.020833,0.0,0.0,0.041667,0.020833,0.0,0.0,0.0,0.020833,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.020833,0.0,0.020833,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.104167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.083333,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,California,0.086957,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.108696,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.086957,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065217,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.021739,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [693]:
for each_set in feature_sets.keys():
    grouped[each_set] = 0
    for each in feature_sets[each_set]:
        grouped[each_set] += grouped[each]
grouped = grouped[['Neighborhood'] + list(feature_sets.keys())]

In [694]:
grouped.shape

(50, 2)

In [695]:
kclusters = 3
grouped_clustering = grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 2, 2, 2, 2, 2, 2])

In [696]:
clustered_regions = pd.DataFrame({"region": grouped['Neighborhood'], "label": kmeans.labels_})
clustered_regions

Unnamed: 0,region,label
0,Alabama,0
1,Alaska,0
2,Arizona,2
3,Arkansas,0
4,California,2
5,Colorado,2
6,Connecticut,2
7,Delaware,2
8,District of Columbia,2
9,Florida,2


In [697]:
cmap = folium.Map(location=(39.381266, -97.922211), zoom_start=4)
folium.Choropleth(
    geo_data=r'us_light.geojson', data=clustered_regions, columns=['region', 'label'], 
    key_on='feature.properties.NAME', fill_color = 'YlOrRd', fill_opacity = 1.0, line_opacity=0.2, 
    legend_name = 'Clusters of US States'
).add_to(cmap)

<folium.features.Choropleth at 0x1b5a6d9da90>

In [698]:
display(cmap)

<p>Let us look at a few states from different tiers.</p>

In [699]:
grouped[grouped['Neighborhood'] == 'Montana']

Unnamed: 0,Neighborhood,feature_set_1
26,Montana,1.0


<p>Clearly, we can see from here that Monatana is an outlier. Only one data unit is available for it. Still, we'd put into tier 3. But for now, let's ignore Monatana.</p>

In [700]:
kclusters = 3
grouped_clustering = grouped[grouped['Neighborhood'] != 'Montana'].drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 2, 2, 2, 2, 2, 2])

In [701]:
clustered_regions = pd.DataFrame({"region": grouped[grouped['Neighborhood'] != 'Montana']['Neighborhood'], "label": kmeans.labels_})
clustered_regions

Unnamed: 0,region,label
0,Alabama,0
1,Alaska,0
2,Arizona,2
3,Arkansas,0
4,California,2
5,Colorado,2
6,Connecticut,2
7,Delaware,2
8,District of Columbia,2
9,Florida,2


In [688]:
cmap = folium.Map(location=(39.381266, -97.922211), zoom_start=4)
folium.Choropleth(
    geo_data=r'us_light.geojson', data=clustered_regions, columns=['region', 'label'], 
    key_on='feature.properties.NAME', fill_color = 'YlOrRd', fill_opacity = 0.3, line_opacity=0.2, 
    legend_name = 'Clusters of US States'
).add_to(cmap)
display(cmap)

In [702]:
grouped[grouped['Neighborhood'] == 'Washington']

Unnamed: 0,Neighborhood,feature_set_1
46,Washington,0.191489


In [703]:
grouped[grouped['Neighborhood'] == 'Utah']

Unnamed: 0,Neighborhood,feature_set_1
43,Utah,0.666667


In [704]:
grouped[grouped['Neighborhood'] == 'Massachusetts']  

Unnamed: 0,Neighborhood,feature_set_1
21,Massachusetts,0.5625


In [705]:
grouped[grouped['Neighborhood'] == 'Virginia']

Unnamed: 0,Neighborhood,feature_set_1
45,Virginia,0.5625


In [708]:
grouped[grouped['Neighborhood'] == 'Colorado']

Unnamed: 0,Neighborhood,feature_set_1
5,Colorado,0.306122


<p>Since there is not enough data, a few states could not be processed, marked by the black color. However, the clusters look reasonable. Most of the states in <button style = "background: yellow">this</button> color, e.g Masuchuttes, Virginia, Utah etc have more educational institutions per person.</p>

While the states in <button style = "background: maroon">this</button> color don't have enough institutes per head.

<ul>
<li><button style = "background: yellow">>></button> Tier 1</li>
<li><button style = "background: orange">>></button> Tier 2</li>
<li><button style = "background: maroon">>></button> Tier 3</li>
</ul>