# Clustering Practice

In this notebook you'll use the techniques presented in Notebook 1 - Clustering to work through two clustering problems.

In particular you'll work through two problems presented in the following resources:
<ol>
    <li><a href="https://www.amazon.com/Hands-Unsupervised-Learning-Using-Python-ebook/dp/B07NY447H8/ref=sr_1_1?crid=2ENKBD4O6EPNX&dchild=1&keywords=hands+on+unsupervised+learning&qid=1589566161&sprefix=hands+on+un%2Caps%2C170&sr=8-1">Hands-On Unsupervised Learning Using Python</a>, Chapter 6, and</li>
    <li><a href="https://www.amazon.com/Basketball-Data-Science-Applications-Chapman-ebook/dp/B083G6PQV2/ref=sr_1_1?crid=3D5B29MXA6E3C&dchild=1&keywords=basketball+data+science+with+applications+in+r&qid=1589566208&sprefix=basketball+data+scie%2Caps%2C172&sr=8-1">Basketball Data Science With Applications in R</a></li>.
</ol>

## What You'll Accomplish

In particular you'll work on:
<ul>
    <li>clustering loan applications to help identify good investments,</li>
    <li>using clustering to identify basketball players that play similar styles.</li>
</ul>

In [1]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## Lending Club Data

Below we load in and do some preprocessing on publicly available data from <a href="https://www.lendingclub.com/">Lending Club</a>, a US peer-to-peer lending company. At the time this data was collected borrowers could request a loan from $\$1,000$ to $\$40,000$. 

Each row of the data set represents a funded loan from the platform. The columns have information about the loan itself and the person that applied for the loan. 

A loan `grade` column is also included and generally implies that this loan was good from the lenders point of view.

One type of problem you may want to solve is to build a model that classifies a loan's grade based on the other features in the data set.

However, another interesting problem would be to cluster observations and look for patterns in the data. Not just for the purposes of identifying highly graded loans, but also for identifying non obvious lending patterns.

### Cleaning Data

I first walk through cleaning the data a little bit, then I leave it to you to explore.

In [2]:
# This reads in the raw csv
# from a github repo
# don't worry that warning is supposed to happen
lending = pd.read_csv("https://raw.githubusercontent.com/aapatel09/handson-unsupervised-learning/master/datasets/lending_club_data/LoanStats3a.csv")

print("The data has", len(lending), "observations.")

The data has 42542 observations.


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# We drop a bunch of empty columns
columns_to_keep = ['grade','sub_grade','loan_amnt','funded_amnt',
                 'funded_amnt_inv','term', 'int_rate','installment', 
                 'annual_inc', 'dti','delinq_2yrs','open_acc','pub_rec','revol_bal',
                 'revol_util','total_acc','out_prncp', 
                 'out_prncp_inv','total_pymnt','total_pymnt_inv', 
                 'total_rec_prncp','total_rec_int','total_rec_late_fee', 
                 'recoveries','collection_recovery_fee','last_pymnt_amnt']

In [4]:
# I now keep those columns
lending = lending.loc[:,columns_to_keep]

In [5]:
# let's examine the first 5 rows
lending.head()

Unnamed: 0,grade,sub_grade,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,annual_inc,dti,...,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt
0,B,B2,5000.0,5000.0,4975.0,36 months,10.65%,162.87,24000.0,27.65,...,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,171.62
1,C,C4,2500.0,2500.0,2500.0,60 months,15.27%,59.83,30000.0,1.0,...,0.0,0.0,1014.53,1014.53,456.46,435.17,0.0,122.9,1.11,119.66
2,C,C5,2400.0,2400.0,2400.0,36 months,15.96%,84.33,12252.0,8.72,...,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,649.91
3,C,C1,10000.0,10000.0,10000.0,36 months,13.49%,339.31,49200.0,20.0,...,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,357.48
4,B,B5,3000.0,3000.0,3000.0,60 months,12.69%,67.79,80000.0,17.94,...,0.0,0.0,4066.908161,4066.91,3000.0,1066.91,0.0,0.0,0.0,67.3


In [6]:
# These three columns are strings but should be numbers
lending[['term','int_rate','revol_util']]

Unnamed: 0,term,int_rate,revol_util
0,36 months,10.65%,83.70%
1,60 months,15.27%,9.40%
2,36 months,15.96%,98.50%
3,36 months,13.49%,21%
4,60 months,12.69%,53.90%
...,...,...,...
42537,36 months,7.75%,
42538,,,
42539,,,
42540,,,


In [7]:
lending.loc[lending.term.notna(),'term'] = lending['term'].dropna().apply(lambda x: int(x.strip().split(" ")[0]))

In [8]:
lending.loc[lending.int_rate.notna(),'int_rate'] = lending['int_rate'].dropna().apply(lambda x: float(x.strip().split("%")[0]))

In [9]:
lending.loc[lending.int_rate.notna(),'revol_util'] = lending['revol_util'].dropna().apply(lambda x: float(x.strip().split("%")[0]))

In [10]:
# Much Better
lending[['term','int_rate','revol_util']]

Unnamed: 0,term,int_rate,revol_util
0,36,10.65,83.7
1,60,15.27,9.4
2,36,15.96,98.5
3,36,13.49,21
4,60,12.69,53.9
...,...,...,...
42537,36,7.75,
42538,,,
42539,,,
42540,,,


In [11]:
# let's impute those nan values with the median
for c in columns_to_keep[2:]:
    replace = lending[c].median()
    
    lending.loc[lending[c].isna(),c] = replace

In [12]:
lending

Unnamed: 0,grade,sub_grade,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,annual_inc,dti,...,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt
0,B,B2,5000.0,5000.0,4975.0,36,10.65,162.87,24000.0,27.65,...,0.0,0.0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,171.62
1,C,C4,2500.0,2500.0,2500.0,60,15.27,59.83,30000.0,1.00,...,0.0,0.0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,119.66
2,C,C5,2400.0,2400.0,2400.0,36,15.96,84.33,12252.0,8.72,...,0.0,0.0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,649.91
3,C,C1,10000.0,10000.0,10000.0,36,13.49,339.31,49200.0,20.00,...,0.0,0.0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,357.48
4,B,B5,3000.0,3000.0,3000.0,60,12.69,67.79,80000.0,17.94,...,0.0,0.0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,67.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42537,A,A3,5000.0,5000.0,0.0,36,7.75,156.11,70000.0,8.81,...,0.0,0.0,5619.762090,0.00,5000.00,619.76,0.00,0.0,0.00,156.39
42538,,,9700.0,9600.0,8500.0,36,11.99,277.69,59000.0,13.47,...,0.0,0.0,9682.251696,8955.87,8000.00,1339.16,0.00,0.0,0.00,528.36
42539,,,9700.0,9600.0,8500.0,36,11.99,277.69,59000.0,13.47,...,0.0,0.0,9682.251696,8955.87,8000.00,1339.16,0.00,0.0,0.00,528.36
42540,,,9700.0,9600.0,8500.0,36,11.99,277.69,59000.0,13.47,...,0.0,0.0,9682.251696,8955.87,8000.00,1339.16,0.00,0.0,0.00,528.36


### Ready to Go

Now that the data is clean let's go ahead and apply clustering.

Feel free to apply any of the clustering techniques we've used so far. You may want to compare the output to the `grade` and `sub_grade` columns.

The goal here is to:
<ol>
    <li>get more familiar with the clustering techniques we learned in notebook 1, and</li>
    <li>see if you can gain any insight about the lending data set.</li>
</ol>

Remember that you'll need to scale the data!

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







## Shooters Gotta Shoot

These clustering techniques are often applied in sports analytics.

They can be helpful in finding commonalities between players and teams in a number of ways.

In this problem you'll load in some shot distributions for NBA players from the 2018-19 NBA season.

### The Data

Each row of the data set will be a different player with their percentage of shots taken in different regions of the basketball court along with what percentage of those shots they made. 

Here is a basketball court cut into 15 regions,
<img src="CourtZones.png" style="width:50%;"></img>

Let's load in the data and look at it.

In [13]:
nba = pd.read_csv("players_18_19.csv")

In [14]:
nba.head()

Unnamed: 0,player_name,zone_1_attempted,zone_1_made,zone_2_attempted,zone_2_made,zone_3_attempted,zone_3_made,zone_4_attempted,zone_4_made,zone_5_attempted,...,zone_11_attempted,zone_11_made,zone_12_attempted,zone_12_made,zone_13_attempted,zone_13_made,zone_14_attempted,zone_14_made,zone_15_attempted,zone_15_made
0,Aaron Gordon,0.366157,0.634465,0.119503,0.344,0.006692,0.428571,0.007648,0.25,0.044933,...,0.038241,0.475,0.106119,0.369369,0.067878,0.323944,0.065966,0.304348,0.003824,0.0
1,Al Horford,0.284924,0.718447,0.172891,0.504,0.011065,0.5,0.013831,0.6,0.040111,...,0.016598,0.166667,0.067773,0.306122,0.070539,0.45098,0.113416,0.353659,0.001383,0.0
2,Al-Farouq Aminu,0.374368,0.603604,0.08769,0.365385,0.006745,0.25,0.001686,0.0,0.008432,...,0.080944,0.354167,0.129848,0.363636,0.146712,0.310345,0.035413,0.428571,0.0,0.0
3,Alec Burks,0.331224,0.566879,0.132911,0.269841,0.004219,0.5,0.025316,0.166667,0.048523,...,0.021097,0.4,0.124473,0.457627,0.084388,0.25,0.06962,0.363636,0.004219,0.0
4,Alex Len,0.532407,0.62029,0.100309,0.338462,0.003086,0.0,0.001543,1.0,0.015432,...,0.098765,0.375,0.040123,0.384615,0.040123,0.5,0.072531,0.276596,0.001543,0.0


Explore this data set and perform clustering on either the whole data set or just selected columns. Are you able to identify different shooting styles?

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







See you in notebook 3!