# Objective

Create clusters of users and interpret them.

# Instructions
- Import the k-means class from the Scikit-learn library.
- Apply normalization to all columns in feature_df to bring all features to the same scale before modeling.
- Create a k-means model with n_clusters = 5 and random_state = 0.
- Predict the clusters of users in feature_df and assign them into a new column.
- Print out the size of clusters and averages of each feature grouped by clusters.

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import preprocessing

In [2]:
df = pd.read_csv('features.csv')
df.head()

Unnamed: 0,target,avg_rating,avg_rating_inbound_users,in_degree,out_degree,page_rank
0,1,3.544248,1.640546,226,215,0.005028
1,2,3.0,1.73565,41,45,0.000978
2,3,-0.285714,2.819381,21,0,0.000382
3,4,3.111111,1.812079,54,63,0.001289
4,5,2.333333,2.591068,3,3,9.3e-05


## Scale features

In [3]:
for col in [x for x in df.columns if x != 'target']:
    df[col] = preprocessing.minmax_scale(df[col])

In [4]:
df.describe()

Unnamed: 0,target,avg_rating,avg_rating_inbound_users,in_degree,out_degree,page_rank
count,5858.0,5858.0,5858.0,5858.0,5858.0,5858.0
mean,3003.711676,0.53643,0.638773,0.009505,0.007953,0.008946
std,1721.680985,0.141352,0.069471,0.033157,0.027689,0.028096
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,1509.25,0.55,0.623478,0.0,0.001311,0.00121
50%,2998.5,0.55,0.646909,0.001873,0.002621,0.002656
75%,4494.75,0.585,0.667866,0.007491,0.005242,0.007088
max,6005.0,1.0,1.0,1.0,1.0,1.0


## Create KMeans model and add cluster as a new feature

In [5]:
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans = kmeans.fit(df.drop(columns=['target']))

In [6]:
df['cluster'] = kmeans.predict(df.drop(columns=['target']))

## Analyse clusters

### Cluster sizes

In [7]:
df.groupby('cluster')['target'].count()

cluster
0     448
1     291
2    4758
3     321
4      40
Name: target, dtype: int64

In [8]:
df.groupby('cluster')['target'].count() / len(df)

cluster
0    0.076477
1    0.049676
2    0.812223
3    0.054797
4    0.006828
Name: target, dtype: float64

The vast majority (81%) of rows have been assigned to cluster 2. Is this a problem?

### Feature averages per cluster

#### Overall averages for comparison

In [9]:
df.drop(columns=['target', 'cluster']).groupby(lambda x: 0).mean()

Unnamed: 0,avg_rating,avg_rating_inbound_users,in_degree,out_degree,page_rank
0,0.53643,0.638773,0.009505,0.007953,0.008946


#### Averages per cluster

In [10]:
df.drop(columns=['target']).groupby('cluster').mean()

Unnamed: 0_level_0,avg_rating,avg_rating_inbound_users,in_degree,out_degree,page_rank
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.358987,0.567509,0.007231,0.003952,0.00776
1,0.053494,0.561366,0.006429,0.004274,0.004149
2,0.566511,0.646612,0.00783,0.006829,0.007591
3,0.76985,0.694604,0.003115,0.003103,0.004583
4,0.585869,0.619588,0.307959,0.252097,0.25327


There are differences in the features averages per cluster compared to the overall averages
- Cluster 0
    - lower avg_rating
    - Interpretation: lowest ranking nodes
- Cluster 1
    - similar avg_rating
    - lower avg_rating_inbound_users
    - lower in_degree and out_degree and also lower than cluster 2
    - Interpretation: lower than avg ranked nodes with low connectivity
- Cluster 2
    - similar avg_rating and avg_rating_inbound_users
    - lower in_degree and out_degree but bigger than cluster 1
    - Interpretation: slightly higher than average ranked nodes with not so lower connectivity
- Cluster 3
    - higher avg_rating
    - lower in_degree and out_degree
    - Interpretation: highest ranked nodes?
- Cluster 4
    - only cluster with higher in_degree and out_degree and they are *much* higher
    - much higher page_rank
    - Interpretation: nodes that are more connected than others