## Assigning first names for users in each cluster

GoodReads user information is inaccessible for privacy issues it seems, and unavailable through their API unless users allow it. To have a simpler and nicer output for the users clusters, I assigned a first name to each users in each cluster from the final clustering.

The first names are selected from the 2019 Insight DS cohorts in NYC, LA and Seattle :)

In [1]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
from collections import Counter

In [2]:
# open clusters and names csv files
clusters = pd.read_csv('clusters_final.csv')
names = np.genfromtxt('names_64.csv', dtype='str')

In [3]:
def get_names_clust(dfin, names):
    """ Assignes names for each user in each clusters,
    a name is picked randomly from the list
    Args:
        dfin (:obj:`DataFrame`): pandas DataFrame of users clusters
        names (:obj:`str`): list of first names from Insight DS cohors
    Returns:
        :obj:`df`: pandas DataFrame with group id and associated name
    """
    # get the unique group numbers
    list_groups = list(set(dfin['group']))
    
    # get a list of dict where key is the group id and value is the name
    all_users_names = []
    for group_id in list_groups:
        group = dfin[dfin['group'] == group_id]
        names_group = random.sample(set(names), len(group))
        for name in names_group:
            users_names = {}
            users_names['group'] = group_id
            users_names['names'] = name
            all_users_names.append(users_names)
    
    # dict to DataFrame
    df_users_names = pd.DataFrame(all_users_names)
    return df_users_names

In [4]:
# get the list of [group_id, user_name] dict
names_clusters = get_names_clust(clusters, names)

# merge dataframes on index (otherwise the user_idx will be repeated)
names_users_clusters = pd.merge(clusters, names_clusters, left_index=True, right_index=True)

# refurbish dataframe
names_users_clusters.drop('group_y', axis=1, inplace=True)

# rename group column name
names_users_clusters.columns = ['user_idx', 'group', 'names'] 

# export to csv
names_users_clusters.to_csv('clusters_group_names.csv', index=False)