# Mobile User Segmentation
## Unsupervised Learning
### Domain
#### Mobile, Apps

Business Context
A key challenge for Mobile App businesses is to analyze the trend in the market to increase their sales/usage.
The trend can be easily observed if the companies can group the customers based on their activity on the
network.
We have access to the user's demographic characteristics, their app usage, geolocation, and mobile device
properties. This grouping can be done by applying different criteria like App installed status, Actively using or
not, phone brand compatibility and so on.
The machine learning clustering algorithms can provide an analytical method to cluster user segments with
similar interests/habits. This will help App/mobile providers better understand and interact with their
subscribers.
Objective
We will be clustering the users into groups who show similar interest in their app usage and understand which
factors are responsible for making the clusters

Dataset description
* App_event.csv - Details on App is installed and used actively or not
* App_labels.csv - Apps and their labels, the label_id can be used to join with label categories
* events.csv - Event data has an event id, location detail (lat/long), and timestamp when the user isusing an app on his device
* gender_age.csv - details of users age & gender
* label_categories.csv - Apps' labels and their categories
* phone_device.csv - Device ids, brand, and models name. here the brand names are in Chinese, you can convert it in English using google for better understanding but we will not do it here. We will use this as a profiling variable, not as a clustering variable.

1. Preprocessing the data (10 points)
a. Import required libraries and read all the CSVs.
b. Check for duplicate device id (phones) and remove them
c. Merge the provided files into a dataframe.
d. Drop unnecessary columns

2. Exploratory Data Analysis (10 points)
a. Check dimensions of the dataframe in terms of rows and columns and study few of the
variables
b. Check the data type
c. Sample only 5 % of the total dataset for our clustering problem since it’s a very large dataset
and can lead to memory issues. Use random_state = 1111 [ Note - take a smaller sample if you
face memory issues ]
d. Check the frequency and distribution of the relevant features
e. Convert string features into categories and make them numerical
f. Study summary statistics and mention your findings
g. Check for missing values and impute missing values if any
h. Standardize the data

3. Build a clustering algorithm for clustering mobile users. Kindly follow the below steps: (10 points) [ Hint
- you can try both k-means and hierarchical clustering]
a. Evaluate the clustering algorithm you’ve used
b. Mention the hyperparameters that perform the best

4. Cluster Profiling: (10 points)
a. Comment on the optimal cluster size
b. Compute the statistical summary for observations in the cluster
c. Check mean, sd, freq, modes, min, max, range..all basic central tendency numbers

5. Do dimensionality reduction using PCA (10 points)

6. Apply k means clustering on the PCA transformed data (10 points)

7. Mention your comments and findings (10 points)

Optional:

1. Try KPrototypes algorithm to cluster the data
a. https://medium.com/@guruprasad0o_o0/notes-on-k-prototype-for-clustering-mixed-typeddata-e80eb526b226
b. https://medium.com/datadriveninvestor/k-prototype-in-clustering-mixed-attributese6907db91914
2. Try kmodes library to cluster the data
a. https://pypi.org/project/kmodes/
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
b. https://medium.com/@davidmasse8/unsupervised-learning-for-categorical-datadd7e497033ae

Food for thought

* Does applying PCA gives a better result in comparison to earlier?
* Can you apply any other algorithms to create clusters of data?
* How clustering can be helpful for your analysis?
* What can you infer about the properties of users of different clusters formed in this project?

Learning Outcomes

* PCA
* k-means clustering
* Scaling
* Silhouette Coefficient

In [None]:
#!pip install --upgrade pip
!pip install gdown

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
!pip install gdown

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt
#To parellize the task
import dask.dataframe as dd
data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/app_events.csv')
app_events = pd.DataFrame(data)
app_events.head()
app_events=reduce_mem_usage(app_events)

In [None]:
data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/app_labels.csv')
app_labels = pd.DataFrame(data)
app_labels.head()
app_labels=reduce_mem_usage(app_labels)

In [None]:
data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/events.csv')
events = pd.DataFrame(data)
events.head()
events=reduce_mem_usage(events)

In [None]:
#data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/gender_age_test.csv')
#gender_age_test = pd.DataFrame(data)
#gender_age_test.head()
data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/gender_age_train.csv')
gender_age = pd.DataFrame(data)
gender_age.head()
gender_age=reduce_mem_usage(gender_age)
#gender_age = pd.concat([gender_age_test,gender_age_train],axis=0)
#gender_age.head()

In [None]:
data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/label_categories.csv')
label_categories = pd.DataFrame(data)
label_categories.head()
label_categories=reduce_mem_usage(label_categories)

In [None]:
data = pd.read_csv('/kaggle/input/china-mobile-user-gemographics/phone_brand_device_model.csv')
phone_device = pd.DataFrame(data)
phone_device.head()
phone_device=reduce_mem_usage(phone_device)

In [None]:
#Check for duplicates in phone_device dataframe
#There are several duplicates in the dataframe which we would need to drop
pd.concat(g for _, g in phone_device.groupby("device_id") if len(g) > 1)

In [None]:
#Drop the duplicates and check if they have actually dropped
phone_device2 = phone_device.drop_duplicates(subset=['device_id'])
print(phone_device2[phone_device2.device_id==-9194249084574705214])
print(phone_device2[phone_device2.device_id==9196371203514832504])

In [None]:
app_event_label = pd.merge(app_events, app_labels, on='app_id',how='inner')
#del [[app_events,app_labels]]
#import gc as gc
#gc.collect()
del app_events
del app_labels
#app_events=pd.DataFrame()
#app_labels=pd.DataFrame()
app_event_label.head()
#
#app_event_label.head()
app_event_label=reduce_mem_usage(app_event_label)
app_event_label.drop(['app_id'],axis=1,inplace=True)

In [None]:
app_event_label2 = pd.merge(app_event_label,events, on='event_id',how='inner')
del app_event_label
del events
#app_event_label=pd.DataFrame()
#events=pd.DataFrame()
app_event_label2.head()
app_event_label2=reduce_mem_usage(app_event_label2)
app_event_label2.drop(['event_id'],axis=1,inplace=True)

In [None]:
#app_event_label2.drop(['event_id'],axis=1,inplace=True)

In [None]:
app_event_label_gender = pd.merge(app_event_label2,gender_age, on='device_id')
del app_event_label2
del gender_age
app_event_label_gender.head()
app_event_label_gender=reduce_mem_usage(app_event_label_gender)
#app_event_label2.drop(['event_id'],axis=1,inplace=True)


In [None]:
app_event_label_gender2 = pd.merge(app_event_label_gender,label_categories, on='label_id')
del app_event_label_gender
del label_categories
app_event_label_gender2.head()
app_event_label_gender2=reduce_mem_usage(app_event_label_gender2)


In [None]:
app_event_label_gender_phone = pd.merge(app_event_label_gender2,phone_device, on='device_id')
del app_event_label_gender2
del phone_device
app_event_label_gender_phone.head()
app_event_label_gender_phone=reduce_mem_usage(app_event_label_gender_phone)

In [None]:
app_event_label_gender_phone.head()

In [None]:
#The data frame has 79,061,565 (~80 MM) rows and 13 columns
#In total there were 15 columns, we dropped 2 columns due to datasize issue during merge step 
#namely app_id and event_id therefore current dataset has 13 columns
app_event_label_gender_phone.shape

In [None]:
#Different datatypes 
#is_installed, is_active, label_id, device_id and age are integers
#longitude, latitude are float
#gender, group,category, phone_brand and device_model are objects (categorical variables)
app_event_label_gender_phone.dtypes

In [None]:
#Taking a 0.1% sample (5% sample is causing lot of memory issues)
df = app_event_label_gender_phone.sample(frac=0.001,  random_state=1111)

In [None]:
del app_event_label_gender_phone

In [None]:
df.head()

In [None]:
#Keep only necessary columns and drop the columns used for joining like label_id, device_id etc
#We can see that phone_brand and device_model have part English and part Chinese characters
df2 = df.drop(['label_id','device_id'],axis=1)
df2.head()

In [None]:
#Predominantly the apps are inactive and male people have majority in both active 
#and non active segment
sns.countplot(x ='is_active', hue = "gender", data = df2); 

In [None]:
#Very interesting plot few age groups are exclusively female while few are only male
a4_dims = (20, 10)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.countplot(x ='group', hue = "gender", data = df2); 

In [None]:
#conda install googletrans

In [None]:
#In a 5% sample we have 3,953,078 (~4MM) rows and 11 columns
#In a 0.01% sample we have 79,062 (~80K) rows and 11 columns
df2.shape

In [None]:
#import googletrans
#from googletrans import Translator
#translator = Translator()
#result = translator.translate('Mitä sinä teet')
#print(result.text)
#df2['phone_brand_eng'] = str(df2.phone_brand)
#print(df2.head())
#df2['device_model_eng'] = str(df2.device_model)
#print(df2.head())
#print(df2['phone_brand_eng'].unique)
#rint(str(df.iloc[0, df.columns.get_loc('phone_brand')]))
#import re
#for x in range(0, df.shape[0]):
#    result = translator.translate(str(df2.iloc[x, df2.columns.get_loc('phone_brand')]))
#    s = result.text
#    df2.iloc[x, df2.columns.get_loc('phone_brand_eng')] = re.sub('[^a-zA-Z]+', '', s)    
#print(df.phone_brand_eng)

In [None]:
#a4_dims = (20, 10)
#fig, ax = pyplot.subplots(figsize=a4_dims)
#sns.countplot(x ='phone_brand_eng',  data = df2); 

In [None]:
#The 5 number summary of the numerical data
#We see that longitude and latitude has mean as NaN (Not a number) and standard deviation as zero
#This means we will not be able to normalize latitude and longitude as mean and std deviation
#are not having valid values
#We would need to check if we have missing values there and in that case we would need to impute or remove 
df2.describe().transpose()

In [None]:
#All mobiles have app installed - no zero values in the random sample
#So this column is practically useless for us and can be removed later
#app_event_label_gender_phone
df2["is_installed"].value_counts()

In [None]:
#About 66% users are not actively using the app and 34% are actively using
df2["is_active"].value_counts(normalize=True)

In [None]:
#Timestamp lot of detailed value in terms of both date and time
#What we can do here is create few derived variables out of this like date and then time
#Time can then further be classified as morning, afternoon, evening, night etc
#However due to lot of memory challenges was unable to explore this part
df2["timestamp"].value_counts(normalize=True)

In [None]:
#from datetime import datetime

#df2['date'] = df2['timestamp'].map(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date)
#df2['time'] = df2['timestamp'].map(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S").time)
#df2["date"] = datetime.strptime(str(df2["timestamp"]),"%d/%m/%Y %H:%M")
#print(df2["date"]);

In [None]:
#Most of the users have longitude of 0 and 1 - probably means data is missing 
#China is not near the equator but much above it which is having latitude of 0
df2["longitude"].value_counts()

In [None]:
#Most of the users have longitude of 0 and 1 - probably means data is missing or invalid
#China is not at prime meridian which runs through Greenwich in England giving us GMT
df2["latitude"].value_counts()

In [None]:
#74% male population and 26% female population - Data skewed towards male group
df2["gender"].value_counts(normalize="True")

In [None]:
#Data is fairly well distributed among age group for male population - prefix M
#Data is more skewed towards older women in the population (Age 29 and above) - prefix F
df2["group"].value_counts(normalize="True")

In [None]:
#Most of the apps belong to Industry tag, Property industry and services
#A large number is unknown as well
df2["category"].value_counts()

In [None]:
#With help of Google translate the top few Chinese phone brands are Huawei, Xiaomi, Samsung, Meizu and Oppo below
df2["phone_brand"].value_counts()

In [None]:
#Top few device models are Glory 6, Mate 7, MI 4, Honor 6 Plus, Galaxy Note 3 etc
df2["device_model"].value_counts()

In [None]:
#A lot of categorical variables are there which we need to convert to numerical for clustering
#Convert gender into dummy/indicator variables using One Hot Encoding
df3 = pd.get_dummies(df2, columns=["gender"])

In [None]:
df3.head()

In [None]:
df4 = pd.get_dummies(df3, columns=["group"])

In [None]:
df4.head()
del df3
#del df2
del df

In [None]:
df5 = pd.get_dummies(df4, columns=["category"])
del df4

In [None]:
df5.head()

In [None]:
#df6 = pd.get_dummies(df5, columns=["phone_brand"])
df6 = df5

In [None]:
del df5

In [None]:

df6.head()

In [None]:
#Majority of the values for latitude and longitude are clustered at values 0 and 1 as noted earlier
#We would not be able to normalize latitude and longitude as std deviation is zero and mean is NaN
#So we will drop it from further analysis
#Also is_installed is not useful as it has value of 1 throughout and will be dropped later
df6.describe().transpose()

In [None]:
# Around 1.5MM rows and 535 columns in the dataset for 5%
# Around 80K rows and 384 columns in the dataset for 0.01%
df6.shape

In [None]:
#Check for null values - No null values found
df6.isnull().sum()

In [None]:
#No NA values found
df6.isna().sum()

In [None]:
#Datetime is an object - let us change that into datetime
df6.dtypes

In [None]:
df6["timestamp"] = pd.to_datetime(df6["timestamp"])

In [None]:
df6.dtypes

In [None]:
import datetime
df6["day"] = pd.to_datetime(df6["timestamp"].dt.strftime('%Y-%m-%d'))
df6["day"].value_counts()

In [None]:
#Timestamp and day variable could not be used due to memory issues
#latitude and longitude values could not be standardized as std dev is zero and mean is NaN
#Device model and phone brand are having Chinese values so could not gather much information
#Is_installed is 1 throughout so practically useless for clustering
df7 = df6.drop(['timestamp','device_model','day','latitude','longitude','phone_brand','is_installed'],axis=1)
#del df6

In [None]:
# For running the clustering algorithm, we would require the data to be scaled
df7.dtypes

In [None]:
#Scaled the data using zscore
from sklearn.preprocessing import StandardScaler
from scipy.stats import zscore
np.seterr(divide='ignore', invalid='ignore')
# 4 samples/observations and 2 variables/features
#X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])# the scaler object (model)
scaler = StandardScaler()# fit and transform the data
#scaled_data = scaler.fit_transform(df7)
scaled_data = df7.apply(zscore)

In [None]:
scaled_data.head()

In [None]:
#Hopkins Statistic is a way of measuring the cluster tendency of a data set.
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) # heuristic from article [1]
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 
                                    2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H


In [None]:
#Checking whether data can be clustered through Hopkins Coefficient
#A very high coefficient value (~0.99) shows that the data can be clustered well
Num_features = scaled_data.select_dtypes(include=[np.number]).columns
hopkins(scaled_data[Num_features])

In [None]:
#sns.pairplot(scaled_data,diag_kind='kde')

In [None]:
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(scaled_data)
    prediction=model.predict(scaled_data)
    meanDistortions.append(sum(np.min(cdist(scaled_data, model.cluster_centers_, 'euclidean'), axis=1)) 
                           / scaled_data.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')


In [None]:
# Let us first start with K = 2 as elbow appears to be at 3
final_model=KMeans(3)
final_model.fit(scaled_data)
prediction=final_model.predict(scaled_data)

#Append the prediction 
df7["GROUP"] = prediction
#scaled_data["GROUP"] = prediction
print("Groups Assigned : \n")
df7.head()

In [None]:
#The two clusters are made on the basis of gender - one is male another is female
DataClust = df7.groupby(['GROUP'])
DataClust.mean()

In [None]:
# Let us try with K = 6 (slight elbow is there)
final_model=KMeans(6)
final_model.fit(scaled_data)
prediction=final_model.predict(scaled_data)

#Append the prediction 
#tech_supp_df["GROUP"] = prediction
df7["GROUP"] = prediction
print("Groups Assigned : \n")
df7.head()

In [None]:
#The five clusters are made on the basis of gender - one is male another is female, other three clusters are a 
#mix of male and female group 
DataClust = df7.groupby(['GROUP'])
DataClust.mean()

In [None]:
# Let us try with K = 8 as there is an elbow there as well
final_model=KMeans(8)
final_model.fit(scaled_data)
prediction=final_model.predict(scaled_data)

#Append the prediction 
df7["GROUP"] = prediction
#scaled_data["GROUP"] = prediction
print("Groups Assigned : \n")
df7.head()

In [None]:
DataClust = df7.groupby(['GROUP'])
DataClust.mean()

In [None]:
DataClust = df7.groupby(['GROUP'])
DataClust.mean()

In [None]:
#from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
#from scipy.spatial.distance import pdist
#Z = linkage(scaled_data, method = 'centroid', metric = 'euclidean')
#c, coph_dists = cophenet(Z,pdist(scaled_data))

In [None]:
# PCA
# Step 1 - Create covariance matrix

cov_matrix = np.cov(scaled_data.T)
print('Covariance Matrix \n%s', cov_matrix)


In [None]:
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)

In [None]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)

In [None]:
plt.plot(var_exp)

In [None]:
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

In [None]:
#We need atleast 350 out of the 378 to get proper coverage of variance
scaled_data.shape

In [None]:
from sklearn.decomposition import PCA

# NOTE - we are generating only 350 PCA dimensions (dimensionality reduction from 379 to 350)

pca = PCA(n_components=350)
data_reduced = pca.fit_transform(scaled_data)
data_reduced.transpose()

In [None]:
pca.components_

In [None]:
#df_comp = pd.DataFrame(pca.components_,columns=list(scaled_data))
#df_comp.head()

In [None]:
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(data_reduced)
    prediction=model.predict(data_reduced)
    meanDistortions.append(sum(np.min(cdist(data_reduced, model.cluster_centers_, 'euclidean'), axis=1)) 
                           / data_reduced.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')

In [None]:
#We see a very clear cluster at k = 3
#Let us use k-means to do the clustering
# Let us try with K = 5 as there is an elbow there as well
final_model=KMeans(3)
final_model.fit(data_reduced)
prediction=final_model.predict(data_reduced)

#Append the prediction 
df7["GROUP"] = prediction
#scaled_data["GROUP"] = prediction
print("Groups Assigned : \n")
df7.head()

In [None]:
#Very clear clusters formed 1 for female and 3 for male, another group is 84% male and 16% female
#The last group has overlapping interests
DataClust = df7.groupby(['GROUP'])
DataClust.mean()

In [None]:
final_model=KMeans(5)
final_model.fit(data_reduced)
prediction=final_model.predict(data_reduced)

#Append the prediction 
df7["GROUP"] = prediction
#scaled_data["GROUP"] = prediction
print("Groups Assigned : \n")
df7.head()

In [None]:
#Very clear clusters formed 1 for female and 3 for male, another group is 84% male and 16% female
#The last group has overlapping interests
DataClust = df7.groupby(['GROUP'])
DataClust.mean()

In [None]:
#Optional part try k prototype clustering
#conda install kmodes

In [None]:
import numpy as np
from kmodes.kprototypes import KPrototypes
import matplotlib.pyplot as plt
from matplotlib import style
from kmodes.kmodes import KModes
style.use("ggplot")

Parameters for KMode run
    -----------
    n_clusters : int, optional, default: 8
        The number of clusters to form as well as the number of
        centroids to generate.
    max_iter : int, default: 300
        Maximum number of iterations of the k-modes algorithm for a
        single run.
    cat_dissim : func, default: matching_dissim
        Dissimilarity function used by the algorithm for categorical variables.
        Defaults to the matching dissimilarity function.
    init : {'Huang', 'Cao', 'random' or an ndarray}, default: 'Cao'
        Method for initialization:
        'Huang': Method in Huang [1997, 1998]
        'Cao': Method in Cao et al. [2009]
        'random': choose 'n_clusters' observations (rows) at random from
        data for the initial centroids.
        If an ndarray is passed, it should be of shape (n_clusters, n_features)
        and gives the initial centroids.
    n_init : int, default: 10
        Number of time the k-modes algorithm will be run with different
        centroid seeds. The final results will be the best output of
        n_init consecutive runs in terms of cost.
    verbose : int, optional
        Verbosity mode.

In [None]:
#Running data with KModes type

km = KModes(n_clusters=5, init='Huang', n_init=6, verbose=1)

#clusters = KPrototypes().fit_predict(X, categorical=[1, 2])

clusters = km.fit_predict(data_reduced)

# Print the cluster centroids
print(km.cluster_centroids_)

In [None]:
from sklearn.preprocessing import StandardScaler
np.seterr(divide='ignore', invalid='ignore')
Num_features =df2.select_dtypes(include=[np.number]).columns
# 4 samples/observations and 2 variables/features
#X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])# the scaler object (model)
#scaler = StandardScaler()# fit and transform the data
#scaled_data2 = scaler.fit_transform(df2) 
df2scaled = df2[Num_features].apply(zscore)

In [None]:
df2scaled.head()

In [None]:
df2scaled.drop(['is_installed'],axis=1,inplace=True)

In [None]:
#Principal Component
from sklearn.decomposition import PCA
pca = PCA(n_components=3, whiten=True)
Num_features=df2scaled.select_dtypes(include=[np.number]).columns
x=df2scaled[Num_features]
principalComponents = pca.fit_transform(x)

# Cumulative Explained Variance
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
    if i == 0:
        cum_explained_var.append(pca.explained_variance_ratio_[i])
    else:
        cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                 cum_explained_var[i-1])

print(cum_explained_var)

In [None]:
#Principal Components converted to a Data frame
principalDf  = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2', 'principal component 3'])
principalDf.shape

In [None]:
df2.head()

In [None]:
df2.reset_index(drop=True,inplace=True)

In [None]:
principalDf.reset_index(drop=True, inplace=True)


In [None]:
#Concatenating the PCAs with the categorical variable
finalDf_Cat = pd.concat([principalDf, df2["gender"],df2["group"],df2["category"]], axis = 1)
finalDf_Cat.head(2)

In [None]:
##### Choosing optimal K value
cost = []
X = finalDf_Cat
for num_clusters in list(range(2,7)):
    kproto = KPrototypes(n_clusters=num_clusters, init='Huang', random_state=1111,n_jobs=-2,
                         max_iter=15,n_init=6,verbose=2) 
    kproto.fit_predict(X, categorical=[3,4,5])
    cost.append(kproto.cost_)

plt.plot(cost)
plt.xlabel('K')
plt.ylabel('cost')
plt.show

In [None]:
# Converting the dataset into matrix
X = finalDf_Cat.to_numpy()

In [None]:
# Running K-Prototype clustering K =2 gives a good elbow point as above
kproto = KPrototypes(n_clusters=2, init='Huang', verbose=2, 
                     random_state=1111,max_iter=20, n_init=6,n_jobs=-2,gamma=.25) 
clusters = kproto.fit_predict(X, categorical=[3,4,5])

In [None]:
#Visualize K-Prototype clustering on the PCA projected Data
df=pd.DataFrame(finalDf_Cat)
df['Cluster_id']=clusters
print(df['Cluster_id'].value_counts())
sns.pairplot(df,hue='Cluster_id',palette='Dark2',diag_kind='kde')

In [None]:
del finalDf_Cat
del df

In [None]:
#df7.boxplot(by='GROUP', layout = (200,40),figsize=(15,10))
#from sklearn.cluster import AgglomerativeClustering
#model = AgglomerativeClustering(n_clusters=3,affinity='euclidean',linkage='average')
#model.fit(scaled_data)
#df7['GROUP'] = model.labels_
#df7.head()

In [None]:
#Very clear clusters formed 1 for female and 3 for male, another group is 84% male and 16% female
#The last group has overlapping interests
#DataClust = df7.groupby(['GROUP'])
#DataClust.mean()

In [None]:
#Convert age group into indicator/dummy variables using One hot encoding
#pd.get_dummies(df2, columns=["group"]).head()

In [None]:
#Convert category into indicator/dummy variables using One hot encoding
#pd.get_dummies(df2, columns=["category"]).head()

In [None]:
#Convert Phone brand group into indicator/dummy variables using One hot encoding
#pd.get_dummies(df2, columns=["phone_brand"]).head()

In [None]:
#Convert device model into indicator/dummy variables using One hot encoding
#We will not do this as it is making too many columns 
#We might have two issue 1) curse of dimensionality 2) memory issues
#pd.get_dummies(df2, columns=["device_model"]).head()

In [None]:
#df2.head()

In [None]:
#df2.shape

In [None]:
#df2.describe().transpose()