# **Clustering Facebook Live Sellers**

The code below is taken from Prashant Banerjee's submission on [kaggle.com](https://www.kaggle.com/prashant111/k-means-clustering-with-python/notebook).

You are encouraged to go to the link above and check out the full code. In this lab, you will do the necessary steps to explore the data and prepare it for sklearn algorithms.

**About the data set**

Live selling is becoming increasingly popular in Asian countries. Small vendors can now reach a wider audience and connect with many customers. 

K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. In this kernel, you will implement K-Means clustering to find intrinsic groups within the dataset that display the same status_type behaviour.

**Import libraries**

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for statistical data visualization
%matplotlib inline

# Acquire data

In [3]:
df = pd.read_csv('UnsupervisedLearning/FacebookLiveSellers/Live.csv')

#TODO: Write code to inspect the first five rows of the data frame
df.head()

Unnamed: 0,status_id,status_type,status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,Column1,Column2,Column3,Column4
0,246675545449582_1649696485147474,video,4/22/2018 6:00,529,512,262,432,92,3,1,1,0,,,,
1,246675545449582_1649426988507757,photo,4/21/2018 22:45,150,0,0,150,0,0,0,0,0,,,,
2,246675545449582_1648730588577397,video,4/21/2018 6:17,227,236,57,204,21,1,1,0,0,,,,
3,246675545449582_1648576705259452,photo,4/21/2018 2:29,111,0,0,111,0,0,0,0,0,,,,
4,246675545449582_1645700502213739,photo,4/18/2018 3:22,213,0,0,204,9,0,0,0,0,,,,


# Inspect data

In [4]:
#TODO: Write code to inspect the shape of the data frame
df.shape()

In [5]:
#TODO: Write code to get information about null values in the data frame
df.info()

# Clean data

In [6]:
# Drop the four redundant columns in the data set
df.drop(['Column1', 'Column2', 'Column3', 'Column4'], 
axis=1, inplace=True)

In [7]:
# Check the summary again 
# to see if there are no redundant columns remaining
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7050 entries, 0 to 7049
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   status_id         7050 non-null   object
 1   status_type       7050 non-null   object
 2   status_published  7050 non-null   object
 3   num_reactions     7050 non-null   int64 
 4   num_comments      7050 non-null   int64 
 5   num_shares        7050 non-null   int64 
 6   num_likes         7050 non-null   int64 
 7   num_loves         7050 non-null   int64 
 8   num_wows          7050 non-null   int64 
 9   num_hahas         7050 non-null   int64 
 10  num_sads          7050 non-null   int64 
 11  num_angrys        7050 non-null   int64 
dtypes: int64(9), object(3)
memory usage: 661.1+ KB


In [8]:
#TODO: Write code to inspect statistical information about the data set
df.describe()

Note that there are 3 categorical variables in the dataset. We will explore them one by one

In [9]:
# View the labels in the variable

df['status_id'].unique()

array(['246675545449582_1649696485147474',
       '246675545449582_1649426988507757',
       '246675545449582_1648730588577397', ...,
       '1050855161656896_1060126464063099',
       '1050855161656896_1058663487542730',
       '1050855161656896_1050858841656528'], dtype=object)

In [10]:
# View how many different types of variables are there

len(df['status_id'].unique())

6997

In [11]:
#TODO: Write code to view the labels in the variable status_published


In [12]:
#TODO: write code to view how many different types of variables 
# there are in status_published


In [13]:
#TODO: Write code to view the labels in the variable status_type


In [14]:
#TODO: write code to view how many different types of variables 
# there are in status_type


From the above inspection, we realize that there are 6997 unique labels in the status_id variable and 6913 unique labels in the status_published variable. 

The total number of instances in the dataset is 7050, which means that these two variables are approximately unique identifiers for each of the instances. Thus these are not variables that we can use, and we should drop them.

In [15]:
#TODO: Write code to drop status_id and status_published


In [16]:
# View a summary of the data set again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7050 entries, 0 to 7049
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   status_id         7050 non-null   object
 1   status_type       7050 non-null   object
 2   status_published  7050 non-null   object
 3   num_reactions     7050 non-null   int64 
 4   num_comments      7050 non-null   int64 
 5   num_shares        7050 non-null   int64 
 6   num_likes         7050 non-null   int64 
 7   num_loves         7050 non-null   int64 
 8   num_wows          7050 non-null   int64 
 9   num_hahas         7050 non-null   int64 
 10  num_sads          7050 non-null   int64 
 11  num_angrys        7050 non-null   int64 
dtypes: int64(9), object(3)
memory usage: 661.1+ KB


In [17]:
#TODO: Write code to inspect the first five rows of the data frame
df.head()

**Converting**

There is 1 non-numeric column status_type in the dataset. We will convert it into integer equivalents.

In [18]:
from sklearn.preprocessing import LabelEncoder

# Split the data set into X and y

X = df

y = df['status_type']

le = LabelEncoder()

X['status_type'] = le.fit_transform(X['status_type'])

y = le.transform(y)

Inspect X 

In [19]:
X.head()

Unnamed: 0,status_id,status_type,status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys
0,246675545449582_1649696485147474,3,4/22/2018 6:00,529,512,262,432,92,3,1,1,0
1,246675545449582_1649426988507757,1,4/21/2018 22:45,150,0,0,150,0,0,0,0,0
2,246675545449582_1648730588577397,3,4/21/2018 6:17,227,236,57,204,21,1,1,0,0
3,246675545449582_1648576705259452,1,4/21/2018 2:29,111,0,0,111,0,0,0,0,0
4,246675545449582_1645700502213739,1,4/18/2018 3:22,213,0,0,204,9,0,0,0,0


**Feature Scaling**

In [20]:
# Like in lab 1, use the MinMax scaler to scale values for better accuracy.

from sklearn.preprocessing import MinMaxScaler
cols = X.columns
ms = MinMaxScaler()

X = ms.fit_transform(X)
X = pd.DataFrame(X, columns=[cols])

ValueError: could not convert string to float: '4/22/2018 6:00'

# Earn Your Wings

Implement a K-Means Clustering algorithm on the cleaned data set. Use the elbow method to find the right value of k to use.
Add comments in your code to explain each step that you take in your implementation.