### QUANTUMBLACK CONFIDENTIAL


##### Copyright (c) 2019 - present QuantumBlack Visual Analytics Ltd. All
##### Rights Reserved.

NOTICE: All information contained herein is, and remains the property of
QuantumBlack Visual Analytics Ltd. and its suppliers, if any. The
intellectual and technical concepts contained herein are proprietary to
QuantumBlack Visual Analytics Ltd. and its suppliers and may be covered
by UK and Foreign Patents, patents in process, and are protected by trade
secret or copyright law. Dissemination of this information or
reproduction of this material is strictly forbidden unless prior written
permission is obtained from QuantumBlack Visual Analytics Ltd.

## Descriptive analytics

Descriptive analytics is one of the most critical stages within any analytics project. Although projects can vary quite significantly, this phase typically lasts for ~3 weeks and should allow us to answer the following questions:  

* Do I have all the data required to make actionable recommendations at the end of the project?
* Are there any underlying issues with the data such as missing values or data inconsistencies?
* What has happened in the past and why?
* Does the data make business sense and align with previous results observed by the business?

### Data preprocessing

#### Importing packages
Packages allow us to carry out more complex operations, here are some descriptions of what packages are used throughout this exercise:  

* `os`    : provides a portable way of using operating system dependent functionality
* `pandas`     : enables us to do data manipulation (on data.frames)
* `numpy`    : to apply operations on data (processed as arrays and matrices)
* `matplotlib` : enables us to interactively visualise and plot data
* `seaborn`    : interacts well with matplotlib and gives diagrams which are more visually appealing
* `sklearn`    : provides many machine learning algorithms
* `scipy`    : enables us to use mathematical functions (e.g., Euclidian distance)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn.cluster as clust
from scipy.spatial.distance import cdist
%matplotlib inline
import os

ModuleNotFoundError: No module named 'pandas'

#### Get the big picture of a dataset
A few functions are particularly useful to get the holistic view of a dataset 

* `df.head()`: Print first n (default=5) rows of dataset
* `df.info()`: Give overall information of dataset like column names, non-null row count and data type altogether
* `df['categorical_column_name'].value_counts()`: Count distinct value (**For Categorical Columns**)
* `df.describe()`: Basic statistics (count, max, min, quantiles) (**For Numerical Columns by default**) - you need to mention **df.describe(include='all')**, to include both numerical or categorical variables - look at the documentation for further information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [None]:
# Read the data from the 'data' folder
df_raw = pd.read_csv("data/adults.csv")

In [None]:
# Lets check the feed data and scan through the structure
"""
Hint : You can use .head(x) to look at the top x rows of data
"""
df_raw.head(5)

In [None]:
# A good way to get the bigger picture of the dataset is to use .info()
df_raw.info()

In [None]:
# descriptive summary of numerical columns
df_raw.describe()

In [None]:
# descriptive summary of numerical columns
df_raw.describe(include="all")

In [None]:
# value counts for categorical columns
df_raw['income'].value_counts()

In [None]:
# value counts for categorical columns
df_raw['occupation'].value_counts()

In [None]:
""" 
Feel free to try more (for free).
Uncomment the following lines.
Shortcut to comment/uncomment is: CTRL (or cmd) + "/"

"""

# df_raw['occupation'].value_counts()
# df_raw['workclass'].value_counts()
# df_raw['education'].value_counts()
# df_raw['marital-status'].value_counts()
# df_raw['relationship'].value_counts()
# df_raw['race'].value_counts()
# df_raw['sex'].value_counts()
# df_raw['native-country'].value_counts()

#### Inspect data quality
1. Eye-ball if **data type** is correct
>1. It is common that some columns are initially to the wrong data type 
>1. (e.g., data type for age can sometimes set to object instead of int, if there are any strange characters).
>1. You can use `pd.to_numeric(df)` or `df.astype()` to change the data type accordingly.

1. Numerical columns usually require some fixing (e.g., use commas for large values)
> `df_raw[fix_cols] = df_raw[fix_cols] \
    .apply(lambda x : x.str.replace(",",".")) \
    .apply(pd.to_numeric)`

1. Pay attention to missing (or null) values
> 1. remove empty rows/columns `dropna()`
> 1. impute empty values (often with median/mean values) `fillna()`

In [None]:
# Identify for which variables the " ?" is present and its frequency (%)
col_names = df_raw.columns
num_data = df_raw.shape[0]
for c in col_names:
    num_non = df_raw[c].isin([" ?"]).sum()
    if num_non > 0:
        print (c)
        print (num_non)
        print ("{0:.2f}%".format(float(num_non) / num_data * 100))
        print ("\n")

In [None]:
# Remove rows with '?'
df_raw = df_raw[df_raw["workclass"] != " ?"]
df_raw = df_raw[df_raw["occupation"] != " ?"]
df_raw = df_raw[df_raw["native-country"] != " ?"]

#### Dealing with categorical values
Most models cannot handle categorical (`object`) values - you hence need to transform them, if you want to include them in your model before modelling. Two approaches are possible:

1. **Label Encoding** (for interpretation)
> Label encoding is simply converting each categorical value in a column to a number. 
> * **Advantage**: straightforward
> * **Disadvantage**: the numeric values can be **“misinterpreted”** by the algorithms, as it enforces an order to values, which might not reflect the truth in your data - e.g., for occupation, can we say that one occupation is more valuable than another? (short answer: no).

1. **One-hot Encoding** (for modelling)
> One-hot encoding converts each categorical value into a new column and assigns a 1 or 0 (True/False) value to each row.
> * **Advantage**: "neutral" representation of the data (does not assign an order)
> * **Disadvantage**: can **significantly increase** the number of columns in the dataset

In [None]:
# select numerical columns and categorical columns seprately
int64_df = df_raw.select_dtypes(include=['int64'])
# change object to category data type
object_df = df_raw.select_dtypes(include=['object']).astype('category')

In [None]:
# view changed types
object_df.dtypes

In [None]:
# View summary of categorical columns
# It indicates the top value (mode) and its associated frequency
object_df.describe()

In [None]:
# LABEL ENCODING
# change 'income' to label encoding
object_df["income_cat"] = object_df["income"].cat.codes

In [None]:
# View your transformed variable in the dataset
object_df.head()

In [None]:
# ONE-HOT ENCODED
# With `pd.get_dummies`
one_hot_df = pd.get_dummies(df_raw, drop_first=True)

In [None]:
# 15 columns in the initial dataset
df_raw.shape

In [None]:
# One-hot encoding added 82 columns to the initial dataset!
# Food for thought: why do you think this could be an issue?
one_hot_df.shape 

In [None]:
# Visualise your dummy variables
one_hot_df.head()

> Our data is now in the correct format and we can do some descriptive analyses!

### Turning data into insights
Data visualisation is vital to obtain preliminary insights on the data before any modelling step. The following plots are commonly used:

1. `histogram`
> A histogram represents the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable.

1. `box plot`
> The box plot (a.k.a. box and whisker diagram) is a standardized way to display the distribution of data based on the five following summary statistics: minimum, first quartile, median, third quartile, and maximum. It is a very useful plot to very quickly visualise the distribution of a continuous variable across multiple categories.

1. `correlation plot`/`clustermap (using hierarchical clustering)`
> A correlation matrix is a table showing Pearson correlation coefficients between selected variables. Each cell in the table shows the Pearson correlation between two variables. A clustermap is a correlation matrix to which hierarchical clustering has been applied.

1. `scatter plot`
> A scatterplot helps identify a linear relationship between two variables. A scatterplot can also be called a scattergram or a scatter diagram. It is another way to illustrate correlation between two variables.

In [None]:
# Histogram
df_raw.hist(bins=15, figsize=(20,15))

In [None]:
# Simple Boxplot
one_hot_df.boxplot(column=['age'])

In [None]:
# Get correlation matrix
corr = one_hot_df.corr().values
# Set up the figure and its dimensions
f, ax = plt.subplots(figsize=(30, 30))
# Build correlation visualisation
sns.heatmap(corr)

In [None]:
# Build a simpler and less saturated correlation matrix
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the figure and its dimensions
f, ax = plt.subplots(figsize=(50, 30))

# Generate a colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the correlation matrix with the colormap above
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
# Build clustermap
sns.clustermap(corr)

In [None]:
# Draw the correlation matrix with the customized colormap `cmap` defined above
sns.clustermap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
# Look at the pairplot of the whole dataset
sns.pairplot(df_raw)

... but how can i **explain** this scatter plot? You can reference the below plot to get the Pearson's correlation coefficient:

In [None]:
# Colour data points according to their income category
sns.pairplot(df_raw, hue='income')

### K-means clustering

Visualising in 2d and 3d is simple, however dimensions greater than 3 can be difficult to visualise. K-means clustering is an **unsupervised learning approach** which aims to split the data into K distinct clusters by aggregating large numbers of features into condensed representations.  

There are three main steps for `K-means clustering`:
> * **Initialisation** – K points are randomly set as cluster centres (a.k.a., centroids or 'means')
> * **Assignment** – K clusters are created by associating each observation to the nearest centroid
> * **Update** – the position of each cluster's centroid is updated based on the new mean calculated

One issue is that it gives an equal weight to all features used for clustering which inaccurate when the scale is different for all features. In order to rectify these inaccuracies, we generally scale our data by standardising it with the following formula: if we note $x$, as the original data point and $x_{std}$, as the standardised data point:


$x_{std} = \frac{x - {\mu}}{\sigma}$ where ${\mu}$ is the mean of the feature and ${\sigma}$ is the standard deviation of the feature


In [None]:
# Look at average values for our One-hot encoded data
one_hot_df.apply(np.mean)

In [None]:
# Standardise the One-hot encoded features
std_features = (one_hot_df
                .apply(lambda x : (x - np.mean(x)) / np.std(x)))
std_features.head()

In [None]:
# Apply K-means
## Always set a random seed, to be able to reproduce results
np.random.seed(2)
std_features_sample = std_features

avg_dist = []
# You can try build more clusters, the current loop tries 1 cluster to 10
K = range(1,10)
for k in K:
    # Select number of clusters to build
    kmeanModel = clust.KMeans(n_clusters=k).fit(std_features_sample)
    # Find total distance from closest cluster for each point 
    total_distance_sum = (np.min(cdist(std_features_sample, kmeanModel.cluster_centers_, 'euclidean'), axis=1))
    # Find average distance for number of clusters
    avg_dist.append(sum(total_distance_sum) / std_features_sample.shape[0])

plt.plot(K, avg_dist, 'bx-')
plt.grid()
plt.xlabel('# of clusters')
plt.ylabel('Average distance per cluster')
plt.title('The Elbow Method showing the optimal # of clusters')
plt.show()

The average intra-cluster distance will almost always decrease as we create increasingly more clusters. Hypothetically we could build an equal amount of clusters to the number of data points and have an intra-cluster distance equal to 0. However, the marginal benefit of adding an additional cluster illustrates that these improvements also decline which result in the elbow-like shape. 

Based on the plot we can see that the plot *elbows* at around 5 clusters. We will choose 3 for now and evaluate the results in the next steps. One of the advantages of this algorithm is that it takes into account all features used in the algorithm with equal importance. 

This means that we do not need to visualise K dimensions any more and hand pick our boundaries.   

In [None]:
# Apply K-means - build models with 3 centres
## Always set a random seed, to be able to reproduce results
np.random.seed(2) # If you ran this in the previous cell, you do not need to run it again

# We can this experiment with a varying number of clusters
number_of_clusters = 3
kmeanModel = clust.KMeans(n_clusters = number_of_clusters).fit(std_features_sample)

# Select the sample we previously selected
df_sample = one_hot_df.loc[std_features_sample.index]
df_sample['cluster'] = np.argmin(cdist(std_features_sample, kmeanModel.cluster_centers_, 'euclidean'), axis = 1)

df_sample.head(10)

In [None]:
# Look at each cluster's average characteristics (will define centroid)
df_sample.groupby('cluster').mean()

In [None]:
# Look at each cluster's size
df_sample.cluster.value_counts()