# Representation Learning

## What is Representation Learning?

I would call Representation Learning as transforming the raw data to machine understandable vector representations which are learned from the relations within the data. One very well-known example is word vectors. Words are represented as vectors while preserving their semantic similarities. Here is a nice example from [GLOVE Embeddings](https://nlp.stanford.edu/projects/glove/):

<img src="https://nlp.stanford.edu/projects/glove/images/man_woman.jpg" style="width: 400px;"/>


## What are we going to do with Representation Learning?

Representation learning has two main benefits:
* It allows us to visually understand the data in 2D plots.
* Learned vectors can later be used in other Machine Learning tasks. This is called Transfer Learning. It is especially useful if the representation is learned on a big data and used in a problem with small data.

We are going to use Age, Country, Gender, Job Title and Education information of the survey respondents as categories. Then we will train a model to predict how much time it takes for them to complete the survey, their salaries and if they use Python, R, GPUs, AutoML, Cloud Services, TF-Keras, Pytorch-Fast.ai and if their favorite media source is Kaggle. So all these input and output elements will interact to get better representations of input categories.

## How are we going to do that?

We will train a neural network model on Keras with Tensorflow backend and feed these categories to embedding layers.

![tfkeras](https://i.imgur.com/aPTRfSn.png)

### First let's make our Numpy-Tensorflow deterministic. God does not play dice, right?
Setting the seeds and disabling the parallelism. With different seeds, you may get slightly different embeddings.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import backend as K


np.random.seed(0)
tf.set_random_seed(0)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

### Reading the data that is relevant to us

In [None]:
df = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv', skiprows=1)

rename_cols = {"Duration (in seconds)": "Duration",
               "What is your age (# years)?": "Age",
               "What is your gender? - Selected Choice": "Gender",
               "In which country do you currently reside?": "Country",
               "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?": "Education",
               "Select the title most similar to your current role (or most recent title if retired): - Selected Choice": "JobTitle",
               "What is your current yearly compensation (approximate $USD)?": "Salary",
               "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (forums, blog, social media, etc)": "FavoriteKaggle",
               "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python": "PythonUser",
               "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R": "RUser",
               "Which types of specialized hardware do you use on a regular basis?  (Select all that apply) - Selected Choice - GPUs": "GPUUser",
               "Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice - None": "CloudUser",
               "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   TensorFlow ": "TensorflowUser",
               "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Keras ": "KerasUser",
               "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  PyTorch ": "PytorchUser",
               "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Fast.ai ": "FastaiUser",
               "Which automated machine learning tools (or partial AutoML tools) do you use on a regular basis?  (Select all that apply) - Selected Choice - None": "AutoMLUser"
              }

df = df.rename(columns=rename_cols)[list(rename_cols.values())]

for col in ["FavoriteKaggle", "PythonUser", "RUser", "GPUUser", 
            "TensorflowUser", "KerasUser", "PytorchUser", "FastaiUser", "AutoMLUser", "CloudUser"]:
    df[col] = df[col].notnull()
    if col in ["AutoMLUser", "CloudUser"]:
        df[col] = ~df[col]
        
df["TensorflowKerasUser"] = df["TensorflowUser"] | df["KerasUser"]
df["PytorchFastaiUser"] = df["PytorchUser"] | df["FastaiUser"]

df = df.drop(["TensorflowUser", "KerasUser", "PytorchUser", "FastaiUser"], axis=1)
binary_columns = ["FavoriteKaggle", "PythonUser", "RUser", "GPUUser", 
                  "TensorflowKerasUser", "PytorchFastaiUser", "AutoMLUser", "CloudUser"]

df.head()

### Preprocessing the data

* Replacing nans
* Clipping Survey Duration and log scaling salaries
* Getting ready for Neural Network: Standard scaling and label encoding

In [None]:
import re
from sklearn.preprocessing import LabelEncoder, StandardScaler

number_re = re.compile(r"[\d,]+")
COUNT_LIMIT = 100
NO_SALARY_REPLACE = "100"
categoricals = ["Age", "Gender", "Country", "Education", "JobTitle"]
le_dict = dict()
ss_dict = dict()

def extract_log_salary(x):
    x = str(x)
    x = number_re.findall(x)
    if len(x) == 0:
        return None
    x = [int(num.replace(",", "")) for num in x]
    return np.log10(np.mean(x))

# replace long country names with their short form for better visualization
df.loc[df["Country"] == "United Kingdom of Great Britain and Northern Ireland", "Country"] = "UK"
df.loc[df["Country"] == "United States of America", "Country"] = "USA"

# log scale salaries so that our model doesn't focus on only rich countries (considering that the loss is MSE)
df.loc[df["JobTitle"] == "Not employed", "Salary"] = NO_SALARY_REPLACE
df.loc[df["JobTitle"] == "Student", "Salary"] = NO_SALARY_REPLACE
df["Salary"] = df["Salary"].apply(extract_log_salary)
df["Salary"].fillna(df["Salary"].median(), inplace=True)

# People who spend more than 30 minutes on the survey assumed to be done after 30 minutes
df["Duration"] = np.clip(df["Duration"], 0, 1800)*1.0

for col in df.columns:
    if col in categoricals:
        # Merge all the rare categories into one category to prevent overfitting
        df.loc[df[col].isnull(), col] = "No {}".format(col)
        df.loc[df.groupby(col)["Duration"].transform("count") < COUNT_LIMIT, col] = "Other {}".format(col)
        le_dict[col] = LabelEncoder()
        df[col + "_le"] = le_dict[col].fit_transform(df[col])
    elif col in ["Salary", "Duration"]:
        ss_dict[col] = StandardScaler()
        df[col + "_ss"] = ss_dict[col].fit_transform(df[col].values.reshape(-1, 1))

df.head()

### Neural Network Architecture

* Embeddings of size 8 for each category are concatenated
* 2 Dense-BN-Relu-Dropout hidden layers to add some non-linearity
* 2 outputs: Duration and Salary as linear output. FavoriteKaggle, PythonUser, RUser, GPUUser, TensorflowKerasUser, PytorchFastaiUser, AutoMLUser, CloudUser as binary output.

In [None]:
from IPython.display import SVG
from keras.utils import model_to_dot
from keras.layers import *
from keras.models import Model

EMB_SIZE = 8
dense_outputs = ["Duration_ss", "Salary_ss"]

def nn_block(input_layer, size, dropout_rate, activation):
    out_layer = Dense(size, activation=None)(input_layer)
    out_layer = BatchNormalization()(out_layer)
    out_layer = Activation(activation)(out_layer)
    out_layer = Dropout(dropout_rate)(out_layer)
    return out_layer


def get_model():
    cat_inputs = []
    cat_embs = []
    for cat in categoricals:
        cat_input = Input(shape=(1,), name=cat + "_input")
        cat_emb = Embedding(len(le_dict[cat].classes_), EMB_SIZE, name=cat)(cat_input)
        cat_embs.append(Flatten(name=cat + "_1D")(cat_emb))
        cat_inputs.append(cat_input)
    
    hidden_layer = concatenate(cat_embs)
    hidden_layer = nn_block(hidden_layer, 64, 0.1, "relu")
    hidden_layer = nn_block(hidden_layer, 16, 0.1, "relu")
    dense_out = Dense(len(dense_outputs), name="linear_out")(hidden_layer)
    binary_out = Dense(len(binary_columns), activation="sigmoid", name="binary_out")(hidden_layer)
    
    model = Model(inputs=cat_inputs, outputs=[dense_out, binary_out])
    return model

def get_input(df):
    return [df[cat + "_le"].values for cat in categoricals]

def get_output(df):
    return [df[dense_outputs].values, df[binary_columns].values]

SVG(model_to_dot(get_model(), dpi=64).create(prog='dot', format='svg'))

### Model Training

* First we make sure that model is not overfitting by splitting it into train/validation sets
* Then the model is trained on whole dataset to learn better embeddings

In [None]:
from sklearn.model_selection import train_test_split
from keras.optimizers import Nadam


class ModelConfig:
    batch_size = 64
    training_scheme = [(0.002, 10), (0.0005, 5), (0.0001, 5)] # lr, epochs
    loss = ["mean_squared_error", "binary_crossentropy"] # MSE for linear output, bce for binary output
    optimizer = Nadam

train_df, val_df = train_test_split(df, shuffle=True, random_state=0, test_size=0.2)
model = get_model()
for lr, epochs in ModelConfig.training_scheme:
    model.compile(loss=ModelConfig.loss, optimizer=ModelConfig.optimizer(lr=lr))
    hist = model.fit(get_input(train_df), get_output(train_df), batch_size=ModelConfig.batch_size, epochs=epochs,
                     validation_data=(get_input(val_df), get_output(val_df)),
                     verbose=2, shuffle=True)
    
model = get_model()
for lr, epochs in ModelConfig.training_scheme:
    model.compile(loss=ModelConfig.loss, optimizer=ModelConfig.optimizer(lr=lr))
    hist = model.fit(get_input(df), get_output(df), batch_size=ModelConfig.batch_size, epochs=epochs,
                     verbose=0, shuffle=True)

### Visualizing the embeddings

* Since embeddings are trained with size 8, we need PCA to visualize it in 2 dimensions (top principal components).
* Plotly is used to create interactive plots so that you can zoom in/out etc.

In [None]:
from sklearn.decomposition import PCA
from plotly.offline import init_notebook_mode, iplot 
import plotly.graph_objs as go
init_notebook_mode(connected=True)


def plot_emb(model, cat, exclude=None):
    L = le_dict[cat].classes_
    W = model.get_layer(cat).get_weights()[0]
    if exclude:
        include = np.where([l not in exclude for l in L])[0]
        L = L[include]
        W = W[include]
    
    W = PCA(n_components=2, random_state=0).fit_transform(W)

    pobj = [go.Scatter(x=W[:, 0], y=W[:, 1], mode = 'markers+text', text=L, hoverinfo="none", textposition="bottom center",
                      marker = dict(size = 20, color = 'rgba(64, 64, 192, .7)'))]

    fig = go.Figure(pobj, layout=go.Layout(title=go.layout.Title(text=cat)))

    iplot(fig)
    
plot_emb(model, "JobTitle", exclude=["Student", "Not employed", "No JobTitle"])

* It seems **Data Analyst** and **Business Analyst** are very similar to each other.
* **Statistician** and **Research Scientist** are the outliers which are not similar to other jobs.
* We also observe a triangle with **Software Engineer**, **Data Engineer** and **Data Scientist**.

In [None]:
plot_emb(model, "Education")

* There is no significant difference between **Bachelor's degree** and **Master's degree**.
* People who **prefer not to answer** this question is most similar to the ones with **no formal education past high school**.

In [None]:
plot_emb(model, "Country")

* Country embeddings seem to cover geographical, economical and cultural similarities.
* x axis seems to be an economical axis. It can correlate with the salaries.
* y axis seems more geographical and cultural. Over zero, we see more Mediterranean or South Asian countries. Below zero, we see more East Asian and South American countries.
* USA, Canada, Australia and West European countries are clustered together. Rest of EU has another cluster. There are more diversity on y axis on the middle of x axis.

In [None]:
plot_emb(model, "Age")

* **18-21** and **22-24** are probably university students and they are very similar.
* **25-29** are mostly in the beginning of their careers.
* After 30 years old, there is no significant difference between the age groups.

# Parallel Universes

Using our model, we can actually do some sensitivity analysis. I wonder about the expected salary and skills for my profile. I also wonder how much my salary and skills would be different if I made different decisions in my life. Let's have a look at me in 6 parallel universes.

* **C-137**: The original universe. 28 years old male data scientist living in The Netherlands with Master's degree.
* **C-510**: I was living in Turkey before. I have decided to stay in Turkey instead of moving to Netherlands.
* **C-425**: I was born 10 years earlier.
* **C-841**: I have decided to do a PhD after my MSc.
* **C-707**: I was working as Software Engineer before. I have decided to stay as Software Engineer.
* **C-210**: This universe went a bit radical. It seems I have decided to change gender in that reality.

**DISCLAIMER:** During this session, I may have some very subjective theories and opinions on the outcome.

In [None]:
def make_prediction(model, df):
    df = df.copy()
    for col in categoricals:
        df[col + "_le"] = le_dict[col].transform(df[col])
    dense_result, bin_result = model.predict(get_input(df))
    
    for i, col in enumerate(binary_columns):
        df[col] = bin_result[:, i]
        df[col] = df[col].apply(lambda x: "{}%".format(int(x*100)))

    df["Duration"] = ss_dict["Duration"].inverse_transform(dense_result[:, 0])
    df["Salary"] = np.power(10, ss_dict["Salary"].inverse_transform(dense_result[:, 1]))
    df["Salary"] = df["Salary"].apply(lambda x: "{}000".format(int(x//1000)))
    return df

universes = [{"Universe": "C-137 (Original)", "Age":"25-29", "Gender":"Male", "Country":"Netherlands", 
                  "Education":"Master’s degree", "JobTitle":"Data Scientist"},
             {"Universe": "C-510 (Turkey)", "Age":"25-29", "Gender":"Male", "Country":"Turkey", 
                  "Education":"Master’s degree", "JobTitle":"Data Scientist"},
             {"Universe": "C-425 (Old)", "Age":"35-39", "Gender":"Male", "Country":"Netherlands", 
                  "Education":"Master’s degree", "JobTitle":"Data Scientist"},
             {"Universe": "C-841 (PhD)", "Age":"25-29", "Gender":"Male", "Country":"Netherlands", 
                  "Education":"Doctoral degree", "JobTitle":"Data Scientist"},
             {"Universe": "C-707 (SE)", "Age":"25-29", "Gender":"Male", "Country":"Netherlands", 
                  "Education":"Master’s degree", "JobTitle":"Software Engineer"},
             {"Universe": "C-210 (Female)", "Age":"25-29", "Gender":"Female", "Country":"Netherlands", 
                  "Education":"Master’s degree", "JobTitle":"Data Scientist"}]

universes = pd.DataFrame(universes)
universes

In [None]:
def compare(res, topics, title):
    fig = go.Figure(data=[
        go.Bar(name=topic, x=res[topic], y=res["Universe"], orientation="h") for topic in topics
    ], layout=go.Layout(title=go.layout.Title(text=title)))
    iplot(fig)

res = make_prediction(model, universes)

compare(res, ["Salary"], "Salary")

## Salary Conclusion
* Salary data may be a bit noisy because I believe while some people provide their gross salary, some might have provided their net salary.
* Me in **Turkey** is earning significantly less than me in the original universe. As far as I know, cost of living in Istanbul is around 3 times cheaper than Amsterdam but the same person in Amsterdam seems to earn almost 6 times more. So purchasing power for a data scientist in Istanbul is like half of Amsterdam. This can be explained by the dramatic loss of value in Turkish Lira in the last couple of years.
* Having a **PhD** degree seems to make the salary worse. I guess this is due to data bias. People who are around 25-29 years old are proably still doing their PhD and PhD positions are paid less compared to private sector jobs. So we should interpret it as "doing PhD" instead of "having PhD".
* **Software Engineer** version of me also makes less money. Even though both Software Engineering and Data Science are equally challenging, recent hype in Data Science probably made the data scientists more expensive.
* **10 years older** version of me has the highest compensation. Given that age and years of experience are highly correlated, it makes sense.
* The most surprising result was the **female** one for me. I always hear that there are fewer females interested in Data Science and companies try to keep the gender balance, therefore try to hire more females. So it looks like a low supply high demand situation. My basic economy knowledge tells me that women should earn more than men. But the data tells the opposite. There is a clear **gender bias** here. I hope there can be a reasonable hidden factor that can explain this because I kept all other attributes the same while changing the gender. Otherwise, if women are getting less when it comes to job offer, it is totally not fair.

In [None]:
compare(res, ["Duration"], "Duration")

## Duration Conclusion
Duration (Survey Time) was very experimental one. I wanted to see if there are any patterns in how much time it takes for people to complete the survey. I have re-run this experiment offline with other countries too. My hypothesis is that Duration measures the fluency in English and being detail oriented. People whose mother tongues are not from the same root as English seem to spend more time on the survey. In addition, having PhD or being old slightly increases the time spent on the survey.

In [None]:
compare(res, binary_columns, "Technology")

## Technology Conclusion
* Living in **Turkey** makes it more likely that **Kaggle** is the favorite ML media source. Knowing that Turkish education system is more competitive than The Netherlands, competitive nature of Kaggle may be making it better learning place for Turkish people.
* **Software Engineers** use both **Python** and **R** less while they use **AutoML** tools more. Some of them might be building machine learning applications on Java/C++ etc using AutoML tools without the need of knowing machine learning in detail.
* **Keras-Tensorflow** usage is correlated with **Pytorch-Fastai**. At least out of 5 attributes, none of them seems to be related to the choice between Tensorflow and Pytorch. 
* **Cloud** usage is another metric that is insensitive to the 5 attributes.
* **PhDs** have the highest **GPU** usage. In an era which Neural Networks are SOTA in many tasks, this makes sense.

### Thanks for reading! Enjoyed the notebook? You can fork it and:
* Play with the model and get more experience on Representation Learning
* Export the learned embeddings for Transfer Learning
* Try the model with your age, gender, education etc and learn about your profile
* Thinking about doing a PhD? Do you have a job offer in another country? You can get insights on your next move.