# Distinct Values in Some Nominal Features

As noted in the post [https://www.kaggle.com/c/cat-in-the-dat/discussion/105537](http://), some of the features contain values in the test set which do not occur in the training set. This notebook provides a quick assessment of the magnitude of this issue. 

The intuition here is that this problem becomes worse as the proportion of nonoverlaping values increases - if only one row of the test set contains a value that never appears in the training set, that row is not likely going to devastate our analysis, though we might want to drop it. On the other hand, if half the rows of the test set contain values that never appear in the training set, any analysis which tries to ignore this discrepancy is highly dubious.

## Initial Loading and Checking

In [None]:
import pandas as pd 

In [None]:
# df_sample_submission = pd.read_csv("../input/cat-in-the-dat/sample_submission.csv")
df_test = pd.read_csv("../input/cat-in-the-dat/test.csv")
df_train = pd.read_csv("../input/cat-in-the-dat/train.csv")

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
print("The training set has {count_train} samples and the test set has {count_test} samples.".format(count_train=len(df_train), count_test=len(df_test)))

## Enumerating Values

First, let's do a a quick count of the distinct values in each column of the training set.

In [None]:
unique_vals = []
for col in df_train.columns:
    unique_vals.append([col, df_train[col].nunique()])
unique_vals

And the same distinct value count for the columns of the test set.

In [None]:
unique_test_vals = []
for col in df_test.columns:
    unique_test_vals.append([col, df_test[col].nunique()])
unique_test_vals

## Searching for Discrepancies

Some columns have different numbers of unique values. Let's check to see which columns in the training set and test set contain the same values.

In [None]:
same_unique_vals = []
for col in df_test.columns:
    same_unique_vals.append([col, set(df_train[col].value_counts().index.tolist()) == set(df_test[col].value_counts().index.tolist())])
same_unique_vals

Obviously **id** has distinct values in the training and test set. But, it looks like **nom_7**, **nom_8**, and **nom_9** are the problem features. Let's see what values actually occur in one data set and not the other. To do so, for these three nominal features we'll compute both the set difference of the distinct values of the training set from the test set, as well as the set difference of the values of the test set from the training set.

In [None]:
diff_cols = ['nom_7', 'nom_8', 'nom_9']
vals_diff = []
for col in diff_cols:
    vals_diff.append([col, 
                      set(df_train[col].value_counts().index.tolist()) - set(df_test[col].value_counts().index.tolist()), 
                      set(df_test[col].value_counts().index.tolist()) - set(df_train[col].value_counts().index.tolist())])
vals_diff

## Assessing Disparity

Looks like the situation may be bad for **nom_8** and **nom_9** - not only does the training set have values that don't occur in the test set, but the test set *also* has values that don't occur in the training set. Depending on the frequency with which such values occur in each column, this could be an issue.

In [None]:
diff_sizes = []
for val_diff in vals_diff:
    diff_sizes.append([val_diff[0], len(val_diff[1]), len(val_diff[2])])

for var in diff_sizes:
    print("The number of values in {col} that occur in the training set but not the test set is {count}".format(col=var[0], count=var[1]))
    print("The number of values in {col} that occur in the test set but not the training set is {count}".format(col=var[0], count=var[2]))


The situation might not be as perilous as it appeared. Let's compute the relative frequencies of these values.

In [None]:
len(df_test.loc[df_test['nom_7'].isin(vals_diff[0][2])]['nom_7'])

perc_diff = []
for var in vals_diff:
    test_perc = len(df_test.loc[df_test[var[0]].isin(var[2])])/len(df_test)
    train_perc = len(df_train.loc[df_train[var[0]].isin(var[1])])/len(df_train)
    perc_diff.append([var[0], test_perc, train_perc])
perc_diff
for perc in perc_diff:
    print("The percentage of values in {col} in the training set that do not occur in the test set is {freq}".format(col=perc[0], freq=perc[2]))
    print("The percentage of values in {col} in the test set that do not occur in the training set is {freq}".format(col=perc[0], freq=perc[1]))

So while there are values that do not occur in both the training and test sets for **nom_7**, **nom_8**, and **nom_9**, it looks like the relative frequency of such values is fairly low. Hopefully this means that these disparate values do not affect the joint and marginal distibutional properties of the features in such a way that is detrimental to our attempts at prediction.