Just a brief look at what's in this data. I'll start by filling in unknown data with average data

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


maindata = pd.read_csv('../input/main_data.csv') 

#Fill in any null data
maindata['AoA'].fillna(maindata['AoA'].mean(),inplace=True)
maindata['VSoA'].fillna(maindata['VSoA'].mean(),inplace=True)
maindata['Freq'].fillna(maindata['Freq'].mean(),inplace=True)
maindata['CDS_freq'].fillna(maindata['CDS_freq'].mean(),inplace=True)
maindata['Lex_cat'].fillna('unknown',inplace=True)
maindata['Broad_lex'].fillna('unknown',inplace=True)

Let's see what kind of information is here

In [2]:
print(maindata.head(5))

I'll start by plotting the age that a word is learned (AoA) against the numeric values first like VSoA, Freq, and CDS_freq

In [3]:
sns.lmplot('VSoA','AoA',data=maindata, fit_reg=True)
sns.plt.title('Typical age to learn word vs how many other words are known at this age')
sns.plt.show()

In [4]:
sns.lmplot('Freq','AoA',data=maindata, fit_reg=True)
sns.plt.title('Typical age to learn word vs word frequency in Norwegian')
sns.plt.show()

In [5]:
sns.lmplot('CDS_freq','AoA',data=maindata, fit_reg=True)
sns.plt.title('Typical age to learn word vs how often this word is used when talking to children')
sns.plt.show()

Interesting. It's starting to look like the frequency of the word in the language does not influence how old a child is when they learn the word. Let's how the breakdown by word category looks

In [6]:
sns.boxplot(maindata['Broad_lex'],maindata['AoA'])
sns.plt.xticks(rotation=45)
sns.plt.show()


In [7]:
print(maindata[maindata['Broad_lex']=='games & routines']['Translation'].head(10))

In [8]:
print(maindata[maindata['Broad_lex']=='nominals']['Translation'].head(10))

In [9]:
print(maindata[maindata['Broad_lex']=='closed-class']['Translation'].head(10))

It looks like there might be some relationship between the word category and the age that it is learned. Let's try to put that relationship into numbers by looking at the correlation

In [10]:
#Convert categorical data to numeric values
var_mod = ['Broad_lex','Lex_cat']
le = LabelEncoder()
for i in var_mod:
	maindata[i] = le.fit_transform(maindata[i])
temp = ['AoA','VSoA','Lex_cat','Broad_lex','Freq','CDS_freq']
corr = maindata[temp].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

sns.heatmap(corr, mask=mask,linewidths=.5,cbar_kws={"shrink":.5})
print(corr)

VSoA still looks like the strongest indicator with everything else having very little correlation. It's interesting to see that the typical age that a word is learned is so strongly correlated to how many words the child knows when they learn that word. 

In the future, it would be interesting to dig deeper into the progression of which words are learned first. The boxplot above seems to indicate that words related to games, or animal noises are learned earlier on but I couldn't think of any other way to look at that information