# When do children learn words?
by ChrisDewa
Thanks for sharing this amazing dataset that was so helpful for learning the gist of pandas and other scipy libraries.

In [None]:
dataset_filepath = f"../input/when-do-children-learn-words/main_data.csv"

## Column name keys in the main_data.csv
_the dataset for the current project_

- IDCDII: Word ID from the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories, version 1
- IDCDIII: Word ID from the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories, version 2
- Word_NW: The word in Norwegian
- Word_CDI: The form of the word found in the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories
- Translation: the English translation of the Norwegian word
- AoA: how old a child generally is was when they this this word, in months (Estimated from the MacArthur-Bates Communicative Development Inventories)
- VSoA: how many other words a child generally knows when they learn this word (rounded up to the nearest 10)
- Lex_cat: the specific part of speech of the word
- Broad_lex: the broad part of speech of the word
- Freq: a measure of how commonly this word occurs in Norwegian
- CDS_Freq: a measure of how commonly this word occurs when a Norwegian adult is talking to a Norwegian child

## Data Preparation and Cleaning

The dataset contains information from the Macarthur-Bates comunicative inventory which is not useful for the analysis so it will be taken out. As the author of this project has no knowledge of the norwegian language the columns for the word in the original language were removed as well.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option("display.precision", 4)

%matplotlib inline

In [None]:
df = pd.read_csv(dataset_filepath)

In [None]:
# Datafram information
df.info()

In [None]:
# Descripotion of Dataframe numerical variables (AoA, VSoA)
df[['AoA', 'VSoA']].describe()

In [None]:
# Word speech categories
d = {'Broad speech categories': df['Broad_lex'].unique(), 'Specific speech categories': df['Lex_cat'].unique()}
cat_vars = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in d.items()]))
cat_vars

In [None]:
# drop first columns as they're dentification of the words or the word in original language
# because they're not usefull for the analysis
df.drop(columns=['ID_CDI_I', 'ID_CDI_II', 'Word_NW', 'Word_CDI'], inplace=True)

# Replace categorical NaN's by 'unknown'
df['Lex_cat'].fillna('unknown',inplace=True)
df['Broad_lex'].fillna('unknown',inplace=True)

# select only cells that contain information of AoA and VSoA because there is no proper way to fill the gaps because of the
# nature of the words, regarding age and development. "toast" is much more simpler than "yourself" in a developmental 
# point of view
df = df[(df.AoA.isna() != True) & (df.VSoA.isna() != True)].copy()

# Replace the rest of the numeric variables NaN 
# using interpolation method. Which takes the average of the cells above and below.
df.interpolate(inplace=True)

## Exploratory Analysis and Visualization


In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
sns.pairplot(df.drop(columns=['Freq', 'CDS_freq']), hue='Broad_lex');

In the above graph we can explore how the different words distribute across categories. 
Its notorious that closed-class words such as "that" and "the" appear late at around 20 months but they form the mayority of learned words later on.
We can also observe there is a big aggregation of new words (AoA) and already known words by the age they learn them (VSoA) at around 18 to 30 months. Which is what is commonly known as the "language explosion" also typically seen in english speaking countries at around 18 to 30 months.

In [None]:
sns.histplot(df['AoA'], bins=10,);
plt.title("Graph 2. Distribution of Age of acquisition of words");

This graph shows the distribution of learned words by age showing the mayority of words being learned at around 18 to 30 months of age.

In [None]:
sns.lmplot(x='AoA', y='VSoA', data=df, hue='Broad_lex');
plt.title("Graph 3. Age at word learning vs words know at that age");

The graph shows the speed at which the mayority of words are learned in the interval of the 18 to 30 months

## Asking and Answering Questions
I'll examine the data and answer questions regarding language development.

#### Q1: What are the first words of norwegian children?

In [None]:
# Interval taken was one months from first word as per regular developmental milestons (acording to denver's charts).
age_of_first_word = df.AoA.min()
df[df.AoA.between(age_of_first_word, age_of_first_word+1)].sort_values(by='AoA')

Show here we see that the first words are "mommy", "daddy", some words to interact with people ("hi") and general sounds of things.

#### Q2: What is the distribution of the learned words by their broad speach category?

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Broad_lex'], y=df['AoA']).set(xlabel=None);
plt.title("Graph 4. Broad part of speech of the words and the age at which they're acquirired");

Above we see the distribution of words at their acquisition age by the broad part of speech they belong to.
It appears the first words of children are nominals but then they quickly learn about their daily life and only later they start learning other parts of speech like pronouns and word connectors (closed-class items)

#### Q3: What is the distribution of the learned words by their specific speach category?

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Lex_cat'], y=df['AoA']).set(xlabel=None)
plt.xticks(rotation=20)
plt.title("Graph 5. Specific parts of the speech of the words and the age at which they're acquirired");

Here we see the specific part of speech the learned words belong to. 
Its interesting to see that those nominals they learned are actually of the poeple that care for them (mommy and daddy) but they restart learning people's names only after 16 months, a difference of 4 months which is a lot from the developmental point of view.
Then we can also observe that most other nominals are learned after the child is 2 years of age, while their daily lifes. 

#### Q4: Does the linguistic explosion happen at the same interval in norwegian children than on english speakers?
_Language explosion_ referes to the time around 18 to 30 months of age where the maximum speed of word acquisition is reached.

In [None]:
median = df['AoA'].median()
median

In [None]:
q1 = df['AoA'].quantile(0.25)
q3 = df['AoA'].quantile(0.75)
print(f"The interval at which the 50% of the total number of words are learned is {round(q1)}-{round(q3)}")

As seen in the **Graph 1**, it appears that norwegian children have their language explosion in a similar manner than english speaking children. It's interesting as well to find the interquartile range of 23-28 however, but this only represent the age at which they learn the mayority of words. Not the speed of acquisition.

In [None]:
# let's build a matrix to plot the AoA to the broad part of speech
aoa_matrix_lex = pd.crosstab(index=df['Lex_cat'], columns=df['AoA'])

In [None]:
aoa_matrix_broad = pd.crosstab(index=df['Broad_lex'], columns=df['AoA'])
sns.heatmap(aoa_matrix_broad, cmap="Blues", vmin=0, vmax=25);
plt.xticks(rotation=60);
plt.title("Graph 8. Amount of words learned by age by their broad part of speech\n");

In [None]:
sns.heatmap(aoa_matrix_lex, cmap="Blues", vmin=0, vmax=25);
plt.xticks(rotation=60);
plt.title("Graph 7. Amount of words learned by age by their specific part of speech\n");

The above heatmaps graphically and clearly show the language explosion of language.

#### Q5 What is the difference in VSoA at age 17 months (pre explosion) vs 25 (At the median of learned words)

In [None]:
vsoa_mean_17 = round(df[df.AoA == 17.0].VSoA.mean(), 2)
vsoa_mean_25 = round(df[df.AoA == 25.0].VSoA.mean(), 2)

print(f"At 17 months children know a mean of {vsoa_mean_17}.\n"
      f"At 25 months children know a mean of {vsoa_mean_25}\n"
      f"The difference is {round(vsoa_mean_25-vsoa_mean_17, 2)}\n"
      f"This difference happens in the span of only {25-17} months.")

The above calculation shows how impressive is the development of language in children, they go the first 12 months without speaking and then in a matter of only a couple of months they are able to express feelings, interact verbally with other, and know their world by name!

## Inferences and Conclusion

#### Main inferences:
- Norwegian children develop speech in a similar way of english speaking children
- The main age interval at which they learn the mayority of words is between 23 and 28 months of age.
- They start learning words regarding they caretakers but they quickly move to learn about their world and life before learning other people's names.

This excercise was very fun, informative and constructive for me. I found it challenging and refreshing.

## References and Future Work

Knowledge of the normal pattern of child development is fundamental both to our understanding of the human person as well as making interventions when atypical patterns are observed. Its important to keep studying normal child development. Studies such as that of Hansen P need to be replicated in other populations as this information is not available for most areas of the world.

It'd be interesting to follow other developmental trayectories for example phonological development of children in similar datasets.

References:

- Hansen P. What makes a word easy to acquire? The effects of word class, frequency, imageability and phonological neighbourhood density on lexical development. First Language. 2017;37(2):205-225. Doi: [10.1177/0142723716679956](https://www.doi.org/10.1177/0142723716679956).
- Dataset [link](https://www.kaggle.com/rtatman/when-do-children-learn-words) on Kaggle.
- Kuperman V, Stadthagen-Gonzalez H, Brysbaert M. Age-of-acquisition ratings for 30,000 English words. Behav Res Methods. 2012 Dec;44(4):978-90. Erratum in: Behav Res Methods. 2013 Sep;45(3):900. PMID: 22581493 [doi](https://www.doi.org/10.3758/s13428-012-0210-4).
