# Replication Analysis

### Partial replication of [Trott & Bergen (2022)](https://www.sciencedirect.com/science/article/pii/S0010027722000828). 

### Load libraries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf



In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!

### Load data

In [3]:
df_homophones = pd.read_csv("data/homophones.csv")
df_homophones.head(5)

Unnamed: 0,Word,PhonDISC,num_sylls_est,num_homophones,neighborhood_size,p_normalized,meanings,num_phones,normalized_surprisal
0,a,1,1,6,34,0.000382,7,1,5.240941
1,able,1bP,1,0,8,0.000135,1,3,1.89802
2,ace,1s,1,0,26,0.000307,1,2,2.668154
3,ache,1k,1,1,25,0.000436,2,2,2.592028
4,act,{kt,1,1,9,8.8e-05,2,3,1.958978


In [4]:
df_frequency_sum = pd.read_csv("data/wordform_frequency.csv")
df_frequency_sum.head(5)

Unnamed: 0,PhonDISC,wordform_frequency
0,#,1208
1,"#""tIkjUl1SH",29
2,"#""tizj@nwEl",1
3,#J@R,113
4,#J@rI,9


## Part 1: Descriptive statistics

##### Use `sns.histplot` to plot out the distribution of `meanings`.

In [240]:
#### Your code here

##### Use `sns.histplot` to plot out the distribution of `num_sylls_est`.

In [242]:
#### Your code here

##### Use `sns.lineplot` to plot out `meanings ~ num_sylls_est`. 

In [244]:
#### Your code here

## Part 2: Calculate expected number of meanings per wordform.

##### Use `groupby` and `sum()` to figure out how many meanings are assigned to wordforms with different number of syllables.

In [246]:
#### Your response here

##### Now, merge this new dataframe from `df_homophones` using `pd.merge`.  

In [248]:
#### Your response here

##### Calculate the expected number of meanings for each wordform.

This is calculated as:

$E_w = M * p(w)$, where $p(w)$ corresponds to `p_normalized`. 

In [250]:
#### Your response here

##### Finally, calculate homophony delta: the difference between `meanings` and `expected`.

In [252]:
#### Your response here

## Part 3: Merge with frequency data

##### Now, we need to merge with our frequency data. Use `pd.merge`.

In [254]:
#### Your response here

##### The paper used log frequency, so let's convert `wordform_frequency` to `log_frequency`.

In [257]:
#### Your response here

## Part 4: Replicate primary analysis

##### Build an OLS regression model predicting `delta` from `log_freq + num_sylls_est + normalized_surprisal`.

In [259]:
#### Your response here

##### Is the coefficient for `log_freq` significant? How would you interpret this?

In [261]:
#### Your response here

**Answer**: Yes, and it's negative––i.e., more frequent wordforms have *fewer* meanings than their phonotactics predict. For every increase in order of magnitude (i.e., from 1 to 10 meanings), wordforms have $-.44$ meanings than predicted by their phonotactics.

##### Now let's plot our results. To start, we need to bin `log_freq`. Use the code below.

In [264]:
df_merged_frequency['freq_binned'] = pd.qcut(df_merged_frequency['log_freq'],
                                             q = 21, 
                                             duplicates = "drop",
                                            labels = list(range(1, 20)))

##### Now plot `delta` by `freq_binned`.

In [265]:
#### Your code here

##### You can also plot both `meanings ~ freq_binned` and `expected ~ freq_binned`.

In [265]:
#### Your code here