In [1]:
from torch import nn
import pandas as pd

## Read the data from a csv file

we can see that there are 517,176 examples

In [13]:
df = pd.read_csv("wiktionary_raw.csv")
df.shape

(517176, 3)

## Lowering the text.

All nouns are lowercase except for proper nouns across our language set (except for German).
Here comes our first decision about cleaning our data. 
- keep possible duplicates because a difference in capitalization?
- Does capitalization matter for German if we are only looking at nouns?
- Do we attempt to remove proper nouns (tend to be capitalzed) by checking across a POS tagger?

For the time being, we will lower all nouns

In [17]:
df["noun"] = df["noun"].str.lower()

Duplicates will now be removed; this is done across all languages, genders and nouns.

Now we can see our total has dropped down to 378,830 which is a ~ 27% loss

In [16]:
# remove dulplicates
df_no_dups = df.drop_duplicates()
df_no_dups.shape

(378830, 3)

## Split df into smaller dfs for each language

In [5]:
languages = df_no_dups['lang'].unique()
languages

array(['fr', 'de', 'pl', 'es'], dtype=object)

In [6]:
dataframes = [df_no_dups[df_no_dups['lang'] == lang] for lang in languages]

### Display distribution of each language dataset
We can see that French has the fewest at 67K nouns, and German the most with 117K.


In [20]:
for idx, lang in enumerate(languages):
    print(f"{lang} dataframe has {dataframes[idx].shape[0]} nouns")

fr dataframe has 67349 nouns
de dataframe has 117928 nouns
pl dataframe has 100593 nouns
es dataframe has 97743 nouns


### Display distribution by gender for each language

In [26]:
new_df = pd.concat(dataframes)

new_df.groupby(['gender','lang']).size().unstack()

lang,de,es,fr,pl
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
feminine,52613.0,40864.0,28231.0,39009.0
masculine,24532.0,51187.0,34687.0,44517.0
neuter,32664.0,,,16939.0


The lowest gender is neuter for Polish with 16,939 nouns

In [25]:
grouped = new_df.groupby(['gender','lang']).size().unstack()
lowest_value = int(grouped.min().min())
lowest_value

16939