The best dataset I've found so far contains conversational sentences from film and series subtitles with translations for multiple languages:

https://opus.nlpl.eu/OpenSubtitles-v2018.php
https://github.com/PolyAI-LDN/conversational-datasets

In [2]:
import os
import constants
import pandas as pd

In [3]:
constants.language_code = 'fr'

In [5]:
filepath_en = f"../input_files/{constants.language_code}/open_subtitles/OpenSubtitles_en-{constants.language_code}.en"
filepath_lang = f"../input_files/{constants.language_code}/open_subtitles/OpenSubtitles_en-{constants.language_code}.{constants.language_code}"

en_series = pd.read_csv(filepath_en, sep='\\t')
lang_series = pd.read_csv(filepath_lang, sep='\\t')

  en_series = pd.read_csv(filepath_en, sep='\\t')
  lang_series = pd.read_csv(filepath_lang, sep='\\t')


In [6]:
lang_series.head()

Unnamed: 0,I've never dreamed before I'm gonna knock the door
0,Into the world of perfect free You ain't no lo...
1,You're gonna say I'm lying I'm gonna get the c...
2,I thought a chance is far from me You ain't no...
3,I was made to hit in America
4,I was made to hit in America


In [11]:
print(len(lang_series) == len(en_series))

True


In [20]:
# Combine into a dataframe and randomly sample n rows
n = 3_000_000

lang_series = lang_series.reset_index(drop=True)
en_series = en_series.reset_index(drop=True)

lang_series_list = lang_series.values.tolist()
en_series_list = en_series.values.tolist()

# Flatten the lists
lang_series_list = [item[0] for item in lang_series_list]
en_series_list = [item[0] for item in en_series_list]


df = pd.DataFrame({
    'sentence': lang_series_list,
    'translation': en_series_list
})

# First 100 or so lines are in English for some reason
#df = df[df.index > 100]

df_sample = df.sample(n, random_state=1)
df_sample.to_csv(f'../input_files/{constants.language_code}/open_subtitles_uncleaned_sentences.csv', sep='\t')

In [25]:
# OLD DATASET
"""
filepath = os.path.join(f"../input_files/{constants.language_code}", "uncleaned_sentences.csv")
df = pd.read_csv(filepath, delimiter='\t', header=None)
df.columns = ["id", "sentence"]
"""

filepath = os.path.join(f"../input_files/{constants.language_code}", "open_subtitles_uncleaned_sentences.csv")
df = pd.read_csv(filepath, delimiter='\t', header=None)
df.columns = ["id", "sentence", "translated_sentence"]

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000001 entries, 0 to 3000000
Data columns (total 3 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           float64
 1   sentence     object 
 2   translation  object 
dtypes: float64(1), object(2)
memory usage: 68.7+ MB


In [27]:
df.head()

Unnamed: 0,id,sentence,translation
0,,sentence,translation
1,16271971.0,Peut-être qu'elle paye toutes les vacheries qu...,Maybe it was her comeuppance for all the bad s...
2,40639131.0,En voilà une.,There's one.
3,3582787.0,Des pétards.,Firecrackers.
4,30418571.0,Tout est redevenu normal.,It's fucking awesome.


In [28]:
# Let's see if there are any duplicates in the dataset
df[df["sentence"].duplicated(keep=False)].sort_values("sentence").head(8)

Unnamed: 0,id,sentence,translation
1113572,26683912.0,!,!
298352,23302528.0,!,!
304512,7234797.0,!,!
2347053,26662096.0,!,!
2973841,28515917.0,!,- What am I doing?
2926332,30346512.0,!,!
1801615,15441934.0,!,!
251512,36312573.0,!,What?


In [29]:
df.dtypes

id             float64
sentence        object
translation     object
dtype: object

In [30]:
# Remove all duplicates from the dataframe
df = df.drop_duplicates("sentence")

In [31]:
lengths: pd.Series = df['sentence'].str.len()
max_characters: int = lengths.max()
max_index = lengths.idxmax()

# Find the sentence with the most characters to see if there are any delimitation issues.
print(f'Longest sentence: {max_characters} characters')
print(df[df.index == max_index]['sentence'].values[0][:600]) # Print out first 600 characters

Longest sentence: 476.0 characters
Moi, Samantha Jane Lockwood... je prends Clayton Beresford Junior... je prends Clayton Beresford Junior... pour époux... pour époux... et je le garde... et je le garde... à partir de cette nuit... à partir de cette nuit... - pour le meilleur et pour le pire... - pour le meilleur et pour le pire... dans la richesse et la pauvreté... dans la richesse et la pauvreté... dans la maladie et dans la santé... dans la maladie et dans la santé... jusqu'à ce que la mort nous sépare.


In [32]:
# Cut off any sentences longer than 200 or 
# shorter than 30 characters
df = df[
    (df['sentence'].str.len() < 200)
    & (df['sentence'].str.len() > 30)
     ]


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1276184 entries, 1 to 3000000
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1276184 non-null  float64
 1   sentence     1276184 non-null  object 
 2   translation  1276184 non-null  object 
dtypes: float64(1), object(2)
memory usage: 38.9+ MB


In [34]:
# Randomly sample n rows to get a reduced dataset for easier training while testing out this method. Set a seed for reproducability.
#n_rows = 30000

#reduced_df = df.sample(n=n_rows, random_state=1)

In [35]:
# Remove id column and save dataframes as csv
df.to_csv(f"../output_files/{constants.language_code}/step0_sentences.csv", sep='\t', index=False)
#reduced_df.to_csv("./french_sentences_reduced.csv", sep='\t', index=False)