# Some characteristics of the dataset underlying this project

The goal of this notebook is to print some characteristics of the data that are included in the introduction of the paper

### Load the relevant data

Note that not all Birkenau survivors could be included included; some of them could not be included because there were problems with them during the data processing

In [1]:
import constants
import pandas as pd
import os
from IPython.display import display
import json
path = os.getcwd()
parent = os.path.abspath(os.path.join(path, os.pardir))
input_directory = parent + '/'+constants.output_data_segment_keyword_matrix
segment_df = pd.read_csv(input_directory + constants.output_segment_keyword_matrix_document_index)
int_codes = segment_df['IntCode'].to_list()
with open(parent+'/'+constants.output_data_segment_keyword_matrix + "metadata_partitions.json") as read_file:
        metadata_partitions = json.load(read_file)
input_folder = parent+'/'+constants.input_data
input_file = constants.input_files_biodata_birkenau
df_biodata = pd.read_csv(input_folder+input_file)
df_biodata = df_biodata[df_biodata.IntCode.isin(int_codes)]

In [2]:
assert len(metadata_partitions['complete']) == len(df_biodata[df_biodata.IntCode.isin(int_codes)])

### Total number of interviews included

In [3]:
print (len(metadata_partitions['complete']))

6628


### Total number of women survivors

In [4]:
print (len(metadata_partitions['complete_w']))

4595


### Total number of men survivors

In [5]:
print (len(metadata_partitions['complete_m']))

2033


### Number of different birthplaces

In [6]:
nationalities = [element for element in metadata_partitions.keys() if (element[0].isupper()) and (element[-2] !="_")]
len(nationalities)

13

Majority of victims were born in 5 countries, here we will focus on those who were born in these five countries; the number of victims is not enough with other countries

In [7]:
countries = []
for country in nationalities:
    countries.append({'country':country,'total':len(metadata_partitions[country]),'women':len(metadata_partitions[country+'_w']),'men':len(metadata_partitions[country+'_m'])})
    
countries = pd.DataFrame(countries)
    

In [8]:
countries.sort_values('total',ascending=False)

Unnamed: 0,country,total,women,men
2,Czechoslovakia (historical),1817,1317,500
9,Poland,1404,849,555
6,Hungary,1030,832,198
10,Romania,948,690,258
1,Austria-Hungary (historical),478,358,120
11,Russia,199,81,118
4,Germany,147,82,65
5,Greece,129,68,61
12,Yugoslavia (historical),100,84,16
8,Netherlands,90,53,37


In [9]:
countries.sort_values('total',ascending=False)[0:5].total.sum()

5677

In [10]:
countries.sort_values('total',ascending=False)[5:].total.sum()

866

In [11]:
countries.sort_values('total',ascending=False)[0:5].country

2      Czechoslovakia (historical)
9                           Poland
6                          Hungary
10                         Romania
1     Austria-Hungary (historical)
Name: country, dtype: object

### Number of different countries where interviews were recorded

In [12]:
df_biodata.InterviewCountry.unique().shape[0]

43

In [13]:
interview_countries = df_biodata.groupby('InterviewCountry')['IntCode'].unique().to_frame().reset_index()

In [14]:
interview_countries['count']= interview_countries.IntCode.apply(lambda x: len(x))

In [15]:
interview_countries.sort_values('count',ascending=False)

Unnamed: 0,InterviewCountry,IntCode,count
36,U.S.A.,"[2, 3, 7, 8, 16, 21, 33, 39, 47, 61, 64, 65, 6...",2536
19,Israel,"[3651, 4712, 4722, 5099, 5709, 5770, 5780, 591...",1666
7,Canada,"[160, 246, 285, 364, 368, 430, 434, 466, 474, ...",662
1,Australia,"[1089, 1095, 1101, 1103, 1275, 1302, 1323, 133...",314
18,Hungary,"[48475, 48476, 48478, 48480, 48982, 49010, 490...",192
15,France,"[2610, 2613, 2961, 2981, 3152, 3320, 3325, 333...",184
12,Czech Republic,"[6809, 7552, 7573, 7634, 7636, 7943, 7959, 803...",152
5,Brazil,"[14524, 16683, 16807, 16869, 17096, 17309, 177...",86
30,Slovakia,"[12463, 14031, 14043, 14764, 14874, 15765, 164...",84
23,Netherlands,"[3440, 3496, 4195, 4201, 4213, 4261, 4435, 444...",77


### Number of different interview languages

In [16]:
df_biodata.InterviewLanguage.unique().shape[0]

40

In [17]:
interview_languages = df_biodata.groupby('InterviewLanguage')['IntCode'].unique().to_frame().reset_index()

In [18]:
interview_languages['count']= interview_languages.IntCode.apply(lambda x: len(x))

In [19]:
interview_languages.sort_values('count',ascending=False)

Unnamed: 0,InterviewLanguage,IntCode,count
5,English,"[2, 3, 7, 8, 16, 21, 33, 39, 47, 61, 64, 65, 6...",3472
18,Hebrew,"[604, 621, 627, 631, 633, 2176, 3651, 4712, 47...",1490
20,Hungarian,"[77, 109, 279, 351, 538, 600, 675, 699, 701, 7...",444
13,French,"[2610, 2613, 2961, 2981, 3152, 3320, 3325, 333...",206
2,Czech,"[829, 6809, 7552, 7573, 7634, 7636, 7943, 7959...",152
36,Spanish,"[2202, 5208, 6634, 8672, 8930, 9526, 9563, 102...",150
14,German,"[4352, 4356, 6466, 8108, 8207, 8663, 8910, 932...",104
27,Portuguese,"[14524, 16683, 16807, 16869, 17096, 17309, 177...",84
4,Dutch,"[3440, 3496, 4195, 4201, 4213, 4261, 4435, 444...",81
30,Russian,"[254, 2323, 4699, 7689, 8494, 10145, 10368, 10...",66


### Total length of the dataset

In [20]:
df_biodata.length_in_minutes.sum()/60

2442.6

### Average time an interviewer speaks

In [21]:
df_biodata.length_in_minutes.mean()

22.11164755582378

### Average time a man speaks 

In [22]:
df_biodata[df_biodata.Gender=="M"].length_in_minutes.mean()

20.087555336940483

### Average time a woman speaks 

In [23]:
df_biodata[df_biodata.Gender=="F"].length_in_minutes.mean()

23.007181719260064

### Chronological extremes

In [24]:
df_biodata.sort_values('InterviewDate')['InterviewDate']

6647             0
6574    1981-08-06
6575    1981-10-22
6576    1981-10-22
6577    1981-10-25
           ...    
6819    2017-03-02
6820    2018-12-13
6821    2019-01-18
6827    2019-06-07
6828    2019-09-19
Name: InterviewDate, Length: 6628, dtype: object

In [25]:
df_biodata['dates'] = pd.to_datetime(df_biodata.InterviewDate,format="%Y-%m-%d",errors="coerce")

In [26]:
df_biodata['dates'].describe(datetime_is_numeric=True)

count                             6627
mean     1996-12-20 05:10:30.692621184
min                1981-08-06 00:00:00
25%                1995-12-27 00:00:00
50%                1996-10-28 00:00:00
75%                1997-11-23 00:00:00
max                2019-09-19 00:00:00
Name: dates, dtype: object

In [27]:
df_biodata[df_biodata['dates']>'1995-12-27'].shape

(4967, 35)