In [1]:
import pandas as pd 
from tqdm import tqdm
from pathlib import Path
import re
import numpy as np
import matplotlib.pyplot as plt
import itertools
from itertools import permutations

In [2]:
def single_size(df, c1, ascending = False): 
    return df.groupby(c1).size().reset_index(name = "count").sort_values("count", ascending = ascending)

def groupby_size(df, c1, c2, ascending = False): 
    df_grouped = df.groupby([c1, c2]).size().reset_index(name = 'count').sort_values('count', ascending = ascending)
    return df_grouped 

def distinct_size(df, c1, c2, ascending = False): 
    df = df[[c1, c2]].drop_duplicates()
    d_c1 = df.groupby(c1).size().reset_index(name = 'count').sort_values("count", ascending = ascending)
    d_c2 = df.groupby(c2).size().reset_index(name = 'count').sort_values("count", ascending = ascending)
    return d_c1, d_c2 

In [3]:
df = pd.read_csv("../data/df_raw.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [18]:
# df.dtypes
df[["q", "related_parent_q", "parent_question", "branching_q"]] # the super questions have NAN 

Unnamed: 0,q,related_parent_q,parent_question,branching_q
0,Are the priests paid by polity:,Does the religion have official political support,Does the religion have official political support,"Status of Participants: Elite, Non-elite (comm..."
1,Are other religious groups in cultural contact...,,,"Status of Participants: Elite, Non-elite (comm..."
2,Print sources for understanding this subject:,,,"Status of Participants: Elite, Non-elite (comm..."
3,Is the cultural contact competitive:,Are other religious groups in cultural contact...,Are other religious groups in cultural contact...,"Status of Participants: Elite, Non-elite (comm..."
4,Does the religious group have scriptures:,,,"Status of Participants: Elite, Non-elite (comm..."
...,...,...,...,...
172065,Other important virtues advocated by the relig...,Are there centrally important virtues advocate...,Are there centrally important virtues advocate...,"Status of Participants: Elite, Religious Speci..."
172066,Other important virtues advocated by the relig...,Are there centrally important virtues advocate...,Are there centrally important virtues advocate...,Status of Participants: Religious Specialists
172067,Other important virtues advocated by the relig...,Are there centrally important virtues advocate...,Are there centrally important virtues advocate...,"Status of Participants: Elite, Non-elite (comm..."
172068,Other important virtues advocated by the relig...,Are there centrally important virtues advocate...,Are there centrally important virtues advocate...,Status of Participants: Religious Specialists


# Entry name and Entry id

In [4]:
e_id, e_name = distinct_size(df, 'entry_id', 'entry_name')
e_id.head(5) # entry_id: is unique 
e_name.head(5) # entry_name: not unique

Unnamed: 0,entry_name,count
769,"West Bengal, India",6
584,Syria,5
285,Italy: The Catholic Church,4
481,"Punjab/Delhi, India",4
119,"Central Macedonia, Greece: Eastern Orthodoxy",3


mapping between entry_id and entry_name is not unique: 
the pattern is that the same entry name can be entered multiple times. Each entry then has a unique entry_id. These are not actually independent, so we need to decide what to do here. Typically same expert but different periods, see e.g. (https://religiondatabase.org/browse/search?q=West%20Bengal,%20India)

# Question id and Question

In [5]:
q_id, q = distinct_size(df, "q_id", "q")
q_id.head(3) # almost unique, but not quite

Unnamed: 0,q_id,count
713,4786,2
0,2231,1
2232,6845,1


In [6]:
df[df["q_id"] == 4786][["q_id", "q", "entry_id", "entry_name"]].head(4)

Unnamed: 0,q_id,q,entry_id,entry_name
782,4786,"Afterlife located in """"other"""" space:",1386,Liumen (Liu School)
827,4786,"Afterlife located in """"other"""" space:",1521,Order of the Hermits of St Augustine (Augustin...
125450,4786,"Afterlife located in """"""""""""""""other"""""""""""""""" space:",688,Darul Uloom Deoband
125497,4786,"Afterlife located in """"""""""""""""other"""""""""""""""" space:",690,Ahl-e-Sunnat wa Jamaat


The only question id which has more than one associated question is 4786 which is related to the "Afterlife located in other space", where the only difference is formatting. I take this to mean that question id is actually distinct for each question. 

In [7]:
q.head(3) # not unique

Unnamed: 0,q,count
1198,Other,41
1199,Other [Specify],19
1424,Specify,18


This above means that the same question (e.g. "Other") appears with multiple question ids (count = 41). I take this to mean that the question "Other" appears multiple times within the questionaire, and thus it makes sense to treat these as different questions. 

In [14]:
df[df["q"] == "Other"].groupby(["q", "entry_id"]).size().reset_index(name = "count").sort_values("count", ascending=False).head(3)

Unnamed: 0,q,entry_id,count
28,Other,1068,14
5,Other,683,14
148,Other,1519,12


The above validates this idea, i.e. in entry 1068 the question "Other" appears 14 times. 

In [15]:
df[df["q"] == "Other"].groupby(["q", "q_id", "entry_id"]).size().reset_index(name = "count").sort_values("count", ascending=False).head(3)


Unnamed: 0,q,q_id,entry_id,count
340,Other,6662,1519,3
0,Other,5346,642,2
385,Other,6710,1519,2


Interestingly, even if we group by q_id we still get duplicates. In entry 1519 we observe the question "Other" with id 6662 appear 3 times. At least in this case (https://religiondatabase.org/browse/1519/#/) this makes sense because the question here (Other) appears three times in the same place (i.e. has the same parent question) so it should have the same id. 

In [28]:
df[(df["entry_id"] == 1519) & (df["q"] == "Other")][["entry_name", "expert", "q", "q_id", "related_parent_q", "answers"]].head(4)

Unnamed: 0,entry_name,expert,q,q_id,related_parent_q,answers
20061,"The Mausoleum of Sunan Muria, Colo (Central Java)",Vivek Neelakantan,Other,6662,Are pilgrimages to this place associated with ...,Other [specify]: Pilgrims undertake ziarah (to...
20062,"The Mausoleum of Sunan Muria, Colo (Central Java)",Vivek Neelakantan,Other,6662,Are pilgrimages to this place associated with ...,Other [specify]: Many married couples undertak...
20063,"The Mausoleum of Sunan Muria, Colo (Central Java)",Vivek Neelakantan,Other,6662,Are pilgrimages to this place associated with ...,Other [specify]: Specific shrines near Sunan M...
20065,"The Mausoleum of Sunan Muria, Colo (Central Java)",Vivek Neelakantan,Other,6671,Is feasting sponsored by the same entity that ...,"Other [specify]: Yayasan Sunan Muria, the foun..."


# Quality of Questions

## Some Questions have weird formatting (e.g. asterisks)

In [33]:
q_quality = single_size(df, "q")
q_quality = q_quality.assign(asterisk = lambda x: x["q"].str.contains("\*"))
q_asterisk = q_quality[q_quality["asterisk"] == True]
q_asterisk = df.merge(q_asterisk, on = "q", how = "inner")[["q", "q_id", "entry_id", "entry_name", "expert"]]
q_asterisk.head(3) # entry_id 1025

Unnamed: 0,q,q_id,entry_id,entry_name,expert
0,*Are there methods of permanently tabling or r...,6860,1025,Dalikal Ppo Klaong Garai,William Noseworthy
1,*Is education gendered with respect to this te...,7560,1025,Dalikal Ppo Klaong Garai,William Noseworthy
2,*Does the text specify teaching relationships ...,7561,1025,Dalikal Ppo Klaong Garai,William Noseworthy


In [34]:
q_asterisk_c = pd.DataFrame([re.sub("\*", "", x) for x in q_asterisk["q"]], columns = ["q"])
df_asterisk = df.merge(q_asterisk_c, on = "q", how = "inner")[["q", "q_id", "entry_id", "entry_name"]]
df_asterisk.head(3)

Unnamed: 0,q,q_id,entry_id,entry_name
0,Are there methods of permanently tabling or re...,7700,1042,Zichiji 資持記
1,Are there methods of permanently tabling or re...,7700,1054,Wenzi 文子
2,Are there methods of permanently tabling or re...,7700,1058,Syŏnggyŏng chikhae kwang-ik


As we can see above this is an error, i.e. the same question appears without asterisks ("Are there methods of permanently...") and with asterisks in one case ("*Are there metohds of permanently..."). Unfortunately, it does not look like they share question id (q_id) as one would have hoped. Below we can see that a total of 14 civilizations have answered the non-misformatted question (q_id: 7700) and only one civilization has answered the misformatted question (q_id 6860). This appears like a genuine error in the data base. 

In [41]:
df[(df["q_id"] == 6860) | (df["q_id"] == 7700)].groupby(['q_id', "q"]).size().reset_index(name = "count")

Unnamed: 0,q_id,q,count
0,6860,*Are there methods of permanently tabling or r...,1
1,7700,Are there methods of permanently tabling or re...,14


## Some questions are misspelled

In [52]:
q_spelling = df[df["q"] == "Is this palce a tomb/burial:"][["q", "q_id", "entry_id", "entry_name", "expert"]]
q_spelling.head(5)

Unnamed: 0,q,q_id,entry_id,entry_name,expert
1494,Is this palce a tomb/burial:,5408,642,The Arch of Titus,Gretel Rodríguez
1627,Is this palce a tomb/burial:,5408,653,The Arch of Constantine,Gretel Rodríguez


In [58]:
df_spelling = df[df["q"] == "Is this place a tomb/burial:"][["q", "q_id", "entry_id", "entry_name"]]
df_spelling.head(3)

Unnamed: 0,q,q_id,entry_id,entry_name
1230,Is this place a tomb/burial:,5832,687,"Temple of Minerva, Forum Transitorium"
1256,Is this place a tomb/burial:,6509,772,The Apollonion at Syracuse
1274,Is this place a tomb/burial:,6509,774,Altar of Hieron II


As we can see above the non-misspelled "Is this place a tomb/burial:" appears in different places, but none of these share question id with the misspelled text ("Is the palce a tomb/burial:"). As we can see below, the question id that matches the misspelled version (5408) only appears in those two records. 

In [55]:
len(df[df["q_id"] == 5408]) # two rows (only the two records from above)

2

This is really bad news because it appears like both misformatting and misspellings create new question ids. This results in more NA for the actual (i.e. non-misspelled and non-misformatted) questions. 

# Check whether these mistakes are online

## 1. spelling mistakes (just checking one of them)

In [62]:
df[df["q"] == "Is this palce a tomb/burial:"][["q", "q_id", "entry_id", "entry_name", "expert", "related_parent_q"]]


Unnamed: 0,q,q_id,entry_id,entry_name,expert,related_parent_q
1494,Is this palce a tomb/burial:,5408,642,The Arch of Titus,Gretel Rodríguez,
1627,Is this palce a tomb/burial:,5408,653,The Arch of Constantine,Gretel Rodríguez,


The Arch of Titus (yes, it appears misspelled): https://religiondatabase.org/browse/642/#/ 

The Arch of Constantine (yes, it appears misspelled): https://religiondatabase.org/browse/653/#/

And yes, when we compare it to the other entries which have this question not misspelled it appears in the same place (so it should share q_id). This is a big problem. However, another big problem is that we have different formats and questionnaires. This I think explains why we have so many NANs. All of the religions that contain answers to the "tomb/burial" question (misspelled and correctly spelled) have one format (https://religiondatabase.org/browse/653/#/), and most other religions appear to have this format (https://religiondatabase.org/browse/1060/#/).

This could be a huge problem for us. 

## 2. Formatting mistakes

In [64]:
q_asterisk.head(2)

Unnamed: 0,q,q_id,entry_id,entry_name,expert
0,*Are there methods of permanently tabling or r...,6860,1025,Dalikal Ppo Klaong Garai,William Noseworthy
1,*Is education gendered with respect to this te...,7560,1025,Dalikal Ppo Klaong Garai,William Noseworthy


When we check this entry (Dalikal Ppo Klaong Garai) it again has a third overall format (https://religiondatabase.org/browse/1025/#/) which is problematic. 

# Only overall Questions 

In [65]:
df_o = df[df["related_parent_q"].isna()]

In [70]:
n_o_q, n_q = len(df_o["q"].drop_duplicates()), len(df["q"].drop_duplicates())
n_o_q_id, n_q_id = len(df_o["q_id"].drop_duplicates()), len(df["q_id"].drop_duplicates())
n_o_e, n_e = len(df_o["entry_name"].drop_duplicates()), len(df["entry_name"].drop_duplicates())
n_o_e_id, n_e_id = len(df_o["entry_id"].drop_duplicates()), len(df["entry_id"].drop_duplicates())
print(f"N questions: ({n_q}, {n_o_q})\nN question ID: ({n_q_id}, {n_o_q_id})")
print(f"N entry ID: ({n_e}, {n_o_e})\nN entry name: ({n_e_id}, {n_o_e_id})")

N questions: (1672, 287)
N question ID: (3360, 596)
N entry ID: (799, 799)
N entry name: (838, 838)


When we only select overall questions we (naturally) get much fewer questions (both questions and question id). The problem is that we still have way to many unique questions (N question ID = 596). 

# Only questions with BINARY answers

In [72]:
# remove questions with non-binary answers
## code answer types
conditions = [
    (df_o['answers'] == "Yes") | (df_o["answers"] == "No"),
    (df_o['answers'] == "Field doesn't know") | (df_o["answers"] == "I don't know"),
    (df_o['answers'] == "NaN")
]
choices = ["Yes/No", "Don't know", "NaN"]
df_o['answer_types'] = np.select(conditions, choices, default="non-binary")

## find questions with non-binary answers and remove them
df_ = df_o.groupby(['q_id', 'answer_types']).size().reset_index(name="count")
df_ = df_[df_["answer_types"] == "non-binary"]
df_ = df_o.merge(df_, how = "outer", indicator = True)
df_b = df_[(df_._merge=="left_only")].drop("_merge", axis = 1)

## how many questions did we exclude?
q_total = len(df_o["q_id"].drop_duplicates())
q_binary = len(df_b["q_id"].drop_duplicates()) 
print(f"total Q: {q_total}\nbinary Q: {q_binary} ({round(q_binary/q_total,2)})%")
answer_vals = ["No", "Yes", "Field doesn't know", "I don't know"]
df2 = df_o.loc[df_o['answers'].isin(answer_vals)] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_o['answer_types'] = np.select(conditions, choices, default="non-binary")


total Q: 596
binary Q: 548 (0.92)%


In [77]:
df2.groupby('answers').size().reset_index(name = "count").sort_values('count', ascending=False)

Unnamed: 0,answers,count
2,No,22756
3,Yes,22257
0,Field doesn't know,4439
1,I don't know,1611


We still have a lot of questions (n = 548). We are suffering from the errors in coding I believe (and the inconsistency in the format of the questionnaire). 

In [80]:
df2.groupby('q').size().reset_index(name = "count").sort_values("count", ascending = True).head(10)

Unnamed: 0,q,count
0,*Are there specific relationships to teachers ...,1
1,*Are there worldly rewards/benefits to educati...,1
2,*Does the text specify teaching relationships ...,1
3,*Is education gendered with respect to this te...,1
249,Number of audience within the sample region (e...,1
16,Are messianic beliefs presentin the text?,1
243,Nature of audience,1
262,Was the place created as the birthplace of a s...,2
236,Is this palce a tomb/burial:,2
257,The society to which the religious group belon...,3


In [82]:
df2.groupby('q_id').size().reset_index(name = "count").sort_values("count", ascending = True).head(5)

Unnamed: 0,q_id,count
343,6842,1
368,7109,1
369,7121,1
370,7129,1
371,7135,1


As we can see above, a lot of the questions with very few answers (count) are the ones we discussed earlier. The questions that are misspelled, and the qusetions that have weird formatting. We get a lot of question ids with an answer in only 1 religion, which we cannot really use. 

# Other quirks

## Same (BINARY) question, multiple answers

In [86]:
df2.groupby(['q_id', 'entry_id']).size().reset_index(name = "count").sort_values("count", ascending=False).head(5)

Unnamed: 0,q_id,entry_id,count
37961,6352,1230,8
12010,4760,613,4
1056,2347,189,4
37954,6352,1213,4
447,2283,257,3


In [90]:
df2[(df2["entry_id"] == 1230) & (df2["q_id"] == 6352)][["entry_id", "entry_name", "expert", "answers", "q", "q_id"]]

Unnamed: 0,entry_id,entry_name,expert,answers,q,q_id
4919,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4924,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4928,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4931,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4933,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4935,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4959,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352
4967,1230,Cult at the Athenian Agora,Laura Gawlinski,Yes,Are there structures or features present:,6352


In this case not a problem because the answer is the same (Yes) for all of the records. Do we observe inconsistent answers in any cases?

In [96]:
df_ = df2.groupby(['q_id', 'entry_id', 'answers']).size().reset_index(name = "count").sort_values("count", ascending=False)
df_.groupby(['q_id', 'entry_id']).size().reset_index(name = "count").sort_values("count", ascending=False).head(5)

Unnamed: 0,q_id,entry_id,count
43932,7658,1144,3
23827,5151,1466,2
2631,2964,239,2
2602,2964,174,2
27271,5181,1231,2


Unfortunately, it appears that we do observe inconsistent responses to the same BINARY questions within some religions. 

In [99]:
df2[(df2["q_id"] == 7658) & (df2["entry_id"] == 1144)][["q", "answers", "answer_val", "q_id", "entry_name", "expert"]]

Unnamed: 0,q,answers,answer_val,q_id,entry_name,expert
27854,Is the production of the text funded by the po...,Field doesn't know,-1,7658,The Yijing 易經 (The Classic of Changes),Matthew Hamm
27857,Is the production of the text funded by the po...,Yes,1,7658,The Yijing 易經 (The Classic of Changes),Matthew Hamm
28069,Is the production of the text funded by the po...,No,0,7658,The Yijing 易經 (The Classic of Changes),Matthew Hamm


The above is an example of the same question being answered with both:
(1) Field doesn't know
(2) Yes
(3) No
In these cases, we need to figure out what we do (i.e. whether we weight, sample randomly, ...?)

## Consistency of temporal information within entries

In [106]:
e_id, e_y = distinct_size(df, 'entry_id', 'end_year')
e_id.head(3) 

Unnamed: 0,entry_id,count
19,197,12
711,1324,11
604,1180,8


In [113]:
df[df["entry_id"] == 197][["end_year", "entry_name", "entry_id", "expert"]].drop_duplicates().head(5) # ...

Unnamed: 0,end_year,entry_name,entry_id,expert
52694,1958,Gönlung Monastery (dgon lung byams pa gling),197,Brenton Sullivan
52696,1957,Gönlung Monastery (dgon lung byams pa gling),197,Brenton Sullivan
52698,1700,Gönlung Monastery (dgon lung byams pa gling),197,Brenton Sullivan
77672,1620,Gönlung Monastery (dgon lung byams pa gling),197,Brenton Sullivan
77673,1650,Gönlung Monastery (dgon lung byams pa gling),197,Brenton Sullivan


Each entry can have multiple end_year values, so again (if we are going to use this for anything) we have to decide how to handle this (e.g. mean, ...?). We have some of the same issues with regions (i.e. multiple regions can be assign, which of course can be correct, but it can also change over time and gets really complicated...)